libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2026-04-30 19:09:18 +08:00

Author	SHA1	Message	Date
Frank Barchard	f2ac6db694	RAWToNV21 using SME, SVE, I8MM or Neon Pixel 9 Now SVE2 2 pass LibYUVConvertTest.RAWToNV21_Opt (364 ms) 31.76% libyuv::ARGBToUVMatrixRow_SVE_SC() 30.38% RAWToARGBRow_SVE2 26.81% ARGBToYMatrixRow_NEON_DotProd 3.26% MergeUVRow_NEON Was NEON 1 pass LibYUVConvertTest.RAWToJNV21_Opt (295 ms) 44.14% RAWToYJRow_NEON 41.91% RAWToUVJRow_NEON 5.11% MergeUVRow_NEON Clang on Intel Skylake clang [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (301 ms) visual c (row_win) [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (2056 ms) clang [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (275 ms) visual c [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (365 ms) Bug: libyuv:42280902 Change-Id: Iaba558ebe96ce6b9881ee9335ba72b8aac390cde Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7802432 Commit-Queue: Frank Barchard <fbarchard@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com> Reviewed-by: Dale Curtis <dalecurtis@chromium.org>	2026-04-29 13:11:04 -07:00
Frank Barchard	4afb965416	RAWToARGB use AVX512BW Bug: libyuv:42280902 Change-Id: I7a80fd64d97b6d411316819df0fd917d609a173b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7787163 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@google.com>	2026-04-22 16:56:46 -07:00
Frank Barchard	bd2c4c76ec	RAWToARGB AVX512VBMI Bug: libyuv:42280902 Change-Id: I1c7f432f004079357a00515785bc524c459ed4b9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7787160 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@google.com>	2026-04-22 14:48:29 -07:00
Frank Barchard	d445250d8b	Replace RAWToY/RGB24ToY with RGBToYMatrix Bug: libyuv:42280902 Change-Id: I6ddebd492036c416550fc045eb39493dea73246b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7784094 Commit-Queue: Frank Barchard <fbarchard@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2026-04-21 17:11:14 -07:00
Frank Barchard	81f698829b	Add RGBToNV21Matrix function - implement wrappers with RAW, RGB24, NV21 and JNV21 to call it. Zen5 Was [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (1146 ms) Now [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (1446 ms) reason - the new code uses 1 pass for RAWToY but 2 pass for RAWToARGB,ARGBToUV. needs 1 RGBToUV Bug: libyuv:42280902 Change-Id: Ife6fbed0829484045409e6d42b85cec1d1fd6052 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7780026 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@google.com>	2026-04-20 18:03:34 -07:00
Frank Barchard	9f13b2814d	add RGBToYMatrixRow_AVX2 Adds RGBToYMatrixRow_AVX2 which reads 24 bit RGB values by reading 3 vectors instead of 4 and permutes them into 4 ARGB vectors before conversion. Also adds RGBToYMatrixRow_Opt and RGBToYMatrixRow_2Step_Opt to convert_argb_test.cc to benchmark and compare the direct AVX2 conversion vs a 2-step approach. ./libyuv_test '--gunit_filter=*RAWToJ400_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=10000 --libyuv_flags=-1 --libyuv_cpu_info=-1 AMD Zen 5 Was LibYUVConvertTest.RAWToJ400_Opt (757 ms) Now LibYUVConvertTest.RAWToJ400_Opt (699 ms) Intel Skylake Was LibYUVConvertTest.RAWToJ400_Opt (1705 ms) Now LibYUVConvertTest.RAWToJ400_Opt (1426 ms) Bug: 477295731 Change-Id: I29866baf4ad5fe7a3725e4a01f2fe24649510a7d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7777325 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-20 12:52:44 -07:00
Frank Barchard	ace7c4573c	Add ARGBToUV444MatrixRow_RVV, ARGBToUVMatrixRow_RVV, and wrappers This change implements ARGBToUV444MatrixRow_RVV, ARGBToUVMatrixRow_RVV, and their wrappers (ARGBToUVRow_RVV, ARGBToUVJRow_RVV, etc.) using RVV intrinsics, mirroring the NEON/AVX2 designs. It wires them into the build and dispatch systems. LIBYUV_RVV_HAS_TUPLE_TYPE is always true on new compilers. This macro has been removed, assuming it is true everywhere, reducing the amount of code in row_rvv.cc, scale_rvv.cc, and row.h. Tested via: ~/bin/doyuv3v && ~/bin/runyuv3v TestARGBToI444Matrix ~/bin/doyuv3av Bug: libyuv:42280902 Change-Id: I36d305386b297d69023c068aa9c62ab6b2ad039c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7769956 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-16 20:52:43 -07:00
Frank Barchard	e034c41661	Port ARGBToUVMatrixRow from AVX2 to AVX512BW Benchmark on Icelake Xeon Now AVX512BW: [ OK ] LibYUVConvertTest.ARGBToNV12_Opt (1723 ms) Was AVX2: [ OK ] LibYUVConvertTest.ARGBToNV12_Opt (2144 ms) - Added `ARGBToUVMatrixRow_AVX512BW` implementation in `source/row_gcc.cc`. - Added corresponding `ARGBToUVRow_AVX512BW` and `ABGRToUVRow_AVX512BW` functions. - Added unaligned wrappers `ARGBToUVRow_Any_AVX512BW` and `ABGRToUVRow_Any_AVX512BW` in `source/row_any.cc`. - Updated `source/row_any.cc` to correctly size `vin` and `vout` buffers for AVX512BW width and adjusted the `ANY12MS` and `ANY12S` macros to handle `MASK=63`. - Updated `include/libyuv/row.h` with the required AVX512BW headers and definitions, scoped appropriately. - Wired all callers of `ARGBToUVRow_AVX2` and related functions in `source/convert.cc` and `source/convert_from_argb.cc` to dynamically use the `AVX512BW` implementations if the CPU flag indicates AVX-512BW support. - Optimized AVX-512 code to generate the `-1` multiplier in a single instruction (`vpternlogd`) and reused it across word (`vpmaddwd`) dot products. Handled the resulting negation by replacing a subtraction with `vpaddw` offset adjustment. Bug: 477295731 R=dalecurtis@chromium.org, rrwinterton@gmail.com Change-Id: Ida5fb27e59ae4c1c3824737f009b80549cd20a06 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7763257 Reviewed-by: richard winterton <rrwinterton@gmail.com> Reviewed-by: Dale Curtis <dalecurtis@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-14 16:15:31 -07:00
Frank Barchard	cbc64c353c	Port ARGBToYRow_AVX2 usages to dynamically use ARGBToYRow_AVX512BW I have successfully ported the usage of ARGBToYRow_AVX2 to dynamically detect and utilize ARGBToYRow_AVX512BW when available. Here's a summary of the changes: 1. Source Modifications: In both source/convert.cc and source/convert_from_argb.cc, I searched for all references where ARGBToYRow_AVX2 was being conditionally used (which operates on 32 pixels). 2. AVX512BW Detection: Immediately following those blocks, I injected a new check for kCpuHasAVX512BW. If the CPU flag is present, the logic now utilizes ARGBToYRow_Any_AVX512BW by default, falling back to the fully aligned ARGBToYRow_AVX512BW when the width is aligned to 64 bytes. 3. Profiling: After building and compiling the tests (doyuv3x), I validated the change using perfyuv3 ARGBToNV12_Opt \| cat. The test successfully executed and the performance profile indicated that ARGBToYRow_AVX512BW successfully executed (taking up ~18% of CPU cycles, replacing the previous AVX2 specific instruction overhead for the Y row extraction). The HAS_ARGBTOYROW_AVX512BW macro implementation now fully supports all AVX2 conversion paths to utilize AVX512BW when the system processor flags allow it! R=richard, rrwinterton@gmail.com Change-Id: Iad811e12d301f5621e6f6d039105420861ade43e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7760779 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2026-04-14 11:42:59 -07:00
Frank Barchard	893eacf9b4	ARGBToY for AVX512 - add ARGBToYMatrixRow_AVX512BW - refactor SSE and AVX to use Matrix functions, making old functions call the new ones. Zen5 1280x720 Was AVX2 LibYUVConvertTest.ARGBToI444_Opt (1125 ms) Now AVX512 LibYUVConvertTest.ARGBToI444_Opt (641 ms) Details by Gemini: 1. Created 3 new Matrix functions: Added ARGBToYMatrixRow_SSSE3, ARGBToYMatrixRow_AVX2, and ARGBToYMatrixRow_AVX512BW to source/row_gcc.cc. These take the const struct ArgbConstants* c parameter similarly to ARGBToUV444MatrixRow_. The x86 vector instructions dynamically calculate the needed values using the properties of the constants struct, including using vpmaddwd inside the AVX512 code to offset the lack of a native vphaddw. 2. Replaced Old Functions with Wrappers: Modified the existing implementations of ARGBToYRow_SSSE3, ARGBToYJRow_SSSE3, ABGRToYRow_SSSE3, ABGRToYJRow_SSSE3, RGBAToYRow_SSSE3, RGBAToYJRow_SSSE3, BGRAToYRow_SSSE3 (and their _AVX2 equivalents) in source/row_gcc.cc to act as inline wrappers calling the new ARGBToYMatrixRow_ functions, passing the right matrix parameters (e.g. &kArgbI601Constants, &kArgbJPEGConstants, &kAbgrI601Constants). 3. Added row_any.cc Handlers: Added ANY11MC definitions to source/row_any.cc to autogenerate ARGBToYMatrixRow_Any_SSSE3, ARGBToYMatrixRow_Any_AVX2, and ARGBToYMatrixRow_Any_AVX512BW which safely handles non-aligned tails. 4. Updated include/libyuv/row.h: Updated the headers with the proper void declarations for all newly generated Matrix and Any_ variants. Also defined HAS_ARGBTOYROW_AVX512BW in the CPU macros. 5. Tested the Implementations: Compiled and tested on Linux x86, which resulted in all tests passing cleanly. Also successfully completed all Windows 32-bit build checks ensuring 32-bit regression prevention without issues. Bug: 477295731 Change-Id: I4f5eec9a961e24a9d760d0a1c0810fb5e29a0bd1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7759494 Reviewed-by: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2026-04-13 17:26:07 -07:00
Frank Barchard	4c3d7d517a	ARGBToUV444 for AVX512 1.27x faster on AMD Zen5 (turin) Now AVX512 perf record ./libyuv_test '--gunit_filter=*ARGBToI444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=10000 --libyuv_flags=-1 --libyuv_cpu_info=-1 [ OK ] LibYUVConvertTest.ARGBToI444_Opt (1071 ms) Overhead Symbol 53.49% ARGBToYRow_AVX2 44.70% ARGBToUV444Row_AVX512BW Was AVX2 [ OK ] LibYUVConvertTest.ARGBToI444_Opt (1369 ms) 61.06% ARGBToUV444Row_AVX2 37.67% ARGBToYRow_AVX2 Bug: libyuv:42280902 Change-Id: I306fbac656d6f7834ce1559e86d01eb34931ec3c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7738362 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Dale Curtis <dalecurtis@chromium.org>	2026-04-08 19:25:41 -07:00
Dale Curtis	1170363ce5	Add Gemini implementation for NEON32 RGB to YUV matrix operations These are about 25% faster than the C versions. Bug: libyuv:42280902 Change-Id: I8b298670ee5f3ed5db35527fc41d6d9a51b020a1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7573682 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org>	2026-03-23 16:30:44 -07:00
Frank Barchard	4183733af5	Rename MergeUVRow_ variable to MergeUVRow Bug: libyuv:42280902 Change-Id: I9935bf958b901ddf84cf91b2097c8cd5d6efadde Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7683070 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Dale Curtis <dalecurtis@chromium.org>	2026-03-18 17:18:25 -07:00
Dale Curtis	b1cacfb38f	Unify X86/X64 versions of ARGBToI4xxMatrix functions Change-Id: Iead13414414543e5f10ba9ba47a6ceaeb3113dee Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7562443 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2026-03-18 16:27:07 -07:00
Dale Curtis	f69a479f04	Add ARGBToNV12Matrix implementation This one reuses the SIMD implementations for MergeUVRow_ from the existing ARGBToNV12 functions. Bug: libyuv:42280902 Change-Id: If0a4be133d657ed0262f29fdd568dac90b49636c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7564317 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org>	2026-03-18 16:26:59 -07:00
Dale Curtis	2c21d57319	Add ABGR versions of the ArgbConstants structures This allows for ABGR conversion using the same methods Bug: libyuv:42280902 Change-Id: I5566e3150b30573a2326a900ce31ab095f8935f9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7564316 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2026-03-17 17:28:51 -07:00
Dale Curtis	30809ff64a	Add ARGBToI4xxMatrix variants This was implemented by Gemini followed by manual review and some tweaking for style. The 601 and JPEG constants are fully verified against the existing non-matrix implementations. On x86 the C-only versions appear to be about 25% slower than the optimized ones. Bug: libyuv:42280902 Change-Id: Ia5b7cb499bad5c76faec53f36086ebb18f2b530f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7512030 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Dale Curtis <dalecurtis@chromium.org>	2026-03-04 10:55:06 -08:00
Frank Barchard	2b4453d46f	Deprecate MIPS and MSA support. - Remove *_msa.cc source files - Update build files - Update header references, planar ifdefs for row functions - Update documentation on supported platforms - Version bumped to 1921 - clang-format applied Bug: 434383432 Change-Id: I072d6aac4956f0ed668e64614ac8557612171f76 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7045953 Reviewed-by: Justin Green <greenjustin@google.com>	2025-10-16 12:20:40 -07:00
George Steed	007b920232	[AArch64] Add SME implementation of ARGBToUVRow and similar Mostly just a straightforward copy of the existing SVE2 code ported to Streaming-SVE. Introduce new "any" kernels for non-multiple of two cases, matching what we already do for SVE2. The existing SVE2 code makes use of the Neon MOVI instruction that is not supported in Streaming-SVE, so adjust the code to use FMOV instead which has the same performance characteristics. Change-Id: I74b7ea1fe8e6af75dfaf92826a4de775a1559f77 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6663806 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-30 09:20:23 -07:00
George Steed	61bdaee13a	Add Neon I8MM implementations of ARGB to UV and variants The maximum coefficient is 128, so store constants negated to take advantage of -128 being representable in 8-bit integers. This allows us to use the I8MM USDOT instructions. Reduction in time taken observed compared to the existing Neon implementation, as a geomean of all ARGBToUV variants: Cortex-A510: -7.1% Cortex-A520: -2.1% Cortex-A710: -8.4% Cortex-A715: -0.3% Cortex-A720: -0.3% Cortex-X2: -40.0% Cortex-X3: -43.3% Cortex-X4: -11.3% Cortex-X925: -2.5% Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-05-12 11:14:00 -07:00
Frank Barchard	61354d2671	ARGBToUV Matrix for AVX2 and SSSE3 - Round before shifting to 8 bit to match NEON - RAWToARGB use unaligned loads and port to AVX2 Was C/SSSE/AVX2 ARGBToI444_Opt (343 ms) ARGBToJ444_Opt (677 ms) RAWToI444_Opt (405 ms) RAWToJ444_Opt (803 ms) Now AVX2 ARGBToI444_Opt (283 ms) ARGBToJ444_Opt (284 ms) RAWToI444_Opt (316 ms) RAWToJ444_Opt (339 ms) Profile Now AVX2 38.31% ARGBToUVJ444Row_AVX2 32.31% RAWToARGBRow_AVX2 23.99% ARGBToYJRow_AVX2 Profile Was C/SSSE/AVX2 73.15% ARGBToUVJ444Row_C 15.74% RAWToARGBRow_SSSE3 8.87% ARGBToYJRow_AVX2 Bug: 381138208 Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Ben Weiss <bweiss@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-10 18:36:18 -08:00
Frank Barchard	c1bac9e6a5	RAWToJ444 and ARGBToJ444 - ARGBToJ444 implements ARGBToUVJ444Row_C - RAWToJ444 implemented as 2 steps - RAWToARGB and ARGBToJ444 libyuv_test '--gunit_filter=RTo?444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 (with bit exact off) Samsung S23 RAWToJ444_Opt (437 ms) ARGBToJ444_Opt (337 ms) ARGBToI444_Opt (196 ms) Skylake Xeon RAWToJ444_Opt (1699 ms) ARGBToJ444_Opt (1559 ms) ARGBToI444_Opt (346 ms) Bug: 390247964 Change-Id: Id1b1b45a5e4512ab50830aadf62f780fbe631575 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207845 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-29 15:18:38 -08:00
George Steed	7391559cb4	[AArch64] Add SME implementation of MergeUVRow{,_16} Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel and use ST2 to avoid needing a separate ZIP instruction. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 01:16:19 -08:00
George Steed	a4ccf9940e	[AArch64] Add I8MM implementation of ARGBToUV444Row We cannot use the standard dot-product instructions since the coefficients multiplication results are both added and subtracted, but I8MM supports mixed-sign dot products which work well here. We need to add an additional variant of the coefficient structs since we need negative constants for the elements that were previously subtracted. Reduction in runtimes observed compared to the previous Neon implementation: Cortex-A510: -37.3% Cortex-A520: -31.1% Cortex-A715: -37.1% Cortex-A720: -37.0% Cortex-X2: -62.1% Cortex-X3: -62.2% Cortex-X4: -40.4% Bug: libyuv:977 Change-Id: Idc3d9a6408c30e1bce3816a1ed926ecd76792236 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712928 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-16 17:32:52 +00:00
George Steed	899bc48327	[AArch64] Add SVE2 implementations of ARGBTo{RAW,RGB24}Row There is no nice way of forming the TBL permute indices here since we are operating on sets of three bytes at a time, so instead load the appropriate indices from a static array. We can make use of SVE predication to ensure we are operating on a multiple of three bytes for the load/store instructions rather than needing to make use of more expensive LD4 or ST3 instructions. Reduction in runtime observed compared to the existing Neon implementations: \| ARGBToRAWRow \| ARGBToRGB24Row Cortex-A510 \| -50.8% \| -19.9% Cortex-A720 \| -39.8% \| -39.1% Cortex-X2 \| -66.5% \| -51.9% Bug: libyuv:973 Change-Id: Iaead678715a3d70d54cf823391272a6196836769 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631544 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 20:27:54 +00:00
George Steed	250e1e1ba3	[AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow Observed performance improvements compared to the existing Neon implementation: Cortex-A510: -21.7% Cortex-A720: -49.2% Cortex-X2: -62.6% Bug: libyuv:973 Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-06-03 23:15:04 +00:00
George Steed	bce3392830	[AArch64] Add SVE2 implementation of ARGBToRGB565Row Observed performance improvements compared to the existing Neon implementation: Cortex-A510: -27.1% Cortex-A720: -49.4% Cortex-X2: -67.9% Bug: libyuv:973 Change-Id: I321dc080a6e89301cd959c2ee18bc6680f749312 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505538 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-31 17:42:27 +00:00
George Steed	9fac9a4a82	[AArch64] Add Neon implementations for {ARGB,ABGR}ToAR30Row There are existing x86 implementations for these kernels but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| ABGRToAR30Row \| ARGBToAR30Row Cortex-A55 \| -55.1% \| -55.1% Cortex-A510 \| -39.3% \| -40.1% Cortex-A76 \| -62.3% \| -63.6% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-21 07:35:07 +00:00
George Steed	6f1d8b1e11	[AArch64] Add SVE2 implementations for ARGBToUVRow and similar By maintaining the interleaved format of the data we can use a common kernel for all input channel orderings and simply pass a different vector of constants instead. A similar approach is possible with only Neon by making use of multiplies and repeated application of ADDP to combine channels, however this is slower on older cores like Cortex-A53 so is not pursued further. For odd problem sizes we need a slightly different implementation for the final element, so introduce an "any" kernel to address that rather than bloating the code for the common case. Observed affect on runtimes compared to the existing Neon kernels: \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 ABGRToUVJRow \| -15.5% \| +5.4% \| -33.1% ABGRToUVRow \| -15.6% \| +5.3% \| -35.9% ARGBToUVJRow \| -10.1% \| +5.4% \| -32.7% ARGBToUVRow \| -10.1% \| +5.4% \| -29.3% BGRAToUVRow \| -15.5% \| +4.6% \| -32.8% RGBAToUVRow \| -10.1% \| +4.2% \| -36.0% Bug: libyuv:973 Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-05-01 19:46:43 +00:00
George Steed	f2e78e1304	[AArch64] Use Neon dot-product instructions in ARGBToYMatrixRow Using the dot-product instructions here allows us to avoid needing LD4 for loading individual colour channels, which gives a big benefit on some micro-architectures where such instructions perform significantly worse than LD1. In addition the dot-product instructions have higher throughput compared to the Neon Observed reduction in runtimes for selected kernels moving from _NEON to _NEON_DotProd: Kernel \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 \| Cortex-X2 ABGRToYJRow \| -6.5% \| -22.5% \| -43.5% \| -71.2% ABGRToYRow \| -6.5% \| -22.5% \| -43.5% \| -68.3% ARGBToYJRow \| -6.5% \| -22.5% \| -43.5% \| -68.1% ARGBToYRow \| -6.5% \| -22.5% \| -43.5% \| -68.1% BGRAToYRow \| -6.5% \| -22.5% \| -42.3% \| -68.4% RGBAToYJRow \| -6.5% \| -22.5% \| -42.2% \| -73.7% RGBAToYRow \| -6.5% \| -22.5% \| -42.3% \| -64.9% Bug: libyuv:977 Change-Id: If244190a7bdacf7e6e6b16af7e6853ee13ff6585 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424737 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-09 03:09:36 +00:00
Frank Barchard	def473f501	malloc return 1 for failures and assert for internal functions Bug: libyuv:968 Change-Id: Iea2f907061532d2e00347996124bc80d079a7bdc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5010874 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-12-04 22:55:20 +00:00
Frank Barchard	31e1d6f896	Check allocations that return NULL and return early BUG=libyuv:968 Change-Id: I9e8594440a6035958511f9c50072820131331fc8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4977552 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-10-27 17:41:36 +00:00
Bruce Lai	ec2e9ca000	[RVV] Support AR64ToAB64 and RGBA-family color conversions Add scalar code for AR64ToAB64, ARGBToRGBA, ARGBToBGRA, ARGBToABGR, RGBAToARGB, BGRAToARGB, and ABGRToARGB. They are originally implemented by ARGBShffle. This CL independetly implements them, and only enables for risc-v now. This CL also add RVV implementation for `RGBA-family <-> RGBA-family` color conversions. * Run on SiFive internal FPGA(VLEN=128): Test Case Speedup AR64ToAB64_Opt x4.6 ARGBToRGBA_Opt x6 ARGBToBGRA_Opt x6 ARGBToABGR_Opt x6 RGBAToARGB_Opt x6 Change-Id: Ie0630901046084aa259699fcdeccc64170d7103f Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4797451 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-09-05 22:44:48 +00:00
Frank Barchard	157b153b60	Fix tidy warning that uint32_t dither4 should not be const - Remove const from uint32_t dither4 parameter to fix clang-tidy warning - Apply clang format - Bump version - Remove unused MMI source; superceded by MSA Bug: None Change-Id: Id49991db25bca4e99590b415312542d917471c62 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581882 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-06-02 00:42:02 +00:00
Darren Hsieh	d14bd701c8	[RVV] Enable CopyRow_RVV, InterpolateRow_RVV, {Merge,Split}UVRow_RVV * Run on SiFive internal FPGA: MergeUVPlane_Opt(~6x vs scalar) SplitUVPlane_Opt(~6x vs scalar) TestCopyPlane(~8x vs scalar) ARGBInterpolate0_Opt(~10x vs scalar) ARGBInterpolate64_Opt(~9x vs scalar) ARGBInterpolate168_Opt(~9x vs scalar) ARGBInterpolate192_Opt(~8.5x vs scalar) ARGBInterpolate255_Opt(~8x vs scalar) Bug: libyuv:956 Change-Id: I8372341865f75f42e30371ef943d5c2e4be7b79a Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4574186 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-05-30 09:10:35 +00:00
Lu Wang	8670bcf17f	Optimize the following 19 functions with LSX in row_lsx.cc. UYVYToYRow_LSX, UYVYToUVRow_LSX, UYVYToUV422Row_LSX, ARGBToUVRow_LSX, ARGBToRGB24Row_LSX, ARGBToRAWRow_LSX, ARGBToRGB565Row_LSX, ARGBToARGB1555Row_LSX, ARGBToARGB4444Row_LSX, ARGBToUV444Row_LSX, ARGBMultiplyRow_LSX, ARGBAddRow_LSX, ARGBSubtractRow_LSX, ARGBAttenuateRow_LSX, ARGBToRGB565DitherRow_LSX, ARGBShuffleRow_LSX, ARGBShadeRow_LSX, ARGBGrayRow_LSX, ARGBSepiaRow_LSX Bug: libyuv:913 Change-Id: I02c0c9d68b229c4a66c96837e9b928c2f5dda1f3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4546814 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-05-19 18:55:58 +00:00
Frank Barchard	6a68b18a96	Bump version and apply clang format Bug: libyuv:956 Change-Id: I2375a02583789af2a5f13f8dba6c663d5975aaa9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4522352 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-05-11 11:27:28 +00:00
Bruce Lai	59eae49f17	Enable ARGBToYMatrixRow_RVV/RGBAToYMatrixRow_RVV/RGBToYMatrixRow_RVV Run on SiFive internal FPGA: ARGBToJ400_Opt (~6x vs scalar) RGBAToJ400_Opt (~6x vs scalar) RGB24ToJ400_Opt (~5.5x vs scalar) LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10 Change-Id: Ia3ce8cea7962fbd8618cc23e850a7913c9cabf4f Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4521783 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-05-11 10:17:51 +00:00
Lu Wang	1d940cc570	Optimize the following functions with LSX. MirrorRow_LSX, MirrorUVRow_LSX, ARGBMirrorRow_LSX, I422ToYUY2Row_LSX, I422ToUYVYRow_LSX, I422ToARGBRow_LSX, I422ToRGBARow_LSX, I422AlphaToARGBRow_LSX, I422ToRGB24Row_LSX, I422ToRGB565Row_LSX, I422ToARGB4444Row_LSX, I422ToARGB1555Row_LSX, YUY2ToYRow_LSX, YUY2ToUVRow_LSX, YUY2ToUV422Row_LSX Bug: libyuv:913 Change-Id: I46cec605001d7ddd73846eed6d0a77f936b6dc53 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4515191 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-05-10 00:25:48 +00:00
Bruce Lai	1330a79e9f	Optimized AR64/AB64 <-> ARGB with RVV * Run on SiFive internal FPGA: ARGBToAR64_Opt (~13.7x vs scalar) ARGBToAB64_Opt (~5.81x vs scalar) AR64ToARGB_Opt (~15.8x vs scalar) AB64ToARGB_Opt (~2.40x vs scalar) LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10 Bug: libyuv:956 Change-Id: Ida642a5077f59d25fb7c5328f671956b2293dadd Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4442913 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-04-20 19:49:55 +00:00
Darren Hsieh	44396e6e9a	Add ARGBToRAWRow_RVV, ARGBToRGB24Row_RVV, RGB24ToARGBRow_RVV * Run on SiFive internal FPGA: ARGBToRAW_Opt (~1.55x vs scalar) ARGBToRGB24_Opt (~1.44x vs scalar) RGB24ToARGB_Opt (~1.77x vs scalar) LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10 Bug: libyuv:956 Change-Id: I26722f6848cd68684d95d9a7ee06ce0416e7985d Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4413083 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-04-13 19:33:16 +00:00
Frank Barchard	88b050f337	MergeUV AVX512BW use assembly - Convert MergeUVRow_AVX512BW to assembly - Enable MergeUVRow_AVX512BW for Windows with clangcl - MergeUVRow_AVX2 use vpmovzxbw and vpsllw - MergeUVRow_16_AVX2 use vpmovzxbw and vpsllw with different shift for U and V AMD Zen 4 640x360 100000 iterations Was AVX512 MergeUVPlane_Opt (884 ms) AVX2 MergeUVPlane_Opt (945 ms) AVX2 MergeUVPlane_16_Opt (2167 ms) Now AVX512 MergeUVPlane_Opt (865 ms) AVX2 MergeUVPlane_Opt (943 ms) SSE2 MergeUVPlane_Opt (973 ms) AVX2 MergeUVPlane_16_Opt (2102 ms) Bug: None Change-Id: I658ada2a75d44c3f93be8bd3ed96f83d5fa2ab8d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4271230 Reviewed-by: Fritz Koenig <frkoenig@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2023-02-22 21:19:08 +00:00
Frank Barchard	2bdc210be9	MergeUV_AVX512BW for I420ToNV12 On Skylake Xeon 640x360 100000 iterations AVX512 MergeUVPlane_Opt (1196 ms) AVX2 MergeUVPlane_Opt (1565 ms) SSE2 MergeUVPlane_Opt (1780 ms) Pixel 7 MergeUVPlane_Opt (1177 ms) Bug: None Change-Id: If47d4fa957cf27781bba5fd6a2f0bf554101a5c6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4242247 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2023-02-13 20:14:57 +00:00
Hao Chen	0809713775	Refine some functions on the Longarch platform. Add ARGBToYMatrixRow_LSX/LASX, RGBAToYMatrixRow_LSX/LASX and RGBToYMatrixRow_LSX/LASX functions with RgbConstants argument. Bug: libyuv:912 Change-Id: I956e639d1f0da4a47a55b79c9d41dcd29e29bdc5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4167860 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-01-18 18:54:14 +00:00
Frank Barchard	f71c83552d	I420ToRGB24MatrixFilter function added - Implemented as 3 steps: Upsample UV to 4:4:4, I444ToARGB, ARGBToRGB24 - Fix some build warnings for missing prototypes. Pixel 4 I420ToRGB24_Opt (743 ms) I420ToRGB24Filter_Opt (1331 ms) Windows with skylake xeon: x86 32 bit I420ToRGB24_Opt (387 ms) I420ToRGB24Filter_Opt (571 ms) x64 64 bit I420ToRGB24_Opt (384 ms) I420ToRGB24Filter_Opt (582 ms) Bug: libyuv:938, libyuv:830 Change-Id: Ie27f70816ec084437014f8a1c630ae011ee2348c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3900298 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2022-09-16 19:46:47 +00:00
Frank Barchard	65e7c9d570	MM21ToYUY2 and ABGRToJ420 conversion MM21 to YUY2 use zip1 for performance Cortex A510 Was MM21ToYUY2 (612 ms) Now MM21ToYUY2 (573 ms) Prefetches help Cortex A53 Was MM21ToYUY2 (4998 ms) Now MM21ToYUY2 (1900 ms) Pixel 4 Cortex A76 Was MM21ToYUY2 (215 ms) Now MM21ToYUY2 (173 ms) ABGRToJ420 - NEON, SSSE3 and AVX2 row functions - J400, J420 and J422 formats. - Added AVX2 for UV on ARGBToJ420. Was SSSE3 Same code/performance as ARGBToJ420 but with constants re-ordered. Pixel 4 ABGRToJ420_Opt (623 ms) ABGRToJ422_Opt (702 ms) ABGRToJ400_Opt (238 ms) Skylake Xeon With LIBYUV_BIT_EXACT which uses C for UV ABGRToJ420_Opt (988 ms) ABGRToJ422_Opt (1872 ms) ABGRToJ400_Opt (186 ms) Skylake Xeon using AVX2 ABGRToJ420_Opt (251 ms) ABGRToJ422_Opt (245 ms) ABGRToJ400_Opt (184 ms) Skylake Xeon using SSSE3 ABGRToJ420_Opt (328 ms) ABGRToJ422_Opt (362 ms) ABGRToJ400_Opt (185 ms) Bug: b/238137982 Change-Id: I559c3fe3fb80fa2ce5be3d8218736f9cbc627666 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3832111 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2022-08-16 22:07:38 +00:00
Frank Barchard	95b14b2446	RAWToJ400 faster version for ARM - Unrolled to 16 pixels - Take constants via structure, allowing different colorspace and channel order - Use ADDHN to add 16.5 and take upper 8 bits of 16 bit values, narrowing to 8 bits - clang-format applied, affecting mips code On Cortex A510 Was RAWToJ400_Opt (1623 ms) Now RAWToJ400_Opt (862 ms) C RAWToJ400_Opt (1627 ms) Bug: b/220171611 Change-Id: I06a9baf9650ebe2802fb6ff6dfbd524e2c06ada0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3534023 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2022-03-18 07:22:36 +00:00
Hao Chen	91bae707e1	Optimize functions for LASX in row_lasx.cc. 1. Optimize 18 functions in source/row_lasx.cc file. 2. Make small modifications to LSX. 3. Remove some unnecessary content. Bug: libyuv:912 Change-Id: Ifd1d85366efb9cdb3b99491e30fa450ff1848661 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3507640 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2022-03-09 08:52:54 +00:00
Frank Barchard	42d76a342f	RAWToJNV21 function with 2 step conversion RAWToJ420 + J420ToNV21 on row level Pixel 6 RAWToJNV21_Opt (320 ms) Skylake Xeon RAWToJNV21_Opt (302 ms) Bug: b/220171611 Change-Id: I39dcce9cf56c576b95666bb4fb1baccf9fbc7f7a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3495876 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2022-03-01 19:33:49 +00:00
Frank Barchard	2c6bfc02d5	Remove MMI support Bug: libyuv:916 Change-Id: I345b7e271ceb4b32fe91e292915e66be40812810 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3415817 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2022-01-26 08:41:33 +00:00

1 2 3

141 Commits