libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2026-05-01 03:19:18 +08:00

Author	SHA1	Message	Date
Frank Barchard	62c19d062d	[libyuv] Remove all x86 SSE optimizations Removed all SSE functions, macros, dispatching logic, and related unit tests across the repository to reduce code size and complexity. Left cpuid detection intact. Supported architectures like AVX2, NEON, SVE, etc. are unaffected. R=rrwinterton@gmail.com Bug: None Test: Build and run libyuv_unittest Change-Id: Id19608dba35b79c4c8fc31f920a6a968883d300f	2026-04-29 16:56:03 -07:00
Frank Barchard	f2ac6db694	RAWToNV21 using SME, SVE, I8MM or Neon Pixel 9 Now SVE2 2 pass LibYUVConvertTest.RAWToNV21_Opt (364 ms) 31.76% libyuv::ARGBToUVMatrixRow_SVE_SC() 30.38% RAWToARGBRow_SVE2 26.81% ARGBToYMatrixRow_NEON_DotProd 3.26% MergeUVRow_NEON Was NEON 1 pass LibYUVConvertTest.RAWToJNV21_Opt (295 ms) 44.14% RAWToYJRow_NEON 41.91% RAWToUVJRow_NEON 5.11% MergeUVRow_NEON Clang on Intel Skylake clang [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (301 ms) visual c (row_win) [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (2056 ms) clang [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (275 ms) visual c [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (365 ms) Bug: libyuv:42280902 Change-Id: Iaba558ebe96ce6b9881ee9335ba72b8aac390cde Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7802432 Commit-Queue: Frank Barchard <fbarchard@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com> Reviewed-by: Dale Curtis <dalecurtis@chromium.org>	2026-04-29 13:11:04 -07:00
Frank Barchard	4afb965416	RAWToARGB use AVX512BW Bug: libyuv:42280902 Change-Id: I7a80fd64d97b6d411316819df0fd917d609a173b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7787163 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@google.com>	2026-04-22 16:56:46 -07:00
Frank Barchard	bd2c4c76ec	RAWToARGB AVX512VBMI Bug: libyuv:42280902 Change-Id: I1c7f432f004079357a00515785bc524c459ed4b9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7787160 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@google.com>	2026-04-22 14:48:29 -07:00
Frank Barchard	d445250d8b	Replace RAWToY/RGB24ToY with RGBToYMatrix Bug: libyuv:42280902 Change-Id: I6ddebd492036c416550fc045eb39493dea73246b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7784094 Commit-Queue: Frank Barchard <fbarchard@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2026-04-21 17:11:14 -07:00
Frank Barchard	81f698829b	Add RGBToNV21Matrix function - implement wrappers with RAW, RGB24, NV21 and JNV21 to call it. Zen5 Was [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (1146 ms) Now [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (1446 ms) reason - the new code uses 1 pass for RAWToY but 2 pass for RAWToARGB,ARGBToUV. needs 1 RGBToUV Bug: libyuv:42280902 Change-Id: Ife6fbed0829484045409e6d42b85cec1d1fd6052 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7780026 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@google.com>	2026-04-20 18:03:34 -07:00
Frank Barchard	9f13b2814d	add RGBToYMatrixRow_AVX2 Adds RGBToYMatrixRow_AVX2 which reads 24 bit RGB values by reading 3 vectors instead of 4 and permutes them into 4 ARGB vectors before conversion. Also adds RGBToYMatrixRow_Opt and RGBToYMatrixRow_2Step_Opt to convert_argb_test.cc to benchmark and compare the direct AVX2 conversion vs a 2-step approach. ./libyuv_test '--gunit_filter=*RAWToJ400_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=10000 --libyuv_flags=-1 --libyuv_cpu_info=-1 AMD Zen 5 Was LibYUVConvertTest.RAWToJ400_Opt (757 ms) Now LibYUVConvertTest.RAWToJ400_Opt (699 ms) Intel Skylake Was LibYUVConvertTest.RAWToJ400_Opt (1705 ms) Now LibYUVConvertTest.RAWToJ400_Opt (1426 ms) Bug: 477295731 Change-Id: I29866baf4ad5fe7a3725e4a01f2fe24649510a7d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7777325 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-20 12:52:44 -07:00
Frank Barchard	ace7c4573c	Add ARGBToUV444MatrixRow_RVV, ARGBToUVMatrixRow_RVV, and wrappers This change implements ARGBToUV444MatrixRow_RVV, ARGBToUVMatrixRow_RVV, and their wrappers (ARGBToUVRow_RVV, ARGBToUVJRow_RVV, etc.) using RVV intrinsics, mirroring the NEON/AVX2 designs. It wires them into the build and dispatch systems. LIBYUV_RVV_HAS_TUPLE_TYPE is always true on new compilers. This macro has been removed, assuming it is true everywhere, reducing the amount of code in row_rvv.cc, scale_rvv.cc, and row.h. Tested via: ~/bin/doyuv3v && ~/bin/runyuv3v TestARGBToI444Matrix ~/bin/doyuv3av Bug: libyuv:42280902 Change-Id: I36d305386b297d69023c068aa9c62ab6b2ad039c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7769956 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-16 20:52:43 -07:00
Frank Barchard	e034c41661	Port ARGBToUVMatrixRow from AVX2 to AVX512BW Benchmark on Icelake Xeon Now AVX512BW: [ OK ] LibYUVConvertTest.ARGBToNV12_Opt (1723 ms) Was AVX2: [ OK ] LibYUVConvertTest.ARGBToNV12_Opt (2144 ms) - Added `ARGBToUVMatrixRow_AVX512BW` implementation in `source/row_gcc.cc`. - Added corresponding `ARGBToUVRow_AVX512BW` and `ABGRToUVRow_AVX512BW` functions. - Added unaligned wrappers `ARGBToUVRow_Any_AVX512BW` and `ABGRToUVRow_Any_AVX512BW` in `source/row_any.cc`. - Updated `source/row_any.cc` to correctly size `vin` and `vout` buffers for AVX512BW width and adjusted the `ANY12MS` and `ANY12S` macros to handle `MASK=63`. - Updated `include/libyuv/row.h` with the required AVX512BW headers and definitions, scoped appropriately. - Wired all callers of `ARGBToUVRow_AVX2` and related functions in `source/convert.cc` and `source/convert_from_argb.cc` to dynamically use the `AVX512BW` implementations if the CPU flag indicates AVX-512BW support. - Optimized AVX-512 code to generate the `-1` multiplier in a single instruction (`vpternlogd`) and reused it across word (`vpmaddwd`) dot products. Handled the resulting negation by replacing a subtraction with `vpaddw` offset adjustment. Bug: 477295731 R=dalecurtis@chromium.org, rrwinterton@gmail.com Change-Id: Ida5fb27e59ae4c1c3824737f009b80549cd20a06 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7763257 Reviewed-by: richard winterton <rrwinterton@gmail.com> Reviewed-by: Dale Curtis <dalecurtis@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-14 16:15:31 -07:00
Frank Barchard	cbc64c353c	Port ARGBToYRow_AVX2 usages to dynamically use ARGBToYRow_AVX512BW I have successfully ported the usage of ARGBToYRow_AVX2 to dynamically detect and utilize ARGBToYRow_AVX512BW when available. Here's a summary of the changes: 1. Source Modifications: In both source/convert.cc and source/convert_from_argb.cc, I searched for all references where ARGBToYRow_AVX2 was being conditionally used (which operates on 32 pixels). 2. AVX512BW Detection: Immediately following those blocks, I injected a new check for kCpuHasAVX512BW. If the CPU flag is present, the logic now utilizes ARGBToYRow_Any_AVX512BW by default, falling back to the fully aligned ARGBToYRow_AVX512BW when the width is aligned to 64 bytes. 3. Profiling: After building and compiling the tests (doyuv3x), I validated the change using perfyuv3 ARGBToNV12_Opt \| cat. The test successfully executed and the performance profile indicated that ARGBToYRow_AVX512BW successfully executed (taking up ~18% of CPU cycles, replacing the previous AVX2 specific instruction overhead for the Y row extraction). The HAS_ARGBTOYROW_AVX512BW macro implementation now fully supports all AVX2 conversion paths to utilize AVX512BW when the system processor flags allow it! R=richard, rrwinterton@gmail.com Change-Id: Iad811e12d301f5621e6f6d039105420861ade43e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7760779 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2026-04-14 11:42:59 -07:00
Frank Barchard	893eacf9b4	ARGBToY for AVX512 - add ARGBToYMatrixRow_AVX512BW - refactor SSE and AVX to use Matrix functions, making old functions call the new ones. Zen5 1280x720 Was AVX2 LibYUVConvertTest.ARGBToI444_Opt (1125 ms) Now AVX512 LibYUVConvertTest.ARGBToI444_Opt (641 ms) Details by Gemini: 1. Created 3 new Matrix functions: Added ARGBToYMatrixRow_SSSE3, ARGBToYMatrixRow_AVX2, and ARGBToYMatrixRow_AVX512BW to source/row_gcc.cc. These take the const struct ArgbConstants* c parameter similarly to ARGBToUV444MatrixRow_. The x86 vector instructions dynamically calculate the needed values using the properties of the constants struct, including using vpmaddwd inside the AVX512 code to offset the lack of a native vphaddw. 2. Replaced Old Functions with Wrappers: Modified the existing implementations of ARGBToYRow_SSSE3, ARGBToYJRow_SSSE3, ABGRToYRow_SSSE3, ABGRToYJRow_SSSE3, RGBAToYRow_SSSE3, RGBAToYJRow_SSSE3, BGRAToYRow_SSSE3 (and their _AVX2 equivalents) in source/row_gcc.cc to act as inline wrappers calling the new ARGBToYMatrixRow_ functions, passing the right matrix parameters (e.g. &kArgbI601Constants, &kArgbJPEGConstants, &kAbgrI601Constants). 3. Added row_any.cc Handlers: Added ANY11MC definitions to source/row_any.cc to autogenerate ARGBToYMatrixRow_Any_SSSE3, ARGBToYMatrixRow_Any_AVX2, and ARGBToYMatrixRow_Any_AVX512BW which safely handles non-aligned tails. 4. Updated include/libyuv/row.h: Updated the headers with the proper void declarations for all newly generated Matrix and Any_ variants. Also defined HAS_ARGBTOYROW_AVX512BW in the CPU macros. 5. Tested the Implementations: Compiled and tested on Linux x86, which resulted in all tests passing cleanly. Also successfully completed all Windows 32-bit build checks ensuring 32-bit regression prevention without issues. Bug: 477295731 Change-Id: I4f5eec9a961e24a9d760d0a1c0810fb5e29a0bd1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7759494 Reviewed-by: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2026-04-13 17:26:07 -07:00
Frank Barchard	4c3d7d517a	ARGBToUV444 for AVX512 1.27x faster on AMD Zen5 (turin) Now AVX512 perf record ./libyuv_test '--gunit_filter=*ARGBToI444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=10000 --libyuv_flags=-1 --libyuv_cpu_info=-1 [ OK ] LibYUVConvertTest.ARGBToI444_Opt (1071 ms) Overhead Symbol 53.49% ARGBToYRow_AVX2 44.70% ARGBToUV444Row_AVX512BW Was AVX2 [ OK ] LibYUVConvertTest.ARGBToI444_Opt (1369 ms) 61.06% ARGBToUV444Row_AVX2 37.67% ARGBToYRow_AVX2 Bug: libyuv:42280902 Change-Id: I306fbac656d6f7834ce1559e86d01eb34931ec3c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7738362 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Dale Curtis <dalecurtis@chromium.org>	2026-04-08 19:25:41 -07:00
Dale Curtis	1170363ce5	Add Gemini implementation for NEON32 RGB to YUV matrix operations These are about 25% faster than the C versions. Bug: libyuv:42280902 Change-Id: I8b298670ee5f3ed5db35527fc41d6d9a51b020a1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7573682 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org>	2026-03-23 16:30:44 -07:00
Dale Curtis	b1cacfb38f	Unify X86/X64 versions of ARGBToI4xxMatrix functions Change-Id: Iead13414414543e5f10ba9ba47a6ceaeb3113dee Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7562443 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2026-03-18 16:27:07 -07:00
Dale Curtis	2c21d57319	Add ABGR versions of the ArgbConstants structures This allows for ABGR conversion using the same methods Bug: libyuv:42280902 Change-Id: I5566e3150b30573a2326a900ce31ab095f8935f9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7564316 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2026-03-17 17:28:51 -07:00
Dale Curtis	30809ff64a	Add ARGBToI4xxMatrix variants This was implemented by Gemini followed by manual review and some tweaking for style. The 601 and JPEG constants are fully verified against the existing non-matrix implementations. On x86 the C-only versions appear to be about 25% slower than the optimized ones. Bug: libyuv:42280902 Change-Id: Ia5b7cb499bad5c76faec53f36086ebb18f2b530f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7512030 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Dale Curtis <dalecurtis@chromium.org>	2026-03-04 10:55:06 -08:00
Frank Barchard	2b4453d46f	Deprecate MIPS and MSA support. - Remove *_msa.cc source files - Update build files - Update header references, planar ifdefs for row functions - Update documentation on supported platforms - Version bumped to 1921 - clang-format applied Bug: 434383432 Change-Id: I072d6aac4956f0ed668e64614ac8557612171f76 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7045953 Reviewed-by: Justin Green <greenjustin@google.com>	2025-10-16 12:20:40 -07:00
George Steed	007b920232	[AArch64] Add SME implementation of ARGBToUVRow and similar Mostly just a straightforward copy of the existing SVE2 code ported to Streaming-SVE. Introduce new "any" kernels for non-multiple of two cases, matching what we already do for SVE2. The existing SVE2 code makes use of the Neon MOVI instruction that is not supported in Streaming-SVE, so adjust the code to use FMOV instead which has the same performance characteristics. Change-Id: I74b7ea1fe8e6af75dfaf92826a4de775a1559f77 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6663806 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-30 09:20:23 -07:00
George Steed	61bdaee13a	Add Neon I8MM implementations of ARGB to UV and variants The maximum coefficient is 128, so store constants negated to take advantage of -128 being representable in 8-bit integers. This allows us to use the I8MM USDOT instructions. Reduction in time taken observed compared to the existing Neon implementation, as a geomean of all ARGBToUV variants: Cortex-A510: -7.1% Cortex-A520: -2.1% Cortex-A710: -8.4% Cortex-A715: -0.3% Cortex-A720: -0.3% Cortex-X2: -40.0% Cortex-X3: -43.3% Cortex-X4: -11.3% Cortex-X925: -2.5% Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-05-12 11:14:00 -07:00
Frank Barchard	918329caee	Make constant 0x0101 using vpcmpeqb+vpabsb Was vpcmpeqb %%ymm4,%%ymm4,%%ymm4 vpsrlw $0xf,%%ymm4,%%ymm4 vpackuswb %%ymm4,%%ymm4,%%ymm4 Now vpcmpeqb %%ymm4,%%ymm4,%%ymm4 vpabsb %%ymm4,%%ymm4 Bug: 381138208 Change-Id: Ib70c24ac636fff95a10c7f06ed8f0a3bc7514906 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6312925 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2025-03-10 13:25:16 -07:00
Frank Barchard	c060118bea	ARGBToJ444 use 256 for fixed point scale UV - use negative coefficients for UV to allow -128 - change shift to truncate instead of round for UV - adapt all row_gcc RGB to UV into matrix functions - add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc Bug: 381138208 Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: James Zern <jzern@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-27 13:04:15 -08:00
Frank Barchard	61354d2671	ARGBToUV Matrix for AVX2 and SSSE3 - Round before shifting to 8 bit to match NEON - RAWToARGB use unaligned loads and port to AVX2 Was C/SSSE/AVX2 ARGBToI444_Opt (343 ms) ARGBToJ444_Opt (677 ms) RAWToI444_Opt (405 ms) RAWToJ444_Opt (803 ms) Now AVX2 ARGBToI444_Opt (283 ms) ARGBToJ444_Opt (284 ms) RAWToI444_Opt (316 ms) RAWToJ444_Opt (339 ms) Profile Now AVX2 38.31% ARGBToUVJ444Row_AVX2 32.31% RAWToARGBRow_AVX2 23.99% ARGBToYJRow_AVX2 Profile Was C/SSSE/AVX2 73.15% ARGBToUVJ444Row_C 15.74% RAWToARGBRow_SSSE3 8.87% ARGBToYJRow_AVX2 Bug: 381138208 Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Ben Weiss <bweiss@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-10 18:36:18 -08:00
Frank Barchard	5a9a6ea936	Add RAWToI444 Skylake Xeon RAWToI444_Opt (433 ms) RAWToJ444_Opt (1781 ms) ARGBToI444_Opt (352 ms) ARGBToJ444_Opt (1577 ms) Samsung S22 Exynos ARGBToI444_Opt (283 ms) ARGBToJ444_Opt (209 ms) RAWToI444_Opt (294 ms) RAWToJ444_Opt (293 ms) Profiling on Samsung S22 Exynos 37.62%, ARGBToUV444Row_NEON_I8MM 29.42%, RAWToARGBRow_SVE2 19.61%, ARGBToYRow_NEON_DotProd Passing different --libyuv_cpu_info=N etc we can compare each ISA C 1 RAWToI444_Opt (781 ms) NEON 511 RAWToI444_Opt (757 ms) NEONDOT 1023 RAWToI444_Opt (571 ms) NEONI8MM 2047 RAWToI444_Opt (334 ms) SVE2 8191 RAWToI444_Opt (307 ms) Bug: 390247964 Change-Id: I0316fedd32222588455afa751f5b854f46bce024 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6223658 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-02-03 16:13:03 -08:00
Frank Barchard	c1bac9e6a5	RAWToJ444 and ARGBToJ444 - ARGBToJ444 implements ARGBToUVJ444Row_C - RAWToJ444 implemented as 2 steps - RAWToARGB and ARGBToJ444 libyuv_test '--gunit_filter=RTo?444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 (with bit exact off) Samsung S23 RAWToJ444_Opt (437 ms) ARGBToJ444_Opt (337 ms) ARGBToI444_Opt (196 ms) Skylake Xeon RAWToJ444_Opt (1699 ms) ARGBToJ444_Opt (1559 ms) ARGBToI444_Opt (346 ms) Bug: 390247964 Change-Id: Id1b1b45a5e4512ab50830aadf62f780fbe631575 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207845 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-29 15:18:38 -08:00
Frank Barchard	26277baf96	J420ToI420 using planar 8 bit scaling - Add Convert8To8Plane which scale and add 8 bit values allowing full range YUV to be converted to limited range YUV libyuv_test '--gunit_filter=J420ToI420' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Samsung S23 J420ToI420_Opt (45 ms) I420ToI420_Opt (37 ms) Skylake J420ToI420_Opt (596 ms) I420ToI420_Opt (99 ms) Bug: 381327032 Change-Id: I380c3fa783491f2e3727af28b0ea9ce16d2bb8a4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6182631 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-22 02:50:24 -08:00
Frank Barchard	47ddac2996	Sub sampling conversions use CopyPlane for Y channel - Replace ScalePlane with CopyPlane for Y channel - Vertical mirroring is supported, but not horizontal mirroring. - Check src_y is not null when dst_y is not null for all libyuv functions that allow a null dst_y. - Apply clang-format - Bump version to 1899 Bug: None Change-Id: Id1805b52b8024ba95a7f1b098dabf45af48670eb Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6128599 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-02 13:34:11 -08:00
Frank Barchard	e0040eb318	Apply clang format Bug: None Change-Id: I0d9db4b384144523e61ae32b6ab3f72e93a0c265 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6138934 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-02 13:31:20 -08:00
George Steed	c2e7f8389a	[AArch64] Add SME implementations of InterpolateRow{,_16,_16To8} InterpolateRow_SME and InterpolateRow_16_SME need special cases to handle if source_y_fraction is 256 since this would overflow a byte and can just be a call to memcpy instead. InterpolateRow_16To8_SME is never called with a source_y_fraction value of 256 so there is no need for a special case here. Change-Id: I67805b5db2c411acb93ada626cf414b35620f467 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:03:41 -08:00
George Steed	418b6df0de	[AArch64] Add SME implementation of Convert16To8Row Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel. SVE has a "widening multiply get high half" instruction in UMULH, however using the same technique as the Neon code to avoid the need for a widening multiply at all is more performant here. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 03:01:55 -08:00
George Steed	7391559cb4	[AArch64] Add SME implementation of MergeUVRow{,_16} Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel and use ST2 to avoid needing a separate ZIP instruction. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 01:16:19 -08:00
George Steed	772f0fde1c	[AArch64] Use full Neon vectors in RGB565To{ARGB,UV,Y}Row_NEON The existing code only makes use of half of the vector lanes in the RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more data to allow using full vectors, adjusting the "any" kernel macros to match. For the RGB565ToUVRow kernel we already have plenty of data but currently call the macro twice as much as needed, so refactor the code to only call it once but operating with full vectors instead. Reduction in runtimes observed for selected micro-architectures: \| RGB565ToARGBRow \| RGB565ToUVRow \| RGB565ToYRow Cortex-A53 \| -35.2% \| -28.8% \| -31.1% Cortex-A55 \| -32.5% \| -34.4% \| -42.9% Cortex-A510 \| -21.6% \| -27.7% \| -47.2% Cortex-A76 \| -0.9% \| -42.0% \| -21.4% Cortex-A720 \| -28.6% \| -37.2% \| -26.1% Cortex-X1 \| -3.2% \| -42.3% \| -23.4% Bug: b/42280945 Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:35:47 +00:00
Frank Barchard	679e851f65	Convert16To8Row_AVX512BW using vpmovuswb - avx2 is pack/perm is mutating order - cvt method maintains channel order on avx512 Sapphire Rapids Benchmark of 640x360 on Sapphire Rapids AVX512BW [ OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms) AVX2 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms) SSE2 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms) Skylake Xeon Now vpmovuswb [ OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms) Was vpackuswb [ OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms) Switch from vpunpcklwd to vpbroadcastw for scale value parameter Was vpunpcklwd %%xmm2,%%xmm2,%%xmm2 vbroadcastss %%xmm2,%%ymm2 Now vpbroadcastw %%xmm2,%%ymm2 Bug: 357439226, 357721018 Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-15 20:13:33 +00:00
Frank Barchard	336e6fd25b	I010ToNV12 conversion using 2 step row function for UV - convert full Y plane with row coalescing if possible - convert rows of UV from 10 bit to 8 bit then call MergeUV libyuv_test '--gunit_filter=010ToNV12_Opt' --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Note: Google Test filter = 010ToNV12_Opt Skylake Xeon Was 2 pass planes [ OK ] LibYUVConvertTest.I010ToNV12_Opt (4512 ms) Now 2 pass rows [ OK ] LibYUVConvertTest.I010ToNV12_Opt (2400 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (2265 ms) On Samsung S23 libyuv_test --gunit_filter=*.????ToNV12_Opt --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000' Was [ OK ] LibYUVConvertTest.I010ToNV12_Opt (3563 ms) Now [ OK ] LibYUVConvertTest.AYUVToNV12_Opt (3068 ms [ OK ] LibYUVConvertTest.ARGBToNV12_Opt (2990 ms [ OK ] LibYUVConvertTest.ABGRToNV12_Opt (2904 ms [ OK ] LibYUVConvertTest.P010ToNV12_Opt (1177 ms [ OK ] LibYUVConvertTest.I010ToNV12_Opt (1150 ms <- now [ OK ] LibYUVConvertTest.I444ToNV12_Opt (1118 ms [ OK ] LibYUVConvertTest.MM21ToNV12_Opt (1008 ms [ OK ] LibYUVConvertTest.UYVYToNV12_Opt (1007 ms [ OK ] LibYUVConvertTest.YUY2ToNV12_Opt (938 ms) [ OK ] LibYUVConvertTest.NV21ToNV12_Opt (496 ms) [ OK ] LibYUVConvertTest.I420ToNV12_Opt (466 ms) Bug: b/357439226, b/357721018 Change-Id: I48405929ae835b171e7d556a16794eac22c50ae9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5782404 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-13 19:30:16 +00:00
Frank Barchard	a97746349b	Add test for I010ToNV12 - Add support for negative height to invert - Fix off by 1 on odd width and height - Bump version to 1895 Initial I010 is 2 step planar conversion libyuv_test '--gunit_filter=*010ToNV12_Opt' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Skylake Xeon [ OK ] LibYUVConvertTest.I010ToNV12_Opt (2675 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (1547 ms) Pixel 7 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (464 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (125 ms) Bug: b/357721018, b/357439226 Change-Id: I2ae59783cf328a6592d0ab80c374ae4dc281daf3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778595 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-12 18:57:56 +00:00
Chunbo Hua	fc94178260	Implement I010ToNV12 conversion I010, also known as YUV420P10, is 10 bit YUV pixel format with 3 planes. Both I010 and NV12 are 4:2:0 subsampling. NV12 has a Y plane, and an interleaved UV plane. Bug: 357721018 Change-Id: If215529b9eda8e0fb32aed666ca179c90244aaff Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5764823 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-06 17:36:13 +00:00
Frank Barchard	32ccd53bb3	Add P010ToNV12 to convert 10 bit biplanar to 8 bit biplanar - P010 and NV12 have the same layout: Full size Y plane and half size UV plane. P010 and NV12 are 4:2:0 subsampling - P010 uses upper 10 bits of 16 bit elements - NV12 uses 8 bit elements - The Convert16To8 used internally will discard the low 2 bits. - UV order is the same - U first in memory, followed by V, interleaved - UV plane is be rounded up in size to allow odd size Y to have UV values - Similar code could be used to convert P210ToNV16, P410ToNV24, with the size of the UV plane affected by subsampling 4:2:2 and 4:4:4 variants. Bug: b/357439226 Change-Id: I5d6ec84d97d0e0cc4008eeb18a929ea28570d6d9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5761958 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-05 18:55:44 +00:00
George Steed	004352ba16	[AArch64] Add SVE2 implementations for AYUVTo{UV,VU}Row These kernels are mostly identical to each other except for the order of the results, so we can use a single macro to parameterize the pairwise addition and use the same macro for both implementations, just with the register order flipped. Similar to other 2x2 kernels the implementation here differs slightly for the last element if the problem size is odd, so use an "any" kernel to avoid needing to handle this in the common code path. Observed reduction in runtime compared to the existing Neon code: \| AYUVToUVRow \| AYUVToVURow Cortex-A510 \| -33.1% \| -33.0% Cortex-A720 \| -25.1% \| -25.1% Cortex-X2 \| -59.5% \| -53.9% Cortex-X4 \| -39.2% \| -39.4% Bug: libyuv:973 Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	6f1d8b1e11	[AArch64] Add SVE2 implementations for ARGBToUVRow and similar By maintaining the interleaved format of the data we can use a common kernel for all input channel orderings and simply pass a different vector of constants instead. A similar approach is possible with only Neon by making use of multiplies and repeated application of ADDP to combine channels, however this is slower on older cores like Cortex-A53 so is not pursued further. For odd problem sizes we need a slightly different implementation for the final element, so introduce an "any" kernel to address that rather than bloating the code for the common case. Observed affect on runtimes compared to the existing Neon kernels: \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 ABGRToUVJRow \| -15.5% \| +5.4% \| -33.1% ABGRToUVRow \| -15.6% \| +5.3% \| -35.9% ARGBToUVJRow \| -10.1% \| +5.4% \| -32.7% ARGBToUVRow \| -10.1% \| +5.4% \| -29.3% BGRAToUVRow \| -15.5% \| +4.6% \| -32.8% RGBAToUVRow \| -10.1% \| +4.2% \| -36.0% Bug: libyuv:973 Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-05-01 19:46:43 +00:00
George Steed	f2e78e1304	[AArch64] Use Neon dot-product instructions in ARGBToYMatrixRow Using the dot-product instructions here allows us to avoid needing LD4 for loading individual colour channels, which gives a big benefit on some micro-architectures where such instructions perform significantly worse than LD1. In addition the dot-product instructions have higher throughput compared to the Neon Observed reduction in runtimes for selected kernels moving from _NEON to _NEON_DotProd: Kernel \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 \| Cortex-X2 ABGRToYJRow \| -6.5% \| -22.5% \| -43.5% \| -71.2% ABGRToYRow \| -6.5% \| -22.5% \| -43.5% \| -68.3% ARGBToYJRow \| -6.5% \| -22.5% \| -43.5% \| -68.1% ARGBToYRow \| -6.5% \| -22.5% \| -43.5% \| -68.1% BGRAToYRow \| -6.5% \| -22.5% \| -42.3% \| -68.4% RGBAToYJRow \| -6.5% \| -22.5% \| -42.2% \| -73.7% RGBAToYRow \| -6.5% \| -22.5% \| -42.3% \| -64.9% Bug: libyuv:977 Change-Id: If244190a7bdacf7e6e6b16af7e6853ee13ff6585 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424737 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-09 03:09:36 +00:00
Frank Barchard	5625f42424	I444ToI420 and I422ToI420 check U and V pointers and return -1 if NULL. - Add detect linux kernel version number in util/cpuid adbrun -- blaze-bin/third_party/libyuv/cpuid Kernel Version 4.14 Cpu Flags 0x7 Has ARM 0x2 Bug: libyuv:970 Change-Id: I655ed598db3655ca8448be08f1d71fbc328ced66 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5207990 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-01-18 21:56:11 +00:00
Frank Barchard	def473f501	malloc return 1 for failures and assert for internal functions Bug: libyuv:968 Change-Id: Iea2f907061532d2e00347996124bc80d079a7bdc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5010874 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-12-04 22:55:20 +00:00
Wan-Teh Chang	fb6341d326	Change ScalePlane,ScalePlane_16,... to return int Change ScalePlane(), ScalePlane_16(), and ScalePlane_12() to return int so that they can report memory allocation failures (by returning 1). BUG=libyuv:968 Change-Id: Ie5c183ee42e3d595302671f9ecb7b3472dc8fdb5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5005031 Commit-Queue: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-11-03 23:53:24 +00:00
Frank Barchard	31e1d6f896	Check allocations that return NULL and return early BUG=libyuv:968 Change-Id: I9e8594440a6035958511f9c50072820131331fc8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4977552 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-10-27 17:41:36 +00:00
Bruce Lai	04821d1e7d	[RVV] Enable ARGBExtractAlphaRow/ARGBCopyYToAlphaRow * Run on SiFive internal FPGA: TestARGBExtractAlpha(~3.2x vs scalar) TestARGBCopyYToAlpha(~1.6x vs scalar) Change-Id: I36525c67e8ac3f71ea9d1a58c7dc15a4009d9da1 Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617955 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-06-15 23:45:24 +00:00
Frank Barchard	157b153b60	Fix tidy warning that uint32_t dither4 should not be const - Remove const from uint32_t dither4 parameter to fix clang-tidy warning - Apply clang format - Bump version - Remove unused MMI source; superceded by MSA Bug: None Change-Id: Id49991db25bca4e99590b415312542d917471c62 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581882 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-06-02 00:42:02 +00:00
Vignesh Venkatasubramanian	c0f64c14ca	Add I412/I212 to I420 functions They re-use the same method as I410/I210 to I420 with a depth value of 12 instead of 10. Bug: b/268505204 Change-Id: I299862b4556461d8c95f0fc1dcd5260e1c1f25cd Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581867 Commit-Queue: Vignesh Venkatasubramanian <vigneshv@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-06-01 19:50:16 +00:00
Darren Hsieh	d14bd701c8	[RVV] Enable CopyRow_RVV, InterpolateRow_RVV, {Merge,Split}UVRow_RVV * Run on SiFive internal FPGA: MergeUVPlane_Opt(~6x vs scalar) SplitUVPlane_Opt(~6x vs scalar) TestCopyPlane(~8x vs scalar) ARGBInterpolate0_Opt(~10x vs scalar) ARGBInterpolate64_Opt(~9x vs scalar) ARGBInterpolate168_Opt(~9x vs scalar) ARGBInterpolate192_Opt(~8.5x vs scalar) ARGBInterpolate255_Opt(~8x vs scalar) Bug: libyuv:956 Change-Id: I8372341865f75f42e30371ef943d5c2e4be7b79a Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4574186 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-05-30 09:10:35 +00:00
Lu Wang	8670bcf17f	Optimize the following 19 functions with LSX in row_lsx.cc. UYVYToYRow_LSX, UYVYToUVRow_LSX, UYVYToUV422Row_LSX, ARGBToUVRow_LSX, ARGBToRGB24Row_LSX, ARGBToRAWRow_LSX, ARGBToRGB565Row_LSX, ARGBToARGB1555Row_LSX, ARGBToARGB4444Row_LSX, ARGBToUV444Row_LSX, ARGBMultiplyRow_LSX, ARGBAddRow_LSX, ARGBSubtractRow_LSX, ARGBAttenuateRow_LSX, ARGBToRGB565DitherRow_LSX, ARGBShuffleRow_LSX, ARGBShadeRow_LSX, ARGBGrayRow_LSX, ARGBSepiaRow_LSX Bug: libyuv:913 Change-Id: I02c0c9d68b229c4a66c96837e9b928c2f5dda1f3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4546814 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-05-19 18:55:58 +00:00
Frank Barchard	a37799344d	ARGBToI420Alpha function to convert ARGB to I420 with Alpha Bug: b/281866362 Change-Id: Ic1093a887fb483f134c78909cf1ee7495e7345ba Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4534100 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2023-05-17 00:23:24 +00:00
Bruce Lai	59eae49f17	Enable ARGBToYMatrixRow_RVV/RGBAToYMatrixRow_RVV/RGBToYMatrixRow_RVV Run on SiFive internal FPGA: ARGBToJ400_Opt (~6x vs scalar) RGBAToJ400_Opt (~6x vs scalar) RGB24ToJ400_Opt (~5.5x vs scalar) LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10 Change-Id: Ia3ce8cea7962fbd8618cc23e850a7913c9cabf4f Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4521783 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-05-11 10:17:51 +00:00

1 2 3 4 5 ...

283 Commits