libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2026-04-30 19:09:18 +08:00

Author	SHA1	Message	Date
Frank Barchard	ddc6764d13	ARGBToUVMatrixRow_RVV replace vlseg8 with vlseg4, implementing horizontal paired adds and accumulation to improve performance on SiFive x280, and fixes the remainder logic to use valid vlseg4 loads. Adds TestARGBToUVRow_Any to test odd-width remainder handling. Also fixes a build break for non-RVV compilations by ensuring all RVV functions and their closing cplusplus braces are correctly wrapped in #if !defined(LIBYUV_DISABLE_RVV). Also adds NV12ToNV21 as a macro alias for NV21ToNV12 in planar_functions.h, as the conversion is bidirectional (swapping byte pairs in the interleaved chroma plane). (Patch from https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7762904) Bug: libyuv:42280902 Change-Id: If2d6cbb3e232d63d43e32aba33fa9b2eee8190e5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7772164 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2026-04-17 15:04:45 -07:00
Frank Barchard	ace7c4573c	Add ARGBToUV444MatrixRow_RVV, ARGBToUVMatrixRow_RVV, and wrappers This change implements ARGBToUV444MatrixRow_RVV, ARGBToUVMatrixRow_RVV, and their wrappers (ARGBToUVRow_RVV, ARGBToUVJRow_RVV, etc.) using RVV intrinsics, mirroring the NEON/AVX2 designs. It wires them into the build and dispatch systems. LIBYUV_RVV_HAS_TUPLE_TYPE is always true on new compilers. This macro has been removed, assuming it is true everywhere, reducing the amount of code in row_rvv.cc, scale_rvv.cc, and row.h. Tested via: ~/bin/doyuv3v && ~/bin/runyuv3v TestARGBToI444Matrix ~/bin/doyuv3av Bug: libyuv:42280902 Change-Id: I36d305386b297d69023c068aa9c62ab6b2ad039c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7769956 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-16 20:52:43 -07:00
Frank Barchard	94644361b4	row_win.cc rewrite into intrinsics - remove inline asm which was only for 32 bit - add ARGBToYMatrixRow_AVX2 - add gn flag libyuv_enable_rowwin=true Example of building with GN and Ninja: Without the new flag: gn gen out/Release "--args=is_debug=false" ninja -C out/Release With the new flag: gn gen out/Release "--args=is_debug=false libyuv_enable_rowwin=true" ninja -C out/Release Bug: libyuv:42280806, 477295731, libyuv:42280902, libyuv:439628764 R=dalecurtis@chromium.org, rrwinterton@gmail.com Change-Id: I451bf814622fba690005c02fbf5816819c6a08c2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7765790 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-15 19:53:16 -07:00
Frank Barchard	e034c41661	Port ARGBToUVMatrixRow from AVX2 to AVX512BW Benchmark on Icelake Xeon Now AVX512BW: [ OK ] LibYUVConvertTest.ARGBToNV12_Opt (1723 ms) Was AVX2: [ OK ] LibYUVConvertTest.ARGBToNV12_Opt (2144 ms) - Added `ARGBToUVMatrixRow_AVX512BW` implementation in `source/row_gcc.cc`. - Added corresponding `ARGBToUVRow_AVX512BW` and `ABGRToUVRow_AVX512BW` functions. - Added unaligned wrappers `ARGBToUVRow_Any_AVX512BW` and `ABGRToUVRow_Any_AVX512BW` in `source/row_any.cc`. - Updated `source/row_any.cc` to correctly size `vin` and `vout` buffers for AVX512BW width and adjusted the `ANY12MS` and `ANY12S` macros to handle `MASK=63`. - Updated `include/libyuv/row.h` with the required AVX512BW headers and definitions, scoped appropriately. - Wired all callers of `ARGBToUVRow_AVX2` and related functions in `source/convert.cc` and `source/convert_from_argb.cc` to dynamically use the `AVX512BW` implementations if the CPU flag indicates AVX-512BW support. - Optimized AVX-512 code to generate the `-1` multiplier in a single instruction (`vpternlogd`) and reused it across word (`vpmaddwd`) dot products. Handled the resulting negation by replacing a subtraction with `vpaddw` offset adjustment. Bug: 477295731 R=dalecurtis@chromium.org, rrwinterton@gmail.com Change-Id: Ida5fb27e59ae4c1c3824737f009b80549cd20a06 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7763257 Reviewed-by: richard winterton <rrwinterton@gmail.com> Reviewed-by: Dale Curtis <dalecurtis@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-14 16:15:31 -07:00
Frank Barchard	59ca5d8074	Fix parameter names and comments for ARGB/BGRA/RGBA/ABGR functions In all functions that start with ARGB, BGRA, RGBA or ABGR in the include/libyuv/ headers, make sure the parameter variable name has the same 4 letters, but lower case, and the comment before the function should have the same matching name. Then make sure the implementation in source/ folder has the same variable names. Change-Id: Idadbbbb993156eea16e318719f4888cb3bed5f6a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7760057 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-13 18:28:37 -07:00
Frank Barchard	893eacf9b4	ARGBToY for AVX512 - add ARGBToYMatrixRow_AVX512BW - refactor SSE and AVX to use Matrix functions, making old functions call the new ones. Zen5 1280x720 Was AVX2 LibYUVConvertTest.ARGBToI444_Opt (1125 ms) Now AVX512 LibYUVConvertTest.ARGBToI444_Opt (641 ms) Details by Gemini: 1. Created 3 new Matrix functions: Added ARGBToYMatrixRow_SSSE3, ARGBToYMatrixRow_AVX2, and ARGBToYMatrixRow_AVX512BW to source/row_gcc.cc. These take the const struct ArgbConstants* c parameter similarly to ARGBToUV444MatrixRow_. The x86 vector instructions dynamically calculate the needed values using the properties of the constants struct, including using vpmaddwd inside the AVX512 code to offset the lack of a native vphaddw. 2. Replaced Old Functions with Wrappers: Modified the existing implementations of ARGBToYRow_SSSE3, ARGBToYJRow_SSSE3, ABGRToYRow_SSSE3, ABGRToYJRow_SSSE3, RGBAToYRow_SSSE3, RGBAToYJRow_SSSE3, BGRAToYRow_SSSE3 (and their _AVX2 equivalents) in source/row_gcc.cc to act as inline wrappers calling the new ARGBToYMatrixRow_ functions, passing the right matrix parameters (e.g. &kArgbI601Constants, &kArgbJPEGConstants, &kAbgrI601Constants). 3. Added row_any.cc Handlers: Added ANY11MC definitions to source/row_any.cc to autogenerate ARGBToYMatrixRow_Any_SSSE3, ARGBToYMatrixRow_Any_AVX2, and ARGBToYMatrixRow_Any_AVX512BW which safely handles non-aligned tails. 4. Updated include/libyuv/row.h: Updated the headers with the proper void declarations for all newly generated Matrix and Any_ variants. Also defined HAS_ARGBTOYROW_AVX512BW in the CPU macros. 5. Tested the Implementations: Compiled and tested on Linux x86, which resulted in all tests passing cleanly. Also successfully completed all Windows 32-bit build checks ensuring 32-bit regression prevention without issues. Bug: 477295731 Change-Id: I4f5eec9a961e24a9d760d0a1c0810fb5e29a0bd1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7759494 Reviewed-by: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2026-04-13 17:26:07 -07:00
Frank Barchard	4f4e1ac553	Fix 2 failing golden tests - Add ifdef for LIBYUV_UNLIMITED_DATA Fixed by Gemini just telling it how to build and run the test and to fix it. Bug: libyuv:353545922 Change-Id: I117a25b75b9616ee2ce6122aa163c2085ed4dc7d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7742120 Reviewed-by: James Zern <jzern@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2026-04-09 11:51:13 -07:00
Sam Maier	e3ceea1e67	Forward-declare ArgbConstants in convert.h to fix visibility error The libyuv into Chromium roller is currently broken, see bug 500795092. This change adds a forward declaration for struct ArgbConstants in include/libyuv/convert.h. This resolves a -Wvisibility error where the struct was being declared within a function prototype, making it invisible outside that scope and breaking automated binding generation (e.g., for crabbyavif). Verified building crabbyavif_libyuv_bindings locally and this patch fixed it. Bug: 500795092 Change-Id: Ie0126650ab346940f4610bd4d2e8a5b3ef9ce103 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7739974 Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Dale Curtis <dalecurtis@chromium.org>	2026-04-09 08:53:56 -07:00
Frank Barchard	4c3d7d517a	ARGBToUV444 for AVX512 1.27x faster on AMD Zen5 (turin) Now AVX512 perf record ./libyuv_test '--gunit_filter=*ARGBToI444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=10000 --libyuv_flags=-1 --libyuv_cpu_info=-1 [ OK ] LibYUVConvertTest.ARGBToI444_Opt (1071 ms) Overhead Symbol 53.49% ARGBToYRow_AVX2 44.70% ARGBToUV444Row_AVX512BW Was AVX2 [ OK ] LibYUVConvertTest.ARGBToI444_Opt (1369 ms) 61.06% ARGBToUV444Row_AVX2 37.67% ARGBToYRow_AVX2 Bug: libyuv:42280902 Change-Id: I306fbac656d6f7834ce1559e86d01eb34931ec3c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7738362 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Dale Curtis <dalecurtis@chromium.org>	2026-04-08 19:25:41 -07:00
Dale Curtis	1170363ce5	Add Gemini implementation for NEON32 RGB to YUV matrix operations These are about 25% faster than the C versions. Bug: libyuv:42280902 Change-Id: I8b298670ee5f3ed5db35527fc41d6d9a51b020a1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7573682 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org>	2026-03-23 16:30:44 -07:00
Dale Curtis	b1cacfb38f	Unify X86/X64 versions of ARGBToI4xxMatrix functions Change-Id: Iead13414414543e5f10ba9ba47a6ceaeb3113dee Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7562443 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2026-03-18 16:27:07 -07:00
Dale Curtis	f69a479f04	Add ARGBToNV12Matrix implementation This one reuses the SIMD implementations for MergeUVRow_ from the existing ARGBToNV12 functions. Bug: libyuv:42280902 Change-Id: If0a4be133d657ed0262f29fdd568dac90b49636c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7564317 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Dale Curtis <dalecurtis@chromium.org>	2026-03-18 16:26:59 -07:00
Dale Curtis	2c21d57319	Add ABGR versions of the ArgbConstants structures This allows for ABGR conversion using the same methods Bug: libyuv:42280902 Change-Id: I5566e3150b30573a2326a900ce31ab095f8935f9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7564316 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Dale Curtis <dalecurtis@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2026-03-17 17:28:51 -07:00
Dale Curtis	30809ff64a	Add ARGBToI4xxMatrix variants This was implemented by Gemini followed by manual review and some tweaking for style. The 601 and JPEG constants are fully verified against the existing non-matrix implementations. On x86 the C-only versions appear to be about 25% slower than the optimized ones. Bug: libyuv:42280902 Change-Id: Ia5b7cb499bad5c76faec53f36086ebb18f2b530f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7512030 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Dale Curtis <dalecurtis@chromium.org>	2026-03-04 10:55:06 -08:00
Frank Barchard	900da61d3c	Experimental SVE FMMLA detect Detect if arm cpu support FMMLA instruction Bug: None Change-Id: Ia7b83bf2735ddeeb8a85da44177e708c34e4b1fb Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7085486 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-10-27 14:34:55 -07:00
Frank Barchard	500f45652c	For for ARM32 build when built with __SOFTFP__ planar_test.cc was Error: selected processor does not support `vmrs r3,fpscr' in ARM mode Error: selected processor does not support `vmsr fpscr,r3' in ARM mode Bug: None Change-Id: I2ee0e7191c372277901c94e29d9ed91bbac71af2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7063737 Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-10-20 11:54:25 -07:00
Mark Zhuang	e237e8d7fb	RVV: Enable some function for intrinsic >= v1.0 According to README of rvv-intrinsic-doc, Clang 19 and GCC 14 supports the v1.0 version. But __riscv_v_intrinsic is 12000 on Clang 19, so need Clang >= 20 to test this patch. I test it with Clang 21. Change-Id: I0e75efcdab3e7bc0ce1acd19eca3568b47c84cbf Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6995438 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-10-17 11:44:14 -07:00
Wan-Teh Chang	fcd7060e0d	Bump LIBYUV_VERSION for removal of MIPS support Bump LIBYUV_VERSION to 1921. Missed in https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7045953. Bug: 434383432 Change-Id: If51122f1b744718551b0b601ead7cacb8c46c20d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7050411 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-10-16 13:32:52 -07:00
Frank Barchard	2b4453d46f	Deprecate MIPS and MSA support. - Remove *_msa.cc source files - Update build files - Update header references, planar ifdefs for row functions - Update documentation on supported platforms - Version bumped to 1921 - clang-format applied Bug: 434383432 Change-Id: I072d6aac4956f0ed668e64614ac8557612171f76 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7045953 Reviewed-by: Justin Green <greenjustin@google.com>	2025-10-16 12:20:40 -07:00
Frank Barchard	94417b9d21	Pass rgbconstants via struct pointer instead of elements with m Now 66 instructions SYM ARGBToUVRow_SSSE3: 62ccd0: BASE push ebp 62ccd1: BASE mov ebp, esp 62ccd3: BASE push ebx 62ccd4: BASE push edi 62ccd5: BASE push esi 62ccd6: BASE and esp, 0xfffffffc 62ccd9: BASE sub esp, 0xc 62ccdc: BASE call 0x62cce1 <ARGBToUVRow_SSSE3+0x11> 62cce1: BASE pop eax 62cce2: BASE add eax, 0xe1c27 62cce8: BASE mov ecx, dword ptr [ebp+0xc] 62cceb: BASE mov edx, dword ptr [ebp+0x8] 62ccee: BASE mov esi, dword ptr [ebp+0x10] 62ccf1: BASE mov edi, dword ptr [ebp+0x18] 62ccf4: BASE mov dword ptr [esp+0x8], edi 62ccf8: BASE mov edi, dword ptr [ebp+0x14] 62ccfb: BASE lea ebx, ptr [eax-0x5ecf88] 62cd01: SSE2 movdqa xmm4, xmmword ptr [ebx] 62cd05: SSE2 movdqa xmm5, xmmword ptr [ebx+0x10] 62cd0a: SSE2 pcmpeqb xmm6, xmm6 62cd0e: SSSE3 pabsb xmm6, xmm6 62cd13: SSE2 movdqa xmm7, xmmword ptr [eax-0x5ecfa8] 62cd1b: BASE sub edi, esi 62cd1d: SSE2 movdqu xmm0, xmmword ptr [edx] 62cd21: SSE2 movdqu xmm1, xmmword ptr [edx+0x10] 62cd26: SSE2 movdqu xmm2, xmmword ptr [edx+ecx1] 62cd2b: SSE2 movdqu xmm3, xmmword ptr [edx+ecx1+0x10] 62cd31: SSSE3 pshufb xmm0, xmm7 62cd36: SSSE3 pshufb xmm1, xmm7 62cd3b: SSSE3 pshufb xmm2, xmm7 62cd40: SSSE3 pshufb xmm3, xmm7 62cd45: SSSE3 pmaddubsw xmm0, xmm6 62cd4a: SSSE3 pmaddubsw xmm1, xmm6 62cd4f: SSSE3 pmaddubsw xmm2, xmm6 62cd54: SSSE3 pmaddubsw xmm3, xmm6 62cd59: SSE2 paddw xmm0, xmm2 62cd5d: SSE2 paddw xmm1, xmm3 62cd61: SSE2 pxor xmm2, xmm2 62cd65: SSE2 psrlw xmm0, 0x1 62cd6a: SSE2 psrlw xmm1, 0x1 62cd6f: SSE2 pavgw xmm0, xmm2 62cd73: SSE2 pavgw xmm1, xmm2 62cd77: SSE2 packuswb xmm0, xmm1 62cd7b: SSE2 movdqa xmm2, xmm6 62cd7f: SSE2 psllw xmm2, 0xf 62cd84: SSE2 movdqa xmm1, xmm0 62cd88: SSSE3 pmaddubsw xmm1, xmm5 62cd8d: SSSE3 pmaddubsw xmm0, xmm4 62cd92: SSSE3 phaddw xmm0, xmm1 62cd97: SSE2 psubw xmm2, xmm0 62cd9b: SSE2 psrlw xmm2, 0x8 62cda0: SSE2 packuswb xmm2, xmm2 62cda4: SSE2 movd dword ptr [esi], xmm2 62cda8: SSE2 pshufd xmm2, xmm2, 0x55 62cdad: SSE2 movd dword ptr [esi+edi1], xmm2 62cdb2: BASE lea edx, ptr [edx+0x20] 62cdb5: BASE lea esi, ptr [esi+0x4] 62cdb8: BASE sub dword ptr [esp+0x8], 0x8 62cdbd: BASE jnle 0x62cd1d <ARGBToUVRow_SSSE3+0x4d> 62cdc3: BASE lea esp, ptr [ebp-0xc] 62cdc6: BASE pop esi 62cdc7: BASE pop edi 62cdc8: BASE pop ebx 62cdc9: BASE pop ebp 62cdca: BASE ret Was 68 instructions ARGBToUVRow_SSSE3: 62ccd0: BASE push ebp 62ccd1: BASE mov ebp, esp 62ccd3: BASE push edi 62ccd4: BASE push esi 62ccd5: BASE and esp, 0xfffffff0 62ccd8: BASE sub esp, 0x30 62ccdb: BASE call 0x62cce0 <ARGBToUVRow_SSSE3+0x10> 62cce0: BASE pop eax 62cce1: BASE add eax, 0xe1c28 62cce7: BASE mov ecx, dword ptr [ebp+0xc] 62ccea: BASE mov edx, dword ptr [ebp+0x8] 62cced: BASE mov esi, dword ptr [ebp+0x10] 62ccf0: BASE mov edi, dword ptr [ebp+0x18] 62ccf3: BASE mov dword ptr [esp+0xc], edi 62ccf7: BASE mov edi, dword ptr [ebp+0x14] 62ccfa: SSE movaps xmm0, xmmword ptr [eax-0x5ecf88] 62cd01: SSE movaps xmmword ptr [esp+0x20], xmm0 62cd06: SSE movaps xmm0, xmmword ptr [eax-0x5ecf78] 62cd0d: SSE movaps xmmword ptr [esp+0x10], xmm0 62cd12: SSE2 movdqa xmm4, xmmword ptr [esp+0x20] 62cd18: SSE2 movdqa xmm5, xmmword ptr [esp+0x10] 62cd1e: SSE2 pcmpeqb xmm6, xmm6 62cd22: SSSE3 pabsb xmm6, xmm6 62cd27: SSE2 movdqa xmm7, xmmword ptr [eax-0x5ecfa8] 62cd2f: BASE sub edi, esi 62cd31: SSE2 movdqu xmm0, xmmword ptr [edx] 62cd35: SSE2 movdqu xmm1, xmmword ptr [edx+0x10] 62cd3a: SSE2 movdqu xmm2, xmmword ptr [edx+ecx1] 62cd3f: SSE2 movdqu xmm3, xmmword ptr [edx+ecx1+0x10] 62cd45: SSSE3 pshufb xmm0, xmm7 62cd4a: SSSE3 pshufb xmm1, xmm7 62cd4f: SSSE3 pshufb xmm2, xmm7 62cd54: SSSE3 pshufb xmm3, xmm7 62cd59: SSSE3 pmaddubsw xmm0, xmm6 62cd5e: SSSE3 pmaddubsw xmm1, xmm6 62cd63: SSSE3 pmaddubsw xmm2, xmm6 62cd68: SSSE3 pmaddubsw xmm3, xmm6 62cd6d: SSE2 paddw xmm0, xmm2 62cd71: SSE2 paddw xmm1, xmm3 62cd75: SSE2 pxor xmm2, xmm2 62cd79: SSE2 psrlw xmm0, 0x1 62cd7e: SSE2 psrlw xmm1, 0x1 62cd83: SSE2 pavgw xmm0, xmm2 62cd87: SSE2 pavgw xmm1, xmm2 62cd8b: SSE2 packuswb xmm0, xmm1 62cd8f: SSE2 movdqa xmm2, xmm6 62cd93: SSE2 psllw xmm2, 0xf 62cd98: SSE2 movdqa xmm1, xmm0 62cd9c: SSSE3 pmaddubsw xmm1, xmm5 62cda1: SSSE3 pmaddubsw xmm0, xmm4 62cda6: SSSE3 phaddw xmm0, xmm1 62cdab: SSE2 psubw xmm2, xmm0 62cdaf: SSE2 psrlw xmm2, 0x8 62cdb4: SSE2 packuswb xmm2, xmm2 62cdb8: SSE2 movd dword ptr [esi], xmm2 62cdbc: SSE2 pshufd xmm2, xmm2, 0x55 62cdc1: SSE2 movd dword ptr [esi+edi1], xmm2 62cdc6: BASE lea edx, ptr [edx+0x20] 62cdc9: BASE lea esi, ptr [esi+0x4] 62cdcc: BASE sub dword ptr [esp+0xc], 0x8 62cdd1: BASE jnle 0x62cd31 <ARGBToUVRow_SSSE3+0x61> 62cdd7: BASE lea esp, ptr [ebp-0x8] 62cdda: BASE pop esi 62cddb: BASE pop edi 62cddc: BASE pop ebp 62cddd: BASE ret 62cdde: BASE int3 BUG=444157316 Change-Id: Iad044f851359f5b052091c7bdab9b96946fc3682 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6987370 Reviewed-by: Justin Green <greenjustin@google.com>	2025-09-29 12:34:36 -07:00
Frank Barchard	7155afc5ca	ARGBToUV AVX2 for x86 32 bit - Reduce to 10 ymm registers - 2 constants generated on the fly Change-Id: Ib25a0cf7c93e5048270735410ccf6723b3949454 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6967319 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-09-18 13:14:45 -07:00
Frank Barchard	142db12947	ARGBToUV use AVX2 for 64 bit x86 Skylake Was ARGBToJ420_Opt (312 ms) Now ARGBToJ420_Opt (242 ms) Icelake Was ARGBToJ420_Opt (302 ms) Now ARGBToJ420_Opt (220 ms) AMD Zen3 on Windows Was ARGBToJ420_Opt (305 ms) Now ARGBToJ420_Opt (216 ms) 32 bit x86 uses SSE Now ARGBToJ420_Opt (326 ms) MCA analysis of new AVX, SSE and old AVX https://godbolt.org/z/37bdazWYr Bug: None Change-Id: I72f5504407751e164c3558aebe836dd15223d65f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6957477 Reviewed-by: Justin Green <greenjustin@google.com>	2025-09-17 14:39:53 -07:00
Mark Zhuang	b33794a586	RVV: Don't disable all rvv optimize when RVV >= v0.12 Disabled since Patch v2 of https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6385788 Change-Id: Id30a62c8f164830204dde02a443f5e4f04d757db Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6953818 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-09-16 18:17:02 -07:00
Frank Barchard	a61882c049	ARGBToUV AVX2 for x86_64 Icelake Was SSSE3+SSSE3 ARGBToJ420_Opt (356 ms) Was SSSE3+AVX2 ARGBToJ420_Opt (301 ms) Now AVX2+AVX2 ARGBToJ420_Opt (227 ms) Change-Id: I2cb427bc164b225b3ad4c5f43c09d6da6ca496d5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6943036 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-09-16 11:33:54 -07:00
Frank Barchard	0f795672ae	Reduce ARGBToUV SSSE3 register usage for clang build error on x64 Bug: 444157316 Change-Id: I2ae9f3dbfb373bb874a3d9699987f7d5b63f2610 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6937665 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-09-10 18:40:06 -07:00
Frank Barchard	d71cda1bb0	Rollback util cpuid hybrid detect due to android build errors Bug: 438241552 Change-Id: Ie56aa7296e796e44e63d0dd913120b897b12cc9b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6843504 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-08-12 14:13:24 -07:00
Frank Barchard	cdd3bae848	TestI400LargeSize fix for warning message build error - change %ld to %zd for size_t printf warnings - disable TestI400LargeSize when disabling SLOW_TESTS - disable cpuid tests that read proc/cpuinfo test data files - add ifdef around timers to allow hexagon build - remove faulty hybrid detect - remove old mips LIBYUV_DISABLE_DSPR2 reference in gyp build - apply clang-format Bug: 434382656 Change-Id: Id74812e6ef29d4a8d0ff967a9189d249b80816d4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6812825 Reviewed-by: Jeremy Leconte <jleconte@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-08-01 12:03:11 -07:00
Frank Barchard	3ff31b2a5f	Make LibYUVConvertTest.TestI400LargeSize skip test on low end arm cpu - detect lack of dot product instruction to infer the cpu is low end - only run the test on higher end arm Bug: 416842099 Change-Id: Idd2dd16a624bbba280cf531644440024b12f7ecf Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6804632 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>	2025-07-31 02:41:17 -07:00
George Steed	007b920232	[AArch64] Add SME implementation of ARGBToUVRow and similar Mostly just a straightforward copy of the existing SVE2 code ported to Streaming-SVE. Introduce new "any" kernels for non-multiple of two cases, matching what we already do for SVE2. The existing SVE2 code makes use of the Neon MOVI instruction that is not supported in Streaming-SVE, so adjust the code to use FMOV instead which has the same performance characteristics. Change-Id: I74b7ea1fe8e6af75dfaf92826a4de775a1559f77 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6663806 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-30 09:20:23 -07:00
George Steed	88798bcd63	[AArch64] Add SME implementation of Convert8To16Row_SME Mostly just a straightforward copy of the Neon code ported to Streaming-SVE. There is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: Ide34dbb7125b5f2a1edda6ef7111a1a49aad324f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6651565 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-23 11:32:56 -07:00
Frank Barchard	6f729fbe65	ARGBToUV SSE use average of 4 pixels - Was using avgb twice for non-exact and C for exact. On Skylake Xeon: Now SSE3 ARGBToJ420_Opt (326 ms) Was Exact C ARGBToJ420_Opt (871 ms) Not exact AVX2 ARGBToJ420_Opt (237 ms) Not exact SSSE3 ARGBToJ420_Opt (312 ms) Bug: 381138208 Change-Id: I6d1081bb52e36f06736c0c6575fa82bb2268629b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6629605 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Ben Weiss <bweiss@google.com>	2025-06-17 11:55:27 -07:00
Frank Barchard	889613683a	Add hybrid detect for Intel laptop cpus - Add +i8mm build option for sve ARGBToUV which uses usdot - util/cpuid Get cpu count (windows, macos, linux) - For each x86 cpu, detect hybrid (e-core) - Includes a comment fix for ubsan unittest - Bump version - Apply clang format to util/.c as well as all .cc/*.h Bug: 424637372 Change-Id: I08310e18051fff62c9e4e4a10d1e4361871119ac Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6635640 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-06-13 13:22:54 -07:00
George Steed	1b2f6cdbe8	[AArch64] Unroll I210ToAR30Row_{SVE2,SME} Now that we have a STOREAR30_SVE_2X implementation, we can use this to unroll other kernels. The predication on I210ToAR30Row needs adjusting to allow loading two vectors of Y compared to one vector of U/V, and additionally UZP is needed to ensure the data arrangement in vector lanes matches the U/V layout. LD2H could also be used, however this provides no performance improvement on most cores and would necessitate the addition of an "any" kernel to handle the case where width % 2 != 0. Reduction in run times of I210ToAR30Row_SVE2 observed compared to the previous SVE2 implementation: (note that even in the observed slowdowns, the SVE2 implementation still outperforms the existing Neon code) Cortex-A510: -37.1% Cortex-A520: -39.1% Cortex-A710: +1.6% (!) Cortex-A715: +6.5% (!) Cortex-A720: +6.5% (!) Cortex-X2: -2.9% Cortex-X3: -2.2% Cortex-X4: -8.8% Cortex-X925: -3.5% Change-Id: I2ff285b48105883526eceb8be1fcbe0e033a553b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6640989 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2025-06-12 14:10:21 -07:00
George Steed	867bdc51ed	[AArch64] Unroll I422ToAR30Row_{SVE2,SME} The existing STOREAR30_SVE macro works fine for out of order cores, however for in-order cores the number of dependent vector instructions laid out consecutively impacts performance. We can improve this by unrolling the loop to process two sets of vectors at a time, allowing little cores to process two independent streams of vector instructions at the same time to improve performance. Using one set of ZIP instructions at the end allows us to (a) avoid ST4 which we know is slow on some micro-architectures, and (b) enable the use of predication and avoid the need for separate "any" kernels. Reduction in run times of I422ToAR30Row_SVE2 observed compared to the previous SVE2 implementation: Cortex-A510: -37.7% Cortex-A520: -38.8% Cortex-A710: -14.8% Cortex-A715: -17.1% Cortex-A720: -16.9% Cortex-X2: -10.3% Cortex-X3: -6.7% Cortex-X4: -9.4% Cortex-X925: -7.1% Change-Id: I160fb41300d2d08fce2e6eb92181324fd723a02d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6632916 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2025-06-12 14:09:49 -07:00
Frank Barchard	4ac0a3ae3d	ubsan compliant '_any' functions using ptrdiff_t for pointer math Bug: 416842099 Change-Id: I1e3c7bc1b363c11baeb3b529ee78e5ac8878c359 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6634217 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-06-10 15:01:52 -07:00
George Steed	cd0ae0a222	row_sve.h: Add missing z21 clobber The z21 register is used in the I444TORGB_SVE_2X macro and other places, so add it to the clobber list macro that is used throughout this file. Change-Id: If4277c1ffcac0fa68cc44263acc6f41a9e82ec8b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6619508 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-08 19:41:44 -07:00
George Steed	998bec7ca9	Sort row.h #define *_NEON lists Sort the Arm Neon and Neon DotProd #define lists to match the alphabetical ordering used for the SVE2 and SME lists. Change-Id: Ibeb380f477d5476d0018d20a754557a5f93f2190 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6613686 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-08 19:38:30 -07:00
George Steed	ef9833fc70	Add Neon implementation of Convert8To16Row Add a Neon implementation of the Convert8To16Row kernel. Compared to the C implementation we can take advantage of knowing that the "scale" parameter is always an unsigned power of two and fits in 16-bits, allowing us to combine this with the shift and avoid needing to widen the input data. Reduction in run times observed compared to the existing C implementation: Cortex-A55: -44.5% Cortex-A510: -26.1% Cortex-A520: -30.6% Cortex-A76: -61.6% Cortex-A710: -57.6% Cortex-X1: -46.5% Cortex-X2: -54.4% Cortex-X3: -57.1% Cortex-X4: -55.0% Cortex-X925: -49.3% Change-Id: I34b858605ece47e46588c0680a1d2afa7a90d7a0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6516186 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-29 13:37:48 -07:00
George Steed	7e5863ae5a	Add SVE2 and SME implementations of I422ToAR30Row This can make use of the existing load/convert/store macros that are already present for other kernels, so add I422ToAR30Row_SVE2 and I422ToAR30Row_SME to match the existing kernels. Reduction in time taken observed for the new SVE2 implementation, compared to the existing Neon implementation: Cortex-A510: -9.1% Cortex-A520: +6.8% (!) Cortex-A710: -4.0% Cortex-A715: -1.1% Cortex-A720: -1.1% Cortex-X2: -5.7% Cortex-X3: -5.9% Cortex-X4: -2.8% Cortex-X925: -4.0% Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-27 11:39:00 -07:00
George Steed	949cb623bf	Add SVE2 and SME implementations of I444ToRGB24Row Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so they are usable in both SVE2 and SME implementations, and use them to add new I444ToRGB24Row implementations for SVE2 and SME. We need to use the unrolled versions here to use the ST3B interleaving store instructions, since there is no partial vector version of this store instruction. Reduction in time taken observed for the new SVE2 implementation, compared to the existing Neon implementation: Cortex-A510: -57.6% Cortex-A520: -38.1% Cortex-A710: -15.5% Cortex-A715: -9.2% Cortex-A720: -9.2% Cortex-X2: -25.8% Cortex-X3: -26.2% Cortex-X4: -23.2% Cortex-X925: -17.8% Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-22 13:33:06 -07:00
Frank Barchard	0853c9353f	ARGBToUV 64 bit use ymm8 for shuffler Bug: 381138208 Change-Id: I5e69bc1610bd6269bf9a4113e729cf307dd36f60 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6536833 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-05-12 15:09:40 -07:00
George Steed	61bdaee13a	Add Neon I8MM implementations of ARGB to UV and variants The maximum coefficient is 128, so store constants negated to take advantage of -128 being representable in 8-bit integers. This allows us to use the I8MM USDOT instructions. Reduction in time taken observed compared to the existing Neon implementation, as a geomean of all ARGBToUV variants: Cortex-A510: -7.1% Cortex-A520: -2.1% Cortex-A710: -8.4% Cortex-A715: -0.3% Cortex-A720: -0.3% Cortex-X2: -40.0% Cortex-X3: -43.3% Cortex-X4: -11.3% Cortex-X925: -2.5% Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-05-12 11:14:00 -07:00
Frank Barchard	9f9b5cf660	ARGBToUV allow 32 bit x86 build - make width loop count on stack - set YMM constants in its own asm block - make struct for shuffle and add constants - disable clang format on row_neon.cc function Bug: 413781394 Change-Id: I263f6862cb7589dc31ac65d118f7ebeb65dbb24a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6495259 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-04-28 12:11:00 -07:00
Wan-Teh Chang	8c48036d15	Remove duplicate code in planar_functions.h The declarations of ARGBAffineRow_C and ARGBAffineRow_SSE2 and the code to support those declarations are duplicated in planar_functions.h. They are already in row.h, so we can simply remove them. Change-Id: I9b522fdd201ca530f1268bf4200cd2e18b806ba5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6434733 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Wan-Teh Chang <wtc@google.com>	2025-04-04 15:48:23 -07:00
Wan-Teh Chang	b7a857659f	Disable Arm SME and SVE assmbly code under MSan The code that disables Arm and Intel assembly code under MSan is duplicated in cpu_support.h and planar_functions.h. This CL does not address the code duplication. Bug: b:407277484, b:407278016, b:407278132 Change-Id: If70fd8d3382916041d75efabcc84010ea3f1e60e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6430806 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-04-03 11:27:31 -07:00
Frank Barchard	23d416d6f3	Detect SME without SVE dependency Bug: None Change-Id: Ibe29488e893a493699ea3fae1a1a54a4fff5969c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6418571 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-03-31 17:27:40 -07:00
Frank Barchard	f145aa26da	Add SME2 detect Bug: None Change-Id: I36e576de1cf468049faaf3923b6c21fc9ad14271 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6401373 Reviewed-by: George Steed <george.steed@arm.com>	2025-03-27 11:08:08 -07:00
George Steed	64ac2d8f0f	Avoid odd width stores in I422ToRGB565Row_{SVE2,SME} The existing code for creating RGB565 data in SVE2 and SME produces two vectors of interleaved 16-bit elements due to the nature of how SVE widening instructions operate. This means that the indices of the 16-bit data created appear in the two result vectors as such: z18.b: [elem0 byte0, elem0 byte1, elem2 byte0, elem2 byte1, ...] z19.b: [elem1 byte0, elem1 byte1, elem3 byte0, elem3 byte1, ...] This is problematic for the final (predicated) iteration of the conversion since the p1 predicate input to the ST2H instruction controls storing the four bytes corresponding to the first two elements, in the first two bytes of z18 and z19. This means that in the case that the width is an odd number there is no way of storing just elem0 in z18 individually. This patch addresses this by permuting the z18/z19 data such that the two bytes from each element are split evenly across the two vectors: z20.b: [elem0 byte0, elem1 byte0, elem2 byte0, elem3 byte0, ...] z21.b: [elem0 byte1, elem1 byte1, elem2 byte1, elem3 byte1, ...] Since we would now always store the same lanes from both vectors we can continue to use the same predicate without further changes. The existing (non-tail) loop body utilizes an all-true predicate so we can avoid the extra permutes in this case, avoiding any performance degradation. Change-Id: I7d2be27c84cd9eb02cebac54a14c3498911f21d3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6395137 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-03-26 04:08:46 -07:00
Frank Barchard	5f284054cb	RVV disable 64 bit elements and vcombine_v Bug: 405451074 Change-Id: I8e4437be92934b3c367c94d867d7967c32747260 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6385788 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-03-25 12:51:25 -07:00
Frank Barchard	c060118bea	ARGBToJ444 use 256 for fixed point scale UV - use negative coefficients for UV to allow -128 - change shift to truncate instead of round for UV - adapt all row_gcc RGB to UV into matrix functions - add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc Bug: 381138208 Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: James Zern <jzern@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-27 13:04:15 -08:00

1 2 3 4 5 ...

1906 Commits