libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2026-01-01 03:12:16 +08:00

Author	SHA1	Message	Date
Frank Barchard	94417b9d21	Pass rgbconstants via struct pointer instead of elements with m Now 66 instructions SYM ARGBToUVRow_SSSE3: 62ccd0: BASE push ebp 62ccd1: BASE mov ebp, esp 62ccd3: BASE push ebx 62ccd4: BASE push edi 62ccd5: BASE push esi 62ccd6: BASE and esp, 0xfffffffc 62ccd9: BASE sub esp, 0xc 62ccdc: BASE call 0x62cce1 <ARGBToUVRow_SSSE3+0x11> 62cce1: BASE pop eax 62cce2: BASE add eax, 0xe1c27 62cce8: BASE mov ecx, dword ptr [ebp+0xc] 62cceb: BASE mov edx, dword ptr [ebp+0x8] 62ccee: BASE mov esi, dword ptr [ebp+0x10] 62ccf1: BASE mov edi, dword ptr [ebp+0x18] 62ccf4: BASE mov dword ptr [esp+0x8], edi 62ccf8: BASE mov edi, dword ptr [ebp+0x14] 62ccfb: BASE lea ebx, ptr [eax-0x5ecf88] 62cd01: SSE2 movdqa xmm4, xmmword ptr [ebx] 62cd05: SSE2 movdqa xmm5, xmmword ptr [ebx+0x10] 62cd0a: SSE2 pcmpeqb xmm6, xmm6 62cd0e: SSSE3 pabsb xmm6, xmm6 62cd13: SSE2 movdqa xmm7, xmmword ptr [eax-0x5ecfa8] 62cd1b: BASE sub edi, esi 62cd1d: SSE2 movdqu xmm0, xmmword ptr [edx] 62cd21: SSE2 movdqu xmm1, xmmword ptr [edx+0x10] 62cd26: SSE2 movdqu xmm2, xmmword ptr [edx+ecx1] 62cd2b: SSE2 movdqu xmm3, xmmword ptr [edx+ecx1+0x10] 62cd31: SSSE3 pshufb xmm0, xmm7 62cd36: SSSE3 pshufb xmm1, xmm7 62cd3b: SSSE3 pshufb xmm2, xmm7 62cd40: SSSE3 pshufb xmm3, xmm7 62cd45: SSSE3 pmaddubsw xmm0, xmm6 62cd4a: SSSE3 pmaddubsw xmm1, xmm6 62cd4f: SSSE3 pmaddubsw xmm2, xmm6 62cd54: SSSE3 pmaddubsw xmm3, xmm6 62cd59: SSE2 paddw xmm0, xmm2 62cd5d: SSE2 paddw xmm1, xmm3 62cd61: SSE2 pxor xmm2, xmm2 62cd65: SSE2 psrlw xmm0, 0x1 62cd6a: SSE2 psrlw xmm1, 0x1 62cd6f: SSE2 pavgw xmm0, xmm2 62cd73: SSE2 pavgw xmm1, xmm2 62cd77: SSE2 packuswb xmm0, xmm1 62cd7b: SSE2 movdqa xmm2, xmm6 62cd7f: SSE2 psllw xmm2, 0xf 62cd84: SSE2 movdqa xmm1, xmm0 62cd88: SSSE3 pmaddubsw xmm1, xmm5 62cd8d: SSSE3 pmaddubsw xmm0, xmm4 62cd92: SSSE3 phaddw xmm0, xmm1 62cd97: SSE2 psubw xmm2, xmm0 62cd9b: SSE2 psrlw xmm2, 0x8 62cda0: SSE2 packuswb xmm2, xmm2 62cda4: SSE2 movd dword ptr [esi], xmm2 62cda8: SSE2 pshufd xmm2, xmm2, 0x55 62cdad: SSE2 movd dword ptr [esi+edi1], xmm2 62cdb2: BASE lea edx, ptr [edx+0x20] 62cdb5: BASE lea esi, ptr [esi+0x4] 62cdb8: BASE sub dword ptr [esp+0x8], 0x8 62cdbd: BASE jnle 0x62cd1d <ARGBToUVRow_SSSE3+0x4d> 62cdc3: BASE lea esp, ptr [ebp-0xc] 62cdc6: BASE pop esi 62cdc7: BASE pop edi 62cdc8: BASE pop ebx 62cdc9: BASE pop ebp 62cdca: BASE ret Was 68 instructions ARGBToUVRow_SSSE3: 62ccd0: BASE push ebp 62ccd1: BASE mov ebp, esp 62ccd3: BASE push edi 62ccd4: BASE push esi 62ccd5: BASE and esp, 0xfffffff0 62ccd8: BASE sub esp, 0x30 62ccdb: BASE call 0x62cce0 <ARGBToUVRow_SSSE3+0x10> 62cce0: BASE pop eax 62cce1: BASE add eax, 0xe1c28 62cce7: BASE mov ecx, dword ptr [ebp+0xc] 62ccea: BASE mov edx, dword ptr [ebp+0x8] 62cced: BASE mov esi, dword ptr [ebp+0x10] 62ccf0: BASE mov edi, dword ptr [ebp+0x18] 62ccf3: BASE mov dword ptr [esp+0xc], edi 62ccf7: BASE mov edi, dword ptr [ebp+0x14] 62ccfa: SSE movaps xmm0, xmmword ptr [eax-0x5ecf88] 62cd01: SSE movaps xmmword ptr [esp+0x20], xmm0 62cd06: SSE movaps xmm0, xmmword ptr [eax-0x5ecf78] 62cd0d: SSE movaps xmmword ptr [esp+0x10], xmm0 62cd12: SSE2 movdqa xmm4, xmmword ptr [esp+0x20] 62cd18: SSE2 movdqa xmm5, xmmword ptr [esp+0x10] 62cd1e: SSE2 pcmpeqb xmm6, xmm6 62cd22: SSSE3 pabsb xmm6, xmm6 62cd27: SSE2 movdqa xmm7, xmmword ptr [eax-0x5ecfa8] 62cd2f: BASE sub edi, esi 62cd31: SSE2 movdqu xmm0, xmmword ptr [edx] 62cd35: SSE2 movdqu xmm1, xmmword ptr [edx+0x10] 62cd3a: SSE2 movdqu xmm2, xmmword ptr [edx+ecx1] 62cd3f: SSE2 movdqu xmm3, xmmword ptr [edx+ecx1+0x10] 62cd45: SSSE3 pshufb xmm0, xmm7 62cd4a: SSSE3 pshufb xmm1, xmm7 62cd4f: SSSE3 pshufb xmm2, xmm7 62cd54: SSSE3 pshufb xmm3, xmm7 62cd59: SSSE3 pmaddubsw xmm0, xmm6 62cd5e: SSSE3 pmaddubsw xmm1, xmm6 62cd63: SSSE3 pmaddubsw xmm2, xmm6 62cd68: SSSE3 pmaddubsw xmm3, xmm6 62cd6d: SSE2 paddw xmm0, xmm2 62cd71: SSE2 paddw xmm1, xmm3 62cd75: SSE2 pxor xmm2, xmm2 62cd79: SSE2 psrlw xmm0, 0x1 62cd7e: SSE2 psrlw xmm1, 0x1 62cd83: SSE2 pavgw xmm0, xmm2 62cd87: SSE2 pavgw xmm1, xmm2 62cd8b: SSE2 packuswb xmm0, xmm1 62cd8f: SSE2 movdqa xmm2, xmm6 62cd93: SSE2 psllw xmm2, 0xf 62cd98: SSE2 movdqa xmm1, xmm0 62cd9c: SSSE3 pmaddubsw xmm1, xmm5 62cda1: SSSE3 pmaddubsw xmm0, xmm4 62cda6: SSSE3 phaddw xmm0, xmm1 62cdab: SSE2 psubw xmm2, xmm0 62cdaf: SSE2 psrlw xmm2, 0x8 62cdb4: SSE2 packuswb xmm2, xmm2 62cdb8: SSE2 movd dword ptr [esi], xmm2 62cdbc: SSE2 pshufd xmm2, xmm2, 0x55 62cdc1: SSE2 movd dword ptr [esi+edi1], xmm2 62cdc6: BASE lea edx, ptr [edx+0x20] 62cdc9: BASE lea esi, ptr [esi+0x4] 62cdcc: BASE sub dword ptr [esp+0xc], 0x8 62cdd1: BASE jnle 0x62cd31 <ARGBToUVRow_SSSE3+0x61> 62cdd7: BASE lea esp, ptr [ebp-0x8] 62cdda: BASE pop esi 62cddb: BASE pop edi 62cddc: BASE pop ebp 62cddd: BASE ret 62cdde: BASE int3 BUG=444157316 Change-Id: Iad044f851359f5b052091c7bdab9b96946fc3682 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6987370 Reviewed-by: Justin Green <greenjustin@google.com>	2025-09-29 12:34:36 -07:00
Daniel.L (Byoungchan Lee)	5b22f31cb5	Fix compilation issue for 32bit PIC build Currently, ARGBToUVMatrixRow_AVX2 and ARGBToUVMatrixRow_SSSE3 fail to compile with clang on 32bit PIC build with the error message: inline assembly requires more registers than available This is because in PIC code EBX is reserved for the GOT and with a frame pointer EBP is also unavailable. Fix this by copying the RGB-to-UV constants to stack locals first and let the asm use simple stack-relative addressing. Bug: 444157316 Change-Id: Ica90f0c35039303ecaa145534683f59659fb5d7f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6980714 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-09-25 13:49:02 -07:00
Frank Barchard	1b1c058787	ARGBToUV for SSE use pshufb/pmaddubsw Was ARGBToJ420_Opt (377 ms) Now ARGBToJ420_Opt (340 ms) Bug: None Change-Id: Iada2d6e9ecdb141b9e2acbdf343f890e4aaebe34 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6967754 Reviewed-by: Justin Green <greenjustin@google.com>	2025-09-19 12:39:39 -07:00
Frank Barchard	7155afc5ca	ARGBToUV AVX2 for x86 32 bit - Reduce to 10 ymm registers - 2 constants generated on the fly Change-Id: Ib25a0cf7c93e5048270735410ccf6723b3949454 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6967319 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-09-18 13:14:45 -07:00
Frank Barchard	142db12947	ARGBToUV use AVX2 for 64 bit x86 Skylake Was ARGBToJ420_Opt (312 ms) Now ARGBToJ420_Opt (242 ms) Icelake Was ARGBToJ420_Opt (302 ms) Now ARGBToJ420_Opt (220 ms) AMD Zen3 on Windows Was ARGBToJ420_Opt (305 ms) Now ARGBToJ420_Opt (216 ms) 32 bit x86 uses SSE Now ARGBToJ420_Opt (326 ms) MCA analysis of new AVX, SSE and old AVX https://godbolt.org/z/37bdazWYr Bug: None Change-Id: I72f5504407751e164c3558aebe836dd15223d65f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6957477 Reviewed-by: Justin Green <greenjustin@google.com>	2025-09-17 14:39:53 -07:00
Frank Barchard	a61882c049	ARGBToUV AVX2 for x86_64 Icelake Was SSSE3+SSSE3 ARGBToJ420_Opt (356 ms) Was SSSE3+AVX2 ARGBToJ420_Opt (301 ms) Now AVX2+AVX2 ARGBToJ420_Opt (227 ms) Change-Id: I2cb427bc164b225b3ad4c5f43c09d6da6ca496d5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6943036 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-09-16 11:33:54 -07:00
Frank Barchard	0f795672ae	Reduce ARGBToUV SSSE3 register usage for clang build error on x64 Bug: 444157316 Change-Id: I2ae9f3dbfb373bb874a3d9699987f7d5b63f2610 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6937665 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-09-10 18:40:06 -07:00
yuanhecai	eb4e4736a4	loong64: UV subsample's 4-pixel rounding average and ARGBToJ444 fixed-point scaling The UV subsample's 4-pixel rounding average and ARGBToJ444 fixed-point scaling were updated in d32d19cc and c060118b. The LoongArch optimization is updated now. Bug: 381138208 Change-Id: I3585d72564e4fffe514599b1a9b4fee8fbbd0266 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6878364 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>	2025-09-03 12:22:44 -07:00
George Steed	b7d97d5f3f	[AArch64] Fix compilation due to incorrect register constraint The y0_fraction and y1_fraction variables in InterpolateRow_NEON were marked as modified by the inline-asm block, however 5eea7812826c551559fdcd4a6988fcf1fbe341f6 marked these variables as `const` which caused both LLVM and GCC to emit errors about modification of const variables. There is no need for these variables to be modified in the loop since they are read-only, so simply update the inline asm block constraints to match. Change-Id: I94ca3696c4163ede6ad27d645f0f445fcfb0a1c3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6818289 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-08-05 11:23:20 -07:00
Frank Barchard	48943bb378	Convert8To16 use VPSRLW instead of VPMULHUW for better lunarlake performance - MCA says old version was 4 cycles and new version is 2.5 cycles/loop - lunarlake is the only known cpu mca -mcpu=lunarlake 100 iterations Was vpmulhu Iterations: 100 Instructions: 1200 Total Cycles: 426 Total uOps: 1200 Dispatch Width: 8 uOps Per Cycle: 2.82 IPC: 2.82 Block RThroughput: 4.0 Now vpsrlw Iterations: 100 Instructions: 1200 Total Cycles: 279 Total uOps: 1400 Dispatch Width: 8 uOps Per Cycle: 5.02 IPC: 4.30 Block RThroughput: 2.5 Bug: None Change-Id: I5a49e1cf1ed3dfb59fe9861a871df9862417c6a6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6697745 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-08-04 12:42:50 -07:00
Xi Ruoyao	dd9ced1c6d	loong64: Use HWCAP instead of CPUCFG to detect LSX/LASX Per the Software Development and Build Convention for LoongArch™ Architectures manual, on Linux we should use HWCAP instead of CPUCFG to detect if LSX/LASX is available. The reason is the kernel may be configured to disable them, and CPUCFG cannot provide info about the kernel support. Change-Id: I3f1b23e6d4c91c7da81311fbbe294e36ff178121 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6772567 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-07-24 23:43:54 -07:00
George Steed	007b920232	[AArch64] Add SME implementation of ARGBToUVRow and similar Mostly just a straightforward copy of the existing SVE2 code ported to Streaming-SVE. Introduce new "any" kernels for non-multiple of two cases, matching what we already do for SVE2. The existing SVE2 code makes use of the Neon MOVI instruction that is not supported in Streaming-SVE, so adjust the code to use FMOV instead which has the same performance characteristics. Change-Id: I74b7ea1fe8e6af75dfaf92826a4de775a1559f77 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6663806 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-30 09:20:23 -07:00
George Steed	88798bcd63	[AArch64] Add SME implementation of Convert8To16Row_SME Mostly just a straightforward copy of the Neon code ported to Streaming-SVE. There is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: Ide34dbb7125b5f2a1edda6ef7111a1a49aad324f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6651565 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-23 11:32:56 -07:00
Frank Barchard	6f729fbe65	ARGBToUV SSE use average of 4 pixels - Was using avgb twice for non-exact and C for exact. On Skylake Xeon: Now SSE3 ARGBToJ420_Opt (326 ms) Was Exact C ARGBToJ420_Opt (871 ms) Not exact AVX2 ARGBToJ420_Opt (237 ms) Not exact SSSE3 ARGBToJ420_Opt (312 ms) Bug: 381138208 Change-Id: I6d1081bb52e36f06736c0c6575fa82bb2268629b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6629605 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Ben Weiss <bweiss@google.com>	2025-06-17 11:55:27 -07:00
Frank Barchard	889613683a	Add hybrid detect for Intel laptop cpus - Add +i8mm build option for sve ARGBToUV which uses usdot - util/cpuid Get cpu count (windows, macos, linux) - For each x86 cpu, detect hybrid (e-core) - Includes a comment fix for ubsan unittest - Bump version - Apply clang format to util/.c as well as all .cc/*.h Bug: 424637372 Change-Id: I08310e18051fff62c9e4e4a10d1e4361871119ac Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6635640 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-06-13 13:22:54 -07:00
George Steed	3d66e94fb5	[AArch64] Improve ARGBToUVRow_SVE2 and related kernels This commit reworks the implementation of ARGBToUVMatrixRow_SVE2, using an approach similar to that recently used in 61bdaee13a701d2b52c6dc943ccc5c888077a591. In particular we can rework these SVE2 implementations to use 8-bit dot-product instructions instead of 16-bit, allowing us to process more data in a single vector. To ensure that the input values fit in 8-bits, negate the UV constants arrays passed to the kernel and undo the now-unnecessary flipping of the middle two component values. This commit mostly reverses the performance inversion where the Neon I8MM implementation was previously faster than the SVE2 implementation. The reduction in runtime observed compared to the existing Neon I8MM implementation is now: Cortex-A510: +5.6% (!) Cortex-A520: -3.0% Cortex-A710: -12.6% Cortex-A715: -10.9% Cortex-A720: -10.8% Cortex-X2: -3.8% Cortex-X3: -10.3% Cortex-X4: -9.5% Cortex-X925: -6.7% Change-Id: I30253976dc8e3651cfb5fd39b63a6763975d41e3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6640990 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2025-06-12 14:10:44 -07:00
Frank Barchard	843cda7e7b	TestI400LargeSize test __x86_64__, _M_X64, or __aarch64__ - apply clang-format to row_neon64.cc Bug: 416842099 Change-Id: Ic21f08d8b65bb86cf72eba82d45591f6558170ec Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6634515 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-06-10 15:53:02 -07:00
Frank Barchard	4ac0a3ae3d	ubsan compliant '_any' functions using ptrdiff_t for pointer math Bug: 416842099 Change-Id: I1e3c7bc1b363c11baeb3b529ee78e5ac8878c359 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6634217 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-06-10 15:01:52 -07:00
George Steed	ef9833fc70	Add Neon implementation of Convert8To16Row Add a Neon implementation of the Convert8To16Row kernel. Compared to the C implementation we can take advantage of knowing that the "scale" parameter is always an unsigned power of two and fits in 16-bits, allowing us to combine this with the shift and avoid needing to widen the input data. Reduction in run times observed compared to the existing C implementation: Cortex-A55: -44.5% Cortex-A510: -26.1% Cortex-A520: -30.6% Cortex-A76: -61.6% Cortex-A710: -57.6% Cortex-X1: -46.5% Cortex-X2: -54.4% Cortex-X3: -57.1% Cortex-X4: -55.0% Cortex-X925: -49.3% Change-Id: I34b858605ece47e46588c0680a1d2afa7a90d7a0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6516186 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-29 13:37:48 -07:00
George Steed	7e5863ae5a	Add SVE2 and SME implementations of I422ToAR30Row This can make use of the existing load/convert/store macros that are already present for other kernels, so add I422ToAR30Row_SVE2 and I422ToAR30Row_SME to match the existing kernels. Reduction in time taken observed for the new SVE2 implementation, compared to the existing Neon implementation: Cortex-A510: -9.1% Cortex-A520: +6.8% (!) Cortex-A710: -4.0% Cortex-A715: -1.1% Cortex-A720: -1.1% Cortex-X2: -5.7% Cortex-X3: -5.9% Cortex-X4: -2.8% Cortex-X925: -4.0% Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-27 11:39:00 -07:00
George Steed	949cb623bf	Add SVE2 and SME implementations of I444ToRGB24Row Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so they are usable in both SVE2 and SME implementations, and use them to add new I444ToRGB24Row implementations for SVE2 and SME. We need to use the unrolled versions here to use the ST3B interleaving store instructions, since there is no partial vector version of this store instruction. Reduction in time taken observed for the new SVE2 implementation, compared to the existing Neon implementation: Cortex-A510: -57.6% Cortex-A520: -38.1% Cortex-A710: -15.5% Cortex-A715: -9.2% Cortex-A720: -9.2% Cortex-X2: -25.8% Cortex-X3: -26.2% Cortex-X4: -23.2% Cortex-X925: -17.8% Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-22 13:33:06 -07:00
Frank Barchard	0853c9353f	ARGBToUV 64 bit use ymm8 for shuffler Bug: 381138208 Change-Id: I5e69bc1610bd6269bf9a4113e729cf307dd36f60 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6536833 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-05-12 15:09:40 -07:00
George Steed	61bdaee13a	Add Neon I8MM implementations of ARGB to UV and variants The maximum coefficient is 128, so store constants negated to take advantage of -128 being representable in 8-bit integers. This allows us to use the I8MM USDOT instructions. Reduction in time taken observed compared to the existing Neon implementation, as a geomean of all ARGBToUV variants: Cortex-A510: -7.1% Cortex-A520: -2.1% Cortex-A710: -8.4% Cortex-A715: -0.3% Cortex-A720: -0.3% Cortex-X2: -40.0% Cortex-X3: -43.3% Cortex-X4: -11.3% Cortex-X925: -2.5% Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-05-12 11:14:00 -07:00
Frank Barchard	9f9b5cf660	ARGBToUV allow 32 bit x86 build - make width loop count on stack - set YMM constants in its own asm block - make struct for shuffle and add constants - disable clang format on row_neon.cc function Bug: 413781394 Change-Id: I263f6862cb7589dc31ac65d118f7ebeb65dbb24a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6495259 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-04-28 12:11:00 -07:00
WANG Xuerui	55a708e226	Fix unified sources build for LoongArch LASX Several consumers of libyuv do unified sources build where many source files are #include'd together to make compilation units larger and allow for more optimization chances. But for LoongArch there is a wrinkle: LASX and LSX code paths are implemented in separate files, unlike the other currently supported architectures, and some definitions are duplicated e.g. struct RgbConstants. Since the duplicated content is identical across the two files, short of some bigger refactoring, we can simply place #ifdef guards around the definitions to fix unified sources build for LoongArch. Change-Id: I952e8e0210221ec8bcc113f75fa1b9ba515ec323 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6272801 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>	2025-04-01 09:48:19 -07:00
Frank Barchard	23d416d6f3	Detect SME without SVE dependency Bug: None Change-Id: Ibe29488e893a493699ea3fae1a1a54a4fff5969c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6418571 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-03-31 17:27:40 -07:00
Frank Barchard	f145aa26da	Add SME2 detect Bug: None Change-Id: I36e576de1cf468049faaf3923b6c21fc9ad14271 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6401373 Reviewed-by: George Steed <george.steed@arm.com>	2025-03-27 11:08:08 -07:00
Frank Barchard	5f284054cb	RVV disable 64 bit elements and vcombine_v Bug: 405451074 Change-Id: I8e4437be92934b3c367c94d867d7967c32747260 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6385788 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-03-25 12:51:25 -07:00
Frank Barchard	0c07032182	clang format applies to git repo Bug: None Change-Id: Ida65a0033e8c783230cadf6912416ffd9bbf90e1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6393515 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-03-25 11:49:25 -07:00
Frank Barchard	918329caee	Make constant 0x0101 using vpcmpeqb+vpabsb Was vpcmpeqb %%ymm4,%%ymm4,%%ymm4 vpsrlw $0xf,%%ymm4,%%ymm4 vpackuswb %%ymm4,%%ymm4,%%ymm4 Now vpcmpeqb %%ymm4,%%ymm4,%%ymm4 vpabsb %%ymm4,%%ymm4 Bug: 381138208 Change-Id: Ib70c24ac636fff95a10c7f06ed8f0a3bc7514906 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6312925 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2025-03-10 13:25:16 -07:00
Frank Barchard	c060118bea	ARGBToJ444 use 256 for fixed point scale UV - use negative coefficients for UV to allow -128 - change shift to truncate instead of round for UV - adapt all row_gcc RGB to UV into matrix functions - add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc Bug: 381138208 Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: James Zern <jzern@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-27 13:04:15 -08:00
Frank Barchard	5257ba4db0	Apply clang format Bug: None Change-Id: Ibd694d0351966a2b5812445de74bbced9c881a79 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6302317 Reviewed-by: James Zern <jzern@google.com> Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-25 11:39:19 -08:00
Frank Barchard	3a7e0ba671	Apply format with no code changes Bug: None Change-Id: I8923bacb9af7e7d4f13e210c8b3d7ea6b81568a5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6301086 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>	2025-02-24 23:57:01 -08:00
Frank Barchard	61354d2671	ARGBToUV Matrix for AVX2 and SSSE3 - Round before shifting to 8 bit to match NEON - RAWToARGB use unaligned loads and port to AVX2 Was C/SSSE/AVX2 ARGBToI444_Opt (343 ms) ARGBToJ444_Opt (677 ms) RAWToI444_Opt (405 ms) RAWToJ444_Opt (803 ms) Now AVX2 ARGBToI444_Opt (283 ms) ARGBToJ444_Opt (284 ms) RAWToI444_Opt (316 ms) RAWToJ444_Opt (339 ms) Profile Now AVX2 38.31% ARGBToUVJ444Row_AVX2 32.31% RAWToARGBRow_AVX2 23.99% ARGBToYJRow_AVX2 Profile Was C/SSSE/AVX2 73.15% ARGBToUVJ444Row_C 15.74% RAWToARGBRow_SSSE3 8.87% ARGBToYJRow_AVX2 Bug: 381138208 Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Ben Weiss <bweiss@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-10 18:36:18 -08:00
Frank Barchard	d32d19ccf2	UV subsample on ARM use rounding average of 4 pixels Performance on Samsung S22 Exynos (SVE2+I8MM+DOTPROD+Neon) AArch64 ARGBToI400_Opt (168 ms) ARGBToJ400_Opt (103 ms) ABGRToJ400_Opt (81 ms) RGBAToJ400_Opt (82 ms) RGB24ToJ400_Opt (176 ms) RAWToJ400_Opt (176 ms) ABGRToI420_Opt (258 ms) ARGBToI420_Opt (259 ms) ARGBToI422_Opt (403 ms) ARGBToI444_Opt (213 ms) ARGBToJ420_Opt (257 ms) ARGBToJ422_Opt (403 ms) ARGBToJ444_Opt (214 ms) ABGRToJ420_Opt (255 ms) ABGRToJ422_Opt (399 ms) ARGB4444ToI420_Opt (285 ms) RGB565ToI420_Opt (316 ms) ARGB1555ToI420_Opt (324 ms) BGRAToI420_Opt (260 ms) RAWToI420_Opt (303 ms) RAWToI444_Opt (303 ms) RAWToJ420_Opt (335 ms) RAWToJ444_Opt (308 ms) RGB24ToI420_Opt (372 ms) RGB24ToJ420_Opt (365 ms) RGBAToI420_Opt (259 ms) AArch32 (Neon) ARGBToI400_Opt (496 ms) ARGBToJ400_Opt (478 ms) ABGRToJ400_Opt (483 ms) RGBAToJ400_Opt (493 ms) RGB24ToJ400_Opt (343 ms) RAWToJ400_Opt (341 ms) ABGRToI420_Opt (993 ms) ARGBToI420_Opt (992 ms) ARGBToI422_Opt (1503 ms) ARGBToI444_Opt (1257 ms) ARGBToJ420_Opt (1006 ms) ARGBToJ422_Opt (1521 ms) ARGBToJ444_Opt (1267 ms) ABGRToJ420_Opt (1002 ms) ABGRToJ422_Opt (1504 ms) ARGB4444ToI420_Opt (1180 ms) RGB565ToI420_Opt (1112 ms) ARGB1555ToI420_Opt (1115 ms) BGRAToI420_Opt (993 ms) RAWToI420_Opt (703 ms) RAWToI444_Opt (1717 ms) RAWToJ420_Opt (704 ms) RAWToJ444_Opt (1739 ms) RGB24ToI420_Opt (703 ms) RGB24ToJ420_Opt (703 ms) RGBAToI420_Opt (993 ms) Bug: 381138208 Change-Id: I33728d5237f357362b0bfc509a9ebe6fe46f45d4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6228987 Reviewed-by: Ben Weiss <bweiss@google.com> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-02-04 15:19:19 -08:00
Frank Barchard	5a9a6ea936	Add RAWToI444 Skylake Xeon RAWToI444_Opt (433 ms) RAWToJ444_Opt (1781 ms) ARGBToI444_Opt (352 ms) ARGBToJ444_Opt (1577 ms) Samsung S22 Exynos ARGBToI444_Opt (283 ms) ARGBToJ444_Opt (209 ms) RAWToI444_Opt (294 ms) RAWToJ444_Opt (293 ms) Profiling on Samsung S22 Exynos 37.62%, ARGBToUV444Row_NEON_I8MM 29.42%, RAWToARGBRow_SVE2 19.61%, ARGBToYRow_NEON_DotProd Passing different --libyuv_cpu_info=N etc we can compare each ISA C 1 RAWToI444_Opt (781 ms) NEON 511 RAWToI444_Opt (757 ms) NEONDOT 1023 RAWToI444_Opt (571 ms) NEONI8MM 2047 RAWToI444_Opt (334 ms) SVE2 8191 RAWToI444_Opt (307 ms) Bug: 390247964 Change-Id: I0316fedd32222588455afa751f5b854f46bce024 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6223658 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-02-03 16:13:03 -08:00
Frank Barchard	b3fd3f3f3b	Fix ARGBToUV444Row_NEON - constants passed in are signed and need to be negated to positive. Bug: 394127527 Change-Id: I531e475d2ddd4583922d4abef13b9282d002dd7a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6226854 Reviewed-by: Ben Weiss <bweiss@google.com>	2025-02-03 13:33:39 -08:00
Frank Barchard	96f98f6915	ARGBToJ444 and RAWToJ444 NEON - Pass JPEG matrix to ARGBToUV444MatrixRow_NEON - Remove NEON unsigned constants in favor of DOTPROD signed constants Samsung S23: Was C for UV ARGBToJ444_Opt (320 ms) RAWToJ444_Opt (411 ms) Now I8MM ARGBToJ444_Opt (196 ms) RAWToJ444_Opt (301 ms) NEON ARGBToJ444_Opt (505 ms) RAWToJ444_Opt (596 ms) 32 bit ARM NEON ARGBToJ444_Opt (1135 ms) RAWToJ444_Opt (1546 ms) Profile of RAWToJ444 37.72% ARGBToUVJ444Row_NEON_I8MM 34.48% RAWToARGBRow_NEON 14.65% ARGBToYJRow_NEON_DotProd Bug: 390247964 Change-Id: Ia26240bee974a0baf502548f2fc896b193c3006c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6220890 Reviewed-by: Ben Weiss <bweiss@google.com>	2025-01-31 16:46:29 -08:00
Frank Barchard	c1bac9e6a5	RAWToJ444 and ARGBToJ444 - ARGBToJ444 implements ARGBToUVJ444Row_C - RAWToJ444 implemented as 2 steps - RAWToARGB and ARGBToJ444 libyuv_test '--gunit_filter=RTo?444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 (with bit exact off) Samsung S23 RAWToJ444_Opt (437 ms) ARGBToJ444_Opt (337 ms) ARGBToI444_Opt (196 ms) Skylake Xeon RAWToJ444_Opt (1699 ms) ARGBToJ444_Opt (1559 ms) ARGBToI444_Opt (346 ms) Bug: 390247964 Change-Id: Id1b1b45a5e4512ab50830aadf62f780fbe631575 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207845 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-29 15:18:38 -08:00
George Steed	c4a0c8d34a	[AArch64] Add SVE2 and SME implementations for Convert8To8Row SVE can make use of the UMULH instruction to avoid needing separate widening multiply and narrowing steps for the scale application. Reduction in runtime for Convert8To8Row_SVE2 observed compared to the existing Neon implementation: Cortex-A510: -13.2% Cortex-A520: -16.4% Cortex-A710: -37.1% Cortex-A715: -38.5% Cortex-A720: -38.4% Cortex-X2: -33.2% Cortex-X3: -31.8% Cortex-X4: -31.8% Cortex-X925: -13.9% Change-Id: I17c0cb81661c5fbce786b47cdf481549cfdcbfc7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207692 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-01-28 15:53:26 -08:00
Frank Barchard	6c2415bfab	J420ToI420 AVX2 libyuv_test '--gunit_filter=J420ToI420' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Skylake Xeon AVX2 J420ToI420_Opt (114 ms) C J420ToI420_Opt (596 ms) Sapphire Rapids AVX2 J420ToI420_Opt (126 ms) C J420ToI420_Opt (717 ms) Samsung S23 NEON J420ToI420_Opt (46 ms) C J420ToI420_Opt (95 ms) Bug: 381327032 Change-Id: I2b551507c2a8b1da4f04651b622fc9247a75050d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6201239 Reviewed-by: Justin Green <greenjustin@google.com>	2025-01-27 11:23:44 -08:00
Frank Barchard	67f3f17d9a	aarch32 J420ToI420 benchmark on medium core adbrun -- taskset 10 blaze-bin/third_party/libyuv/libyuv_test '--gunit_filter=J420ToI420' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Now Neon J420ToI420_Opt (159 ms) Was C J420ToI420_Opt (215 ms) AArch64 J420ToI420_Opt (93 ms) C version does this: vld1.8 {d20, d21}, [r6]! vorr q12, q8, q8 subs r4, #16 vmovl.u8 q11, d21 vmovl.u8 q10, d20 vmul.i16 q11, q9, q11 vmul.i16 q10, q9, q10 vsra.u16 q12, q11, #8 vorr q11, q8, q8 vsra.u16 q11, q10, #8 vmovn.i16 d21, q12 vmovn.i16 d20, q11 vst1.8 {d20, d21}, [r5]! bne 0x3d9078 <Convert8To8Row_C+0x36> @ imm = #-54 Explanation of above C code vorr moves 16 into register vsra does shift + accumulate to that register Compared to aarch64 instead of mull, C uses movl+mul instead of uzp2, C uses sra #8 + movn. takes 2 movn vs 1 uzp2 instead of add, C does vorr + sra Change-Id: I9648f06e52ccbafaecf07bd89f8ffff27565d025 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6189497 Reviewed-by: Justin Green <greenjustin@google.com>	2025-01-22 13:47:09 -08:00
Frank Barchard	26277baf96	J420ToI420 using planar 8 bit scaling - Add Convert8To8Plane which scale and add 8 bit values allowing full range YUV to be converted to limited range YUV libyuv_test '--gunit_filter=J420ToI420' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Samsung S23 J420ToI420_Opt (45 ms) I420ToI420_Opt (37 ms) Skylake J420ToI420_Opt (596 ms) I420ToI420_Opt (99 ms) Bug: 381327032 Change-Id: I380c3fa783491f2e3727af28b0ea9ce16d2bb8a4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6182631 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-22 02:50:24 -08:00
Frank Barchard	ef52c1658a	avx10_2 detect Run with sde only -dmr reports AVX10.2 emr:Has AVX10_2 0x0 adl:Has AVX10_2 0x0 icx:Has AVX10_2 0x0 snb:Has AVX10_2 0x0 tnt:Has AVX10_2 0x0 icl:Has AVX10_2 0x0 slm:Has AVX10_2 0x0 dmr:Has AVX10_2 0x2000000 cwf:Has AVX10_2 0x0 mrm:Has AVX10_2 0x0 skx:Has AVX10_2 0x0 wsm:Has AVX10_2 0x0 gnr:Has AVX10_2 0x0 gnr256:Has AVX10_2 0x0 bdw:Has AVX10_2 0x0 cpx:Has AVX10_2 0x0 rpl:Has AVX10_2 0x0 snr:Has AVX10_2 0x0 ptl:Has AVX10_2 0x0 slt:Has AVX10_2 0x0 ivb:Has AVX10_2 0x0 spr:Has AVX10_2 0x0 tgl:Has AVX10_2 0x0 arl:Has AVX10_2 0x0 srf:Has AVX10_2 0x0 nhm:Has AVX10_2 0x0 skl:Has AVX10_2 0x0 mtl:Has AVX10_2 0x0 pnr:Has AVX10_2 0x0 glp:Has AVX10_2 0x0 lnl:Has AVX10_2 0x0 cnl:Has AVX10_2 0x0 hsw:Has AVX10_2 0x0 clx:Has AVX10_2 0x0 glm:Has AVX10_2 0x0 sde -dmr -- libyuv_test --gunit_filter=Cpu [ RUN ] LibYUVBaseTest.TestCpuId Cpu Vendor: GenuineIntel 0x756e6547 0x49656e69 0x6c65746e Cpu Family 6 (0x6), Model 214 (0xd6) [ OK ] LibYUVBaseTest.TestCpuId (34 ms) [ RUN ] LibYUVBaseTest.TestCpuHas Kernel Version 6.10 Has X86 0x8 Has SSE2 0x100 Has SSSE3 0x200 Has SSE4.1 0x400 Has SSE4.2 0x800 Has AVX 0x1000 Has AVX2 0x2000 Has ERMS 0x4000 Has FSMR 0x8000 Has FMA3 0x10000 Has F16C 0x20000 Has AVX512BW 0x40000 Has AVX512VL 0x80000 Has AVX512VNNI 0x100000 Has AVX512VBMI 0x200000 Has AVX512VBMI2 0x400000 Has AVX512VBITALG 0x800000 Has AVX10 0x1000000 Has AVX10_2 0x2000000 HAS AVXVNNI 0x4000000 Has AVXVNNIINT8 0x8000000 Has AMXINT8 0x10000000 [ OK ] LibYUVBaseTest.TestCpuHas (10 ms) This is how oneDNN does avx10 version: `e15d2c220f/src/cpu/x64/xbyak/xbyak_util.h (L698-L701)` Bug: b/350318244 Change-Id: I6f78402fecc38a92019d137b3439d7bce950510c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6172267 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-01-21 13:53:19 -08:00
Frank Barchard	47ddac2996	Sub sampling conversions use CopyPlane for Y channel - Replace ScalePlane with CopyPlane for Y channel - Vertical mirroring is supported, but not horizontal mirroring. - Check src_y is not null when dst_y is not null for all libyuv functions that allow a null dst_y. - Apply clang-format - Bump version to 1899 Bug: None Change-Id: Id1805b52b8024ba95a7f1b098dabf45af48670eb Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6128599 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-02 13:34:11 -08:00
Frank Barchard	e0040eb318	Apply clang format Bug: None Change-Id: I0d9db4b384144523e61ae32b6ab3f72e93a0c265 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6138934 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-02 13:31:20 -08:00
Darren Hsieh	b5a18f9d93	[RVV] Optimize ScaleARGBFilterCols with RVV * Run on SiFive internal FPGA: Test Case Speedup ARGBScaleDownBy3by8_Linear x2.05 ARGBScaleDownBy3by8_Bilinear x1.76 ARGBScaleDownBy3by8_Box x1.76 Bug: 42280924 Co-Developed-by: Bruce Lai <bruce.lai@sifive.com> Change-Id: Ib9979b1f2ca92d2ef5aa373f9b2459c246ded6c8 Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5103572 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-29 17:32:00 -08:00
George Steed	db5a71c528	[AArch64] Remove unused variables in HalfRow_{16To8,16}_SME The HalfRow kernels assume that the fraction is exactly half, so there is no need to calculate it. No-Try: True Change-Id: I2319d55ba99f202aa22c9693ec44c9891e7f72d5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087914 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>	2024-12-13 08:00:58 -08:00
George Steed	7fd0bd197e	[AArch64] Port YUVToRGB color conversions to SME Some of the color conversion kernels already have Streaming-SVE implementations however many do not. We can re-use the existing SVE implementation by moving it to a new shared row_sve.h header and marking it with a "streaming-compatible" attribute to ensure it can be called from both streaming and non-streaming execution modes. As part of this move to a common header we also add duplicated streaming-mode implementations of the following kernels that did not previously have an SME implementation: - I210AlphaToARGBRow_SME - I210ToAR30Row_SME - I210ToARGBRow_SME - I212ToAR30Row_SME - I212ToARGBRow_SME - I400ToARGBRow_SME - I410AlphaToARGBRow_SME - I410ToAR30Row_SME - I410ToARGBRow_SME - I422AlphaToARGBRow_SME - I422ToARGB1555Row_SME - I422ToARGB4444Row_SME - I422ToRGB24Row_SME - I422ToRGB565Row_SME - I422ToRGBARow_SME - I444AlphaToARGBRow_SME - NV12ToARGBRow_SME - NV12ToRGB24Row_SME - NV21ToARGBRow_SME - NV21ToRGB24Row_SME - P210ToAR30Row_SME - P210ToARGBRow_SME - P410ToAR30Row_SME - P410ToARGBRow_SME - UYVYToARGBRow_SME - YUY2ToARGBRow_SME Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:07:54 -08:00
George Steed	c2e7f8389a	[AArch64] Add SME implementations of InterpolateRow{,_16,_16To8} InterpolateRow_SME and InterpolateRow_16_SME need special cases to handle if source_y_fraction is 256 since this would overflow a byte and can just be a call to memcpy instead. InterpolateRow_16To8_SME is never called with a source_y_fraction value of 256 so there is no need for a special case here. Change-Id: I67805b5db2c411acb93ada626cf414b35620f467 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:03:41 -08:00

1 2 3 4 5 ...

1995 Commits