1892 Commits

Author SHA1 Message Date
Frank Barchard
900da61d3c Experimental SVE FMMLA detect
Detect if arm cpu support FMMLA instruction

Bug: None
Change-Id: Ia7b83bf2735ddeeb8a85da44177e708c34e4b1fb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7085486
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-10-27 14:34:55 -07:00
Frank Barchard
500f45652c For for ARM32 build when built with __SOFTFP__
planar_test.cc was
  Error: selected processor does not support `vmrs r3,fpscr' in ARM mode
  Error: selected processor does not support `vmsr fpscr,r3' in ARM mode

Bug: None
Change-Id: I2ee0e7191c372277901c94e29d9ed91bbac71af2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7063737
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-10-20 11:54:25 -07:00
Mark Zhuang
e237e8d7fb RVV: Enable some function for intrinsic >= v1.0
According to README of rvv-intrinsic-doc,
Clang 19 and GCC 14 supports the v1.0 version.
But __riscv_v_intrinsic is 12000 on Clang 19,
so need Clang >= 20 to test this patch.
I test it with Clang 21.

Change-Id: I0e75efcdab3e7bc0ce1acd19eca3568b47c84cbf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6995438
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-10-17 11:44:14 -07:00
Wan-Teh Chang
fcd7060e0d Bump LIBYUV_VERSION for removal of MIPS support
Bump LIBYUV_VERSION to 1921. Missed in
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7045953.

Bug: 434383432
Change-Id: If51122f1b744718551b0b601ead7cacb8c46c20d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7050411
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-10-16 13:32:52 -07:00
Frank Barchard
2b4453d46f Deprecate MIPS and MSA support.
- Remove *_msa.cc source files
- Update build files
- Update header references, planar ifdefs for row functions
- Update documentation on supported platforms
- Version bumped to 1921
- clang-format applied

Bug: 434383432
Change-Id: I072d6aac4956f0ed668e64614ac8557612171f76
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7045953
Reviewed-by: Justin Green <greenjustin@google.com>
2025-10-16 12:20:40 -07:00
Frank Barchard
94417b9d21 Pass rgbconstants via struct pointer instead of elements with m
Now 66 instructions
SYM ARGBToUVRow_SSSE3:
62ccd0: BASE       push ebp
62ccd1: BASE       mov ebp, esp
62ccd3: BASE       push ebx
62ccd4: BASE       push edi
62ccd5: BASE       push esi
62ccd6: BASE       and esp, 0xfffffffc
62ccd9: BASE       sub esp, 0xc
62ccdc: BASE       call 0x62cce1 <ARGBToUVRow_SSSE3+0x11>
62cce1: BASE       pop eax
62cce2: BASE       add eax, 0xe1c27
62cce8: BASE       mov ecx, dword ptr [ebp+0xc]
62cceb: BASE       mov edx, dword ptr [ebp+0x8]
62ccee: BASE       mov esi, dword ptr [ebp+0x10]
62ccf1: BASE       mov edi, dword ptr [ebp+0x18]
62ccf4: BASE       mov dword ptr [esp+0x8], edi
62ccf8: BASE       mov edi, dword ptr [ebp+0x14]
62ccfb: BASE       lea ebx, ptr [eax-0x5ecf88]
62cd01: SSE2       movdqa xmm4, xmmword ptr [ebx]
62cd05: SSE2       movdqa xmm5, xmmword ptr [ebx+0x10]
62cd0a: SSE2       pcmpeqb xmm6, xmm6
62cd0e: SSSE3      pabsb xmm6, xmm6
62cd13: SSE2       movdqa xmm7, xmmword ptr [eax-0x5ecfa8]
62cd1b: BASE       sub edi, esi

62cd1d: SSE2       movdqu xmm0, xmmword ptr [edx]
62cd21: SSE2       movdqu xmm1, xmmword ptr [edx+0x10]
62cd26: SSE2       movdqu xmm2, xmmword ptr [edx+ecx*1]
62cd2b: SSE2       movdqu xmm3, xmmword ptr [edx+ecx*1+0x10]
62cd31: SSSE3      pshufb xmm0, xmm7
62cd36: SSSE3      pshufb xmm1, xmm7
62cd3b: SSSE3      pshufb xmm2, xmm7
62cd40: SSSE3      pshufb xmm3, xmm7
62cd45: SSSE3      pmaddubsw xmm0, xmm6
62cd4a: SSSE3      pmaddubsw xmm1, xmm6
62cd4f: SSSE3      pmaddubsw xmm2, xmm6
62cd54: SSSE3      pmaddubsw xmm3, xmm6
62cd59: SSE2       paddw xmm0, xmm2
62cd5d: SSE2       paddw xmm1, xmm3
62cd61: SSE2       pxor xmm2, xmm2
62cd65: SSE2       psrlw xmm0, 0x1
62cd6a: SSE2       psrlw xmm1, 0x1
62cd6f: SSE2       pavgw xmm0, xmm2
62cd73: SSE2       pavgw xmm1, xmm2
62cd77: SSE2       packuswb xmm0, xmm1
62cd7b: SSE2       movdqa xmm2, xmm6
62cd7f: SSE2       psllw xmm2, 0xf
62cd84: SSE2       movdqa xmm1, xmm0
62cd88: SSSE3      pmaddubsw xmm1, xmm5
62cd8d: SSSE3      pmaddubsw xmm0, xmm4
62cd92: SSSE3      phaddw xmm0, xmm1
62cd97: SSE2       psubw xmm2, xmm0
62cd9b: SSE2       psrlw xmm2, 0x8
62cda0: SSE2       packuswb xmm2, xmm2
62cda4: SSE2       movd dword ptr [esi], xmm2
62cda8: SSE2       pshufd xmm2, xmm2, 0x55
62cdad: SSE2       movd dword ptr [esi+edi*1], xmm2
62cdb2: BASE       lea edx, ptr [edx+0x20]
62cdb5: BASE       lea esi, ptr [esi+0x4]
62cdb8: BASE       sub dword ptr [esp+0x8], 0x8
62cdbd: BASE       jnle 0x62cd1d <ARGBToUVRow_SSSE3+0x4d>

62cdc3: BASE       lea esp, ptr [ebp-0xc]
62cdc6: BASE       pop esi
62cdc7: BASE       pop edi
62cdc8: BASE       pop ebx
62cdc9: BASE       pop ebp
62cdca: BASE       ret

Was 68 instructions
ARGBToUVRow_SSSE3:
62ccd0: BASE       push ebp
62ccd1: BASE       mov ebp, esp
62ccd3: BASE       push edi
62ccd4: BASE       push esi
62ccd5: BASE       and esp, 0xfffffff0
62ccd8: BASE       sub esp, 0x30
62ccdb: BASE       call 0x62cce0 <ARGBToUVRow_SSSE3+0x10>
62cce0: BASE       pop eax
62cce1: BASE       add eax, 0xe1c28
62cce7: BASE       mov ecx, dword ptr [ebp+0xc]
62ccea: BASE       mov edx, dword ptr [ebp+0x8]
62cced: BASE       mov esi, dword ptr [ebp+0x10]
62ccf0: BASE       mov edi, dword ptr [ebp+0x18]
62ccf3: BASE       mov dword ptr [esp+0xc], edi
62ccf7: BASE       mov edi, dword ptr [ebp+0x14]
62ccfa: SSE        movaps xmm0, xmmword ptr [eax-0x5ecf88]
62cd01: SSE        movaps xmmword ptr [esp+0x20], xmm0
62cd06: SSE        movaps xmm0, xmmword ptr [eax-0x5ecf78]
62cd0d: SSE        movaps xmmword ptr [esp+0x10], xmm0
62cd12: SSE2       movdqa xmm4, xmmword ptr [esp+0x20]
62cd18: SSE2       movdqa xmm5, xmmword ptr [esp+0x10]
62cd1e: SSE2       pcmpeqb xmm6, xmm6
62cd22: SSSE3      pabsb xmm6, xmm6
62cd27: SSE2       movdqa xmm7, xmmword ptr [eax-0x5ecfa8]
62cd2f: BASE       sub edi, esi

62cd31: SSE2       movdqu xmm0, xmmword ptr [edx]
62cd35: SSE2       movdqu xmm1, xmmword ptr [edx+0x10]
62cd3a: SSE2       movdqu xmm2, xmmword ptr [edx+ecx*1]
62cd3f: SSE2       movdqu xmm3, xmmword ptr [edx+ecx*1+0x10]
62cd45: SSSE3      pshufb xmm0, xmm7
62cd4a: SSSE3      pshufb xmm1, xmm7
62cd4f: SSSE3      pshufb xmm2, xmm7
62cd54: SSSE3      pshufb xmm3, xmm7
62cd59: SSSE3      pmaddubsw xmm0, xmm6
62cd5e: SSSE3      pmaddubsw xmm1, xmm6
62cd63: SSSE3      pmaddubsw xmm2, xmm6
62cd68: SSSE3      pmaddubsw xmm3, xmm6
62cd6d: SSE2       paddw xmm0, xmm2
62cd71: SSE2       paddw xmm1, xmm3
62cd75: SSE2       pxor xmm2, xmm2
62cd79: SSE2       psrlw xmm0, 0x1
62cd7e: SSE2       psrlw xmm1, 0x1
62cd83: SSE2       pavgw xmm0, xmm2
62cd87: SSE2       pavgw xmm1, xmm2
62cd8b: SSE2       packuswb xmm0, xmm1
62cd8f: SSE2       movdqa xmm2, xmm6
62cd93: SSE2       psllw xmm2, 0xf
62cd98: SSE2       movdqa xmm1, xmm0
62cd9c: SSSE3      pmaddubsw xmm1, xmm5
62cda1: SSSE3      pmaddubsw xmm0, xmm4
62cda6: SSSE3      phaddw xmm0, xmm1
62cdab: SSE2       psubw xmm2, xmm0
62cdaf: SSE2       psrlw xmm2, 0x8
62cdb4: SSE2       packuswb xmm2, xmm2
62cdb8: SSE2       movd dword ptr [esi], xmm2
62cdbc: SSE2       pshufd xmm2, xmm2, 0x55
62cdc1: SSE2       movd dword ptr [esi+edi*1], xmm2
62cdc6: BASE       lea edx, ptr [edx+0x20]
62cdc9: BASE       lea esi, ptr [esi+0x4]
62cdcc: BASE       sub dword ptr [esp+0xc], 0x8
62cdd1: BASE       jnle 0x62cd31 <ARGBToUVRow_SSSE3+0x61>

62cdd7: BASE       lea esp, ptr [ebp-0x8]
62cdda: BASE       pop esi
62cddb: BASE       pop edi
62cddc: BASE       pop ebp
62cddd: BASE       ret
62cdde: BASE       int3
BUG=444157316

Change-Id: Iad044f851359f5b052091c7bdab9b96946fc3682
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6987370
Reviewed-by: Justin Green <greenjustin@google.com>
2025-09-29 12:34:36 -07:00
Frank Barchard
7155afc5ca ARGBToUV AVX2 for x86 32 bit
- Reduce to 10 ymm registers - 2 constants generated on the fly

Change-Id: Ib25a0cf7c93e5048270735410ccf6723b3949454
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6967319
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-18 13:14:45 -07:00
Frank Barchard
142db12947 ARGBToUV use AVX2 for 64 bit x86
Skylake
Was ARGBToJ420_Opt (312 ms)
Now ARGBToJ420_Opt (242 ms)

Icelake
Was ARGBToJ420_Opt (302 ms)
Now ARGBToJ420_Opt (220 ms)

AMD Zen3 on Windows
Was ARGBToJ420_Opt (305 ms)
Now ARGBToJ420_Opt (216 ms)
32 bit x86 uses SSE
Now ARGBToJ420_Opt (326 ms)

MCA analysis of new AVX, SSE and old AVX
https://godbolt.org/z/37bdazWYr

Bug: None
Change-Id: I72f5504407751e164c3558aebe836dd15223d65f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6957477
Reviewed-by: Justin Green <greenjustin@google.com>
2025-09-17 14:39:53 -07:00
Mark Zhuang
b33794a586 RVV: Don't disable all rvv optimize when RVV >= v0.12
Disabled since Patch v2 of
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6385788

Change-Id: Id30a62c8f164830204dde02a443f5e4f04d757db
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6953818
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-09-16 18:17:02 -07:00
Frank Barchard
a61882c049 ARGBToUV AVX2 for x86_64
Icelake
Was SSSE3+SSSE3 ARGBToJ420_Opt (356 ms)
Was SSSE3+AVX2  ARGBToJ420_Opt (301 ms)
Now AVX2+AVX2   ARGBToJ420_Opt (227 ms)

Change-Id: I2cb427bc164b225b3ad4c5f43c09d6da6ca496d5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6943036
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-16 11:33:54 -07:00
Frank Barchard
0f795672ae Reduce ARGBToUV SSSE3 register usage for clang build error on x64
Bug: 444157316
Change-Id: I2ae9f3dbfb373bb874a3d9699987f7d5b63f2610
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6937665
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-10 18:40:06 -07:00
Frank Barchard
d71cda1bb0 Rollback util cpuid hybrid detect due to android build errors
Bug: 438241552
Change-Id: Ie56aa7296e796e44e63d0dd913120b897b12cc9b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6843504
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-08-12 14:13:24 -07:00
Frank Barchard
cdd3bae848 TestI400LargeSize fix for warning message build error
- change %ld to %zd for size_t printf warnings
- disable TestI400LargeSize when disabling SLOW_TESTS
- disable cpuid tests that read proc/cpuinfo test data files
- add ifdef around timers to allow hexagon build
- remove faulty hybrid detect
- remove old mips LIBYUV_DISABLE_DSPR2 reference in gyp build
- apply clang-format

Bug: 434382656
Change-Id: Id74812e6ef29d4a8d0ff967a9189d249b80816d4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6812825
Reviewed-by: Jeremy Leconte <jleconte@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-08-01 12:03:11 -07:00
Frank Barchard
3ff31b2a5f Make LibYUVConvertTest.TestI400LargeSize skip test on low end arm cpu
- detect lack of dot product instruction to infer the cpu is low end
- only run the test on higher end arm

Bug: 416842099
Change-Id: Idd2dd16a624bbba280cf531644440024b12f7ecf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6804632
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2025-07-31 02:41:17 -07:00
George Steed
007b920232 [AArch64] Add SME implementation of ARGBToUVRow and similar
Mostly just a straightforward copy of the existing SVE2 code ported to
Streaming-SVE. Introduce new "any" kernels for non-multiple of two
cases, matching what we already do for SVE2.

The existing SVE2 code makes use of the Neon MOVI instruction that is
not supported in Streaming-SVE, so adjust the code to use FMOV instead
which has the same performance characteristics.

Change-Id: I74b7ea1fe8e6af75dfaf92826a4de775a1559f77
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6663806
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-06-30 09:20:23 -07:00
George Steed
88798bcd63 [AArch64] Add SME implementation of Convert8To16Row_SME
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE. There is no benefit from this kernel when the SVE vector
length is only 128 bits, so skip writing a non-streaming SVE
implementation.

Change-Id: Ide34dbb7125b5f2a1edda6ef7111a1a49aad324f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6651565
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-06-23 11:32:56 -07:00
Frank Barchard
6f729fbe65 ARGBToUV SSE use average of 4 pixels
- Was using avgb twice for non-exact and C for exact.

On Skylake Xeon:

Now SSE3
ARGBToJ420_Opt (326 ms)

Was
Exact C
ARGBToJ420_Opt (871 ms)
Not exact AVX2
ARGBToJ420_Opt (237 ms)
Not exact SSSE3
ARGBToJ420_Opt (312 ms)

Bug: 381138208
Change-Id: I6d1081bb52e36f06736c0c6575fa82bb2268629b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6629605
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
2025-06-17 11:55:27 -07:00
Frank Barchard
889613683a Add hybrid detect for Intel laptop cpus
- Add +i8mm build option for sve ARGBToUV which uses usdot
- util/cpuid Get cpu count (windows, macos, linux)
- For each x86 cpu, detect hybrid (e-core)
- Includes a comment fix for ubsan unittest
- Bump version
- Apply clang format to util/*.c as well as all *.cc/*.h

Bug: 424637372
Change-Id: I08310e18051fff62c9e4e4a10d1e4361871119ac
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6635640
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-06-13 13:22:54 -07:00
George Steed
1b2f6cdbe8 [AArch64] Unroll I210ToAR30Row_{SVE2,SME}
Now that we have a STOREAR30_SVE_2X implementation, we can use this to
unroll other kernels. The predication on I210ToAR30Row needs adjusting
to allow loading two vectors of Y compared to one vector of U/V, and
additionally UZP is needed to ensure the data arrangement in vector
lanes matches the U/V layout. LD2H could also be used, however this
provides no performance improvement on most cores and would necessitate
the addition of an "any" kernel to handle the case where width % 2 != 0.

Reduction in run times of I210ToAR30Row_SVE2 observed compared to the
previous SVE2 implementation: (note that even in the observed slowdowns,
the SVE2 implementation still outperforms the existing Neon code)

Cortex-A510: -37.1%
Cortex-A520: -39.1%
Cortex-A710: +1.6% (!)
Cortex-A715: +6.5% (!)
Cortex-A720: +6.5% (!)
  Cortex-X2: -2.9%
  Cortex-X3: -2.2%
  Cortex-X4: -8.8%
Cortex-X925: -3.5%

Change-Id: I2ff285b48105883526eceb8be1fcbe0e033a553b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6640989
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2025-06-12 14:10:21 -07:00
George Steed
867bdc51ed [AArch64] Unroll I422ToAR30Row_{SVE2,SME}
The existing STOREAR30_SVE macro works fine for out of order cores,
however for in-order cores the number of dependent vector instructions
laid out consecutively impacts performance.

We can improve this by unrolling the loop to process two sets of vectors
at a time, allowing little cores to process two independent streams of
vector instructions at the same time to improve performance. Using one
set of ZIP instructions at the end allows us to (a) avoid ST4 which we
know is slow on some micro-architectures, and (b) enable the use of
predication and avoid the need for separate "any" kernels.

Reduction in run times of I422ToAR30Row_SVE2 observed compared to the
previous SVE2 implementation:

Cortex-A510: -37.7%
Cortex-A520: -38.8%
Cortex-A710: -14.8%
Cortex-A715: -17.1%
Cortex-A720: -16.9%
  Cortex-X2: -10.3%
  Cortex-X3:  -6.7%
  Cortex-X4:  -9.4%
Cortex-X925:  -7.1%

Change-Id: I160fb41300d2d08fce2e6eb92181324fd723a02d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6632916
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2025-06-12 14:09:49 -07:00
Frank Barchard
4ac0a3ae3d ubsan compliant '_any' functions using ptrdiff_t for pointer math
Bug: 416842099
Change-Id: I1e3c7bc1b363c11baeb3b529ee78e5ac8878c359
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6634217
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-06-10 15:01:52 -07:00
George Steed
cd0ae0a222 row_sve.h: Add missing z21 clobber
The z21 register is used in the I444TORGB_SVE_2X macro and other places,
so add it to the clobber list macro that is used throughout this file.

Change-Id: If4277c1ffcac0fa68cc44263acc6f41a9e82ec8b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6619508
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-06-08 19:41:44 -07:00
George Steed
998bec7ca9 Sort row.h #define *_NEON lists
Sort the Arm Neon and Neon DotProd #define lists to match the
alphabetical ordering used for the SVE2 and SME lists.

Change-Id: Ibeb380f477d5476d0018d20a754557a5f93f2190
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6613686
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-06-08 19:38:30 -07:00
George Steed
ef9833fc70 Add Neon implementation of Convert8To16Row
Add a Neon implementation of the Convert8To16Row kernel. Compared to the
C implementation we can take advantage of knowing that the "scale"
parameter is always an unsigned power of two and fits in 16-bits,
allowing us to combine this with the shift and avoid needing to widen
the input data.

Reduction in run times observed compared to the existing C
implementation:

 Cortex-A55: -44.5%
Cortex-A510: -26.1%
Cortex-A520: -30.6%
 Cortex-A76: -61.6%
Cortex-A710: -57.6%
  Cortex-X1: -46.5%
  Cortex-X2: -54.4%
  Cortex-X3: -57.1%
  Cortex-X4: -55.0%
Cortex-X925: -49.3%

Change-Id: I34b858605ece47e46588c0680a1d2afa7a90d7a0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6516186
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-05-29 13:37:48 -07:00
George Steed
7e5863ae5a Add SVE2 and SME implementations of I422ToAR30Row
This can make use of the existing load/convert/store macros that are
already present for other kernels, so add I422ToAR30Row_SVE2 and
I422ToAR30Row_SME to match the existing kernels.

Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:

Cortex-A510: -9.1%
Cortex-A520: +6.8% (!)
Cortex-A710: -4.0%
Cortex-A715: -1.1%
Cortex-A720: -1.1%
  Cortex-X2: -5.7%
  Cortex-X3: -5.9%
  Cortex-X4: -2.8%
Cortex-X925: -4.0%

Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-05-27 11:39:00 -07:00
George Steed
949cb623bf Add SVE2 and SME implementations of I444ToRGB24Row
Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so
they are usable in both SVE2 and SME implementations, and use them to
add new I444ToRGB24Row implementations for SVE2 and SME. We need to use
the unrolled versions here to use the ST3B interleaving store
instructions, since there is no partial vector version of this store
instruction.

Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:

Cortex-A510: -57.6%
Cortex-A520: -38.1%
Cortex-A710: -15.5%
Cortex-A715:  -9.2%
Cortex-A720:  -9.2%
  Cortex-X2: -25.8%
  Cortex-X3: -26.2%
  Cortex-X4: -23.2%
Cortex-X925: -17.8%

Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-05-22 13:33:06 -07:00
Frank Barchard
0853c9353f ARGBToUV 64 bit use ymm8 for shuffler
Bug: 381138208
Change-Id: I5e69bc1610bd6269bf9a4113e729cf307dd36f60
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6536833
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-05-12 15:09:40 -07:00
George Steed
61bdaee13a Add Neon I8MM implementations of ARGB to UV and variants
The maximum coefficient is 128, so store constants negated to take
advantage of -128 being representable in 8-bit integers. This allows us
to use the I8MM USDOT instructions.

Reduction in time taken observed compared to the existing Neon
implementation, as a geomean of all ARGBToUV variants:

Cortex-A510:  -7.1%
Cortex-A520:  -2.1%
Cortex-A710:  -8.4%
Cortex-A715:  -0.3%
Cortex-A720:  -0.3%
  Cortex-X2: -40.0%
  Cortex-X3: -43.3%
  Cortex-X4: -11.3%
Cortex-X925:  -2.5%

Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-05-12 11:14:00 -07:00
Frank Barchard
9f9b5cf660 ARGBToUV allow 32 bit x86 build
- make width loop count on stack
- set YMM constants in its own asm block
- make struct for shuffle and add constants
- disable clang format on row_neon.cc function

Bug: 413781394
Change-Id: I263f6862cb7589dc31ac65d118f7ebeb65dbb24a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6495259
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-04-28 12:11:00 -07:00
Wan-Teh Chang
8c48036d15 Remove duplicate code in planar_functions.h
The declarations of ARGBAffineRow_C and ARGBAffineRow_SSE2 and the code
to support those declarations are duplicated in planar_functions.h. They
are already in row.h, so we can simply remove them.

Change-Id: I9b522fdd201ca530f1268bf4200cd2e18b806ba5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6434733
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
2025-04-04 15:48:23 -07:00
Wan-Teh Chang
b7a857659f Disable Arm SME and SVE assmbly code under MSan
The code that disables Arm and Intel assembly code under MSan is
duplicated in cpu_support.h and planar_functions.h. This CL does not
address the code duplication.

Bug: b:407277484, b:407278016, b:407278132
Change-Id: If70fd8d3382916041d75efabcc84010ea3f1e60e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6430806
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-04-03 11:27:31 -07:00
Frank Barchard
23d416d6f3 Detect SME without SVE dependency
Bug: None
Change-Id: Ibe29488e893a493699ea3fae1a1a54a4fff5969c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6418571
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-03-31 17:27:40 -07:00
Frank Barchard
f145aa26da Add SME2 detect
Bug: None
Change-Id: I36e576de1cf468049faaf3923b6c21fc9ad14271
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6401373
Reviewed-by: George Steed <george.steed@arm.com>
2025-03-27 11:08:08 -07:00
George Steed
64ac2d8f0f Avoid odd width stores in I422ToRGB565Row_{SVE2,SME}
The existing code for creating RGB565 data in SVE2 and SME produces two
vectors of interleaved 16-bit elements due to the nature of how SVE
widening instructions operate. This means that the indices of the 16-bit
data created appear in the two result vectors as such:

    z18.b: [elem0 byte0, elem0 byte1, elem2 byte0, elem2 byte1, ...]
    z19.b: [elem1 byte0, elem1 byte1, elem3 byte0, elem3 byte1, ...]

This is problematic for the final (predicated) iteration of the
conversion since the p1 predicate input to the ST2H instruction controls
storing the four bytes corresponding to the first two elements, in the
first two bytes of z18 and z19. This means that in the case that the
width is an odd number there is no way of storing just elem0 in z18
individually.

This patch addresses this by permuting the z18/z19 data such that the
two bytes from each element are split evenly across the two vectors:

    z20.b: [elem0 byte0, elem1 byte0, elem2 byte0, elem3 byte0, ...]
    z21.b: [elem0 byte1, elem1 byte1, elem2 byte1, elem3 byte1, ...]

Since we would now always store the same lanes from both vectors we can
continue to use the same predicate without further changes.

The existing (non-tail) loop body utilizes an all-true predicate so we
can avoid the extra permutes in this case, avoiding any performance
degradation.

Change-Id: I7d2be27c84cd9eb02cebac54a14c3498911f21d3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6395137
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-03-26 04:08:46 -07:00
Frank Barchard
5f284054cb RVV disable 64 bit elements and vcombine_v
Bug: 405451074
Change-Id: I8e4437be92934b3c367c94d867d7967c32747260
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6385788
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-03-25 12:51:25 -07:00
Frank Barchard
c060118bea ARGBToJ444 use 256 for fixed point scale UV
- use negative coefficients for UV to allow -128
- change shift to truncate instead of round for UV
- adapt all row_gcc RGB to UV into matrix functions
- add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc

Bug: 381138208
Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-27 13:04:15 -08:00
Frank Barchard
61354d2671 ARGBToUV Matrix for AVX2 and SSSE3
- Round before shifting to 8 bit to match NEON
  - RAWToARGB use unaligned loads and port to AVX2

Was C/SSSE/AVX2
ARGBToI444_Opt (343 ms)
ARGBToJ444_Opt (677 ms)
RAWToI444_Opt (405 ms)
RAWToJ444_Opt (803 ms)

Now AVX2
ARGBToI444_Opt (283 ms)
ARGBToJ444_Opt (284 ms)
RAWToI444_Opt (316 ms)
RAWToJ444_Opt (339 ms)

Profile Now AVX2
  38.31%  ARGBToUVJ444Row_AVX2
  32.31%  RAWToARGBRow_AVX2
  23.99%  ARGBToYJRow_AVX2

Profile Was C/SSSE/AVX2
    73.15%  ARGBToUVJ444Row_C
    15.74%  RAWToARGBRow_SSSE3
     8.87%  ARGBToYJRow_AVX2

Bug: 381138208
Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Ben Weiss <bweiss@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-10 18:36:18 -08:00
Frank Barchard
d32d19ccf2 UV subsample on ARM use rounding average of 4 pixels
Performance on Samsung S22 Exynos (SVE2+I8MM+DOTPROD+Neon)
AArch64
ARGBToI400_Opt (168 ms)
ARGBToJ400_Opt (103 ms)
ABGRToJ400_Opt (81 ms)
RGBAToJ400_Opt (82 ms)
RGB24ToJ400_Opt (176 ms)
RAWToJ400_Opt (176 ms)
ABGRToI420_Opt (258 ms)
ARGBToI420_Opt (259 ms)
ARGBToI422_Opt (403 ms)
ARGBToI444_Opt (213 ms)
ARGBToJ420_Opt (257 ms)
ARGBToJ422_Opt (403 ms)
ARGBToJ444_Opt (214 ms)
ABGRToJ420_Opt (255 ms)
ABGRToJ422_Opt (399 ms)
ARGB4444ToI420_Opt (285 ms)
RGB565ToI420_Opt (316 ms)
ARGB1555ToI420_Opt (324 ms)
BGRAToI420_Opt (260 ms)
RAWToI420_Opt (303 ms)
RAWToI444_Opt (303 ms)
RAWToJ420_Opt (335 ms)
RAWToJ444_Opt (308 ms)
RGB24ToI420_Opt (372 ms)
RGB24ToJ420_Opt (365 ms)
RGBAToI420_Opt (259 ms)

AArch32 (Neon)
ARGBToI400_Opt (496 ms)
ARGBToJ400_Opt (478 ms)
ABGRToJ400_Opt (483 ms)
RGBAToJ400_Opt (493 ms)
RGB24ToJ400_Opt (343 ms)
RAWToJ400_Opt (341 ms)
ABGRToI420_Opt (993 ms)
ARGBToI420_Opt (992 ms)
ARGBToI422_Opt (1503 ms)
ARGBToI444_Opt (1257 ms)
ARGBToJ420_Opt (1006 ms)
ARGBToJ422_Opt (1521 ms)
ARGBToJ444_Opt (1267 ms)
ABGRToJ420_Opt (1002 ms)
ABGRToJ422_Opt (1504 ms)
ARGB4444ToI420_Opt (1180 ms)
RGB565ToI420_Opt (1112 ms)
ARGB1555ToI420_Opt (1115 ms)
BGRAToI420_Opt (993 ms)
RAWToI420_Opt (703 ms)
RAWToI444_Opt (1717 ms)
RAWToJ420_Opt (704 ms)
RAWToJ444_Opt (1739 ms)
RGB24ToI420_Opt (703 ms)
RGB24ToJ420_Opt (703 ms)
RGBAToI420_Opt (993 ms)

Bug: 381138208
Change-Id: I33728d5237f357362b0bfc509a9ebe6fe46f45d4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6228987
Reviewed-by: Ben Weiss <bweiss@google.com>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-02-04 15:19:19 -08:00
George Steed
ccdf870348 [AArch64] Fix up inline asm name in Convert8To8Row_SVE_SC
The existing implementation mistakenly refers to the parameter %2. This
works fine however the parameter is already named %[width], and using
the name should be preferred.

Change-Id: Ifaf8fc83cdfc9b15c79d52e7e47cb72b53270a12
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6225753
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-02-04 10:28:17 -08:00
Frank Barchard
5a9a6ea936 Add RAWToI444
Skylake Xeon
  RAWToI444_Opt (433 ms)
  RAWToJ444_Opt (1781 ms)
  ARGBToI444_Opt (352 ms)
  ARGBToJ444_Opt (1577 ms)

Samsung S22 Exynos
  ARGBToI444_Opt (283 ms)
  ARGBToJ444_Opt (209 ms)
  RAWToI444_Opt (294 ms)
  RAWToJ444_Opt (293 ms)

Profiling on Samsung S22 Exynos
37.62%,  ARGBToUV444Row_NEON_I8MM
29.42%,  RAWToARGBRow_SVE2
19.61%,  ARGBToYRow_NEON_DotProd

Passing different --libyuv_cpu_info=N etc we can compare each ISA
C           1  RAWToI444_Opt (781 ms)
NEON      511  RAWToI444_Opt (757 ms)
NEONDOT  1023  RAWToI444_Opt (571 ms)
NEONI8MM 2047  RAWToI444_Opt (334 ms)
SVE2     8191  RAWToI444_Opt (307 ms)



Bug: 390247964
Change-Id: I0316fedd32222588455afa751f5b854f46bce024
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6223658
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-02-03 16:13:03 -08:00
Frank Barchard
b3fd3f3f3b Fix ARGBToUV444Row_NEON
- constants passed in are signed and need to be negated to positive.

Bug: 394127527
Change-Id: I531e475d2ddd4583922d4abef13b9282d002dd7a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6226854
Reviewed-by: Ben Weiss <bweiss@google.com>
2025-02-03 13:33:39 -08:00
Frank Barchard
96f98f6915 ARGBToJ444 and RAWToJ444 NEON
- Pass JPEG matrix to ARGBToUV444MatrixRow_NEON
- Remove NEON unsigned constants in favor of DOTPROD signed constants

Samsung S23:
Was C for UV
  ARGBToJ444_Opt (320 ms)
  RAWToJ444_Opt (411 ms)
Now I8MM
  ARGBToJ444_Opt (196 ms)
  RAWToJ444_Opt (301 ms)
NEON
  ARGBToJ444_Opt (505 ms)
  RAWToJ444_Opt (596 ms)

32 bit ARM NEON
  ARGBToJ444_Opt (1135 ms)
  RAWToJ444_Opt (1546 ms)

Profile of RAWToJ444
  37.72%  ARGBToUVJ444Row_NEON_I8MM
  34.48%  RAWToARGBRow_NEON
  14.65%  ARGBToYJRow_NEON_DotProd

Bug: 390247964
Change-Id: Ia26240bee974a0baf502548f2fc896b193c3006c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6220890
Reviewed-by: Ben Weiss <bweiss@google.com>
2025-01-31 16:46:29 -08:00
Frank Barchard
c1bac9e6a5 RAWToJ444 and ARGBToJ444
- ARGBToJ444 implements ARGBToUVJ444Row_C
- RAWToJ444 implemented as 2 steps - RAWToARGB and ARGBToJ444

libyuv_test '--gunit_filter=*R*To?444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1
(with bit exact off)

Samsung S23
RAWToJ444_Opt (437 ms)
ARGBToJ444_Opt (337 ms)
ARGBToI444_Opt (196 ms)

Skylake Xeon
RAWToJ444_Opt (1699 ms)
ARGBToJ444_Opt (1559 ms)
ARGBToI444_Opt (346 ms)

Bug: 390247964
Change-Id: Id1b1b45a5e4512ab50830aadf62f780fbe631575
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207845
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-29 15:18:38 -08:00
George Steed
c4a0c8d34a [AArch64] Add SVE2 and SME implementations for Convert8To8Row
SVE can make use of the UMULH instruction to avoid needing separate
widening multiply and narrowing steps for the scale application.

Reduction in runtime for Convert8To8Row_SVE2 observed compared to the
existing Neon implementation:

        Cortex-A510: -13.2%
        Cortex-A520: -16.4%
        Cortex-A710: -37.1%
        Cortex-A715: -38.5%
        Cortex-A720: -38.4%
          Cortex-X2: -33.2%
          Cortex-X3: -31.8%
          Cortex-X4: -31.8%
        Cortex-X925: -13.9%

Change-Id: I17c0cb81661c5fbce786b47cdf481549cfdcbfc7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207692
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-01-28 15:53:26 -08:00
Frank Barchard
6c2415bfab J420ToI420 AVX2
libyuv_test '--gunit_filter=*J420ToI420*' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Skylake Xeon
AVX2 J420ToI420_Opt (114 ms)
C    J420ToI420_Opt (596 ms)

Sapphire Rapids
AVX2 J420ToI420_Opt (126 ms)
C    J420ToI420_Opt (717 ms)

Samsung S23
NEON J420ToI420_Opt (46 ms)
C    J420ToI420_Opt (95 ms)

Bug: 381327032
Change-Id: I2b551507c2a8b1da4f04651b622fc9247a75050d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6201239
Reviewed-by: Justin Green <greenjustin@google.com>
2025-01-27 11:23:44 -08:00
Frank Barchard
67f3f17d9a aarch32 J420ToI420
benchmark on medium core
adbrun -- taskset 10 blaze-bin/third_party/libyuv/libyuv_test '--gunit_filter=*J420ToI420*' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Now Neon
J420ToI420_Opt (159 ms)
Was C
J420ToI420_Opt (215 ms)

AArch64
J420ToI420_Opt (93 ms)

C version does this:
vld1.8	{d20, d21}, [r6]!
vorr	q12, q8, q8
subs	r4, #16
vmovl.u8	q11, d21
vmovl.u8	q10, d20
vmul.i16	q11, q9, q11
vmul.i16	q10, q9, q10
vsra.u16	q12, q11, #8
vorr	q11, q8, q8
vsra.u16	q11, q10, #8
vmovn.i16	d21, q12
vmovn.i16	d20, q11
vst1.8	{d20, d21}, [r5]!
bne	0x3d9078 <Convert8To8Row_C+0x36> @ imm = #-54

Explanation of above C code
vorr moves 16 into register
vsra does shift + accumulate to that register

Compared to aarch64
instead of mull, C uses movl+mul
instead of uzp2, C uses sra #8 + movn. takes 2 movn vs 1 uzp2
instead of add, C does vorr + sra

Change-Id: I9648f06e52ccbafaecf07bd89f8ffff27565d025
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6189497
Reviewed-by: Justin Green <greenjustin@google.com>
2025-01-22 13:47:09 -08:00
Frank Barchard
26277baf96 J420ToI420 using planar 8 bit scaling
- Add Convert8To8Plane which scale and add 8 bit values allowing full range
  YUV to be converted to limited range YUV

libyuv_test '--gunit_filter=*J420ToI420*' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Samsung S23
J420ToI420_Opt (45 ms)
I420ToI420_Opt (37 ms)

Skylake
J420ToI420_Opt (596 ms)
I420ToI420_Opt (99 ms)

Bug: 381327032
Change-Id: I380c3fa783491f2e3727af28b0ea9ce16d2bb8a4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6182631
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-22 02:50:24 -08:00
Frank Barchard
ef52c1658a avx10_2 detect
Run with sde only -dmr reports AVX10.2
emr:Has AVX10_2 0x0
adl:Has AVX10_2 0x0
icx:Has AVX10_2 0x0
snb:Has AVX10_2 0x0
tnt:Has AVX10_2 0x0
icl:Has AVX10_2 0x0
slm:Has AVX10_2 0x0
dmr:Has AVX10_2 0x2000000
cwf:Has AVX10_2 0x0
mrm:Has AVX10_2 0x0
skx:Has AVX10_2 0x0
wsm:Has AVX10_2 0x0
gnr:Has AVX10_2 0x0
gnr256:Has AVX10_2 0x0
bdw:Has AVX10_2 0x0
cpx:Has AVX10_2 0x0
rpl:Has AVX10_2 0x0
snr:Has AVX10_2 0x0
ptl:Has AVX10_2 0x0
slt:Has AVX10_2 0x0
ivb:Has AVX10_2 0x0
spr:Has AVX10_2 0x0
tgl:Has AVX10_2 0x0
arl:Has AVX10_2 0x0
srf:Has AVX10_2 0x0
nhm:Has AVX10_2 0x0
skl:Has AVX10_2 0x0
mtl:Has AVX10_2 0x0
pnr:Has AVX10_2 0x0
glp:Has AVX10_2 0x0
lnl:Has AVX10_2 0x0
cnl:Has AVX10_2 0x0
hsw:Has AVX10_2 0x0
clx:Has AVX10_2 0x0
glm:Has AVX10_2 0x0

sde -dmr -- libyuv_test --gunit_filter=*Cpu*
[ RUN      ] LibYUVBaseTest.TestCpuId
Cpu Vendor: GenuineIntel 0x756e6547 0x49656e69 0x6c65746e
Cpu Family 6 (0x6), Model 214 (0xd6)
[       OK ] LibYUVBaseTest.TestCpuId (34 ms)
[ RUN      ] LibYUVBaseTest.TestCpuHas
Kernel Version 6.10
Has X86 0x8
Has SSE2 0x100
Has SSSE3 0x200
Has SSE4.1 0x400
Has SSE4.2 0x800
Has AVX 0x1000
Has AVX2 0x2000
Has ERMS 0x4000
Has FSMR 0x8000
Has FMA3 0x10000
Has F16C 0x20000
Has AVX512BW 0x40000
Has AVX512VL 0x80000
Has AVX512VNNI 0x100000
Has AVX512VBMI 0x200000
Has AVX512VBMI2 0x400000
Has AVX512VBITALG 0x800000
Has AVX10 0x1000000
Has AVX10_2 0x2000000
HAS AVXVNNI 0x4000000
Has AVXVNNIINT8 0x8000000
Has AMXINT8 0x10000000
[       OK ] LibYUVBaseTest.TestCpuHas (10 ms)

This is how oneDNN does avx10 version:
e15d2c220f/src/cpu/x64/xbyak/xbyak_util.h (L698-L701)

Bug: b/350318244
Change-Id: I6f78402fecc38a92019d137b3439d7bce950510c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6172267
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-01-21 13:53:19 -08:00
Frank Barchard
47ddac2996 Sub sampling conversions use CopyPlane for Y channel
- Replace ScalePlane with CopyPlane for Y channel
- Vertical mirroring is supported, but not horizontal mirroring.
- Check src_y is not null when dst_y is not null for all libyuv functions that allow a null dst_y.
- Apply clang-format
- Bump version to 1899

Bug: None
Change-Id: Id1805b52b8024ba95a7f1b098dabf45af48670eb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6128599
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-02 13:34:11 -08:00
Frank Barchard
e0040eb318 Apply clang format
Bug: None
Change-Id: I0d9db4b384144523e61ae32b6ab3f72e93a0c265
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6138934
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-02 13:31:20 -08:00