230 Commits

Author SHA1 Message Date
Frank Barchard
cda55fcf53 Mirrow AVX2 functions for Visual C
Bug: libyuv:42280902
Change-Id: Iabbec9af3a4f4dd89294e60145823c7fc4dd6ec6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7843378
Commit-Queue: Frank Barchard <fbarchard@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2026-05-15 15:05:31 -07:00
Frank Barchard
cb061d0378 Unittests use ASSERT instead of EXPECT
Bug: libyuv:508639302
Change-Id: I22c35e08f3b6db1a656192877c1fb1bf4e96d6f5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7838659
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
2026-05-11 19:10:47 -07:00
Frank Barchard
e23282704f ARGBToYRow_AVX512BW preserve XMM6-XMM15 due to Windows stack alignment
Bug: 505124541
Change-Id: Id5ae539f57b314980182bec76a788e33273b2392
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7835639
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
2026-05-11 13:12:22 -07:00
Frank Barchard
561a9780e2 YUV to RGB avoid avx assist
Here are the functions flagged for mixing both SSE and AVX (or AVX-512)
instructions, which can trigger an AVX transition/assist performance
penalty:

Libyuv Functions addressed in this CL
   * I422ToARGBRow_AVX512BW
   * HalfFloatRow_SSE2

Not addressed:
   * ScaleFilterCols_SSSE3

Bug: libyuv:509681367
Change-Id: I8ced6065dfe0c516d05857086393782c8590062a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7814945
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2026-05-05 12:57:55 -07:00
Wan-Teh Chang
b438739c8b Use ptrdiff_t for buffer offsets
Use ptrdiff_t instead of intptr_t for buffer offsets, such as stride,
width_temp, and src_step*.

Change-Id: I64e6701fa71ab59c94325a6dad8762d040035208
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7800070
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
2026-04-28 18:21:42 -07:00
Frank Barchard
4afb965416 RAWToARGB use AVX512BW
Bug: libyuv:42280902
Change-Id: I7a80fd64d97b6d411316819df0fd917d609a173b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7787163
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
2026-04-22 16:56:46 -07:00
Frank Barchard
bd2c4c76ec RAWToARGB AVX512VBMI
Bug: libyuv:42280902
Change-Id: I1c7f432f004079357a00515785bc524c459ed4b9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7787160
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
2026-04-22 14:48:29 -07:00
Frank Barchard
d445250d8b Replace RAWToY/RGB24ToY with RGBToYMatrix
Bug: libyuv:42280902
Change-Id: I6ddebd492036c416550fc045eb39493dea73246b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7784094
Commit-Queue: Frank Barchard <fbarchard@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2026-04-21 17:11:14 -07:00
Frank Barchard
9f13b2814d add RGBToYMatrixRow_AVX2
Adds RGBToYMatrixRow_AVX2 which reads 24 bit RGB values by reading 3 vectors instead of 4 and permutes them into 4 ARGB vectors before conversion.
Also adds RGBToYMatrixRow_Opt and RGBToYMatrixRow_2Step_Opt to convert_argb_test.cc to benchmark and compare the direct AVX2 conversion vs a 2-step approach.

./libyuv_test '--gunit_filter=*RAWToJ400_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=10000 --libyuv_flags=-1 --libyuv_cpu_info=-1

AMD Zen 5
Was LibYUVConvertTest.RAWToJ400_Opt (757 ms)
Now LibYUVConvertTest.RAWToJ400_Opt (699 ms)

Intel Skylake
Was LibYUVConvertTest.RAWToJ400_Opt (1705 ms)
Now LibYUVConvertTest.RAWToJ400_Opt (1426 ms)

Bug: 477295731
Change-Id: I29866baf4ad5fe7a3725e4a01f2fe24649510a7d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7777325
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2026-04-20 12:52:44 -07:00
Frank Barchard
e034c41661 Port ARGBToUVMatrixRow from AVX2 to AVX512BW
Benchmark on Icelake Xeon
Now AVX512BW:
[       OK ] LibYUVConvertTest.ARGBToNV12_Opt (1723 ms)
Was AVX2:
[       OK ] LibYUVConvertTest.ARGBToNV12_Opt (2144 ms)

- Added `ARGBToUVMatrixRow_AVX512BW` implementation in `source/row_gcc.cc`.
- Added corresponding `ARGBToUVRow_AVX512BW` and `ABGRToUVRow_AVX512BW` functions.
- Added unaligned wrappers `ARGBToUVRow_Any_AVX512BW` and `ABGRToUVRow_Any_AVX512BW` in `source/row_any.cc`.
- Updated `source/row_any.cc` to correctly size `vin` and `vout` buffers for AVX512BW width and adjusted the `ANY12MS` and `ANY12S` macros to handle `MASK=63`.
- Updated `include/libyuv/row.h` with the required AVX512BW headers and definitions, scoped appropriately.
- Wired all callers of `ARGBToUVRow_AVX2` and related functions in `source/convert.cc` and `source/convert_from_argb.cc` to dynamically use the `AVX512BW` implementations if the CPU flag indicates AVX-512BW support.
- Optimized AVX-512 code to generate the `-1` multiplier in a single instruction (`vpternlogd`) and reused it across word (`vpmaddwd`) dot products. Handled the resulting negation by replacing a subtraction with `vpaddw` offset adjustment.

Bug: 477295731
R=dalecurtis@chromium.org, rrwinterton@gmail.com

Change-Id: Ida5fb27e59ae4c1c3824737f009b80549cd20a06
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7763257
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2026-04-14 16:15:31 -07:00
Frank Barchard
59ca5d8074 Fix parameter names and comments for ARGB/BGRA/RGBA/ABGR functions
In all functions that start with ARGB, BGRA, RGBA or ABGR in the include/libyuv/ headers, make sure the parameter variable name has the same 4 letters, but lower case, and the comment before the function should have the same matching name. Then make sure the implementation in source/ folder has the same variable names.

Change-Id: Idadbbbb993156eea16e318719f4888cb3bed5f6a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7760057
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2026-04-13 18:28:37 -07:00
Frank Barchard
893eacf9b4 ARGBToY for AVX512
- add ARGBToYMatrixRow_AVX512BW
- refactor SSE and AVX to use Matrix functions, making old functions
  call the new ones.

Zen5 1280x720
Was AVX2   LibYUVConvertTest.ARGBToI444_Opt (1125 ms)
Now AVX512 LibYUVConvertTest.ARGBToI444_Opt (641 ms)

Details by Gemini:
  1. Created 3 new Matrix functions:
    Added ARGBToYMatrixRow_SSSE3, ARGBToYMatrixRow_AVX2, and
    ARGBToYMatrixRow_AVX512BW to source/row_gcc.cc. These take the
    const struct ArgbConstants* c parameter similarly to
    ARGBToUV444MatrixRow_*. The x86 vector instructions dynamically
    calculate the needed values using the properties of the constants
    struct, including using vpmaddwd inside the AVX512 code to offset
    the lack of a native vphaddw.

  2. Replaced Old Functions with Wrappers:
    Modified the existing implementations of ARGBToYRow_SSSE3,
    ARGBToYJRow_SSSE3, ABGRToYRow_SSSE3, ABGRToYJRow_SSSE3,
    RGBAToYRow_SSSE3, RGBAToYJRow_SSSE3, BGRAToYRow_SSSE3 (and their
    _AVX2 equivalents) in source/row_gcc.cc to act as inline wrappers
    calling the new ARGBToYMatrixRow_* functions, passing the right
    matrix parameters (e.g. &kArgbI601Constants, &kArgbJPEGConstants,
    &kAbgrI601Constants).

  3. Added row_any.cc Handlers:
    Added ANY11MC definitions to source/row_any.cc to autogenerate
    ARGBToYMatrixRow_Any_SSSE3, ARGBToYMatrixRow_Any_AVX2, and
    ARGBToYMatrixRow_Any_AVX512BW which safely handles non-aligned
    tails.

  4. Updated include/libyuv/row.h:
    Updated the headers with the proper void declarations for all newly
    generated Matrix and Any_ variants. Also defined
    HAS_ARGBTOYROW_AVX512BW in the CPU macros.

  5. Tested the Implementations:
    Compiled and tested on Linux x86, which resulted in all tests passing
    cleanly. Also successfully completed all Windows 32-bit build checks
    ensuring 32-bit regression prevention without issues.

Bug: 477295731
Change-Id: I4f5eec9a961e24a9d760d0a1c0810fb5e29a0bd1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7759494
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2026-04-13 17:26:07 -07:00
Frank Barchard
4c3d7d517a ARGBToUV444 for AVX512
1.27x faster on AMD Zen5 (turin)

Now AVX512
perf record ./libyuv_test '--gunit_filter=*ARGBToI444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=10000 --libyuv_flags=-1 --libyuv_cpu_info=-1

[       OK ] LibYUVConvertTest.ARGBToI444_Opt (1071 ms)
Overhead  Symbol
  53.49%  ARGBToYRow_AVX2
  44.70%  ARGBToUV444Row_AVX512BW

Was AVX2
[       OK ] LibYUVConvertTest.ARGBToI444_Opt (1369 ms)
  61.06%  ARGBToUV444Row_AVX2
  37.67%  ARGBToYRow_AVX2

Bug:  libyuv:42280902
Change-Id: I306fbac656d6f7834ce1559e86d01eb34931ec3c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7738362
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>
2026-04-08 19:25:41 -07:00
Dale Curtis
b1cacfb38f Unify X86/X64 versions of ARGBToI4xxMatrix functions
Change-Id: Iead13414414543e5f10ba9ba47a6ceaeb3113dee
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7562443
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2026-03-18 16:27:07 -07:00
Frank Barchard
94417b9d21 Pass rgbconstants via struct pointer instead of elements with m
Now 66 instructions
SYM ARGBToUVRow_SSSE3:
62ccd0: BASE       push ebp
62ccd1: BASE       mov ebp, esp
62ccd3: BASE       push ebx
62ccd4: BASE       push edi
62ccd5: BASE       push esi
62ccd6: BASE       and esp, 0xfffffffc
62ccd9: BASE       sub esp, 0xc
62ccdc: BASE       call 0x62cce1 <ARGBToUVRow_SSSE3+0x11>
62cce1: BASE       pop eax
62cce2: BASE       add eax, 0xe1c27
62cce8: BASE       mov ecx, dword ptr [ebp+0xc]
62cceb: BASE       mov edx, dword ptr [ebp+0x8]
62ccee: BASE       mov esi, dword ptr [ebp+0x10]
62ccf1: BASE       mov edi, dword ptr [ebp+0x18]
62ccf4: BASE       mov dword ptr [esp+0x8], edi
62ccf8: BASE       mov edi, dword ptr [ebp+0x14]
62ccfb: BASE       lea ebx, ptr [eax-0x5ecf88]
62cd01: SSE2       movdqa xmm4, xmmword ptr [ebx]
62cd05: SSE2       movdqa xmm5, xmmword ptr [ebx+0x10]
62cd0a: SSE2       pcmpeqb xmm6, xmm6
62cd0e: SSSE3      pabsb xmm6, xmm6
62cd13: SSE2       movdqa xmm7, xmmword ptr [eax-0x5ecfa8]
62cd1b: BASE       sub edi, esi

62cd1d: SSE2       movdqu xmm0, xmmword ptr [edx]
62cd21: SSE2       movdqu xmm1, xmmword ptr [edx+0x10]
62cd26: SSE2       movdqu xmm2, xmmword ptr [edx+ecx*1]
62cd2b: SSE2       movdqu xmm3, xmmword ptr [edx+ecx*1+0x10]
62cd31: SSSE3      pshufb xmm0, xmm7
62cd36: SSSE3      pshufb xmm1, xmm7
62cd3b: SSSE3      pshufb xmm2, xmm7
62cd40: SSSE3      pshufb xmm3, xmm7
62cd45: SSSE3      pmaddubsw xmm0, xmm6
62cd4a: SSSE3      pmaddubsw xmm1, xmm6
62cd4f: SSSE3      pmaddubsw xmm2, xmm6
62cd54: SSSE3      pmaddubsw xmm3, xmm6
62cd59: SSE2       paddw xmm0, xmm2
62cd5d: SSE2       paddw xmm1, xmm3
62cd61: SSE2       pxor xmm2, xmm2
62cd65: SSE2       psrlw xmm0, 0x1
62cd6a: SSE2       psrlw xmm1, 0x1
62cd6f: SSE2       pavgw xmm0, xmm2
62cd73: SSE2       pavgw xmm1, xmm2
62cd77: SSE2       packuswb xmm0, xmm1
62cd7b: SSE2       movdqa xmm2, xmm6
62cd7f: SSE2       psllw xmm2, 0xf
62cd84: SSE2       movdqa xmm1, xmm0
62cd88: SSSE3      pmaddubsw xmm1, xmm5
62cd8d: SSSE3      pmaddubsw xmm0, xmm4
62cd92: SSSE3      phaddw xmm0, xmm1
62cd97: SSE2       psubw xmm2, xmm0
62cd9b: SSE2       psrlw xmm2, 0x8
62cda0: SSE2       packuswb xmm2, xmm2
62cda4: SSE2       movd dword ptr [esi], xmm2
62cda8: SSE2       pshufd xmm2, xmm2, 0x55
62cdad: SSE2       movd dword ptr [esi+edi*1], xmm2
62cdb2: BASE       lea edx, ptr [edx+0x20]
62cdb5: BASE       lea esi, ptr [esi+0x4]
62cdb8: BASE       sub dword ptr [esp+0x8], 0x8
62cdbd: BASE       jnle 0x62cd1d <ARGBToUVRow_SSSE3+0x4d>

62cdc3: BASE       lea esp, ptr [ebp-0xc]
62cdc6: BASE       pop esi
62cdc7: BASE       pop edi
62cdc8: BASE       pop ebx
62cdc9: BASE       pop ebp
62cdca: BASE       ret

Was 68 instructions
ARGBToUVRow_SSSE3:
62ccd0: BASE       push ebp
62ccd1: BASE       mov ebp, esp
62ccd3: BASE       push edi
62ccd4: BASE       push esi
62ccd5: BASE       and esp, 0xfffffff0
62ccd8: BASE       sub esp, 0x30
62ccdb: BASE       call 0x62cce0 <ARGBToUVRow_SSSE3+0x10>
62cce0: BASE       pop eax
62cce1: BASE       add eax, 0xe1c28
62cce7: BASE       mov ecx, dword ptr [ebp+0xc]
62ccea: BASE       mov edx, dword ptr [ebp+0x8]
62cced: BASE       mov esi, dword ptr [ebp+0x10]
62ccf0: BASE       mov edi, dword ptr [ebp+0x18]
62ccf3: BASE       mov dword ptr [esp+0xc], edi
62ccf7: BASE       mov edi, dword ptr [ebp+0x14]
62ccfa: SSE        movaps xmm0, xmmword ptr [eax-0x5ecf88]
62cd01: SSE        movaps xmmword ptr [esp+0x20], xmm0
62cd06: SSE        movaps xmm0, xmmword ptr [eax-0x5ecf78]
62cd0d: SSE        movaps xmmword ptr [esp+0x10], xmm0
62cd12: SSE2       movdqa xmm4, xmmword ptr [esp+0x20]
62cd18: SSE2       movdqa xmm5, xmmword ptr [esp+0x10]
62cd1e: SSE2       pcmpeqb xmm6, xmm6
62cd22: SSSE3      pabsb xmm6, xmm6
62cd27: SSE2       movdqa xmm7, xmmword ptr [eax-0x5ecfa8]
62cd2f: BASE       sub edi, esi

62cd31: SSE2       movdqu xmm0, xmmword ptr [edx]
62cd35: SSE2       movdqu xmm1, xmmword ptr [edx+0x10]
62cd3a: SSE2       movdqu xmm2, xmmword ptr [edx+ecx*1]
62cd3f: SSE2       movdqu xmm3, xmmword ptr [edx+ecx*1+0x10]
62cd45: SSSE3      pshufb xmm0, xmm7
62cd4a: SSSE3      pshufb xmm1, xmm7
62cd4f: SSSE3      pshufb xmm2, xmm7
62cd54: SSSE3      pshufb xmm3, xmm7
62cd59: SSSE3      pmaddubsw xmm0, xmm6
62cd5e: SSSE3      pmaddubsw xmm1, xmm6
62cd63: SSSE3      pmaddubsw xmm2, xmm6
62cd68: SSSE3      pmaddubsw xmm3, xmm6
62cd6d: SSE2       paddw xmm0, xmm2
62cd71: SSE2       paddw xmm1, xmm3
62cd75: SSE2       pxor xmm2, xmm2
62cd79: SSE2       psrlw xmm0, 0x1
62cd7e: SSE2       psrlw xmm1, 0x1
62cd83: SSE2       pavgw xmm0, xmm2
62cd87: SSE2       pavgw xmm1, xmm2
62cd8b: SSE2       packuswb xmm0, xmm1
62cd8f: SSE2       movdqa xmm2, xmm6
62cd93: SSE2       psllw xmm2, 0xf
62cd98: SSE2       movdqa xmm1, xmm0
62cd9c: SSSE3      pmaddubsw xmm1, xmm5
62cda1: SSSE3      pmaddubsw xmm0, xmm4
62cda6: SSSE3      phaddw xmm0, xmm1
62cdab: SSE2       psubw xmm2, xmm0
62cdaf: SSE2       psrlw xmm2, 0x8
62cdb4: SSE2       packuswb xmm2, xmm2
62cdb8: SSE2       movd dword ptr [esi], xmm2
62cdbc: SSE2       pshufd xmm2, xmm2, 0x55
62cdc1: SSE2       movd dword ptr [esi+edi*1], xmm2
62cdc6: BASE       lea edx, ptr [edx+0x20]
62cdc9: BASE       lea esi, ptr [esi+0x4]
62cdcc: BASE       sub dword ptr [esp+0xc], 0x8
62cdd1: BASE       jnle 0x62cd31 <ARGBToUVRow_SSSE3+0x61>

62cdd7: BASE       lea esp, ptr [ebp-0x8]
62cdda: BASE       pop esi
62cddb: BASE       pop edi
62cddc: BASE       pop ebp
62cddd: BASE       ret
62cdde: BASE       int3
BUG=444157316

Change-Id: Iad044f851359f5b052091c7bdab9b96946fc3682
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6987370
Reviewed-by: Justin Green <greenjustin@google.com>
2025-09-29 12:34:36 -07:00
Daniel.L (Byoungchan Lee)
5b22f31cb5 Fix compilation issue for 32bit PIC build
Currently, ARGBToUVMatrixRow_AVX2 and ARGBToUVMatrixRow_SSSE3 fail to
compile with clang on 32bit PIC build with the error message: inline
assembly requires more registers than available

This is because in PIC code EBX is reserved for the GOT and with a frame
pointer EBP is also unavailable.

Fix this by copying the RGB-to-UV constants to stack locals first and
let the asm use simple stack-relative addressing.

Bug: 444157316
Change-Id: Ica90f0c35039303ecaa145534683f59659fb5d7f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6980714
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-25 13:49:02 -07:00
Frank Barchard
1b1c058787 ARGBToUV for SSE use pshufb/pmaddubsw
Was
ARGBToJ420_Opt (377 ms)
Now
ARGBToJ420_Opt (340 ms)

Bug: None
Change-Id: Iada2d6e9ecdb141b9e2acbdf343f890e4aaebe34
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6967754
Reviewed-by: Justin Green <greenjustin@google.com>
2025-09-19 12:39:39 -07:00
Frank Barchard
7155afc5ca ARGBToUV AVX2 for x86 32 bit
- Reduce to 10 ymm registers - 2 constants generated on the fly

Change-Id: Ib25a0cf7c93e5048270735410ccf6723b3949454
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6967319
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-18 13:14:45 -07:00
Frank Barchard
142db12947 ARGBToUV use AVX2 for 64 bit x86
Skylake
Was ARGBToJ420_Opt (312 ms)
Now ARGBToJ420_Opt (242 ms)

Icelake
Was ARGBToJ420_Opt (302 ms)
Now ARGBToJ420_Opt (220 ms)

AMD Zen3 on Windows
Was ARGBToJ420_Opt (305 ms)
Now ARGBToJ420_Opt (216 ms)
32 bit x86 uses SSE
Now ARGBToJ420_Opt (326 ms)

MCA analysis of new AVX, SSE and old AVX
https://godbolt.org/z/37bdazWYr

Bug: None
Change-Id: I72f5504407751e164c3558aebe836dd15223d65f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6957477
Reviewed-by: Justin Green <greenjustin@google.com>
2025-09-17 14:39:53 -07:00
Frank Barchard
a61882c049 ARGBToUV AVX2 for x86_64
Icelake
Was SSSE3+SSSE3 ARGBToJ420_Opt (356 ms)
Was SSSE3+AVX2  ARGBToJ420_Opt (301 ms)
Now AVX2+AVX2   ARGBToJ420_Opt (227 ms)

Change-Id: I2cb427bc164b225b3ad4c5f43c09d6da6ca496d5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6943036
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-16 11:33:54 -07:00
Frank Barchard
0f795672ae Reduce ARGBToUV SSSE3 register usage for clang build error on x64
Bug: 444157316
Change-Id: I2ae9f3dbfb373bb874a3d9699987f7d5b63f2610
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6937665
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-10 18:40:06 -07:00
Frank Barchard
48943bb378 Convert8To16 use VPSRLW instead of VPMULHUW for better lunarlake performance
- MCA says old version was 4 cycles and new version is 2.5 cycles/loop
- lunarlake is the only known cpu
mca -mcpu=lunarlake 100 iterations

Was vpmulhu
  Iterations:        100
  Instructions:      1200
  Total Cycles:      426
  Total uOps:        1200

  Dispatch Width:    8
  uOps Per Cycle:    2.82
  IPC:               2.82
  Block RThroughput: 4.0

Now vpsrlw
  Iterations:        100
  Instructions:      1200
  Total Cycles:      279
  Total uOps:        1400

  Dispatch Width:    8
  uOps Per Cycle:    5.02
  IPC:               4.30
  Block RThroughput: 2.5

Bug: None
Change-Id: I5a49e1cf1ed3dfb59fe9861a871df9862417c6a6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6697745
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-08-04 12:42:50 -07:00
Frank Barchard
6f729fbe65 ARGBToUV SSE use average of 4 pixels
- Was using avgb twice for non-exact and C for exact.

On Skylake Xeon:

Now SSE3
ARGBToJ420_Opt (326 ms)

Was
Exact C
ARGBToJ420_Opt (871 ms)
Not exact AVX2
ARGBToJ420_Opt (237 ms)
Not exact SSSE3
ARGBToJ420_Opt (312 ms)

Bug: 381138208
Change-Id: I6d1081bb52e36f06736c0c6575fa82bb2268629b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6629605
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
2025-06-17 11:55:27 -07:00
Frank Barchard
889613683a Add hybrid detect for Intel laptop cpus
- Add +i8mm build option for sve ARGBToUV which uses usdot
- util/cpuid Get cpu count (windows, macos, linux)
- For each x86 cpu, detect hybrid (e-core)
- Includes a comment fix for ubsan unittest
- Bump version
- Apply clang format to util/*.c as well as all *.cc/*.h

Bug: 424637372
Change-Id: I08310e18051fff62c9e4e4a10d1e4361871119ac
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6635640
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-06-13 13:22:54 -07:00
Frank Barchard
0853c9353f ARGBToUV 64 bit use ymm8 for shuffler
Bug: 381138208
Change-Id: I5e69bc1610bd6269bf9a4113e729cf307dd36f60
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6536833
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-05-12 15:09:40 -07:00
Frank Barchard
9f9b5cf660 ARGBToUV allow 32 bit x86 build
- make width loop count on stack
- set YMM constants in its own asm block
- make struct for shuffle and add constants
- disable clang format on row_neon.cc function

Bug: 413781394
Change-Id: I263f6862cb7589dc31ac65d118f7ebeb65dbb24a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6495259
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-04-28 12:11:00 -07:00
Frank Barchard
0c07032182 clang format applies to git repo
Bug: None
Change-Id: Ida65a0033e8c783230cadf6912416ffd9bbf90e1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6393515
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-03-25 11:49:25 -07:00
Frank Barchard
918329caee Make constant 0x0101 using vpcmpeqb+vpabsb
Was
      vpcmpeqb    %%ymm4,%%ymm4,%%ymm4
      vpsrlw      $0xf,%%ymm4,%%ymm4
      vpackuswb   %%ymm4,%%ymm4,%%ymm4
Now
      vpcmpeqb    %%ymm4,%%ymm4,%%ymm4
      vpabsb      %%ymm4,%%ymm4

Bug: 381138208
Change-Id: Ib70c24ac636fff95a10c7f06ed8f0a3bc7514906
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6312925
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2025-03-10 13:25:16 -07:00
Frank Barchard
c060118bea ARGBToJ444 use 256 for fixed point scale UV
- use negative coefficients for UV to allow -128
- change shift to truncate instead of round for UV
- adapt all row_gcc RGB to UV into matrix functions
- add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc

Bug: 381138208
Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-27 13:04:15 -08:00
Frank Barchard
5257ba4db0 Apply clang format
Bug: None
Change-Id: Ibd694d0351966a2b5812445de74bbced9c881a79
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6302317
Reviewed-by: James Zern <jzern@google.com>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-25 11:39:19 -08:00
Frank Barchard
3a7e0ba671 Apply format with no code changes
Bug: None
Change-Id: I8923bacb9af7e7d4f13e210c8b3d7ea6b81568a5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6301086
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2025-02-24 23:57:01 -08:00
Frank Barchard
61354d2671 ARGBToUV Matrix for AVX2 and SSSE3
- Round before shifting to 8 bit to match NEON
  - RAWToARGB use unaligned loads and port to AVX2

Was C/SSSE/AVX2
ARGBToI444_Opt (343 ms)
ARGBToJ444_Opt (677 ms)
RAWToI444_Opt (405 ms)
RAWToJ444_Opt (803 ms)

Now AVX2
ARGBToI444_Opt (283 ms)
ARGBToJ444_Opt (284 ms)
RAWToI444_Opt (316 ms)
RAWToJ444_Opt (339 ms)

Profile Now AVX2
  38.31%  ARGBToUVJ444Row_AVX2
  32.31%  RAWToARGBRow_AVX2
  23.99%  ARGBToYJRow_AVX2

Profile Was C/SSSE/AVX2
    73.15%  ARGBToUVJ444Row_C
    15.74%  RAWToARGBRow_SSSE3
     8.87%  ARGBToYJRow_AVX2

Bug: 381138208
Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Ben Weiss <bweiss@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-10 18:36:18 -08:00
Frank Barchard
6c2415bfab J420ToI420 AVX2
libyuv_test '--gunit_filter=*J420ToI420*' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Skylake Xeon
AVX2 J420ToI420_Opt (114 ms)
C    J420ToI420_Opt (596 ms)

Sapphire Rapids
AVX2 J420ToI420_Opt (126 ms)
C    J420ToI420_Opt (717 ms)

Samsung S23
NEON J420ToI420_Opt (46 ms)
C    J420ToI420_Opt (95 ms)

Bug: 381327032
Change-Id: I2b551507c2a8b1da4f04651b622fc9247a75050d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6201239
Reviewed-by: Justin Green <greenjustin@google.com>
2025-01-27 11:23:44 -08:00
Frank Barchard
e0040eb318 Apply clang format
Bug: None
Change-Id: I0d9db4b384144523e61ae32b6ab3f72e93a0c265
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6138934
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-02 13:31:20 -08:00
Frank Barchard
1c501a8f3f CpuId test FSMR - Fast Short Rep Movsb
- Renumber cpuid bits to use low byte to ID the type of CPU and upper 24 bits for features

Intel CPUs starting at Icelake support FSMR
adl:Has FSMR 0x8000
arl:Has FSMR 0x0
bdw:Has FSMR 0x0
clx:Has FSMR 0x0
cnl:Has FSMR 0x0
cpx:Has FSMR 0x0
emr:Has FSMR 0x8000
glm:Has FSMR 0x0
glp:Has FSMR 0x0
gnr:Has FSMR 0x8000
gnr256:Has FSMR 0x8000
hsw:Has FSMR 0x0
icl:Has FSMR 0x8000
icx:Has FSMR 0x8000
ivb:Has FSMR 0x0
knl:Has FSMR 0x0
knm:Has FSMR 0x0
lnl:Has FSMR 0x8000
mrm:Has FSMR 0x0
mtl:Has FSMR 0x8000
nhm:Has FSMR 0x0
pnr:Has FSMR 0x0
rpl:Has FSMR 0x8000
skl:Has FSMR 0x0
skx:Has FSMR 0x0
slm:Has FSMR 0x0
slt:Has FSMR 0x0
snb:Has FSMR 0x0
snr:Has FSMR 0x0
spr:Has FSMR 0x8000
srf:Has FSMR 0x0
tgl:Has FSMR 0x8000
tnt:Has FSMR 0x0
wsm:Has FSMR 0x0

Intel CPUs starting at Ivybridge support ERMS

adl:Has ERMS 0x4000
arl:Has ERMS 0x4000
bdw:Has ERMS 0x4000
clx:Has ERMS 0x4000
cnl:Has ERMS 0x4000
cpx:Has ERMS 0x4000
emr:Has ERMS 0x4000
glm:Has ERMS 0x4000
glp:Has ERMS 0x4000
gnr:Has ERMS 0x4000
gnr256:Has ERMS 0x4000
hsw:Has ERMS 0x4000
icl:Has ERMS 0x4000
icx:Has ERMS 0x4000
ivb:Has ERMS 0x4000
knl:Has ERMS 0x4000
knm:Has ERMS 0x4000
lnl:Has ERMS 0x4000
mrm:Has ERMS 0x0
mtl:Has ERMS 0x4000
nhm:Has ERMS 0x0
pnr:Has ERMS 0x0
rpl:Has ERMS 0x4000
skl:Has ERMS 0x4000
skx:Has ERMS 0x4000
slm:Has ERMS 0x4000
slt:Has ERMS 0x0
snb:Has ERMS 0x0
snr:Has ERMS 0x4000
spr:Has ERMS 0x4000
srf:Has ERMS 0x4000
tgl:Has ERMS 0x4000
tnt:Has ERMS 0x4000
wsm:Has ERMS 0x0
Change-Id: I18e5a3905f2691ab66d4d0cb6f668c0a0ff72d37
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6027541
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2024-11-18 17:56:45 +00:00
Frank Barchard
75f7cfdde5 SplitRGB for SSE4 and AVX2
libyuv_test '--gunit_filter=*SplitRGB*' --libyuv_width=640 --libyuv_height=360 --libyuv_repeat=100000 --libyuv_flags=-1 --libyuv_cpu_info=-1
Note: Google Test filter = *SplitRGB*

Skylake Xeon x86 32 bit
AVX2  LibYUVPlanarTest.SplitRGBPlane_Opt (4143 ms)
SSE4  LibYUVPlanarTest.SplitRGBPlane_Opt (4543 ms)
SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5346 ms)
C     LibYUVPlanarTest.SplitRGBPlane_Opt (22965 ms)

Skylake Xeon x86 64 bit
AVX2  LibYUVPlanarTest.SplitRGBPlane_Opt (4470 ms)
SSE4  LibYUVPlanarTest.SplitRGBPlane_Opt (4723 ms)
SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5465 ms)
C     LibYUVPlanarTest.SplitRGBPlane_Opt (4707 ms)

Bug: 379186682
Change-Id: Idce67a4ded836f2ee31854aa06f3903e7bcb7791
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6024314
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2024-11-15 00:46:25 +00:00
Frank Barchard
679e851f65 Convert16To8Row_AVX512BW using vpmovuswb
- avx2 is pack/perm is mutating order
- cvt method maintains channel order on avx512

Sapphire Rapids

Benchmark of 640x360 on Sapphire Rapids
AVX512BW
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms)

AVX2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms)

SSE2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms)

Skylake Xeon
Now vpmovuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms)

Was vpackuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms)

Switch from vpunpcklwd to vpbroadcastw for scale value parameter
Was
vpunpcklwd  %%xmm2,%%xmm2,%%xmm2
vbroadcastss %%xmm2,%%ymm2

Now
vpbroadcastw %%xmm2,%%ymm2

Bug: 357439226, 357721018
Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-15 20:13:33 +00:00
Frank Barchard
611806a155 [AArch64] Fix SVE/SME vector length printing in cpuid
A semicolon is treated as the start of a comment by some assemblers
causing the vector length to be reported incorrectly, so use a newline
instead.

- Add volatile asm in row_gcc and row_neon64

Bug: b/5631539
Change-Id: I6b0836fcdd9247ef7b9e8ceda01df3150519ecf8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5666060
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-02 19:44:41 +00:00
Frank Barchard
fa16ddbb9f cpuid show vector length on ARM and RISCV
- additional asm volatile changes from github
- rotate mips remove C function - moved to common

Run on Samsung S22
[ RUN      ] LibYUVBaseTest.TestCpuHas
Kernel Version 5.10
Has Arm 0x2
Has Neon 0x4
Has Neon DotProd 0x10
Has Neon I8MM 0x20
Has SVE 0x40
Has SVE2 0x80
Has SME 0x0
SVE vector length: 16 bytes
[       OK ] LibYUVBaseTest.TestCpuHas (0 ms)
[ RUN      ] LibYUVBaseTest.TestCompilerMacros
__ATOMIC_RELAXED 0
__cplusplus 201703
__clang_major__ 17
__clang_minor__ 0
__GNUC__ 4
__GNUC_MINOR__ 2
__aarch64__ 1
__clang__ 1
__llvm__ 1
__pic__ 2
INT_TYPES_DEFINED
__has_feature

Run on RISCV qemu emulating SiFive X280:
[ RUN      ] LibYUVBaseTest.TestCpuHas
Kernel Version 6.6
Has RISCV 0x10000000
Has RVV 0x20000000
RVV vector length: 64 bytes
[       OK ] LibYUVBaseTest.TestCpuHas (4 ms)
[ RUN      ] LibYUVBaseTest.TestCompilerMacros
__ATOMIC_RELAXED 0
__cplusplus 202002
__clang_major__ 9999
__clang_minor__ 0
__GNUC__ 4
__GNUC_MINOR__ 2
__riscv 1
__riscv_vector 1
__riscv_v_intrinsic 12000
__riscv_zve64x 1000000
__clang__ 1
__llvm__ 1
__pic__ 2
INT_TYPES_DEFINED
__has_feature

Bug: b/42280943
Change-Id: I53cf0450be4965a28942e113e4c77295ace70999
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5672088
Reviewed-by: David Gao <davidgao@google.com>
2024-07-02 18:10:56 +00:00
Frank Barchard
616bee5420 Add volatile for gcc inline to avoid being removed
Bug: b/42280943
Change-Id: I4439077a92ffa6dff91d2d10accd5251b76f7544
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5671187
Reviewed-by: David Gao <davidgao@google.com>
2024-07-02 01:25:24 +00:00
Frank Barchard
b0dfa70114 RVV remove unused variables
- ARM Planar test use regular asm volatile syntax
- x86 row functions remove volatile from asm

Bug: 347111119, 347112532
Change-Id: I535b3dfa1a7a19824503bd95584a63b047b0e9a1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5637058
Reviewed-by: Justin Green <greenjustin@google.com>
2024-06-17 20:25:31 +00:00
Frank Barchard
3e435fe6d4 YUY2ToARGB use ymm6/7 for shuffle constants
- 1 load and 2 shuffles from registers replaces 2 loads and 2 memory shuffles
- vbroadcastf128 16 byte shuffler replaces 32 byte shufflers
- bump version and apply clang-format

libyuv_test '--gunit_filter=*.???2ToARGB_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

AMD Zen2
I422ToARGB_Opt (272 ms)
NV12ToARGB_Opt (255 ms)
YUY2ToARGB_Opt (208 ms)

Was
YUY2ToARGB_Opt (214 ms)

Change-Id: I1fa4d462d04536c877d1cab1a14586be8ed1b2f2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5218447
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2024-01-22 21:47:23 +00:00
Frank Barchard
a366ad714a ARGBAttenuate use (a + b + 255) >> 8
- Makes ARM and Intel match and fixes some off by 1 cases
- Add ARGBToUV444MatrixRow_NEON
- Add ConvertFP16ToFP32Column_NEON
- scale_rvv fix intinsic build error
- disable row_win version of ARGBAttenuate/Unattenuate

Bug: libyuv:936, libyuv:956
Change-Id: Ied99aaad3a11a8eb69212b628c58f86ec0723c38
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617013
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-06-16 21:37:53 +00:00
Frank Barchard
157b153b60 Fix tidy warning that uint32_t dither4 should not be const
- Remove const from uint32_t dither4 parameter to fix clang-tidy warning
- Apply clang format
- Bump version
- Remove unused MMI source; superceded by MSA

Bug: None
Change-Id: Id49991db25bca4e99590b415312542d917471c62
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581882
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-06-02 00:42:02 +00:00
Frank Barchard
88b050f337 MergeUV AVX512BW use assembly
- Convert MergeUVRow_AVX512BW to assembly
- Enable MergeUVRow_AVX512BW for Windows with clangcl
- MergeUVRow_AVX2 use vpmovzxbw and vpsllw
- MergeUVRow_16_AVX2 use vpmovzxbw and vpsllw with different shift for U and V

AMD Zen 4 640x360 100000 iterations
Was
AVX512 MergeUVPlane_Opt (884 ms)
AVX2   MergeUVPlane_Opt (945 ms)
AVX2   MergeUVPlane_16_Opt (2167 ms)

Now
AVX512 MergeUVPlane_Opt (865 ms)
AVX2   MergeUVPlane_Opt (943 ms)
SSE2   MergeUVPlane_Opt (973 ms)
AVX2   MergeUVPlane_16_Opt (2102 ms)

Bug: None
Change-Id: I658ada2a75d44c3f93be8bd3ed96f83d5fa2ab8d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4271230
Reviewed-by: Fritz Koenig <frkoenig@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-22 21:19:08 +00:00
Frank Barchard
2bdc210be9 MergeUV_AVX512BW for I420ToNV12
On Skylake Xeon 640x360 100000 iterations
AVX512   MergeUVPlane_Opt (1196 ms)
AVX2     MergeUVPlane_Opt (1565 ms)
SSE2     MergeUVPlane_Opt (1780 ms)
Pixel 7  MergeUVPlane_Opt (1177 ms)

Bug: None
Change-Id: If47d4fa957cf27781bba5fd6a2f0bf554101a5c6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4242247
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-13 20:14:57 +00:00
Frank Barchard
ea26d7adb1 DetilePlane_16 AVX version
- fix ifdefs for DetilePlane_16 to use 16 bit versions, not 8 bit.  (no functional change)

Bug: b/258474032
Change-Id: Ic07e02d9801e21126ebee0ceb5779aa712a493ce
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4034812
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-11-18 23:59:06 +00:00
Frank Barchard
8713ba3f0b Add vzeroupper to AVX row functions
- move power of two macro to planar functions source
- revert row.h IS_ALIGNED change

Bug: b/258474032
Change-Id: If87bb8d55c9b9930dd3e378614f8e4faae0870e9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4035166
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2022-11-17 23:00:08 +00:00
Frank Barchard
2d2cee418a Add Detile_16 planar function for 10 bit MT2T format
- Neon and SSE2
- Any for odd widths

Pixel 2 little core AArch32 build
C
TestDetilePlane_16 (1275 ms)
TestDetilePlane (1203 ms)
Neon
TestDetilePlane_16 (693 ms)
TestDetilePlane (660 ms)

Bug: b/258474032
Change-Id: Idbd09c5e9324e4deef5f1d54090d4b63cc7db812
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4031848
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-11-17 02:47:57 +00:00
Frank Barchard
00950840d1 YUY2ToNV12 using YUY2ToY and YUY2ToNVUV
- Optimized YUY2ToNV12 that reduces it from 3 steps to 2 steps
  - Was SplitUV, memcpy Y, InterpolateUV
  - Now YUY2ToY, YUY2ToNVUV
- rollback LIBYUV_UNLIMITED_DATA

3840x2160 1000 iterations:

Pixel 2 Cortex A73
Was YUY2ToNV12_Opt (6515 ms)
Now YUY2ToNV12_Opt (3350 ms)

AB7 Mediatek P35 Cortex A53
Was YUY2ToNV12_Opt (6435 ms)
Now YUY2ToNV12_Opt (3301 ms)

Skylake AVX2 x64
Was YUY2ToNV12_Opt (1872 ms)
Now YUY2ToNV12_Opt (1657 ms)

SSE2 x64
Was YUY2ToNV12_Opt (2008 ms)
Now YUY2ToNV12_Opt (1691 ms)

Windows Skylake AVX2 32 bit x86
Was YUY2ToNV12_Opt (2161 ms)
Now YUY2ToNV12_Opt (1628 ms)

Bug: libyuv:943
Change-Id: I6c2ba2ae765413426baf770b837de114f808f6d0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3929843
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-09-30 22:41:21 +00:00