216 Commits

Author SHA1 Message Date
Frank Barchard
94417b9d21 Pass rgbconstants via struct pointer instead of elements with m
Now 66 instructions
SYM ARGBToUVRow_SSSE3:
62ccd0: BASE       push ebp
62ccd1: BASE       mov ebp, esp
62ccd3: BASE       push ebx
62ccd4: BASE       push edi
62ccd5: BASE       push esi
62ccd6: BASE       and esp, 0xfffffffc
62ccd9: BASE       sub esp, 0xc
62ccdc: BASE       call 0x62cce1 <ARGBToUVRow_SSSE3+0x11>
62cce1: BASE       pop eax
62cce2: BASE       add eax, 0xe1c27
62cce8: BASE       mov ecx, dword ptr [ebp+0xc]
62cceb: BASE       mov edx, dword ptr [ebp+0x8]
62ccee: BASE       mov esi, dword ptr [ebp+0x10]
62ccf1: BASE       mov edi, dword ptr [ebp+0x18]
62ccf4: BASE       mov dword ptr [esp+0x8], edi
62ccf8: BASE       mov edi, dword ptr [ebp+0x14]
62ccfb: BASE       lea ebx, ptr [eax-0x5ecf88]
62cd01: SSE2       movdqa xmm4, xmmword ptr [ebx]
62cd05: SSE2       movdqa xmm5, xmmword ptr [ebx+0x10]
62cd0a: SSE2       pcmpeqb xmm6, xmm6
62cd0e: SSSE3      pabsb xmm6, xmm6
62cd13: SSE2       movdqa xmm7, xmmword ptr [eax-0x5ecfa8]
62cd1b: BASE       sub edi, esi

62cd1d: SSE2       movdqu xmm0, xmmword ptr [edx]
62cd21: SSE2       movdqu xmm1, xmmword ptr [edx+0x10]
62cd26: SSE2       movdqu xmm2, xmmword ptr [edx+ecx*1]
62cd2b: SSE2       movdqu xmm3, xmmword ptr [edx+ecx*1+0x10]
62cd31: SSSE3      pshufb xmm0, xmm7
62cd36: SSSE3      pshufb xmm1, xmm7
62cd3b: SSSE3      pshufb xmm2, xmm7
62cd40: SSSE3      pshufb xmm3, xmm7
62cd45: SSSE3      pmaddubsw xmm0, xmm6
62cd4a: SSSE3      pmaddubsw xmm1, xmm6
62cd4f: SSSE3      pmaddubsw xmm2, xmm6
62cd54: SSSE3      pmaddubsw xmm3, xmm6
62cd59: SSE2       paddw xmm0, xmm2
62cd5d: SSE2       paddw xmm1, xmm3
62cd61: SSE2       pxor xmm2, xmm2
62cd65: SSE2       psrlw xmm0, 0x1
62cd6a: SSE2       psrlw xmm1, 0x1
62cd6f: SSE2       pavgw xmm0, xmm2
62cd73: SSE2       pavgw xmm1, xmm2
62cd77: SSE2       packuswb xmm0, xmm1
62cd7b: SSE2       movdqa xmm2, xmm6
62cd7f: SSE2       psllw xmm2, 0xf
62cd84: SSE2       movdqa xmm1, xmm0
62cd88: SSSE3      pmaddubsw xmm1, xmm5
62cd8d: SSSE3      pmaddubsw xmm0, xmm4
62cd92: SSSE3      phaddw xmm0, xmm1
62cd97: SSE2       psubw xmm2, xmm0
62cd9b: SSE2       psrlw xmm2, 0x8
62cda0: SSE2       packuswb xmm2, xmm2
62cda4: SSE2       movd dword ptr [esi], xmm2
62cda8: SSE2       pshufd xmm2, xmm2, 0x55
62cdad: SSE2       movd dword ptr [esi+edi*1], xmm2
62cdb2: BASE       lea edx, ptr [edx+0x20]
62cdb5: BASE       lea esi, ptr [esi+0x4]
62cdb8: BASE       sub dword ptr [esp+0x8], 0x8
62cdbd: BASE       jnle 0x62cd1d <ARGBToUVRow_SSSE3+0x4d>

62cdc3: BASE       lea esp, ptr [ebp-0xc]
62cdc6: BASE       pop esi
62cdc7: BASE       pop edi
62cdc8: BASE       pop ebx
62cdc9: BASE       pop ebp
62cdca: BASE       ret

Was 68 instructions
ARGBToUVRow_SSSE3:
62ccd0: BASE       push ebp
62ccd1: BASE       mov ebp, esp
62ccd3: BASE       push edi
62ccd4: BASE       push esi
62ccd5: BASE       and esp, 0xfffffff0
62ccd8: BASE       sub esp, 0x30
62ccdb: BASE       call 0x62cce0 <ARGBToUVRow_SSSE3+0x10>
62cce0: BASE       pop eax
62cce1: BASE       add eax, 0xe1c28
62cce7: BASE       mov ecx, dword ptr [ebp+0xc]
62ccea: BASE       mov edx, dword ptr [ebp+0x8]
62cced: BASE       mov esi, dword ptr [ebp+0x10]
62ccf0: BASE       mov edi, dword ptr [ebp+0x18]
62ccf3: BASE       mov dword ptr [esp+0xc], edi
62ccf7: BASE       mov edi, dword ptr [ebp+0x14]
62ccfa: SSE        movaps xmm0, xmmword ptr [eax-0x5ecf88]
62cd01: SSE        movaps xmmword ptr [esp+0x20], xmm0
62cd06: SSE        movaps xmm0, xmmword ptr [eax-0x5ecf78]
62cd0d: SSE        movaps xmmword ptr [esp+0x10], xmm0
62cd12: SSE2       movdqa xmm4, xmmword ptr [esp+0x20]
62cd18: SSE2       movdqa xmm5, xmmword ptr [esp+0x10]
62cd1e: SSE2       pcmpeqb xmm6, xmm6
62cd22: SSSE3      pabsb xmm6, xmm6
62cd27: SSE2       movdqa xmm7, xmmword ptr [eax-0x5ecfa8]
62cd2f: BASE       sub edi, esi

62cd31: SSE2       movdqu xmm0, xmmword ptr [edx]
62cd35: SSE2       movdqu xmm1, xmmword ptr [edx+0x10]
62cd3a: SSE2       movdqu xmm2, xmmword ptr [edx+ecx*1]
62cd3f: SSE2       movdqu xmm3, xmmword ptr [edx+ecx*1+0x10]
62cd45: SSSE3      pshufb xmm0, xmm7
62cd4a: SSSE3      pshufb xmm1, xmm7
62cd4f: SSSE3      pshufb xmm2, xmm7
62cd54: SSSE3      pshufb xmm3, xmm7
62cd59: SSSE3      pmaddubsw xmm0, xmm6
62cd5e: SSSE3      pmaddubsw xmm1, xmm6
62cd63: SSSE3      pmaddubsw xmm2, xmm6
62cd68: SSSE3      pmaddubsw xmm3, xmm6
62cd6d: SSE2       paddw xmm0, xmm2
62cd71: SSE2       paddw xmm1, xmm3
62cd75: SSE2       pxor xmm2, xmm2
62cd79: SSE2       psrlw xmm0, 0x1
62cd7e: SSE2       psrlw xmm1, 0x1
62cd83: SSE2       pavgw xmm0, xmm2
62cd87: SSE2       pavgw xmm1, xmm2
62cd8b: SSE2       packuswb xmm0, xmm1
62cd8f: SSE2       movdqa xmm2, xmm6
62cd93: SSE2       psllw xmm2, 0xf
62cd98: SSE2       movdqa xmm1, xmm0
62cd9c: SSSE3      pmaddubsw xmm1, xmm5
62cda1: SSSE3      pmaddubsw xmm0, xmm4
62cda6: SSSE3      phaddw xmm0, xmm1
62cdab: SSE2       psubw xmm2, xmm0
62cdaf: SSE2       psrlw xmm2, 0x8
62cdb4: SSE2       packuswb xmm2, xmm2
62cdb8: SSE2       movd dword ptr [esi], xmm2
62cdbc: SSE2       pshufd xmm2, xmm2, 0x55
62cdc1: SSE2       movd dword ptr [esi+edi*1], xmm2
62cdc6: BASE       lea edx, ptr [edx+0x20]
62cdc9: BASE       lea esi, ptr [esi+0x4]
62cdcc: BASE       sub dword ptr [esp+0xc], 0x8
62cdd1: BASE       jnle 0x62cd31 <ARGBToUVRow_SSSE3+0x61>

62cdd7: BASE       lea esp, ptr [ebp-0x8]
62cdda: BASE       pop esi
62cddb: BASE       pop edi
62cddc: BASE       pop ebp
62cddd: BASE       ret
62cdde: BASE       int3
BUG=444157316

Change-Id: Iad044f851359f5b052091c7bdab9b96946fc3682
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6987370
Reviewed-by: Justin Green <greenjustin@google.com>
2025-09-29 12:34:36 -07:00
Daniel.L (Byoungchan Lee)
5b22f31cb5 Fix compilation issue for 32bit PIC build
Currently, ARGBToUVMatrixRow_AVX2 and ARGBToUVMatrixRow_SSSE3 fail to
compile with clang on 32bit PIC build with the error message: inline
assembly requires more registers than available

This is because in PIC code EBX is reserved for the GOT and with a frame
pointer EBP is also unavailable.

Fix this by copying the RGB-to-UV constants to stack locals first and
let the asm use simple stack-relative addressing.

Bug: 444157316
Change-Id: Ica90f0c35039303ecaa145534683f59659fb5d7f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6980714
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-25 13:49:02 -07:00
Frank Barchard
1b1c058787 ARGBToUV for SSE use pshufb/pmaddubsw
Was
ARGBToJ420_Opt (377 ms)
Now
ARGBToJ420_Opt (340 ms)

Bug: None
Change-Id: Iada2d6e9ecdb141b9e2acbdf343f890e4aaebe34
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6967754
Reviewed-by: Justin Green <greenjustin@google.com>
2025-09-19 12:39:39 -07:00
Frank Barchard
7155afc5ca ARGBToUV AVX2 for x86 32 bit
- Reduce to 10 ymm registers - 2 constants generated on the fly

Change-Id: Ib25a0cf7c93e5048270735410ccf6723b3949454
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6967319
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-18 13:14:45 -07:00
Frank Barchard
142db12947 ARGBToUV use AVX2 for 64 bit x86
Skylake
Was ARGBToJ420_Opt (312 ms)
Now ARGBToJ420_Opt (242 ms)

Icelake
Was ARGBToJ420_Opt (302 ms)
Now ARGBToJ420_Opt (220 ms)

AMD Zen3 on Windows
Was ARGBToJ420_Opt (305 ms)
Now ARGBToJ420_Opt (216 ms)
32 bit x86 uses SSE
Now ARGBToJ420_Opt (326 ms)

MCA analysis of new AVX, SSE and old AVX
https://godbolt.org/z/37bdazWYr

Bug: None
Change-Id: I72f5504407751e164c3558aebe836dd15223d65f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6957477
Reviewed-by: Justin Green <greenjustin@google.com>
2025-09-17 14:39:53 -07:00
Frank Barchard
a61882c049 ARGBToUV AVX2 for x86_64
Icelake
Was SSSE3+SSSE3 ARGBToJ420_Opt (356 ms)
Was SSSE3+AVX2  ARGBToJ420_Opt (301 ms)
Now AVX2+AVX2   ARGBToJ420_Opt (227 ms)

Change-Id: I2cb427bc164b225b3ad4c5f43c09d6da6ca496d5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6943036
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-16 11:33:54 -07:00
Frank Barchard
0f795672ae Reduce ARGBToUV SSSE3 register usage for clang build error on x64
Bug: 444157316
Change-Id: I2ae9f3dbfb373bb874a3d9699987f7d5b63f2610
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6937665
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-09-10 18:40:06 -07:00
Frank Barchard
48943bb378 Convert8To16 use VPSRLW instead of VPMULHUW for better lunarlake performance
- MCA says old version was 4 cycles and new version is 2.5 cycles/loop
- lunarlake is the only known cpu
mca -mcpu=lunarlake 100 iterations

Was vpmulhu
  Iterations:        100
  Instructions:      1200
  Total Cycles:      426
  Total uOps:        1200

  Dispatch Width:    8
  uOps Per Cycle:    2.82
  IPC:               2.82
  Block RThroughput: 4.0

Now vpsrlw
  Iterations:        100
  Instructions:      1200
  Total Cycles:      279
  Total uOps:        1400

  Dispatch Width:    8
  uOps Per Cycle:    5.02
  IPC:               4.30
  Block RThroughput: 2.5

Bug: None
Change-Id: I5a49e1cf1ed3dfb59fe9861a871df9862417c6a6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6697745
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-08-04 12:42:50 -07:00
Frank Barchard
6f729fbe65 ARGBToUV SSE use average of 4 pixels
- Was using avgb twice for non-exact and C for exact.

On Skylake Xeon:

Now SSE3
ARGBToJ420_Opt (326 ms)

Was
Exact C
ARGBToJ420_Opt (871 ms)
Not exact AVX2
ARGBToJ420_Opt (237 ms)
Not exact SSSE3
ARGBToJ420_Opt (312 ms)

Bug: 381138208
Change-Id: I6d1081bb52e36f06736c0c6575fa82bb2268629b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6629605
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
2025-06-17 11:55:27 -07:00
Frank Barchard
889613683a Add hybrid detect for Intel laptop cpus
- Add +i8mm build option for sve ARGBToUV which uses usdot
- util/cpuid Get cpu count (windows, macos, linux)
- For each x86 cpu, detect hybrid (e-core)
- Includes a comment fix for ubsan unittest
- Bump version
- Apply clang format to util/*.c as well as all *.cc/*.h

Bug: 424637372
Change-Id: I08310e18051fff62c9e4e4a10d1e4361871119ac
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6635640
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-06-13 13:22:54 -07:00
Frank Barchard
0853c9353f ARGBToUV 64 bit use ymm8 for shuffler
Bug: 381138208
Change-Id: I5e69bc1610bd6269bf9a4113e729cf307dd36f60
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6536833
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-05-12 15:09:40 -07:00
Frank Barchard
9f9b5cf660 ARGBToUV allow 32 bit x86 build
- make width loop count on stack
- set YMM constants in its own asm block
- make struct for shuffle and add constants
- disable clang format on row_neon.cc function

Bug: 413781394
Change-Id: I263f6862cb7589dc31ac65d118f7ebeb65dbb24a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6495259
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-04-28 12:11:00 -07:00
Frank Barchard
0c07032182 clang format applies to git repo
Bug: None
Change-Id: Ida65a0033e8c783230cadf6912416ffd9bbf90e1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6393515
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-03-25 11:49:25 -07:00
Frank Barchard
918329caee Make constant 0x0101 using vpcmpeqb+vpabsb
Was
      vpcmpeqb    %%ymm4,%%ymm4,%%ymm4
      vpsrlw      $0xf,%%ymm4,%%ymm4
      vpackuswb   %%ymm4,%%ymm4,%%ymm4
Now
      vpcmpeqb    %%ymm4,%%ymm4,%%ymm4
      vpabsb      %%ymm4,%%ymm4

Bug: 381138208
Change-Id: Ib70c24ac636fff95a10c7f06ed8f0a3bc7514906
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6312925
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2025-03-10 13:25:16 -07:00
Frank Barchard
c060118bea ARGBToJ444 use 256 for fixed point scale UV
- use negative coefficients for UV to allow -128
- change shift to truncate instead of round for UV
- adapt all row_gcc RGB to UV into matrix functions
- add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc

Bug: 381138208
Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-27 13:04:15 -08:00
Frank Barchard
5257ba4db0 Apply clang format
Bug: None
Change-Id: Ibd694d0351966a2b5812445de74bbced9c881a79
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6302317
Reviewed-by: James Zern <jzern@google.com>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-25 11:39:19 -08:00
Frank Barchard
3a7e0ba671 Apply format with no code changes
Bug: None
Change-Id: I8923bacb9af7e7d4f13e210c8b3d7ea6b81568a5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6301086
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2025-02-24 23:57:01 -08:00
Frank Barchard
61354d2671 ARGBToUV Matrix for AVX2 and SSSE3
- Round before shifting to 8 bit to match NEON
  - RAWToARGB use unaligned loads and port to AVX2

Was C/SSSE/AVX2
ARGBToI444_Opt (343 ms)
ARGBToJ444_Opt (677 ms)
RAWToI444_Opt (405 ms)
RAWToJ444_Opt (803 ms)

Now AVX2
ARGBToI444_Opt (283 ms)
ARGBToJ444_Opt (284 ms)
RAWToI444_Opt (316 ms)
RAWToJ444_Opt (339 ms)

Profile Now AVX2
  38.31%  ARGBToUVJ444Row_AVX2
  32.31%  RAWToARGBRow_AVX2
  23.99%  ARGBToYJRow_AVX2

Profile Was C/SSSE/AVX2
    73.15%  ARGBToUVJ444Row_C
    15.74%  RAWToARGBRow_SSSE3
     8.87%  ARGBToYJRow_AVX2

Bug: 381138208
Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Ben Weiss <bweiss@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-10 18:36:18 -08:00
Frank Barchard
6c2415bfab J420ToI420 AVX2
libyuv_test '--gunit_filter=*J420ToI420*' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Skylake Xeon
AVX2 J420ToI420_Opt (114 ms)
C    J420ToI420_Opt (596 ms)

Sapphire Rapids
AVX2 J420ToI420_Opt (126 ms)
C    J420ToI420_Opt (717 ms)

Samsung S23
NEON J420ToI420_Opt (46 ms)
C    J420ToI420_Opt (95 ms)

Bug: 381327032
Change-Id: I2b551507c2a8b1da4f04651b622fc9247a75050d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6201239
Reviewed-by: Justin Green <greenjustin@google.com>
2025-01-27 11:23:44 -08:00
Frank Barchard
e0040eb318 Apply clang format
Bug: None
Change-Id: I0d9db4b384144523e61ae32b6ab3f72e93a0c265
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6138934
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-02 13:31:20 -08:00
Frank Barchard
1c501a8f3f CpuId test FSMR - Fast Short Rep Movsb
- Renumber cpuid bits to use low byte to ID the type of CPU and upper 24 bits for features

Intel CPUs starting at Icelake support FSMR
adl:Has FSMR 0x8000
arl:Has FSMR 0x0
bdw:Has FSMR 0x0
clx:Has FSMR 0x0
cnl:Has FSMR 0x0
cpx:Has FSMR 0x0
emr:Has FSMR 0x8000
glm:Has FSMR 0x0
glp:Has FSMR 0x0
gnr:Has FSMR 0x8000
gnr256:Has FSMR 0x8000
hsw:Has FSMR 0x0
icl:Has FSMR 0x8000
icx:Has FSMR 0x8000
ivb:Has FSMR 0x0
knl:Has FSMR 0x0
knm:Has FSMR 0x0
lnl:Has FSMR 0x8000
mrm:Has FSMR 0x0
mtl:Has FSMR 0x8000
nhm:Has FSMR 0x0
pnr:Has FSMR 0x0
rpl:Has FSMR 0x8000
skl:Has FSMR 0x0
skx:Has FSMR 0x0
slm:Has FSMR 0x0
slt:Has FSMR 0x0
snb:Has FSMR 0x0
snr:Has FSMR 0x0
spr:Has FSMR 0x8000
srf:Has FSMR 0x0
tgl:Has FSMR 0x8000
tnt:Has FSMR 0x0
wsm:Has FSMR 0x0

Intel CPUs starting at Ivybridge support ERMS

adl:Has ERMS 0x4000
arl:Has ERMS 0x4000
bdw:Has ERMS 0x4000
clx:Has ERMS 0x4000
cnl:Has ERMS 0x4000
cpx:Has ERMS 0x4000
emr:Has ERMS 0x4000
glm:Has ERMS 0x4000
glp:Has ERMS 0x4000
gnr:Has ERMS 0x4000
gnr256:Has ERMS 0x4000
hsw:Has ERMS 0x4000
icl:Has ERMS 0x4000
icx:Has ERMS 0x4000
ivb:Has ERMS 0x4000
knl:Has ERMS 0x4000
knm:Has ERMS 0x4000
lnl:Has ERMS 0x4000
mrm:Has ERMS 0x0
mtl:Has ERMS 0x4000
nhm:Has ERMS 0x0
pnr:Has ERMS 0x0
rpl:Has ERMS 0x4000
skl:Has ERMS 0x4000
skx:Has ERMS 0x4000
slm:Has ERMS 0x4000
slt:Has ERMS 0x0
snb:Has ERMS 0x0
snr:Has ERMS 0x4000
spr:Has ERMS 0x4000
srf:Has ERMS 0x4000
tgl:Has ERMS 0x4000
tnt:Has ERMS 0x4000
wsm:Has ERMS 0x0
Change-Id: I18e5a3905f2691ab66d4d0cb6f668c0a0ff72d37
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6027541
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2024-11-18 17:56:45 +00:00
Frank Barchard
75f7cfdde5 SplitRGB for SSE4 and AVX2
libyuv_test '--gunit_filter=*SplitRGB*' --libyuv_width=640 --libyuv_height=360 --libyuv_repeat=100000 --libyuv_flags=-1 --libyuv_cpu_info=-1
Note: Google Test filter = *SplitRGB*

Skylake Xeon x86 32 bit
AVX2  LibYUVPlanarTest.SplitRGBPlane_Opt (4143 ms)
SSE4  LibYUVPlanarTest.SplitRGBPlane_Opt (4543 ms)
SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5346 ms)
C     LibYUVPlanarTest.SplitRGBPlane_Opt (22965 ms)

Skylake Xeon x86 64 bit
AVX2  LibYUVPlanarTest.SplitRGBPlane_Opt (4470 ms)
SSE4  LibYUVPlanarTest.SplitRGBPlane_Opt (4723 ms)
SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5465 ms)
C     LibYUVPlanarTest.SplitRGBPlane_Opt (4707 ms)

Bug: 379186682
Change-Id: Idce67a4ded836f2ee31854aa06f3903e7bcb7791
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6024314
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2024-11-15 00:46:25 +00:00
Frank Barchard
679e851f65 Convert16To8Row_AVX512BW using vpmovuswb
- avx2 is pack/perm is mutating order
- cvt method maintains channel order on avx512

Sapphire Rapids

Benchmark of 640x360 on Sapphire Rapids
AVX512BW
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms)

AVX2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms)

SSE2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms)

Skylake Xeon
Now vpmovuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms)

Was vpackuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms)

Switch from vpunpcklwd to vpbroadcastw for scale value parameter
Was
vpunpcklwd  %%xmm2,%%xmm2,%%xmm2
vbroadcastss %%xmm2,%%ymm2

Now
vpbroadcastw %%xmm2,%%ymm2

Bug: 357439226, 357721018
Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-15 20:13:33 +00:00
Frank Barchard
611806a155 [AArch64] Fix SVE/SME vector length printing in cpuid
A semicolon is treated as the start of a comment by some assemblers
causing the vector length to be reported incorrectly, so use a newline
instead.

- Add volatile asm in row_gcc and row_neon64

Bug: b/5631539
Change-Id: I6b0836fcdd9247ef7b9e8ceda01df3150519ecf8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5666060
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-02 19:44:41 +00:00
Frank Barchard
fa16ddbb9f cpuid show vector length on ARM and RISCV
- additional asm volatile changes from github
- rotate mips remove C function - moved to common

Run on Samsung S22
[ RUN      ] LibYUVBaseTest.TestCpuHas
Kernel Version 5.10
Has Arm 0x2
Has Neon 0x4
Has Neon DotProd 0x10
Has Neon I8MM 0x20
Has SVE 0x40
Has SVE2 0x80
Has SME 0x0
SVE vector length: 16 bytes
[       OK ] LibYUVBaseTest.TestCpuHas (0 ms)
[ RUN      ] LibYUVBaseTest.TestCompilerMacros
__ATOMIC_RELAXED 0
__cplusplus 201703
__clang_major__ 17
__clang_minor__ 0
__GNUC__ 4
__GNUC_MINOR__ 2
__aarch64__ 1
__clang__ 1
__llvm__ 1
__pic__ 2
INT_TYPES_DEFINED
__has_feature

Run on RISCV qemu emulating SiFive X280:
[ RUN      ] LibYUVBaseTest.TestCpuHas
Kernel Version 6.6
Has RISCV 0x10000000
Has RVV 0x20000000
RVV vector length: 64 bytes
[       OK ] LibYUVBaseTest.TestCpuHas (4 ms)
[ RUN      ] LibYUVBaseTest.TestCompilerMacros
__ATOMIC_RELAXED 0
__cplusplus 202002
__clang_major__ 9999
__clang_minor__ 0
__GNUC__ 4
__GNUC_MINOR__ 2
__riscv 1
__riscv_vector 1
__riscv_v_intrinsic 12000
__riscv_zve64x 1000000
__clang__ 1
__llvm__ 1
__pic__ 2
INT_TYPES_DEFINED
__has_feature

Bug: b/42280943
Change-Id: I53cf0450be4965a28942e113e4c77295ace70999
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5672088
Reviewed-by: David Gao <davidgao@google.com>
2024-07-02 18:10:56 +00:00
Frank Barchard
616bee5420 Add volatile for gcc inline to avoid being removed
Bug: b/42280943
Change-Id: I4439077a92ffa6dff91d2d10accd5251b76f7544
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5671187
Reviewed-by: David Gao <davidgao@google.com>
2024-07-02 01:25:24 +00:00
Frank Barchard
b0dfa70114 RVV remove unused variables
- ARM Planar test use regular asm volatile syntax
- x86 row functions remove volatile from asm

Bug: 347111119, 347112532
Change-Id: I535b3dfa1a7a19824503bd95584a63b047b0e9a1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5637058
Reviewed-by: Justin Green <greenjustin@google.com>
2024-06-17 20:25:31 +00:00
Frank Barchard
3e435fe6d4 YUY2ToARGB use ymm6/7 for shuffle constants
- 1 load and 2 shuffles from registers replaces 2 loads and 2 memory shuffles
- vbroadcastf128 16 byte shuffler replaces 32 byte shufflers
- bump version and apply clang-format

libyuv_test '--gunit_filter=*.???2ToARGB_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

AMD Zen2
I422ToARGB_Opt (272 ms)
NV12ToARGB_Opt (255 ms)
YUY2ToARGB_Opt (208 ms)

Was
YUY2ToARGB_Opt (214 ms)

Change-Id: I1fa4d462d04536c877d1cab1a14586be8ed1b2f2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5218447
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2024-01-22 21:47:23 +00:00
Frank Barchard
a366ad714a ARGBAttenuate use (a + b + 255) >> 8
- Makes ARM and Intel match and fixes some off by 1 cases
- Add ARGBToUV444MatrixRow_NEON
- Add ConvertFP16ToFP32Column_NEON
- scale_rvv fix intinsic build error
- disable row_win version of ARGBAttenuate/Unattenuate

Bug: libyuv:936, libyuv:956
Change-Id: Ied99aaad3a11a8eb69212b628c58f86ec0723c38
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617013
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-06-16 21:37:53 +00:00
Frank Barchard
157b153b60 Fix tidy warning that uint32_t dither4 should not be const
- Remove const from uint32_t dither4 parameter to fix clang-tidy warning
- Apply clang format
- Bump version
- Remove unused MMI source; superceded by MSA

Bug: None
Change-Id: Id49991db25bca4e99590b415312542d917471c62
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581882
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-06-02 00:42:02 +00:00
Frank Barchard
88b050f337 MergeUV AVX512BW use assembly
- Convert MergeUVRow_AVX512BW to assembly
- Enable MergeUVRow_AVX512BW for Windows with clangcl
- MergeUVRow_AVX2 use vpmovzxbw and vpsllw
- MergeUVRow_16_AVX2 use vpmovzxbw and vpsllw with different shift for U and V

AMD Zen 4 640x360 100000 iterations
Was
AVX512 MergeUVPlane_Opt (884 ms)
AVX2   MergeUVPlane_Opt (945 ms)
AVX2   MergeUVPlane_16_Opt (2167 ms)

Now
AVX512 MergeUVPlane_Opt (865 ms)
AVX2   MergeUVPlane_Opt (943 ms)
SSE2   MergeUVPlane_Opt (973 ms)
AVX2   MergeUVPlane_16_Opt (2102 ms)

Bug: None
Change-Id: I658ada2a75d44c3f93be8bd3ed96f83d5fa2ab8d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4271230
Reviewed-by: Fritz Koenig <frkoenig@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-22 21:19:08 +00:00
Frank Barchard
2bdc210be9 MergeUV_AVX512BW for I420ToNV12
On Skylake Xeon 640x360 100000 iterations
AVX512   MergeUVPlane_Opt (1196 ms)
AVX2     MergeUVPlane_Opt (1565 ms)
SSE2     MergeUVPlane_Opt (1780 ms)
Pixel 7  MergeUVPlane_Opt (1177 ms)

Bug: None
Change-Id: If47d4fa957cf27781bba5fd6a2f0bf554101a5c6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4242247
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-13 20:14:57 +00:00
Frank Barchard
ea26d7adb1 DetilePlane_16 AVX version
- fix ifdefs for DetilePlane_16 to use 16 bit versions, not 8 bit.  (no functional change)

Bug: b/258474032
Change-Id: Ic07e02d9801e21126ebee0ceb5779aa712a493ce
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4034812
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-11-18 23:59:06 +00:00
Frank Barchard
8713ba3f0b Add vzeroupper to AVX row functions
- move power of two macro to planar functions source
- revert row.h IS_ALIGNED change

Bug: b/258474032
Change-Id: If87bb8d55c9b9930dd3e378614f8e4faae0870e9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4035166
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2022-11-17 23:00:08 +00:00
Frank Barchard
2d2cee418a Add Detile_16 planar function for 10 bit MT2T format
- Neon and SSE2
- Any for odd widths

Pixel 2 little core AArch32 build
C
TestDetilePlane_16 (1275 ms)
TestDetilePlane (1203 ms)
Neon
TestDetilePlane_16 (693 ms)
TestDetilePlane (660 ms)

Bug: b/258474032
Change-Id: Idbd09c5e9324e4deef5f1d54090d4b63cc7db812
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4031848
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-11-17 02:47:57 +00:00
Frank Barchard
00950840d1 YUY2ToNV12 using YUY2ToY and YUY2ToNVUV
- Optimized YUY2ToNV12 that reduces it from 3 steps to 2 steps
  - Was SplitUV, memcpy Y, InterpolateUV
  - Now YUY2ToY, YUY2ToNVUV
- rollback LIBYUV_UNLIMITED_DATA

3840x2160 1000 iterations:

Pixel 2 Cortex A73
Was YUY2ToNV12_Opt (6515 ms)
Now YUY2ToNV12_Opt (3350 ms)

AB7 Mediatek P35 Cortex A53
Was YUY2ToNV12_Opt (6435 ms)
Now YUY2ToNV12_Opt (3301 ms)

Skylake AVX2 x64
Was YUY2ToNV12_Opt (1872 ms)
Now YUY2ToNV12_Opt (1657 ms)

SSE2 x64
Was YUY2ToNV12_Opt (2008 ms)
Now YUY2ToNV12_Opt (1691 ms)

Windows Skylake AVX2 32 bit x86
Was YUY2ToNV12_Opt (2161 ms)
Now YUY2ToNV12_Opt (1628 ms)

Bug: libyuv:943
Change-Id: I6c2ba2ae765413426baf770b837de114f808f6d0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3929843
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-09-30 22:41:21 +00:00
Frank Barchard
f9fda6e7d8 Fix shift amount for SSSE3 assembly for I012 format conversions
Bug: libyuv:938, libyuv:942
Change-Id: I6fb6e7e17fa941785e398bc630f465baf72fcabd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3906091
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2022-09-20 23:07:53 +00:00
Frank Barchard
8fc02134c8 10/12 bit YUV replicate upper bits to low bits before converting to RGB
- shift high bits of 10 and 12 bit into lower bits

Bug: libyuv:941, libyuv:942,
Change-Id: I14381dbf226ef27dcce06893ea88860835639baa
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3906085
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2022-09-20 20:56:43 +00:00
Frank Barchard
248172e2ba I422ToRGB24, I422ToRAW, I422ToRGB24MatrixFilter conversion functions added.
- YUV to RGB use linear for first and last row.
- add assert(yuvconstants)
- rename pointers to match row functions.
- use macros that match row functions.
- use 12 bit upsampler for conversions of 10 and 12 bits

Cortex A53 AArch32
I420ToRGB24_Opt (3627 ms)
I422ToRGB24_Opt (4099 ms)
I444ToRGB24_Opt (4186 ms)
I420ToRGB24Filter_Opt (5451 ms)
I422ToRGB24Filter_Opt (5430 ms)

AVX2
Was I420ToRGB24Filter_Opt (583 ms)
Now I420ToRGB24Filter_Opt (560 ms)

Neon Cortex A7
Was I420ToRGB24Filter_Opt (5447 ms)
Now I420ToRGB24Filter_Opt (5439 ms)

Bug: libyuv:938


Change-Id: I1731f2dd591073ae11a756f06574103ba0f803c7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3906082
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-09-20 02:00:52 +00:00
Frank Barchard
f71c83552d I420ToRGB24MatrixFilter function added
- Implemented as 3 steps: Upsample UV to 4:4:4, I444ToARGB, ARGBToRGB24
- Fix some build warnings for missing prototypes.

Pixel 4
I420ToRGB24_Opt (743 ms)
I420ToRGB24Filter_Opt (1331 ms)

Windows with skylake xeon:
x86 32 bit
I420ToRGB24_Opt (387 ms)
I420ToRGB24Filter_Opt (571 ms)
x64 64 bit
I420ToRGB24_Opt (384 ms)
I420ToRGB24Filter_Opt (582 ms)


Bug: libyuv:938, libyuv:830
Change-Id: Ie27f70816ec084437014f8a1c630ae011ee2348c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3900298
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2022-09-16 19:46:47 +00:00
Frank Barchard
3e38ce5058 SSE2 MM21->YUY2 conversion
Add SSE2 optimization for MM21ToYUY2 conversion.

Bug: b/238137982
Change-Id: I189f712514308322f651b082b496bce9c015c4ee
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3832525
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2022-08-17 18:39:05 +00:00
Frank Barchard
65e7c9d570 MM21ToYUY2 and ABGRToJ420 conversion
MM21 to YUY2 use zip1 for performance

Cortex A510
Was MM21ToYUY2 (612 ms)
Now MM21ToYUY2 (573 ms)

Prefetches help Cortex A53
Was MM21ToYUY2 (4998 ms)
Now MM21ToYUY2 (1900 ms)

Pixel 4 Cortex A76
Was MM21ToYUY2 (215 ms)
Now MM21ToYUY2 (173 ms)

ABGRToJ420
- NEON, SSSE3 and AVX2 row functions
- J400, J420 and J422 formats.
- Added AVX2 for UV on ARGBToJ420.  Was SSSE3

Same code/performance as ARGBToJ420 but with constants re-ordered.
Pixel 4
ABGRToJ420_Opt (623 ms)
ABGRToJ422_Opt (702 ms)
ABGRToJ400_Opt (238 ms)

Skylake Xeon
With LIBYUV_BIT_EXACT which uses C for UV
ABGRToJ420_Opt (988 ms)
ABGRToJ422_Opt (1872 ms)
ABGRToJ400_Opt (186 ms)
Skylake Xeon using AVX2
ABGRToJ420_Opt (251 ms)
ABGRToJ422_Opt (245 ms)
ABGRToJ400_Opt (184 ms)
Skylake Xeon using SSSE3
ABGRToJ420_Opt (328 ms)
ABGRToJ422_Opt (362 ms)
ABGRToJ400_Opt (185 ms)

Bug: b/238137982
Change-Id: I559c3fe3fb80fa2ce5be3d8218736f9cbc627666
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3832111
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2022-08-16 22:07:38 +00:00
Frank Barchard
6900494d90 Merge/SplitRGB fix -mcmodel=large x86 and InterpolateRow_16To8_NEON
MergeRGB and SplitRGB use a register to point to 9 shuffle tables.

- fixes an out of registers error with -mcmodel=large

InterpolateRow_16To8_NEON improves performance for I210ToI420:

On Pixel 4 for 720p x1000 images
Was I210ToI420_Opt (608 ms)
Now I210ToI420_Opt (336 ms)

On Skylake Xeon
Was I210ToI420_Opt (259 ms)
Now I210ToI420_Opt (209 ms)


Bug: libyuv:931, libyuv:930
Change-Id: I20f8244803f06da511299bf1a2ffc7945eb35221
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3717054
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2022-06-29 00:00:46 +00:00
Justin Green
b4ddbaf549 Add support for MM21.
Add support for MM21 to NV12 and I420 conversion, and add SIMD
optimizations for arm, aarch64, SSE2, and SSSE3 machines.

Bug: libyuv:915, b/215425056
Change-Id: Iecb0c33287f35766a6169d4adf3b7397f1ba8b5d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3433269
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Justin Green <greenjustin@google.com>
2022-02-03 17:01:49 +00:00
Frank Barchard
90ffd5cba9 I420ToARGB for AVX512
On Skylake Xeon
AVX512  I420ToARGB_Opt (2050 ms)
AVX2    I420ToARGB_Opt (2533 ms)
SSSE3   I420ToARGB_Opt (3688 ms)

Bug: libyuv:911
Change-Id: I2214cc15dec24b06541895ca59d88990edbb2216
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3382100
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-01-14 09:17:33 +00:00
Frank Barchard
78625492cb InterpolateRow_AVX2 use AVX2 instead of ERMS for 100%
Bug: b/210066781
Change-Id: I709e403f03bd6b9f8fe693b165b242b784076fe0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3329072
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2021-12-15 03:54:18 +00:00
Frank Barchard
fdc71956bd InterpolateRow_AVX2 - extend width count to 64 bits
Bug: b/210066781
Change-Id: Ib9052d8edfce29b95ca02a6f7254d3ff35d2b64d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3329070
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2021-12-10 04:11:18 +00:00
Frank Barchard
d7a2d5da87 J400ToARGB optimized for Exynos using ZIP+ST1
Bug: 204562143
Change-Id: I56c98198c02bd0dd1283f1c14837730c92832c39
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3328702
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2021-12-10 01:00:07 +00:00
Frank Barchard
000806f373 NV21ToYUV24 replace ST3 with ST1. ARGBToAR64 replace ST2 with ST1
On Samsung S8 Exynos M2
Was ST3 NV21ToYUV24_Opt (769 ms)
Now ST1 NV21ToYUV24_Opt (473 ms)
Was ST2 ARGBToAR64_Opt (1759 ms)
Now ST1 ARGBToAR64_Opt (987 ms)

Skylake Xeon, AVX2 version:
Was NV21ToYUV24_Opt (885 ms)
Now NV21ToYUV24_Opt (194 ms)

Bug: b/204562143, b/124413599
Change-Id: Icc9cb64d822cd11937789a4e04fbb773b3e33aa3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3290664
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2021-11-24 07:38:49 +00:00
Frank Barchard
55b97cb48f BIT_EXACT for unattenuate and attenuate.
- reenable Intel SIMD unaffected by BIT_EXACT
- add bit exact version of ARGBAttenuate, which uses ARM version of formula.
- add bit exact version of ARGBUnatenuate, which mimics the AVX code.

Apply clang format to cleanup code.

Bug: libyuv:908, b/202888439
Change-Id: Ie842b1b3956b48f4190858e61c02998caedc2897
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3224702
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2021-10-15 19:46:02 +00:00