2027 Commits

Author SHA1 Message Date
George Steed
ef9833fc70 Add Neon implementation of Convert8To16Row
Add a Neon implementation of the Convert8To16Row kernel. Compared to the
C implementation we can take advantage of knowing that the "scale"
parameter is always an unsigned power of two and fits in 16-bits,
allowing us to combine this with the shift and avoid needing to widen
the input data.

Reduction in run times observed compared to the existing C
implementation:

 Cortex-A55: -44.5%
Cortex-A510: -26.1%
Cortex-A520: -30.6%
 Cortex-A76: -61.6%
Cortex-A710: -57.6%
  Cortex-X1: -46.5%
  Cortex-X2: -54.4%
  Cortex-X3: -57.1%
  Cortex-X4: -55.0%
Cortex-X925: -49.3%

Change-Id: I34b858605ece47e46588c0680a1d2afa7a90d7a0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6516186
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-05-29 13:37:48 -07:00
George Steed
7e5863ae5a Add SVE2 and SME implementations of I422ToAR30Row
This can make use of the existing load/convert/store macros that are
already present for other kernels, so add I422ToAR30Row_SVE2 and
I422ToAR30Row_SME to match the existing kernels.

Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:

Cortex-A510: -9.1%
Cortex-A520: +6.8% (!)
Cortex-A710: -4.0%
Cortex-A715: -1.1%
Cortex-A720: -1.1%
  Cortex-X2: -5.7%
  Cortex-X3: -5.9%
  Cortex-X4: -2.8%
Cortex-X925: -4.0%

Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-05-27 11:39:00 -07:00
George Steed
949cb623bf Add SVE2 and SME implementations of I444ToRGB24Row
Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so
they are usable in both SVE2 and SME implementations, and use them to
add new I444ToRGB24Row implementations for SVE2 and SME. We need to use
the unrolled versions here to use the ST3B interleaving store
instructions, since there is no partial vector version of this store
instruction.

Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:

Cortex-A510: -57.6%
Cortex-A520: -38.1%
Cortex-A710: -15.5%
Cortex-A715:  -9.2%
Cortex-A720:  -9.2%
  Cortex-X2: -25.8%
  Cortex-X3: -26.2%
  Cortex-X4: -23.2%
Cortex-X925: -17.8%

Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-05-22 13:33:06 -07:00
Frank Barchard
0853c9353f ARGBToUV 64 bit use ymm8 for shuffler
Bug: 381138208
Change-Id: I5e69bc1610bd6269bf9a4113e729cf307dd36f60
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6536833
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-05-12 15:09:40 -07:00
George Steed
61bdaee13a Add Neon I8MM implementations of ARGB to UV and variants
The maximum coefficient is 128, so store constants negated to take
advantage of -128 being representable in 8-bit integers. This allows us
to use the I8MM USDOT instructions.

Reduction in time taken observed compared to the existing Neon
implementation, as a geomean of all ARGBToUV variants:

Cortex-A510:  -7.1%
Cortex-A520:  -2.1%
Cortex-A710:  -8.4%
Cortex-A715:  -0.3%
Cortex-A720:  -0.3%
  Cortex-X2: -40.0%
  Cortex-X3: -43.3%
  Cortex-X4: -11.3%
Cortex-X925:  -2.5%

Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-05-12 11:14:00 -07:00
Frank Barchard
9f9b5cf660 ARGBToUV allow 32 bit x86 build
- make width loop count on stack
- set YMM constants in its own asm block
- make struct for shuffle and add constants
- disable clang format on row_neon.cc function

Bug: 413781394
Change-Id: I263f6862cb7589dc31ac65d118f7ebeb65dbb24a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6495259
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-04-28 12:11:00 -07:00
WANG Xuerui
55a708e226 Fix unified sources build for LoongArch LASX
Several consumers of libyuv do unified sources build where many source
files are #include'd together to make compilation units larger and allow
for more optimization chances. But for LoongArch there is a wrinkle:
LASX and LSX code paths are implemented in separate files, unlike the
other currently supported architectures, and some definitions are
duplicated e.g. struct RgbConstants.

Since the duplicated content is identical across the two files, short of
some bigger refactoring, we can simply place #ifdef guards around the
definitions to fix unified sources build for LoongArch.

Change-Id: I952e8e0210221ec8bcc113f75fa1b9ba515ec323
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6272801
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
2025-04-01 09:48:19 -07:00
Frank Barchard
23d416d6f3 Detect SME without SVE dependency
Bug: None
Change-Id: Ibe29488e893a493699ea3fae1a1a54a4fff5969c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6418571
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-03-31 17:27:40 -07:00
Frank Barchard
f145aa26da Add SME2 detect
Bug: None
Change-Id: I36e576de1cf468049faaf3923b6c21fc9ad14271
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6401373
Reviewed-by: George Steed <george.steed@arm.com>
2025-03-27 11:08:08 -07:00
Frank Barchard
5f284054cb RVV disable 64 bit elements and vcombine_v
Bug: 405451074
Change-Id: I8e4437be92934b3c367c94d867d7967c32747260
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6385788
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-03-25 12:51:25 -07:00
Frank Barchard
0c07032182 clang format applies to git repo
Bug: None
Change-Id: Ida65a0033e8c783230cadf6912416ffd9bbf90e1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6393515
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-03-25 11:49:25 -07:00
Frank Barchard
918329caee Make constant 0x0101 using vpcmpeqb+vpabsb
Was
      vpcmpeqb    %%ymm4,%%ymm4,%%ymm4
      vpsrlw      $0xf,%%ymm4,%%ymm4
      vpackuswb   %%ymm4,%%ymm4,%%ymm4
Now
      vpcmpeqb    %%ymm4,%%ymm4,%%ymm4
      vpabsb      %%ymm4,%%ymm4

Bug: 381138208
Change-Id: Ib70c24ac636fff95a10c7f06ed8f0a3bc7514906
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6312925
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2025-03-10 13:25:16 -07:00
Frank Barchard
c060118bea ARGBToJ444 use 256 for fixed point scale UV
- use negative coefficients for UV to allow -128
- change shift to truncate instead of round for UV
- adapt all row_gcc RGB to UV into matrix functions
- add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc

Bug: 381138208
Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-27 13:04:15 -08:00
Frank Barchard
5257ba4db0 Apply clang format
Bug: None
Change-Id: Ibd694d0351966a2b5812445de74bbced9c881a79
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6302317
Reviewed-by: James Zern <jzern@google.com>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-25 11:39:19 -08:00
Frank Barchard
3a7e0ba671 Apply format with no code changes
Bug: None
Change-Id: I8923bacb9af7e7d4f13e210c8b3d7ea6b81568a5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6301086
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2025-02-24 23:57:01 -08:00
Frank Barchard
61354d2671 ARGBToUV Matrix for AVX2 and SSSE3
- Round before shifting to 8 bit to match NEON
  - RAWToARGB use unaligned loads and port to AVX2

Was C/SSSE/AVX2
ARGBToI444_Opt (343 ms)
ARGBToJ444_Opt (677 ms)
RAWToI444_Opt (405 ms)
RAWToJ444_Opt (803 ms)

Now AVX2
ARGBToI444_Opt (283 ms)
ARGBToJ444_Opt (284 ms)
RAWToI444_Opt (316 ms)
RAWToJ444_Opt (339 ms)

Profile Now AVX2
  38.31%  ARGBToUVJ444Row_AVX2
  32.31%  RAWToARGBRow_AVX2
  23.99%  ARGBToYJRow_AVX2

Profile Was C/SSSE/AVX2
    73.15%  ARGBToUVJ444Row_C
    15.74%  RAWToARGBRow_SSSE3
     8.87%  ARGBToYJRow_AVX2

Bug: 381138208
Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Ben Weiss <bweiss@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-10 18:36:18 -08:00
Frank Barchard
d32d19ccf2 UV subsample on ARM use rounding average of 4 pixels
Performance on Samsung S22 Exynos (SVE2+I8MM+DOTPROD+Neon)
AArch64
ARGBToI400_Opt (168 ms)
ARGBToJ400_Opt (103 ms)
ABGRToJ400_Opt (81 ms)
RGBAToJ400_Opt (82 ms)
RGB24ToJ400_Opt (176 ms)
RAWToJ400_Opt (176 ms)
ABGRToI420_Opt (258 ms)
ARGBToI420_Opt (259 ms)
ARGBToI422_Opt (403 ms)
ARGBToI444_Opt (213 ms)
ARGBToJ420_Opt (257 ms)
ARGBToJ422_Opt (403 ms)
ARGBToJ444_Opt (214 ms)
ABGRToJ420_Opt (255 ms)
ABGRToJ422_Opt (399 ms)
ARGB4444ToI420_Opt (285 ms)
RGB565ToI420_Opt (316 ms)
ARGB1555ToI420_Opt (324 ms)
BGRAToI420_Opt (260 ms)
RAWToI420_Opt (303 ms)
RAWToI444_Opt (303 ms)
RAWToJ420_Opt (335 ms)
RAWToJ444_Opt (308 ms)
RGB24ToI420_Opt (372 ms)
RGB24ToJ420_Opt (365 ms)
RGBAToI420_Opt (259 ms)

AArch32 (Neon)
ARGBToI400_Opt (496 ms)
ARGBToJ400_Opt (478 ms)
ABGRToJ400_Opt (483 ms)
RGBAToJ400_Opt (493 ms)
RGB24ToJ400_Opt (343 ms)
RAWToJ400_Opt (341 ms)
ABGRToI420_Opt (993 ms)
ARGBToI420_Opt (992 ms)
ARGBToI422_Opt (1503 ms)
ARGBToI444_Opt (1257 ms)
ARGBToJ420_Opt (1006 ms)
ARGBToJ422_Opt (1521 ms)
ARGBToJ444_Opt (1267 ms)
ABGRToJ420_Opt (1002 ms)
ABGRToJ422_Opt (1504 ms)
ARGB4444ToI420_Opt (1180 ms)
RGB565ToI420_Opt (1112 ms)
ARGB1555ToI420_Opt (1115 ms)
BGRAToI420_Opt (993 ms)
RAWToI420_Opt (703 ms)
RAWToI444_Opt (1717 ms)
RAWToJ420_Opt (704 ms)
RAWToJ444_Opt (1739 ms)
RGB24ToI420_Opt (703 ms)
RGB24ToJ420_Opt (703 ms)
RGBAToI420_Opt (993 ms)

Bug: 381138208
Change-Id: I33728d5237f357362b0bfc509a9ebe6fe46f45d4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6228987
Reviewed-by: Ben Weiss <bweiss@google.com>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-02-04 15:19:19 -08:00
Frank Barchard
5a9a6ea936 Add RAWToI444
Skylake Xeon
  RAWToI444_Opt (433 ms)
  RAWToJ444_Opt (1781 ms)
  ARGBToI444_Opt (352 ms)
  ARGBToJ444_Opt (1577 ms)

Samsung S22 Exynos
  ARGBToI444_Opt (283 ms)
  ARGBToJ444_Opt (209 ms)
  RAWToI444_Opt (294 ms)
  RAWToJ444_Opt (293 ms)

Profiling on Samsung S22 Exynos
37.62%,  ARGBToUV444Row_NEON_I8MM
29.42%,  RAWToARGBRow_SVE2
19.61%,  ARGBToYRow_NEON_DotProd

Passing different --libyuv_cpu_info=N etc we can compare each ISA
C           1  RAWToI444_Opt (781 ms)
NEON      511  RAWToI444_Opt (757 ms)
NEONDOT  1023  RAWToI444_Opt (571 ms)
NEONI8MM 2047  RAWToI444_Opt (334 ms)
SVE2     8191  RAWToI444_Opt (307 ms)



Bug: 390247964
Change-Id: I0316fedd32222588455afa751f5b854f46bce024
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6223658
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-02-03 16:13:03 -08:00
Frank Barchard
b3fd3f3f3b Fix ARGBToUV444Row_NEON
- constants passed in are signed and need to be negated to positive.

Bug: 394127527
Change-Id: I531e475d2ddd4583922d4abef13b9282d002dd7a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6226854
Reviewed-by: Ben Weiss <bweiss@google.com>
2025-02-03 13:33:39 -08:00
Frank Barchard
96f98f6915 ARGBToJ444 and RAWToJ444 NEON
- Pass JPEG matrix to ARGBToUV444MatrixRow_NEON
- Remove NEON unsigned constants in favor of DOTPROD signed constants

Samsung S23:
Was C for UV
  ARGBToJ444_Opt (320 ms)
  RAWToJ444_Opt (411 ms)
Now I8MM
  ARGBToJ444_Opt (196 ms)
  RAWToJ444_Opt (301 ms)
NEON
  ARGBToJ444_Opt (505 ms)
  RAWToJ444_Opt (596 ms)

32 bit ARM NEON
  ARGBToJ444_Opt (1135 ms)
  RAWToJ444_Opt (1546 ms)

Profile of RAWToJ444
  37.72%  ARGBToUVJ444Row_NEON_I8MM
  34.48%  RAWToARGBRow_NEON
  14.65%  ARGBToYJRow_NEON_DotProd

Bug: 390247964
Change-Id: Ia26240bee974a0baf502548f2fc896b193c3006c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6220890
Reviewed-by: Ben Weiss <bweiss@google.com>
2025-01-31 16:46:29 -08:00
Frank Barchard
c1bac9e6a5 RAWToJ444 and ARGBToJ444
- ARGBToJ444 implements ARGBToUVJ444Row_C
- RAWToJ444 implemented as 2 steps - RAWToARGB and ARGBToJ444

libyuv_test '--gunit_filter=*R*To?444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1
(with bit exact off)

Samsung S23
RAWToJ444_Opt (437 ms)
ARGBToJ444_Opt (337 ms)
ARGBToI444_Opt (196 ms)

Skylake Xeon
RAWToJ444_Opt (1699 ms)
ARGBToJ444_Opt (1559 ms)
ARGBToI444_Opt (346 ms)

Bug: 390247964
Change-Id: Id1b1b45a5e4512ab50830aadf62f780fbe631575
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207845
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-29 15:18:38 -08:00
George Steed
c4a0c8d34a [AArch64] Add SVE2 and SME implementations for Convert8To8Row
SVE can make use of the UMULH instruction to avoid needing separate
widening multiply and narrowing steps for the scale application.

Reduction in runtime for Convert8To8Row_SVE2 observed compared to the
existing Neon implementation:

        Cortex-A510: -13.2%
        Cortex-A520: -16.4%
        Cortex-A710: -37.1%
        Cortex-A715: -38.5%
        Cortex-A720: -38.4%
          Cortex-X2: -33.2%
          Cortex-X3: -31.8%
          Cortex-X4: -31.8%
        Cortex-X925: -13.9%

Change-Id: I17c0cb81661c5fbce786b47cdf481549cfdcbfc7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207692
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-01-28 15:53:26 -08:00
Frank Barchard
6c2415bfab J420ToI420 AVX2
libyuv_test '--gunit_filter=*J420ToI420*' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Skylake Xeon
AVX2 J420ToI420_Opt (114 ms)
C    J420ToI420_Opt (596 ms)

Sapphire Rapids
AVX2 J420ToI420_Opt (126 ms)
C    J420ToI420_Opt (717 ms)

Samsung S23
NEON J420ToI420_Opt (46 ms)
C    J420ToI420_Opt (95 ms)

Bug: 381327032
Change-Id: I2b551507c2a8b1da4f04651b622fc9247a75050d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6201239
Reviewed-by: Justin Green <greenjustin@google.com>
2025-01-27 11:23:44 -08:00
Frank Barchard
67f3f17d9a aarch32 J420ToI420
benchmark on medium core
adbrun -- taskset 10 blaze-bin/third_party/libyuv/libyuv_test '--gunit_filter=*J420ToI420*' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Now Neon
J420ToI420_Opt (159 ms)
Was C
J420ToI420_Opt (215 ms)

AArch64
J420ToI420_Opt (93 ms)

C version does this:
vld1.8	{d20, d21}, [r6]!
vorr	q12, q8, q8
subs	r4, #16
vmovl.u8	q11, d21
vmovl.u8	q10, d20
vmul.i16	q11, q9, q11
vmul.i16	q10, q9, q10
vsra.u16	q12, q11, #8
vorr	q11, q8, q8
vsra.u16	q11, q10, #8
vmovn.i16	d21, q12
vmovn.i16	d20, q11
vst1.8	{d20, d21}, [r5]!
bne	0x3d9078 <Convert8To8Row_C+0x36> @ imm = #-54

Explanation of above C code
vorr moves 16 into register
vsra does shift + accumulate to that register

Compared to aarch64
instead of mull, C uses movl+mul
instead of uzp2, C uses sra #8 + movn. takes 2 movn vs 1 uzp2
instead of add, C does vorr + sra

Change-Id: I9648f06e52ccbafaecf07bd89f8ffff27565d025
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6189497
Reviewed-by: Justin Green <greenjustin@google.com>
2025-01-22 13:47:09 -08:00
Frank Barchard
26277baf96 J420ToI420 using planar 8 bit scaling
- Add Convert8To8Plane which scale and add 8 bit values allowing full range
  YUV to be converted to limited range YUV

libyuv_test '--gunit_filter=*J420ToI420*' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Samsung S23
J420ToI420_Opt (45 ms)
I420ToI420_Opt (37 ms)

Skylake
J420ToI420_Opt (596 ms)
I420ToI420_Opt (99 ms)

Bug: 381327032
Change-Id: I380c3fa783491f2e3727af28b0ea9ce16d2bb8a4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6182631
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-22 02:50:24 -08:00
Frank Barchard
ef52c1658a avx10_2 detect
Run with sde only -dmr reports AVX10.2
emr:Has AVX10_2 0x0
adl:Has AVX10_2 0x0
icx:Has AVX10_2 0x0
snb:Has AVX10_2 0x0
tnt:Has AVX10_2 0x0
icl:Has AVX10_2 0x0
slm:Has AVX10_2 0x0
dmr:Has AVX10_2 0x2000000
cwf:Has AVX10_2 0x0
mrm:Has AVX10_2 0x0
skx:Has AVX10_2 0x0
wsm:Has AVX10_2 0x0
gnr:Has AVX10_2 0x0
gnr256:Has AVX10_2 0x0
bdw:Has AVX10_2 0x0
cpx:Has AVX10_2 0x0
rpl:Has AVX10_2 0x0
snr:Has AVX10_2 0x0
ptl:Has AVX10_2 0x0
slt:Has AVX10_2 0x0
ivb:Has AVX10_2 0x0
spr:Has AVX10_2 0x0
tgl:Has AVX10_2 0x0
arl:Has AVX10_2 0x0
srf:Has AVX10_2 0x0
nhm:Has AVX10_2 0x0
skl:Has AVX10_2 0x0
mtl:Has AVX10_2 0x0
pnr:Has AVX10_2 0x0
glp:Has AVX10_2 0x0
lnl:Has AVX10_2 0x0
cnl:Has AVX10_2 0x0
hsw:Has AVX10_2 0x0
clx:Has AVX10_2 0x0
glm:Has AVX10_2 0x0

sde -dmr -- libyuv_test --gunit_filter=*Cpu*
[ RUN      ] LibYUVBaseTest.TestCpuId
Cpu Vendor: GenuineIntel 0x756e6547 0x49656e69 0x6c65746e
Cpu Family 6 (0x6), Model 214 (0xd6)
[       OK ] LibYUVBaseTest.TestCpuId (34 ms)
[ RUN      ] LibYUVBaseTest.TestCpuHas
Kernel Version 6.10
Has X86 0x8
Has SSE2 0x100
Has SSSE3 0x200
Has SSE4.1 0x400
Has SSE4.2 0x800
Has AVX 0x1000
Has AVX2 0x2000
Has ERMS 0x4000
Has FSMR 0x8000
Has FMA3 0x10000
Has F16C 0x20000
Has AVX512BW 0x40000
Has AVX512VL 0x80000
Has AVX512VNNI 0x100000
Has AVX512VBMI 0x200000
Has AVX512VBMI2 0x400000
Has AVX512VBITALG 0x800000
Has AVX10 0x1000000
Has AVX10_2 0x2000000
HAS AVXVNNI 0x4000000
Has AVXVNNIINT8 0x8000000
Has AMXINT8 0x10000000
[       OK ] LibYUVBaseTest.TestCpuHas (10 ms)

This is how oneDNN does avx10 version:
e15d2c220f/src/cpu/x64/xbyak/xbyak_util.h (L698-L701)

Bug: b/350318244
Change-Id: I6f78402fecc38a92019d137b3439d7bce950510c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6172267
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2025-01-21 13:53:19 -08:00
Frank Barchard
47ddac2996 Sub sampling conversions use CopyPlane for Y channel
- Replace ScalePlane with CopyPlane for Y channel
- Vertical mirroring is supported, but not horizontal mirroring.
- Check src_y is not null when dst_y is not null for all libyuv functions that allow a null dst_y.
- Apply clang-format
- Bump version to 1899

Bug: None
Change-Id: Id1805b52b8024ba95a7f1b098dabf45af48670eb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6128599
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-02 13:34:11 -08:00
Frank Barchard
e0040eb318 Apply clang format
Bug: None
Change-Id: I0d9db4b384144523e61ae32b6ab3f72e93a0c265
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6138934
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-02 13:31:20 -08:00
Darren Hsieh
b5a18f9d93 [RVV] Optimize ScaleARGBFilterCols with RVV
* Run on SiFive internal FPGA:

Test Case	                Speedup
ARGBScaleDownBy3by8_Linear      x2.05
ARGBScaleDownBy3by8_Bilinear    x1.76
ARGBScaleDownBy3by8_Box         x1.76

Bug: 42280924
Co-Developed-by: Bruce Lai <bruce.lai@sifive.com>
Change-Id: Ib9979b1f2ca92d2ef5aa373f9b2459c246ded6c8
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5103572
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-29 17:32:00 -08:00
George Steed
db5a71c528 [AArch64] Remove unused variables in HalfRow_{16To8,16}_SME
The HalfRow kernels assume that the fraction is exactly half, so there
is no need to calculate it.

No-Try: True
Change-Id: I2319d55ba99f202aa22c9693ec44c9891e7f72d5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087914
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
2024-12-13 08:00:58 -08:00
George Steed
7fd0bd197e [AArch64] Port YUVToRGB color conversions to SME
Some of the color conversion kernels already have Streaming-SVE
implementations however many do not. We can re-use the existing SVE
implementation by moving it to a new shared row_sve.h header and marking
it with a "streaming-compatible" attribute to ensure it can be called
from both streaming and non-streaming execution modes.

As part of this move to a common header we also add duplicated
streaming-mode implementations of the following kernels that did not
previously have an SME implementation:

- I210AlphaToARGBRow_SME
- I210ToAR30Row_SME
- I210ToARGBRow_SME
- I212ToAR30Row_SME
- I212ToARGBRow_SME
- I400ToARGBRow_SME
- I410AlphaToARGBRow_SME
- I410ToAR30Row_SME
- I410ToARGBRow_SME
- I422AlphaToARGBRow_SME
- I422ToARGB1555Row_SME
- I422ToARGB4444Row_SME
- I422ToRGB24Row_SME
- I422ToRGB565Row_SME
- I422ToRGBARow_SME
- I444AlphaToARGBRow_SME
- NV12ToARGBRow_SME
- NV12ToRGB24Row_SME
- NV21ToARGBRow_SME
- NV21ToRGB24Row_SME
- P210ToAR30Row_SME
- P210ToARGBRow_SME
- P410ToAR30Row_SME
- P410ToARGBRow_SME
- UYVYToARGBRow_SME
- YUY2ToARGBRow_SME

Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 03:07:54 -08:00
George Steed
c2e7f8389a [AArch64] Add SME implementations of InterpolateRow{,_16,_16To8}
InterpolateRow_SME and InterpolateRow_16_SME need special cases to
handle if source_y_fraction is 256 since this would overflow a byte and
can just be a call to memcpy instead.

InterpolateRow_16To8_SME is never called with a source_y_fraction value
of 256 so there is no need for a special case here.

Change-Id: I67805b5db2c411acb93ada626cf414b35620f467
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 03:03:41 -08:00
George Steed
2d8652f3e7 [AArch64] Add SME implementation of CopyRow
Add a streaming-SVE implementation of CopyRow using normal vector
load/store instructions.

Change-Id: Ia551413f9740a96473fa2e8a0958953be2f4b04e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074374
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 03:02:07 -08:00
George Steed
418b6df0de [AArch64] Add SME implementation of Convert16To8Row
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel.
SVE has a "widening multiply get high half" instruction in UMULH,
however using the same technique as the Neon code to avoid the need for
a widening multiply at all is more performant here.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-12-12 03:01:55 -08:00
runzezhang
192b8c2238 Add NV24 scaling support to libyuv
Some projects require scaling support for the NV24 format, but libyuv currently lacks this functionality. This commit adds a scaling function for NV24, enabling its use in projects that require NV24 format processing.

Change-Id: I6e6b2bea342e1df7f387056ab3bc5003da983bb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6068715
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 02:46:11 -08:00
George Steed
85331e00cc [AArch64] Add SME impls of ScaleRowDown2{,Linear,Box}_16
Mostly just straightforward copies of the Neon code ported to
Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2
SME kernels, but operating on 16-bit data rather than 8-bit.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: I7bad0719d24cdb1760d1039c63c0e77726b28a54
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070784
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-12-12 01:21:08 -08:00
George Steed
15f2ae7d70 [AArch64] Add SME impls of ScaleARGBRowDown2{,Linear,Box}
Mostly just straightforward copies of the Neon code ported to
Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2
and ScaleUVRowDown2 SME kernels, but operating on 32-bit ARGB tuples
rather than 8-bit data or 16-bit UV tuples.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: I15600c2498cc592f5ea1d97b78fafec327de7947
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070783
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-12-12 01:19:20 -08:00
George Steed
7391559cb4 [AArch64] Add SME implementation of MergeUVRow{,_16}
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel
and use ST2 to avoid needing a separate ZIP instruction.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 01:16:19 -08:00
George Steed
3e75e41e79 [AArch64] Add "limit" variable explanations in SVE *AR30 kernels
As requested here: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023583/1/source/row_sve.cc#1973

Change-Id: I15d8ca1f724a7123fbf52ac60b18c850e4004e64
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067153
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-11 23:50:27 -08:00
George Steed
11ef227b6d [AArch64] Clean up formatting in row_sve.cc
Force macros onto empty lines with empty comments and adjust some other
comments to be consistent with the rest of the file.

Change-Id: I1a35283608b868c53e91b337187ebe0e402c9834
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067152
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-11 23:48:57 -08:00
George Steed
8f659daffd [AArch64] Add SVE2 implementations of NV{12,21}ToRGB24Row
Now that we have the `_2X` versions of the macros we can use these to
implement `ToRGB24` kernels. These cannot use the bottom/top approach
previously used by other SVE kernels since there are three rather than
two or four elements each.

Reduction in runtimes observed compared to the existing Neon
implementations:

            | NV12ToRGB24Row | NV21ToRGB24Row
Cortex-A510 |         -60.7% |         -60.7%
Cortex-A520 |         -46.0% |         -46.0%
Cortex-A715 |         -25.2% |         -25.2%
Cortex-A720 |         -25.2% |         -25.2%
  Cortex-X2 |         -28.9% |         -29.0%
  Cortex-X3 |         -28.2% |         -28.1%
  Cortex-X4 |         -30.8% |         -30.7%
Cortex-X925 |         -28.8% |         -28.9%

Change-Id: I39853d124bfdcac38584109870b398b8ecd5b632
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067149
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-04 17:51:08 +00:00
George Steed
233f859e3c [AArch64] Remove redundant increments in ScaleRowDown2_16_NEON
These were mistakenly copied from the main loop body, however this
particular block of the code is only executed at most once so we do not
need to perform the address updates.

Also adjust formatting with clang-format to match other kernels.

Change-Id: I8214821417d5e4f455ebe8805e1a37a9728ab8d2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067154
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-04 17:48:11 +00:00
George Steed
9144583f22 [AArch64] Add SME impls of MultiplyRow_16 and ARGBMultiplyRow
Mostly just a translation of the existing Neon code to SME.

Change-Id: Ic3d6b8ac774c9a1bb9204ed6c78c8802668bffe9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067147
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-03 22:11:19 +00:00
George Steed
88a3472f52 [AArch64] Unroll SVE2 impls of NV{12,21}ToARGBRow
We can reuse most of the logic from the existing I422TORGB_SVE_2X macro
and simply amend the existing READNV_SVE macro to read twice as much
data.

Unrolling is primarily beneficial for little cores but also provides
some smaller benefits to larger cores as well.

            | NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 |             -48.0% |             -47.9%
Cortex-A520 |             -48.1% |             -48.2%
Cortex-A715 |             -20.4% |             -20.4%
Cortex-A720 |             -20.6% |             -20.6%
  Cortex-X2 |              -7.1% |              -7.3%
  Cortex-X3 |              -4.0% |              -4.3%
  Cortex-X4 |             -14.1% |             -14.3%
Cortex-X925 |              -8.2% |              -8.6%

Change-Id: I195005d23e743d7d46319220ad05ee89bb7385ae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067148
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-03 22:03:42 +00:00
George Steed
03a935493d [AArch64] Simplify predicate width calculations
Several of the existing SVE kernels used calculations of the form:

        remainder = width & (vl - 1) == 0 ? vl : width & (vl - 1);

This is due to initial SVE contributed code unconditionally using the
predicated tail for the final iteration even if the width was a perfect
multiple of the vector length.

In the current code the fully-predicated main body loop will instead
iterate through the width completely and simply skip over the tail
entirely. Skipping over the tail means that the case handled by the
ternary condition now never occurs, and the remainder calculation can
now simply be:

        remainder = width & (vl - 1);

This avoids the need for a compare and conditional select in the
function prologue.

Change-Id: Ia73f5f8bc66fad6bea64439dc2beeaccb54622d2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067151
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-03 21:54:32 +00:00
George Steed
2c32b689e4 [AArch64] Improve instruction interleaving in READI212_SVE
The existing instruction arrangement is sub-optimal on little cores
since it has instructions with dependencies next to each other, so
spread them out to improve performance.

No significant change observed on bigger cores, but little cores do show
some small improvements except for the *Alpha* kernels which regress
slightly.

Runtimes observed compared to the previous SVE implementation:

                   | Cortex-A510 | Cortex-A520
I210AlphaToARGBRow |   (!) +7.0% |   (!) +6.8%
     I210ToAR30Row |      -10.3% |       -9.9%
     I210ToARGBRow |       -2.4% |       -2.3%
     I212ToAR30Row |      -10.3% |       -9.9%
     I212ToARGBRow |       -2.4% |       -2.3%

Change-Id: I626942ce02c4610cfac1ea4f8e7890653ee4324f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067150
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-03 21:50:47 +00:00
Hao Chen
532126bf70 Fix bugs in ARGBAttenuateRow_LASX/LSX function
Fix errors in ARGBAttenuateRow_LASX and ARGBAttenuateRow_LSX functions
caused by changes in calculation methods.
In addition, add the option to automatically add "-mlsx" and "-mlasx" to
enable SIMD optimization when compiling with cmake on LoongArch
platform.

Bug: libyuv:913
Change-Id: I7215f5198d3fb94f981d60969dc21a483006023e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802829
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
2024-11-30 23:09:04 +00:00
George Steed
9a9752134e [AArch64] Add Neon implementation of ScaleRowDown2Linear_16
Reduction in runtime observed relative to the auto-vectorized C
implementation compiled with LLVM 19:

  Cortex-A55: -13.7%
 Cortex-A510: -49.0%
 Cortex-A520: -32.0%
  Cortex-A76: -34.3%
 Cortex-A710: -56.7%
 Cortex-A715: -45.4%
 Cortex-A720: -44.7%
   Cortex-X1: -70.6%
   Cortex-X2: -67.9%
   Cortex-X3: -72.2%
   Cortex-X4: -40.0%
 Cortex-X925: -24.1%

Bug: b/42280942
Change-Id: I977899a2239e752400c9901f4d8482a76841269a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040154
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-25 21:10:26 +00:00
George Steed
11c57f4f12 [AArch64] Add Neon implementation of ScaleRowDown2_16_NEON
The auto-vectorized implementation unrolls to process 32 elements per
iteration, so unroll the new Neon implementation to match and avoid a
performance regression on little cores.

Performance relative to the auto-vectorized C implementation compiled
with LLVM 19:

 Cortex-A55: -35.8%
Cortex-A510: -20.4%
Cortex-A520: -22.1%
 Cortex-A76: -54.8%
Cortex-A710: -44.5%
Cortex-A715: -31.1%
Cortex-A720: -31.4%
  Cortex-X1: -48.5%
  Cortex-X2: -47.8%
  Cortex-X3: -47.6%
  Cortex-X4: -51.1%
Cortex-X925: -14.6%

Bug: b/42280942
Change-Id: Ib4e89ba230d554f2717052e934ca0e8a109ccc42
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040153
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-25 21:10:05 +00:00
George Steed
952d6a282f [AArch64] Enable use of ScaleRowDown2Box_16_NEON
The #ifdef surrounding the use of this kernel is never defined and
ScaleRowDown2_16_NEON does not exist, so add the missing #define and
remove the use of ScaleRowDown2_16_NEON for now. Additionally since
there is no implementation of this kernel for 32-bit Arm, restrict the
define to only be present on AArch64.

Bug: b/42280942
Change-Id: Icc35c145c1bad1c0df2933a2d8bc7dcf7fe63cb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040152
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-24 19:58:00 +00:00
George Steed
9ed07258c7 [AArch64] Add SVE2 implementation of I410ToAR30Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -18.1%
Cortex-A520:  -6.0%
Cortex-A715: -22.0%
Cortex-A720: -21.1%
  Cortex-X2:  -9.4%
  Cortex-X3: -12.0%
  Cortex-X4:  -7.6%
Cortex-X925:  -5.8%

Bug: b/42280942
Change-Id: I853a028e08f1f1076ac20cd9c7f4f8ac8a211ac1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023584
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:59:55 +00:00
George Steed
3dd047733e [AArch64] Add SVE2 implementation of I410AlphaToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -37.2%
Cortex-A520:  -6.9%
Cortex-A715: -14.8%
Cortex-A720: -16.0%
  Cortex-X2: -14.8%
  Cortex-X3: -17.5%
  Cortex-X4: -12.8%
Cortex-X925: -13.0%

Bug: b/42280942
Change-Id: I1977fd1e1dfac25021724483fd89c6ff3e227d8b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023582
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:58:11 +00:00
George Steed
e84d809348 [AArch64] Add SVE2 implementation of I410ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -37.9%
Cortex-A520:  -9.2%
Cortex-A715: -14.3%
Cortex-A720: -14.2%
  Cortex-X2: -10.9%
  Cortex-X3: -11.1%
  Cortex-X4: -12.5%
Cortex-X925: -10.6%

Bug: b/42280942
Change-Id: I6720b07c900c7dfbd849ee38e413e98b9374dac2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023581
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:54:48 +00:00
George Steed
7c9c72ab4b [AArch64] Add SVE2 implementation of I210ToAR30Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -15.5%
Cortex-A520:  -3.8%
Cortex-A715: -15.8%
Cortex-A720: -15.8%
  Cortex-X2:  -7.9%
  Cortex-X3:  -6.5%
  Cortex-X4:  -5.0%
Cortex-X925:  -5.3%

Bug: b/42280942
Change-Id: I5171537fd125b3214d25a0ae503a8f40dbeb6042
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023583
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-11-23 00:53:16 +00:00
George Steed
fc3569ad27 [AArch64] Add SVE2 implementation of I210AlphaToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -33.9%
Cortex-A520:  -4.2%
Cortex-A715: -22.0%
Cortex-A720: -22.4%
  Cortex-X2: -14.6%
  Cortex-X3: -14.5%
  Cortex-X4: -11.6%
Cortex-X925: -12.6%

Bug: b/42280942
Change-Id: Ifb4ed7a865c369d584af498cc65b84d065cfb207
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023580
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:47:32 +00:00
George Steed
50108f29fb [AArch64] Add SVE2 implementation of I212ToAR30Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -15.4%
Cortex-A520:  -3.8%
Cortex-A715: -15.7%
Cortex-A720: -15.6%
  Cortex-X2:  -7.9%
  Cortex-X3:  -5.7%
  Cortex-X4:  -5.3%
Cortex-X925:  -4.8%

Bug: b/42280942
Change-Id: I99846820682687c8e0f52d05f5aa3d50369fe0a2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025829
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:27:57 +00:00
George Steed
305a7a4ede [AArch64] Add SVE2 implementation of I212ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -34.5%
Cortex-A520:  -6.5%
Cortex-A715: -10.1%
Cortex-A720: -16.1%
  Cortex-X2: -11.9%
  Cortex-X3: -11.9%
  Cortex-X4:  -9.3%
Cortex-X925: -11.2%

Bug: b/42280942
Change-Id: Idc30e69552f7d227217ac7011a786210b11e4752
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025828
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:21:27 +00:00
Frank Barchard
595146434a HalfFloat fix SigIll on aarch64
- Remove special case Scale of 1 which used fp16 cvt but requires cpuid
- Port aarch64 to aarch32
- Use C for aarch32 with small (denormal) scale value

Bug: 377693555
Change-Id: I38e207e79ac54907ed6e65118b8109288fddb207
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6043392
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-11-22 22:08:00 +00:00
Frank Barchard
307b951229 Add CopyPlane_Unaligned, _Any and _Invert tests/benchmarksCpuId test
- Add AMD_ERMSB detect for ERMS on AMD

Bug: 379457420
Change-Id: I608568556024faf19abe4d0662aeeee553a0a349
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6032852
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-11-19 23:53:05 +00:00
Frank Barchard
1c501a8f3f CpuId test FSMR - Fast Short Rep Movsb
- Renumber cpuid bits to use low byte to ID the type of CPU and upper 24 bits for features

Intel CPUs starting at Icelake support FSMR
adl:Has FSMR 0x8000
arl:Has FSMR 0x0
bdw:Has FSMR 0x0
clx:Has FSMR 0x0
cnl:Has FSMR 0x0
cpx:Has FSMR 0x0
emr:Has FSMR 0x8000
glm:Has FSMR 0x0
glp:Has FSMR 0x0
gnr:Has FSMR 0x8000
gnr256:Has FSMR 0x8000
hsw:Has FSMR 0x0
icl:Has FSMR 0x8000
icx:Has FSMR 0x8000
ivb:Has FSMR 0x0
knl:Has FSMR 0x0
knm:Has FSMR 0x0
lnl:Has FSMR 0x8000
mrm:Has FSMR 0x0
mtl:Has FSMR 0x8000
nhm:Has FSMR 0x0
pnr:Has FSMR 0x0
rpl:Has FSMR 0x8000
skl:Has FSMR 0x0
skx:Has FSMR 0x0
slm:Has FSMR 0x0
slt:Has FSMR 0x0
snb:Has FSMR 0x0
snr:Has FSMR 0x0
spr:Has FSMR 0x8000
srf:Has FSMR 0x0
tgl:Has FSMR 0x8000
tnt:Has FSMR 0x0
wsm:Has FSMR 0x0

Intel CPUs starting at Ivybridge support ERMS

adl:Has ERMS 0x4000
arl:Has ERMS 0x4000
bdw:Has ERMS 0x4000
clx:Has ERMS 0x4000
cnl:Has ERMS 0x4000
cpx:Has ERMS 0x4000
emr:Has ERMS 0x4000
glm:Has ERMS 0x4000
glp:Has ERMS 0x4000
gnr:Has ERMS 0x4000
gnr256:Has ERMS 0x4000
hsw:Has ERMS 0x4000
icl:Has ERMS 0x4000
icx:Has ERMS 0x4000
ivb:Has ERMS 0x4000
knl:Has ERMS 0x4000
knm:Has ERMS 0x4000
lnl:Has ERMS 0x4000
mrm:Has ERMS 0x0
mtl:Has ERMS 0x4000
nhm:Has ERMS 0x0
pnr:Has ERMS 0x0
rpl:Has ERMS 0x4000
skl:Has ERMS 0x4000
skx:Has ERMS 0x4000
slm:Has ERMS 0x4000
slt:Has ERMS 0x0
snb:Has ERMS 0x0
snr:Has ERMS 0x4000
spr:Has ERMS 0x4000
srf:Has ERMS 0x4000
tgl:Has ERMS 0x4000
tnt:Has ERMS 0x4000
wsm:Has ERMS 0x0
Change-Id: I18e5a3905f2691ab66d4d0cb6f668c0a0ff72d37
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6027541
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2024-11-18 17:56:45 +00:00
Frank Barchard
75f7cfdde5 SplitRGB for SSE4 and AVX2
libyuv_test '--gunit_filter=*SplitRGB*' --libyuv_width=640 --libyuv_height=360 --libyuv_repeat=100000 --libyuv_flags=-1 --libyuv_cpu_info=-1
Note: Google Test filter = *SplitRGB*

Skylake Xeon x86 32 bit
AVX2  LibYUVPlanarTest.SplitRGBPlane_Opt (4143 ms)
SSE4  LibYUVPlanarTest.SplitRGBPlane_Opt (4543 ms)
SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5346 ms)
C     LibYUVPlanarTest.SplitRGBPlane_Opt (22965 ms)

Skylake Xeon x86 64 bit
AVX2  LibYUVPlanarTest.SplitRGBPlane_Opt (4470 ms)
SSE4  LibYUVPlanarTest.SplitRGBPlane_Opt (4723 ms)
SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5465 ms)
C     LibYUVPlanarTest.SplitRGBPlane_Opt (4707 ms)

Bug: 379186682
Change-Id: Idce67a4ded836f2ee31854aa06f3903e7bcb7791
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6024314
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2024-11-15 00:46:25 +00:00
George Steed
823d960afc [AArch64] Add SVE2 implementations of {P210,P410}ToAR30Row
Observed reductions in runtime compared to the existing Neon code:

            | P210ToAR30Row | P410ToAR30Row
Cortex-A510 |       -16.5%  |        -21.2%
Cortex-A520 | (!)    +2.7%  |         -8.7%
Cortex-A715 |        -6.1%  |         -6.1%
Cortex-A720 |        -6.2%  |         -5.9%
  Cortex-X2 |        -4.1%  |         -4.2%
  Cortex-X3 |        -4.2%  |         -4.2%
  Cortex-X4 |        -1.2%  |         -1.2%
Cortex-X925 |        -3.6%  |         -2.8%

Bug: b/42280942
Change-Id: I40723a370fad1ccb53f8ccd9d32cddb502500dd6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023036
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-14 16:52:21 +00:00
George Steed
0ddf3f7b90 [AArch64] Add SVE2 implementation of I210ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -34.5%
Cortex-A520:  -6.5%
Cortex-A715: -10.1%
Cortex-A720: -13.9%
  Cortex-X2: -11.9%
  Cortex-X3: -11.6%
  Cortex-X4:  -9.5%
Cortex-X925: -11.5%

Bug: b/42280942
Change-Id: Ie97dc3b5efd021ecfea14d4c477cc205191e09c3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023037
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-14 16:36:41 +00:00
George Steed
5b906a0ec8 [AArch64] Add SVE2 implementation of P410ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -34.7%
Cortex-A520:  -2.4%
Cortex-A715: -18.7%
Cortex-A720: -18.8%
  Cortex-X2:  -7.7%
  Cortex-X3:  -8.9%
  Cortex-X4:  +1.0% (!)
Cortex-X925:  -8.3%

Bug: b/42280942
Change-Id: I90dca0573887a9a24e2172378a9e0eb6812e2131
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975321
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-12 18:34:56 +00:00
George Steed
b753822d47 [AArch64] Add SVE2 implementation of P210ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -32.8%
Cortex-A520:  +8.7% (!)
Cortex-A715: -18.9%
Cortex-A720: -18.9%
  Cortex-X2:  -7.9%
  Cortex-X3:  -8.8%
  Cortex-X4:  +1.0% (!)
Cortex-X925:  -8.6%

Bug: b/42280942
Change-Id: Ibe557500c3788b4fb39372c92b2f42ba216e6fea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975320
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-11-12 18:32:55 +00:00
George Steed
721ad4aa18 [AArch64] Add SME implementation of ScaleUVRowDown2Box
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: Ie15bb4e7484b61e78f405ad4e8a8a7bbb66b7edb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979727
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-12 18:30:30 +00:00
George Steed
576218dbce [AArch64] Add SME implementation of ScaleUVRowDown2Linear
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: I401eb6ad14b3159917c8e3a79ab20dde318d28b6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979726
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-12 18:28:57 +00:00
George Steed
551cee7845 [AArch64] Add SME implementation of ScaleUVRowDown2
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: Ic4ba5f97dc57afc558c08a57e9b5009d6e487e0f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979725
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-12 18:24:28 +00:00
George Steed
5c12e0b2de [AArch64] Add SVE2 implementations of HalfFloat{,1}Row
For HalfFloat1Row, SVE has direct 16-bit integer to half-float
conversion instructions so there is no need to widen to 32-bits.

For HalfFloatRow, SVE zero-extending loads avoid the need for seperate
UXTL(2) instructions.

Observed reductions in runtime compared to the existing Neon code:

            | HalfFloat1Row | HalfFloatRow
Cortex-A510 |        -38.3% |       -17.3%
Cortex-A520 |        -37.6% |       -18.8%
Cortex-A720 |        -50.1% |        -7.8%
  Cortex-X2 |        -50.2% |        -0.4%
  Cortex-X4 |        -51.5% |       -12.5%

Bug: b/42280942
Change-Id: I445071ccd453113144ce42d465ba03c9ee89ec9e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975319
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-07 18:53:00 +00:00
George Steed
7d383c2f1a [AArch64] Add comments to ScaleRowDown38_{2,3}_Box_NEON impls
Add a few comments to help illustrate the permute operations.

As requested here:
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872803

Change-Id: I8596ef63af5fae4dba1e6fdb548742ba7e191ab9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975317
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-07 18:47:12 +00:00
George Steed
f27b983f38 [AArch64] Add SVE2 implementation of DivideRow_16
SVE contains the UMULH instruction which allows us to multiply and take
the high half of the result in a single instruction rather than needing
separate widening multiply and then narrowing shift steps.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -21.2%
Cortex-A520: -20.9%
Cortex-A715: -47.9%
Cortex-A720: -47.6%
  Cortex-X2:  -5.2%
  Cortex-X3:  -2.6%
  Cortex-X4: -32.4%
Cortex-X925:  -1.5%

Bug: b/42280942
Change-Id: I25154699b17772db1fb5cb84c049919181d86f4b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975318
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-07 18:46:02 +00:00
George Steed
aec4b4e22e [AArch64] Add SME implementation of ScaleRowDown2Box
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead.  We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: I5021aeda30f4c5f1aa4cc6326c8d7886851d2c09
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913885
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-07 18:42:21 +00:00
George Steed
b0f72309c6 Remove duplicate kernel assignment from scale_uv.cc
The assignment of ScaleUVRowDown2Box_NEON is already done in the block
immediately below this one, so just remove this code.

Change-Id: I83c0f18dbe66e908cd4fbce73e20e96a137860cf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979723
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-01 15:42:21 +00:00
George Steed
f00c43f4d6 [AArch64] Unroll HalfFloat{,1}Row_NEON
The existing C implementation compiled with a recent LLVM is
auto-vectorised and unrolled to process four vectors per loop iteration,
making the Neon implementation slower than the C implementation on
little cores. To avoid this, unroll the Neon implementation to also
process four vectors per iteration.

Reduction in cycle counts observed compared to the existing Neon
implementation:

            | HalfFloat1Row_NEON | HalfFloatRow_NEON
Cortex-A510 |             -37.1% |            -40.8%
Cortex-A520 |             -32.3% |            -37.4%
Cortex-A720 |               0.0% |            -10.6%
  Cortex-X2 |               0.0% |             -7.8%
  Cortex-X4 |              +0.3% |             -6.9%

Bug: b/42280945
Change-Id: I12b474c970fc4355d75ed924c4ca6169badda2bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872805
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-30 17:58:29 +00:00
George Steed
51d07554a0 [AArch64] Add SME implementation of ScaleRowDown2Linear
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead.  We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: Ie6b91bd4407130ba2653838088e81e72e4460f68
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913884
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-30 17:57:15 +00:00
George Steed
593965cea2 [AArch64] Add SME implementation of ScaleRowDown2
Including associated changes for adding a new scale_sme.cc file.

There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead.  We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: I47d149613fbabd8c203605a809811f1a668e8fb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913883
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-30 17:56:41 +00:00
George Steed
237f39cb8c [AArch64] Add SME implementation of I444ToARGBRow
This is based on an unrolled version of the existing SVE2 code. The
implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.

Change-Id: I83d8e58aafd814125b3446fb1c9ec4a5fb56fe3e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913882
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-29 18:10:23 +00:00
George Steed
22c5c18778 [AArch64] Add SME implementation of I422ToARGBRow
Including addition of a new row_sme.cc file and associated
infrastructure.

The actual implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.

Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-29 05:49:28 +00:00
George Steed
775fd92e59 [AArch64] Optimize ScaleRowDown38_3_Box_NEON
Replace LD4 and TRN instructions with LD1s and TBL since LD4 is known to
be slow on some micro-architectures, and remove other unnecessary
permutes.

Reduction in run times:

 Cortex-A55: -24.8%
Cortex-A510: -32.7%
Cortex-A520: -37.7%
 Cortex-A76: -51.8%
Cortex-A715: -58.9%
Cortex-A720: -58.9%
  Cortex-X1: -54.8%
  Cortex-X2: -50.3%
  Cortex-X3: -57.1%
  Cortex-X4: -49.8%
Cortex-X925: -52.0%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: b/42280945
Change-Id: Ie96bac30fffbe41f8d1501ee289795830ab127e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872803
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-28 17:04:22 +00:00
George Steed
0bce5120f6 [AArch64] Optimize ScaleRowDown38_2_Box_NEON
Replace LD4 and TRN instructions with LD1s and TBL since LD4 is known to
be slow on some micro-architectures, and remove other unnecessary permutes.

Reduction in run times:

 Cortex-A55: -17.9%
Cortex-A510: -28.7%
Cortex-A520: -31.8%
 Cortex-A76: -40.8%
Cortex-A715: -46.1%
Cortex-A720: -46.1%
  Cortex-X1: -44.3%
  Cortex-X2: -40.1%
  Cortex-X3: -46.3%
  Cortex-X4: -40.2%
Cortex-X925: -42.3%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: b/42280945
Change-Id: I84e2cd04912fc11d59b4407a1836f047b74a4c92
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872802
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-28 17:03:54 +00:00
George Steed
22ac86800e [AArch64] Add SVE2 implementation of I422ToARGB4444Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -35.5%
Cortex-A520: -38.2%
Cortex-A715: -19.8%
Cortex-A720: -19.8%
  Cortex-X2: -24.2%
  Cortex-X3: -24.1%
  Cortex-X4: -21.6%
Cortex-X925: -19.5%

Bug: b/42280942
Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f4eaeca22a [AArch64] Add SVE2 implementation of I422ToARGB1555Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.8%
Cortex-A520: -42.6%
Cortex-A715: -22.5%
Cortex-A720: -22.6%
  Cortex-X2: -22.7%
  Cortex-X3: -22.4%
  Cortex-X4: -19.4%
Cortex-X925: -27.0%

Bug: b/42280942
Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f40042533c [AArch64] Add SVE2 implementation of I422ToRGB565Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.1%
Cortex-A520: -38.2%
Cortex-A715: -21.5%
Cortex-A720: -21.6%
  Cortex-X2: -21.6%
  Cortex-X3: -22.0%
  Cortex-X4: -23.5%
Cortex-X925: -21.7%

Bug: b/42280942
Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
4621b0cc7f [AArch64] Rework data loading in ScaleFilterCols_NEON
Lane-indexed LD2 instructions are slow and introduce an unnecessary
dependency on the previous iteration of the loop. To avoid this
dependency use a scalar load for the first iteration and lane-indexed
LD1 for the remainder, then TRN1 and TRN2 to split out the even and odd
elements.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55:  -6.7%
Cortex-A510: -13.2%
Cortex-A520: -13.1%
 Cortex-A76: -54.5%
Cortex-A715: -60.3%
Cortex-A720: -61.0%
  Cortex-X1: -69.1%
  Cortex-X2: -68.6%
  Cortex-X3: -73.9%
  Cortex-X4: -73.8%
Cortex-X925: -69.0%

Bug: b/42280945
Change-Id: I1c4adfb82a43bdcf2dd4cc212088fc21a5812244
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872804
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:25:23 +00:00
George Steed
faade2f73f [AArch64] Avoid partial vector stores in ScaleRowDown38_NEON
The existing code performs a pair of stores since there is no AArch64
instruction in Neon to store exactly 12 bytes from a vector register.

It is guaranteed to be safe to write full vectors until the last
iteration of the loop, since the extra four bytes will be over-written
by subsequent iterations. This allows us to avoid duplicating the store
instruction and address arithmetic.

Reduction in runtime observed relative to the existing Neon
implementation:

 Cortex-A55:  +2.0%
Cortex-A510: -25.3%
Cortex-A520: -15.1%
 Cortex-A76: -32.2%
Cortex-A715: -19.7%
Cortex-A720: -19.6%
  Cortex-X1: -31.6%
  Cortex-X2: -27.1%
  Cortex-X3: -25.9%
  Cortex-X4: -24.7%
Cortex-X925: -35.8%

Bug: b/42280945
Change-Id: I222ed662f169d82f5f472bebb1bcfe6d428ccae2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872843
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 20:52:08 +00:00
George Steed
0dce974ca0 [AArch64] Add SVE2 implementation of I422ToRGB24Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -57.8%
Cortex-A520: -41.7%
Cortex-A715: -28.0%
Cortex-A720: -28.1%
  Cortex-X2: -29.7%
  Cortex-X3: -28.7%
  Cortex-X4: -30.5%
Cortex-X925: -30.3%

Bug: b/42280942
Change-Id: I328bd16babda75fb089c8da8f2714465f658187e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802965
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-24 02:17:32 +00:00
Frank Barchard
7633328b5f Make functions that malloc check for ubsan math overflow
- add support for negative heights
- sanity check null pointers and invalid width/height

Bug: b/371615496
Change-Id: Icbefcb1ccc5cdf90e417c73440c6fad3b63ed7df
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5917072
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-10-08 21:08:34 +00:00
Wan-Teh Chang
364b7fa81b Remove redundant unsigned integer overflow tests
Bug: b/371615496
Change-Id: I28df888942085138a54e18c7e939300d959c68b0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5914872
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-08 01:14:35 +00:00
Frank Barchard
ffd791f749 Check malloc allocation sizes are less than SIZE_MAX
Bug: b/371615496
Change-Id: I75a94b08469d6d6b6fd55a8659031cbcb3d48eed
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5912039
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-10-07 21:34:15 +00:00
Wan-Teh Chang
77f3acade4 ScalePlaneDown34: test dst_width%24 == 0 for armv7
In ScalePlaneDown34(), check if dst_width % 24 == 0 for armv7, and check
if dst_width % 48 == 0 for aarch64.

No-Try: True
Bug: b/369963535, b/366045177
Change-Id: I7dc1227517c83c97a1d1052ef2230d5cec41da10
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5896492
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2024-09-27 23:00:19 +00:00
Frank Barchard
61bf0b61f7 Fix for ARGB scaling down by 4x horizontally but not vertically
Add test ARGBScaleTo50x1_Box
libyuv_test '--gunit_filter=*ARGBScaleTo50x1*' --libyuv_width=200 --libyuv_height=50

Bug:  chromium:361611480
Change-Id: Ic984951d74eb0c377c6746f61e91593a8a7d1a66
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5884656
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-09-24 18:00:47 +00:00
George Steed
02c6e8baca Change ARGBMultiplyRow_C to match Neon
The existing behaviour does not round correctly in all cases, so adjust
it to match the existing Neon implementation.

Update the tests to require bit-exactness and disable other
implementations that do not round correctly.

Change-Id: Ie790fb4b4805b555d74d689d83802e1dd4f33df5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5869115
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-23 21:48:33 +00:00
George Steed
a37e6bc81b [AArch64] Re-enable SME only for Linux and new versions of Clang
This was previously disabled in
679e851f653866a49e21f69fe8380bd20123f0ee, so re-enable it but only for
Linux where SME is known to work correctly.

Change-Id: I2626b03f3854b27162df1b55fc6767e02ffe318d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802958
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-23 09:29:53 +00:00
Wan-Teh Chang
85e55115f0 Untangle arm and aarch64 #ifdefs in GetCpuFlags()
Change-Id: I5df39c20a700aee38954bc9288fdee116138645d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5879350
Reviewed-by: George Steed <george.steed@arm.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-20 23:40:19 +00:00
Alex Richardson
f1b28b3510 Avoid reading /proc/cpuinfo for non-Linux Arm platforms
While we will return kCpuHasNEON if the file fails to open, this does
unnecessarily introduce filesystem operations which are not needed e.g.
on embedded non-Linux platforms. When not building for Linux, we can
simply rely on the compiler flags to determine whether NEON support is
present for Arm32.

Change-Id: Ifb0eab2a46969fca5f733ce624abdf54da9b32a2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778479
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: George Steed <george.steed@arm.com>
2024-09-20 22:22:03 +00:00
George Steed
7eb552c891 [AArch64] Avoid unnecessary MOVs in ScaleARGBRowDownEvenBox_NEON
The existing code uses three MOV instructions through a temporary
register to swap the low and high halves of a vector register, however
this can be done with a pair of ZIP instructions instead.

Also use a pair of RSHRN rather than RSHRN2 to allow these to execute in
parallel on little cores.

Reduction in runtime observed compared to the existing Neon
implementation:

 Cortex-A55:  -8.3%
Cortex-A510: -20.6%
Cortex-A520: -16.6%
 Cortex-A76:  -6.8%
Cortex-A715:  -6.2%
Cortex-A720:  -6.2%
  Cortex-X1: -22.0%
  Cortex-X2: -18.7%
  Cortex-X3: -21.1%
  Cortex-X4: -25.8%
Cortex-X925: -21.9%

Change-Id: I87ae133be86c3c9f850d5848ec19d9b71ebda4d9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872801
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-20 00:28:12 +00:00
George Steed
23a6a412e5 [AArch64] Unroll and use TBL in ScaleRowDown34_NEON
ST3 is known to be slow on a number of modern micro-architectures. By
unrolling the code we are able to use TBL to shuffle elements into the
correct indices without needing to use LD4 and ST3, giving a good
improvement in performance across the board.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55: -14.4%
Cortex-A510: -66.0%
Cortex-A520: -50.8%
 Cortex-A76: -60.5%
Cortex-A715: -63.9%
Cortex-A720: -64.2%
  Cortex-X1: -74.3%
  Cortex-X2: -75.4%
  Cortex-X3: -75.5%
  Cortex-X4: -48.1%

Bug: b/42280945
Change-Id: Ia1efb03af2d6ec00bc5a4b72168963fede9f0c83
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785971
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 15:37:27 +00:00
George Steed
d5303f4f77 [AArch64] Unroll ARGB1555ToARGBRow_NEON to use full Neon vectors
Processing more data per loop iteration means that we can use the full
128-bit Neon vectors and also allows us to use e.g. UZP1 to perform XTN
+ XTN2 in a single instruction.

The early Cortex-X cores are not a fan of ST4 .16b with a
post-increment, so split out the pointer increment to a separate
instruction to avoid this bottleneck.

Reductions in runtime observed for ARGB1555ToARGBRow_NEON:

 Cortex-A55: -18.1%
Cortex-A510: -11.2%
Cortex-A520: -39.5%
 Cortex-A76: -18.0%
Cortex-A715: -34.8%
Cortex-A720: -34.8%
  Cortex-X1:  -0.9%
  Cortex-X2:  -4.6%
  Cortex-X3:  -3.6%
  Cortex-X4: -20.8%

Bug: libyuv:976
Change-Id: Iae2ac24ffdbc718cd1e05bb77191f8d1df3fcf6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790975
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-16 04:36:43 +00:00
George Steed
772f0fde1c [AArch64] Use full Neon vectors in RGB565To{ARGB,UV,Y}Row_NEON
The existing code only makes use of half of the vector lanes in the
RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more
data to allow using full vectors, adjusting the "any" kernel macros to
match. For the RGB565ToUVRow kernel we already have plenty of data but
currently call the macro twice as much as needed, so refactor the code
to only call it once but operating with full vectors instead.

Reduction in runtimes observed for selected micro-architectures:

            | RGB565ToARGBRow | RGB565ToUVRow | RGB565ToYRow
 Cortex-A53 |          -35.2% |        -28.8% |       -31.1%
 Cortex-A55 |          -32.5% |        -34.4% |       -42.9%
Cortex-A510 |          -21.6% |        -27.7% |       -47.2%
 Cortex-A76 |           -0.9% |        -42.0% |       -21.4%
Cortex-A720 |          -28.6% |        -37.2% |       -26.1%
  Cortex-X1 |           -3.2% |        -42.3% |       -23.4%

Bug: b/42280945
Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:35:47 +00:00
George Steed
2dfb84b311 [AArch64] Unroll to use full vectors in ARGBToARGB1555Row_NEON
By loading packed 16-bit AR/GB data and operating on that directly we
avoid the need to perform a separate widening step before the
conversion.

Reduction in runtime observed compared to the existing Neon code:

 Cortex-A55: -13.2%
Cortex-A510:  -5.4%
 Cortex-A76: -21.5%
Cortex-A720: -25.2%
  Cortex-X1: -50.6%
  Cortex-X2: -36.8%

Bug: b/42280945
Change-Id: I780c71fdff1d017464c6e4e38f86979dda0e43ad
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790973
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-16 04:33:22 +00:00