224 Commits

Author SHA1 Message Date
Frank Barchard
2b4453d46f Deprecate MIPS and MSA support.
- Remove *_msa.cc source files
- Update build files
- Update header references, planar ifdefs for row functions
- Update documentation on supported platforms
- Version bumped to 1921
- clang-format applied

Bug: 434383432
Change-Id: I072d6aac4956f0ed668e64614ac8557612171f76
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7045953
Reviewed-by: Justin Green <greenjustin@google.com>
2025-10-16 12:20:40 -07:00
George Steed
7e5863ae5a Add SVE2 and SME implementations of I422ToAR30Row
This can make use of the existing load/convert/store macros that are
already present for other kernels, so add I422ToAR30Row_SVE2 and
I422ToAR30Row_SME to match the existing kernels.

Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:

Cortex-A510: -9.1%
Cortex-A520: +6.8% (!)
Cortex-A710: -4.0%
Cortex-A715: -1.1%
Cortex-A720: -1.1%
  Cortex-X2: -5.7%
  Cortex-X3: -5.9%
  Cortex-X4: -2.8%
Cortex-X925: -4.0%

Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-05-27 11:39:00 -07:00
George Steed
949cb623bf Add SVE2 and SME implementations of I444ToRGB24Row
Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so
they are usable in both SVE2 and SME implementations, and use them to
add new I444ToRGB24Row implementations for SVE2 and SME. We need to use
the unrolled versions here to use the ST3B interleaving store
instructions, since there is no partial vector version of this store
instruction.

Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:

Cortex-A510: -57.6%
Cortex-A520: -38.1%
Cortex-A710: -15.5%
Cortex-A715:  -9.2%
Cortex-A720:  -9.2%
  Cortex-X2: -25.8%
  Cortex-X3: -26.2%
  Cortex-X4: -23.2%
Cortex-X925: -17.8%

Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-05-22 13:33:06 -07:00
Frank Barchard
61354d2671 ARGBToUV Matrix for AVX2 and SSSE3
- Round before shifting to 8 bit to match NEON
  - RAWToARGB use unaligned loads and port to AVX2

Was C/SSSE/AVX2
ARGBToI444_Opt (343 ms)
ARGBToJ444_Opt (677 ms)
RAWToI444_Opt (405 ms)
RAWToJ444_Opt (803 ms)

Now AVX2
ARGBToI444_Opt (283 ms)
ARGBToJ444_Opt (284 ms)
RAWToI444_Opt (316 ms)
RAWToJ444_Opt (339 ms)

Profile Now AVX2
  38.31%  ARGBToUVJ444Row_AVX2
  32.31%  RAWToARGBRow_AVX2
  23.99%  ARGBToYJRow_AVX2

Profile Was C/SSSE/AVX2
    73.15%  ARGBToUVJ444Row_C
    15.74%  RAWToARGBRow_SSSE3
     8.87%  ARGBToYJRow_AVX2

Bug: 381138208
Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Ben Weiss <bweiss@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-10 18:36:18 -08:00
George Steed
7fd0bd197e [AArch64] Port YUVToRGB color conversions to SME
Some of the color conversion kernels already have Streaming-SVE
implementations however many do not. We can re-use the existing SVE
implementation by moving it to a new shared row_sve.h header and marking
it with a "streaming-compatible" attribute to ensure it can be called
from both streaming and non-streaming execution modes.

As part of this move to a common header we also add duplicated
streaming-mode implementations of the following kernels that did not
previously have an SME implementation:

- I210AlphaToARGBRow_SME
- I210ToAR30Row_SME
- I210ToARGBRow_SME
- I212ToAR30Row_SME
- I212ToARGBRow_SME
- I400ToARGBRow_SME
- I410AlphaToARGBRow_SME
- I410ToAR30Row_SME
- I410ToARGBRow_SME
- I422AlphaToARGBRow_SME
- I422ToARGB1555Row_SME
- I422ToARGB4444Row_SME
- I422ToRGB24Row_SME
- I422ToRGB565Row_SME
- I422ToRGBARow_SME
- I444AlphaToARGBRow_SME
- NV12ToARGBRow_SME
- NV12ToRGB24Row_SME
- NV21ToARGBRow_SME
- NV21ToRGB24Row_SME
- P210ToAR30Row_SME
- P210ToARGBRow_SME
- P410ToAR30Row_SME
- P410ToARGBRow_SME
- UYVYToARGBRow_SME
- YUY2ToARGBRow_SME

Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 03:07:54 -08:00
George Steed
8f659daffd [AArch64] Add SVE2 implementations of NV{12,21}ToRGB24Row
Now that we have the `_2X` versions of the macros we can use these to
implement `ToRGB24` kernels. These cannot use the bottom/top approach
previously used by other SVE kernels since there are three rather than
two or four elements each.

Reduction in runtimes observed compared to the existing Neon
implementations:

            | NV12ToRGB24Row | NV21ToRGB24Row
Cortex-A510 |         -60.7% |         -60.7%
Cortex-A520 |         -46.0% |         -46.0%
Cortex-A715 |         -25.2% |         -25.2%
Cortex-A720 |         -25.2% |         -25.2%
  Cortex-X2 |         -28.9% |         -29.0%
  Cortex-X3 |         -28.2% |         -28.1%
  Cortex-X4 |         -30.8% |         -30.7%
Cortex-X925 |         -28.8% |         -28.9%

Change-Id: I39853d124bfdcac38584109870b398b8ecd5b632
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067149
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-04 17:51:08 +00:00
Hao Chen
532126bf70 Fix bugs in ARGBAttenuateRow_LASX/LSX function
Fix errors in ARGBAttenuateRow_LASX and ARGBAttenuateRow_LSX functions
caused by changes in calculation methods.
In addition, add the option to automatically add "-mlsx" and "-mlasx" to
enable SIMD optimization when compiling with cmake on LoongArch
platform.

Bug: libyuv:913
Change-Id: I7215f5198d3fb94f981d60969dc21a483006023e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802829
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
2024-11-30 23:09:04 +00:00
George Steed
9ed07258c7 [AArch64] Add SVE2 implementation of I410ToAR30Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -18.1%
Cortex-A520:  -6.0%
Cortex-A715: -22.0%
Cortex-A720: -21.1%
  Cortex-X2:  -9.4%
  Cortex-X3: -12.0%
  Cortex-X4:  -7.6%
Cortex-X925:  -5.8%

Bug: b/42280942
Change-Id: I853a028e08f1f1076ac20cd9c7f4f8ac8a211ac1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023584
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:59:55 +00:00
George Steed
3dd047733e [AArch64] Add SVE2 implementation of I410AlphaToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -37.2%
Cortex-A520:  -6.9%
Cortex-A715: -14.8%
Cortex-A720: -16.0%
  Cortex-X2: -14.8%
  Cortex-X3: -17.5%
  Cortex-X4: -12.8%
Cortex-X925: -13.0%

Bug: b/42280942
Change-Id: I1977fd1e1dfac25021724483fd89c6ff3e227d8b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023582
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:58:11 +00:00
George Steed
e84d809348 [AArch64] Add SVE2 implementation of I410ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -37.9%
Cortex-A520:  -9.2%
Cortex-A715: -14.3%
Cortex-A720: -14.2%
  Cortex-X2: -10.9%
  Cortex-X3: -11.1%
  Cortex-X4: -12.5%
Cortex-X925: -10.6%

Bug: b/42280942
Change-Id: I6720b07c900c7dfbd849ee38e413e98b9374dac2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023581
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:54:48 +00:00
George Steed
7c9c72ab4b [AArch64] Add SVE2 implementation of I210ToAR30Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -15.5%
Cortex-A520:  -3.8%
Cortex-A715: -15.8%
Cortex-A720: -15.8%
  Cortex-X2:  -7.9%
  Cortex-X3:  -6.5%
  Cortex-X4:  -5.0%
Cortex-X925:  -5.3%

Bug: b/42280942
Change-Id: I5171537fd125b3214d25a0ae503a8f40dbeb6042
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023583
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-11-23 00:53:16 +00:00
George Steed
fc3569ad27 [AArch64] Add SVE2 implementation of I210AlphaToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -33.9%
Cortex-A520:  -4.2%
Cortex-A715: -22.0%
Cortex-A720: -22.4%
  Cortex-X2: -14.6%
  Cortex-X3: -14.5%
  Cortex-X4: -11.6%
Cortex-X925: -12.6%

Bug: b/42280942
Change-Id: Ifb4ed7a865c369d584af498cc65b84d065cfb207
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023580
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:47:32 +00:00
George Steed
50108f29fb [AArch64] Add SVE2 implementation of I212ToAR30Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -15.4%
Cortex-A520:  -3.8%
Cortex-A715: -15.7%
Cortex-A720: -15.6%
  Cortex-X2:  -7.9%
  Cortex-X3:  -5.7%
  Cortex-X4:  -5.3%
Cortex-X925:  -4.8%

Bug: b/42280942
Change-Id: I99846820682687c8e0f52d05f5aa3d50369fe0a2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025829
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:27:57 +00:00
George Steed
305a7a4ede [AArch64] Add SVE2 implementation of I212ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -34.5%
Cortex-A520:  -6.5%
Cortex-A715: -10.1%
Cortex-A720: -16.1%
  Cortex-X2: -11.9%
  Cortex-X3: -11.9%
  Cortex-X4:  -9.3%
Cortex-X925: -11.2%

Bug: b/42280942
Change-Id: Idc30e69552f7d227217ac7011a786210b11e4752
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025828
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:21:27 +00:00
George Steed
823d960afc [AArch64] Add SVE2 implementations of {P210,P410}ToAR30Row
Observed reductions in runtime compared to the existing Neon code:

            | P210ToAR30Row | P410ToAR30Row
Cortex-A510 |       -16.5%  |        -21.2%
Cortex-A520 | (!)    +2.7%  |         -8.7%
Cortex-A715 |        -6.1%  |         -6.1%
Cortex-A720 |        -6.2%  |         -5.9%
  Cortex-X2 |        -4.1%  |         -4.2%
  Cortex-X3 |        -4.2%  |         -4.2%
  Cortex-X4 |        -1.2%  |         -1.2%
Cortex-X925 |        -3.6%  |         -2.8%

Bug: b/42280942
Change-Id: I40723a370fad1ccb53f8ccd9d32cddb502500dd6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023036
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-14 16:52:21 +00:00
George Steed
0ddf3f7b90 [AArch64] Add SVE2 implementation of I210ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -34.5%
Cortex-A520:  -6.5%
Cortex-A715: -10.1%
Cortex-A720: -13.9%
  Cortex-X2: -11.9%
  Cortex-X3: -11.6%
  Cortex-X4:  -9.5%
Cortex-X925: -11.5%

Bug: b/42280942
Change-Id: Ie97dc3b5efd021ecfea14d4c477cc205191e09c3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023037
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-14 16:36:41 +00:00
George Steed
5b906a0ec8 [AArch64] Add SVE2 implementation of P410ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -34.7%
Cortex-A520:  -2.4%
Cortex-A715: -18.7%
Cortex-A720: -18.8%
  Cortex-X2:  -7.7%
  Cortex-X3:  -8.9%
  Cortex-X4:  +1.0% (!)
Cortex-X925:  -8.3%

Bug: b/42280942
Change-Id: I90dca0573887a9a24e2172378a9e0eb6812e2131
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975321
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-12 18:34:56 +00:00
George Steed
b753822d47 [AArch64] Add SVE2 implementation of P210ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -32.8%
Cortex-A520:  +8.7% (!)
Cortex-A715: -18.9%
Cortex-A720: -18.9%
  Cortex-X2:  -7.9%
  Cortex-X3:  -8.8%
  Cortex-X4:  +1.0% (!)
Cortex-X925:  -8.6%

Bug: b/42280942
Change-Id: Ibe557500c3788b4fb39372c92b2f42ba216e6fea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975320
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-11-12 18:32:55 +00:00
George Steed
237f39cb8c [AArch64] Add SME implementation of I444ToARGBRow
This is based on an unrolled version of the existing SVE2 code. The
implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.

Change-Id: I83d8e58aafd814125b3446fb1c9ec4a5fb56fe3e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913882
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-29 18:10:23 +00:00
George Steed
22c5c18778 [AArch64] Add SME implementation of I422ToARGBRow
Including addition of a new row_sme.cc file and associated
infrastructure.

The actual implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.

Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-29 05:49:28 +00:00
George Steed
22ac86800e [AArch64] Add SVE2 implementation of I422ToARGB4444Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -35.5%
Cortex-A520: -38.2%
Cortex-A715: -19.8%
Cortex-A720: -19.8%
  Cortex-X2: -24.2%
  Cortex-X3: -24.1%
  Cortex-X4: -21.6%
Cortex-X925: -19.5%

Bug: b/42280942
Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f4eaeca22a [AArch64] Add SVE2 implementation of I422ToARGB1555Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.8%
Cortex-A520: -42.6%
Cortex-A715: -22.5%
Cortex-A720: -22.6%
  Cortex-X2: -22.7%
  Cortex-X3: -22.4%
  Cortex-X4: -19.4%
Cortex-X925: -27.0%

Bug: b/42280942
Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f40042533c [AArch64] Add SVE2 implementation of I422ToRGB565Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.1%
Cortex-A520: -38.2%
Cortex-A715: -21.5%
Cortex-A720: -21.6%
  Cortex-X2: -21.6%
  Cortex-X3: -22.0%
  Cortex-X4: -23.5%
Cortex-X925: -21.7%

Bug: b/42280942
Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
0dce974ca0 [AArch64] Add SVE2 implementation of I422ToRGB24Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -57.8%
Cortex-A520: -41.7%
Cortex-A715: -28.0%
Cortex-A720: -28.1%
  Cortex-X2: -29.7%
  Cortex-X3: -28.7%
  Cortex-X4: -30.5%
Cortex-X925: -30.3%

Bug: b/42280942
Change-Id: I328bd16babda75fb089c8da8f2714465f658187e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802965
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-24 02:17:32 +00:00
George Steed
d5303f4f77 [AArch64] Unroll ARGB1555ToARGBRow_NEON to use full Neon vectors
Processing more data per loop iteration means that we can use the full
128-bit Neon vectors and also allows us to use e.g. UZP1 to perform XTN
+ XTN2 in a single instruction.

The early Cortex-X cores are not a fan of ST4 .16b with a
post-increment, so split out the pointer increment to a separate
instruction to avoid this bottleneck.

Reductions in runtime observed for ARGB1555ToARGBRow_NEON:

 Cortex-A55: -18.1%
Cortex-A510: -11.2%
Cortex-A520: -39.5%
 Cortex-A76: -18.0%
Cortex-A715: -34.8%
Cortex-A720: -34.8%
  Cortex-X1:  -0.9%
  Cortex-X2:  -4.6%
  Cortex-X3:  -3.6%
  Cortex-X4: -20.8%

Bug: libyuv:976
Change-Id: Iae2ac24ffdbc718cd1e05bb77191f8d1df3fcf6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790975
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-16 04:36:43 +00:00
George Steed
772f0fde1c [AArch64] Use full Neon vectors in RGB565To{ARGB,UV,Y}Row_NEON
The existing code only makes use of half of the vector lanes in the
RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more
data to allow using full vectors, adjusting the "any" kernel macros to
match. For the RGB565ToUVRow kernel we already have plenty of data but
currently call the macro twice as much as needed, so refactor the code
to only call it once but operating with full vectors instead.

Reduction in runtimes observed for selected micro-architectures:

            | RGB565ToARGBRow | RGB565ToUVRow | RGB565ToYRow
 Cortex-A53 |          -35.2% |        -28.8% |       -31.1%
 Cortex-A55 |          -32.5% |        -34.4% |       -42.9%
Cortex-A510 |          -21.6% |        -27.7% |       -47.2%
 Cortex-A76 |           -0.9% |        -42.0% |       -21.4%
Cortex-A720 |          -28.6% |        -37.2% |       -26.1%
  Cortex-X1 |           -3.2% |        -42.3% |       -23.4%

Bug: b/42280945
Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:35:47 +00:00
George Steed
555f80f3ce [AArch64] Add SVE2 implementation of RGB24ToARGBRow
This can make use of the existing helper functions for RAWToARGBRow_SVE2
and RAWToRGBARow_SVE2 since the layouts are similar, we just need to
adjust the TBL constants to match the different input layout.

Observed reduction in runtime compared to the existing Neon kernel:

Cortex-A510: -25.6%
Cortex-A720: -15.2%
  Cortex-X2: -10.2%
  Cortex-X4: -30.2%

Bug: libyuv:973
Change-Id: Ie3676693286be90d09f0045766c3492cbc04ea64
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5638555
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-08 20:12:05 +00:00
George Steed
c613c3f102 [AArch64] Add SVE2 implementations for RAWTo{ARGB,RGBA}Row
We can construct particular predicates to load only up to 3/4 of a full
vector, allowing us to use TBL to shuffle elements into the correct
place rather than needing to rely on more expensive LD3 or ST4
instructions.

Reduction in runtimes observed compared to the existing Neon
implementation:

            | RAWToARGBRow | RAWToRGBARow
Cortex-A510 |       -32.4% |       -31.9%
Cortex-A720 |       -15.7% |       -15.6%
  Cortex-X2 |       -24.6% |       -24.4%

Bug: libyuv:973
Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-06 22:40:15 +00:00
George Steed
d1ec694ad3 [AArch64] Add P{210,410}To{ARGB,AR30}Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

              | Cortex-A55 | Cortex-A510 | Cortex-A76
P210ToARGBRow |     -59.8% |      -16.8% |     -53.2%
P210ToAR30Row |     -48.1% |      -21.8% |     -54.0%
P410ToARGBRow |     -56.5% |      -32.2% |     -54.1%
P410ToAR30Row |     -42.4% |       -4.5% |     -50.4%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-06 22:37:08 +00:00
George Steed
d32436e8f8 [AArch64] Add Neon implementation for I422ToAR30Row_NEON
There is an existing x86 implementation for this kernel, but not for
AArch64, so add one.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 Cortex-A55: -43.1%
Cortex-A510: -22.3%
 Cortex-A76: -54.8%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-02 18:16:33 +00:00
George Steed
bbd9cedc4f [AArch64] Add Neon impls for I212To{ARGB,AR30}Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToAR30Row | I210ToARGBRow
 Cortex-A55 |        -40.8% |        -54.4%
Cortex-A510 |        -26.2% |        -22.7%
 Cortex-A76 |        -49.2% |        -44.5%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-02 18:16:33 +00:00
George Steed
367dd50755 [AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow
This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels
except reading the YUV components all from the same interleaved input
array. We load four-byte elements and then use TBL to de-interleave the
UV components.

Unlike the NV{12,21} cases we need to de-interleave bytes rather than
widened 16-bit elements. Since we need a TBL instruction already it
would ordinarily be possible to perform the zero-extension from bytes to
16-bit elements by setting the index for every other byte to be out of
range. Such an approach does not work in SVE since at a vector length of
2048 bits since all possible byte values (0-255) are valid indices into
the vector. We instead get around this by rewriting the I4XXTORGB_SVE
macro to perform widening multiplies, operating on the low byte of each
16-bit UV element instead of the full value and therefore eliminating
the need for a zero-extension.

Observed reductions in runtimes compared to the existing Neon code:

            | UYVYToARGBRow | YUY2ToARGBRow
Cortex-A510 |        -30.2% |        -30.2%
Cortex-A720 |         -4.8% |         -4.7%
  Cortex-X2 |         -9.6% |        -10.1%

Bug: libyuv:973
Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-13 22:06:46 +00:00
George Steed
cd4113f4e8 [AArch64] Add SVE2 implementation of I400ToARGBRow
This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we
can pre-calculate the UV component results before the loop body.

Unlike in the Neon version of the code we can make use of MOVPRFX and
USQADD to avoid needing to apply the bias separately from the UV
coefficient multiply additions.

Reduction in runtime observed compared to the existing Neon code:

Cortex-A510: -26.1%
Cortex-A520:  -5.9%
Cortex-A715: -49.5%
Cortex-A720: -49.4%
  Cortex-X2: -22.5%
  Cortex-X3: -23.5%
  Cortex-X4: -21.6%

Bug: libyuv:973
Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-13 22:02:06 +00:00
George Steed
34abe98fe2 [AArch64] Add SVE2 implementations for NV{12,21}ToARGBRow
We need a permute to duplicate the UV components, so we can share a
common implementation for both NV12 and NV21 by varying the inputs to
the INDEX instruction that generates the TBL indices.

Observed reductions in runtimes compared to the existing Neon code:

            | NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 |             -29.1% |             -29.1%
Cortex-A720 |              -4.8% |              -4.8%
  Cortex-X2 |              -9.2% |              -9.2%

Bug: libyuv:973
Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-12 16:24:40 +00:00
George Steed
96bbdb53ed [AArch64] Add SVE2 implementation of I422ToRGBARow
This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we
just need to interleave differently for the output.

The RGBA format actually saves us an instruction compared to ARGB since
there is no need to merge in the alpha component, we can just replace
the odd elements of the alpha vector itself during the narrowing.

Also rename some existing macros to make more sense when distinguishing
between ARGB and RGBA.

Reductions in runtime observed compared to the existing Neon code:

Cortex-A510: -27.0%
Cortex-A720:  -5.3%
  Cortex-X2: -14.7%

Bug: libyuv:973
Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
d0da5a3298 [AArch64] Add SVE2 implementation of ARGB1555ToARGBRow
Avoiding LD4 and unrolling gives a good perf improvement for the little
core especially.

Observed reduction in runtime relative to the existing Neon code:

Cortex-A510: -69.7%
Cortex-A720:  -7.7%
  Cortex-X2: -41.9%
  Cortex-X4: -14.5%

Bug: libyuv:973
Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
250e1e1ba3 [AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow
Observed performance improvements compared to the existing Neon
implementation:

Cortex-A510: -21.7%
Cortex-A720: -49.2%
  Cortex-X2: -62.6%

Bug: libyuv:973
Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-06-03 23:15:04 +00:00
George Steed
6c70eb2819 [AArch64] Add Neon impls for I{210,410}ToAR30Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 I210ToAR30Row on Cortex-A55: -43.8%
I210ToAR30Row on Cortex-A510: -27.0%
 I210ToAR30Row on Cortex-A76: -50.4%
 I410ToAR30Row on Cortex-A55: -44.3%
I410ToAR30Row on Cortex-A510: -17.5%
 I410ToAR30Row on Cortex-A76: -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:46:12 +00:00
George Steed
812b4955b2 [AArch64] Add Neon impls for I{210,410}ToARGBRow_NEON
There is are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToARGBRow | I410ToARGBRow
 Cortex-A55 |        -55.6% |        -56.2%
Cortex-A510 |        -22.6% |        -35.6%
 Cortex-A76 |        -48.1% |        -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 17:40:48 +00:00
George Steed
5b4160b9c3 [AArch64] Add Neon impls for I{210,410}AlphaToARGBRow_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210AlphaToARGBRow | I410AlphaToARGBRow
 Cortex-A55 |             -55.3% |             -56.1%
Cortex-A510 |             -27.9% |             -42.6%
 Cortex-A76 |             -54.9% |             -60.3%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:41:31 +00:00
Wan-Teh Chang
ec6f15079f Remove unneeded #ifdef HAVE_JPEG code
Change-Id: Ic7e1393b48bec735625197243b3d436ea01cfb07
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5529467
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-09 23:02:18 +00:00
George Steed
f483007b9a [AArch64] Add SVE implementation for I422AlphaToARGBRow
This is mostly identical to the existing I422ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -32.1%
Cortex-A720:  -5.1%
  Cortex-X2: -10.1%

Bug: libyuv:973
Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:54:07 +00:00
George Steed
b53b27d6bf [AArch64] Add SVE implementation for I444AlphaToARGBRow
This is mostly identical to the existing I444ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -34.2%
Cortex-A720: -17.6%
  Cortex-X2:  -9.6%

Bug: libyuv:973
Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:54:02 +00:00
George Steed
6ac90403a1 [AArch64] Add SVE implementation for I422ToARGBRow
We need a new macro for reading I422 data, but is otherwise mostly
identical to the existing I444ToARGBRow_SVE implementation.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -25.0%
Cortex-A720:  -5.0%
  Cortex-X2: -10.8%

Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-27 18:26:11 +00:00
George Steed
e52007eff9 [AArch64] Add SVE2 implementation for I444ToARGBRow
Being able to use SVE2 functionality for these kernels has a number of
performance wins compared to the existing Neon code:

* For the Y component calculation we are able to use UMULH, versus the
  existing UMULL x2 + UZP2 sequence in Neon.

* For the RGBTORGBA8 calculation we are able to take advantage of
  interleaving narrowing instructions, allowing us to use ST2 rather
  than ST4 for the store. This is a big performance win on some
  micro-architectures where ST4 is costly.

* The use of predication means we do not need to add "any" kernels, we
  can simply rerun the calculation with a not-full predicate for the
  final iteration.

To avoid the overhead of generating a predicate register on every
iteration we duplicate the loop body and only generate a predicate on
the final iteration of the loop. This costs a small amount on the final
iteration but should still be significantly quicker than the overhead of
a function call needed by the "any" cases. Duplicating the loop body to
reduce the use of the WHILELT instruction improves little core
performance by ~12% by itself but has negligable impact on other
micro-architectures.

Reduction in runtime for the new SVE2 implementation compared to the
existing Neon implementation on selected micro-architectures:

Cortex-A510: -36.5%
Cortex-A720: -17.3%
  Cortex-X2: -11.3%

Bug: libyuv:973
Change-Id: I2a485f0dfa077a56f96b80a667ad38bbea47b4b4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424739
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-09 03:11:01 +00:00
Frank Barchard
914624f0b8 YUY2ToARGBMatrix and UYVYToARGBMatrix added to allow any color matrix
Bug: libyuv:971
Change-Id: If15d4598d75500a3717f07d02c0c295fdc58254e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5214453
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-01-19 21:21:37 +00:00
Frank Barchard
def473f501 malloc return 1 for failures and assert for internal functions
Bug: libyuv:968
Change-Id: Iea2f907061532d2e00347996124bc80d079a7bdc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5010874
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-12-04 22:55:20 +00:00
Frank Barchard
31e1d6f896 Check allocations that return NULL and return early
BUG=libyuv:968

Change-Id: I9e8594440a6035958511f9c50072820131331fc8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4977552
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-10-27 17:41:36 +00:00
Bruce Lai
ec2e9ca000 [RVV] Support AR64ToAB64 and RGBA-family color conversions
Add scalar code for AR64ToAB64, ARGBToRGBA, ARGBToBGRA, ARGBToABGR, RGBAToARGB, BGRAToARGB, and ABGRToARGB.
They are originally implemented by ARGBShffle.
This CL independetly implements them, and only enables for risc-v now.
This CL also add RVV implementation for `RGBA-family <-> RGBA-family` color conversions.

* Run on SiFive internal FPGA(VLEN=128):

Test Case	Speedup
AR64ToAB64_Opt  x4.6
ARGBToRGBA_Opt  x6
ARGBToBGRA_Opt  x6
ARGBToABGR_Opt  x6
RGBAToARGB_Opt  x6

Change-Id: Ie0630901046084aa259699fcdeccc64170d7103f
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4797451
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-09-05 22:44:48 +00:00
Darren Hsieh
10de943a12 [RVV] Enable ScaleRowUp2_(Bi)linear_RVV/ScaleUVRowUp2_(Bi)linear_RVV
ScaleUVRowUp2_(Bi)linear_RVV function is equal to other platforms' ScaleRowUp2_(Bi)linear_Any_XXX.
We process entire row in this function.
Other platforms only implement non-edge part of image and process edge with scalar.
ScaleRowUp2_(Bi)linear_Any_XXX: Combine ScaleRowUp2_(Bi)linear_XXX(non-edge) + ScaleRowUp2_(Bi)linear_C(edge) by SBUH2LANY/SU2BLANY.

* Run on SiFive internal FPGA:

Test case                       RVV function			Speedup
I444ScaleFrom640x360_Bilinear	ScaleRowUp2_Bilinear_RVV	8.21
I444ScaleFrom640x360_Linear	ScaleRowUp2_Linear_RVV	        8.08
UVScaleFrom640x360_Bilinear	ScaleUVRowUp2_Bilinear_RVV	7.80
UVScaleFrom640x360_Linear	ScaleUVRowUp2_Linear_RVV	7.03

Change-Id: I539245ce51858f077506a78f0e7e82377ac6a95d
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4666062
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-07-26 18:05:50 +00:00