1841 Commits

Author SHA1 Message Date
George Steed
11ff6067a5 [AArch64] Add SVE2 implementation of RAWToRGB24Row
There is no nice way of forming the TBL permute indices here since we
are operating on sets of three bytes at a time, so instead load the
appropriate indices from a static array. We can make use of SVE
predication to ensure we are operating on a multiple of three bytes for
the load/store instructions rather than needing to make use of more
expensive LD3 or ST3 instructions.

Reduction in runtime observed compared to the existing Neon
implementation:

Cortex-A510: -39.2%
Cortex-A720: -34.5%
  Cortex-X2: -31.0%

Bug: libyuv:973
Change-Id: I68560bde7a529e5cec150b0e9d3ffe4341038fb8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631543
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-08 15:55:14 +00:00
George Steed
c613c3f102 [AArch64] Add SVE2 implementations for RAWTo{ARGB,RGBA}Row
We can construct particular predicates to load only up to 3/4 of a full
vector, allowing us to use TBL to shuffle elements into the correct
place rather than needing to rely on more expensive LD3 or ST4
instructions.

Reduction in runtimes observed compared to the existing Neon
implementation:

            | RAWToARGBRow | RAWToRGBARow
Cortex-A510 |       -32.4% |       -31.9%
Cortex-A720 |       -15.7% |       -15.6%
  Cortex-X2 |       -24.6% |       -24.4%

Bug: libyuv:973
Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-06 22:40:15 +00:00
George Steed
d1ec694ad3 [AArch64] Add P{210,410}To{ARGB,AR30}Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

              | Cortex-A55 | Cortex-A510 | Cortex-A76
P210ToARGBRow |     -59.8% |      -16.8% |     -53.2%
P210ToAR30Row |     -48.1% |      -21.8% |     -54.0%
P410ToARGBRow |     -56.5% |      -32.2% |     -54.1%
P410ToAR30Row |     -42.4% |       -4.5% |     -50.4%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-06 22:37:08 +00:00
Frank Barchard
611806a155 [AArch64] Fix SVE/SME vector length printing in cpuid
A semicolon is treated as the start of a comment by some assemblers
causing the vector length to be reported incorrectly, so use a newline
instead.

- Add volatile asm in row_gcc and row_neon64

Bug: b/5631539
Change-Id: I6b0836fcdd9247ef7b9e8ceda01df3150519ecf8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5666060
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-02 19:44:41 +00:00
George Steed
d32436e8f8 [AArch64] Add Neon implementation for I422ToAR30Row_NEON
There is an existing x86 implementation for this kernel, but not for
AArch64, so add one.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 Cortex-A55: -43.1%
Cortex-A510: -22.3%
 Cortex-A76: -54.8%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-02 18:16:33 +00:00
George Steed
bbd9cedc4f [AArch64] Add Neon impls for I212To{ARGB,AR30}Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToAR30Row | I210ToARGBRow
 Cortex-A55 |        -40.8% |        -54.4%
Cortex-A510 |        -26.2% |        -22.7%
 Cortex-A76 |        -49.2% |        -44.5%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-02 18:16:33 +00:00
Frank Barchard
fa16ddbb9f cpuid show vector length on ARM and RISCV
- additional asm volatile changes from github
- rotate mips remove C function - moved to common

Run on Samsung S22
[ RUN      ] LibYUVBaseTest.TestCpuHas
Kernel Version 5.10
Has Arm 0x2
Has Neon 0x4
Has Neon DotProd 0x10
Has Neon I8MM 0x20
Has SVE 0x40
Has SVE2 0x80
Has SME 0x0
SVE vector length: 16 bytes
[       OK ] LibYUVBaseTest.TestCpuHas (0 ms)
[ RUN      ] LibYUVBaseTest.TestCompilerMacros
__ATOMIC_RELAXED 0
__cplusplus 201703
__clang_major__ 17
__clang_minor__ 0
__GNUC__ 4
__GNUC_MINOR__ 2
__aarch64__ 1
__clang__ 1
__llvm__ 1
__pic__ 2
INT_TYPES_DEFINED
__has_feature

Run on RISCV qemu emulating SiFive X280:
[ RUN      ] LibYUVBaseTest.TestCpuHas
Kernel Version 6.6
Has RISCV 0x10000000
Has RVV 0x20000000
RVV vector length: 64 bytes
[       OK ] LibYUVBaseTest.TestCpuHas (4 ms)
[ RUN      ] LibYUVBaseTest.TestCompilerMacros
__ATOMIC_RELAXED 0
__cplusplus 202002
__clang_major__ 9999
__clang_minor__ 0
__GNUC__ 4
__GNUC_MINOR__ 2
__riscv 1
__riscv_vector 1
__riscv_v_intrinsic 12000
__riscv_zve64x 1000000
__clang__ 1
__llvm__ 1
__pic__ 2
INT_TYPES_DEFINED
__has_feature

Bug: b/42280943
Change-Id: I53cf0450be4965a28942e113e4c77295ace70999
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5672088
Reviewed-by: David Gao <davidgao@google.com>
2024-07-02 18:10:56 +00:00
Frank Barchard
616bee5420 Add volatile for gcc inline to avoid being removed
Bug: b/42280943
Change-Id: I4439077a92ffa6dff91d2d10accd5251b76f7544
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5671187
Reviewed-by: David Gao <davidgao@google.com>
2024-07-02 01:25:24 +00:00
Frank Barchard
efd164d64e Disable RVV ScaleDownBy4 if compiler option is not enabled
- Some configs have int64 elements off by default.
  Disable ScaleDownBy4 row function to avoid compile error

Bug: 344954354
Change-Id: Ie0d74daea72375eff6438ab54cb2803d68d67e52
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598460
Reviewed-by: James Zern <jzern@google.com>
2024-06-18 01:52:40 +00:00
Frank Barchard
b0dfa70114 RVV remove unused variables
- ARM Planar test use regular asm volatile syntax
- x86 row functions remove volatile from asm

Bug: 347111119, 347112532
Change-Id: I535b3dfa1a7a19824503bd95584a63b047b0e9a1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5637058
Reviewed-by: Justin Green <greenjustin@google.com>
2024-06-17 20:25:31 +00:00
Bruce Lai
7758c961c5 Support RVV v0.12 intrinsics for row_rvv.cc & scale_rvv.cc
1. Add two defined marco LIBYUV_RVV_HAS_TUPLE_TYPE & LIBYUV_RVV_HAS_VXRM_ARG

Intrinsic v0.12 introduces
- tuple type in segment load & store
- vxrm argument in fixed-point intrinsics (e.g vnclip)

These two marcos are controled by __riscv_v_intrinsic.

2. Support RVV v0.12 intrinsics in row_rvv.cc & scale_rvv.cc

Change-Id: I921f91d9dc8fdda031e7b6647d0e296aa2793c39
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4767120
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-17 18:01:49 +00:00
George Steed
367dd50755 [AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow
This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels
except reading the YUV components all from the same interleaved input
array. We load four-byte elements and then use TBL to de-interleave the
UV components.

Unlike the NV{12,21} cases we need to de-interleave bytes rather than
widened 16-bit elements. Since we need a TBL instruction already it
would ordinarily be possible to perform the zero-extension from bytes to
16-bit elements by setting the index for every other byte to be out of
range. Such an approach does not work in SVE since at a vector length of
2048 bits since all possible byte values (0-255) are valid indices into
the vector. We instead get around this by rewriting the I4XXTORGB_SVE
macro to perform widening multiplies, operating on the low byte of each
16-bit UV element instead of the full value and therefore eliminating
the need for a zero-extension.

Observed reductions in runtimes compared to the existing Neon code:

            | UYVYToARGBRow | YUY2ToARGBRow
Cortex-A510 |        -30.2% |        -30.2%
Cortex-A720 |         -4.8% |         -4.7%
  Cortex-X2 |         -9.6% |        -10.1%

Bug: libyuv:973
Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-13 22:06:46 +00:00
George Steed
cd4113f4e8 [AArch64] Add SVE2 implementation of I400ToARGBRow
This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we
can pre-calculate the UV component results before the loop body.

Unlike in the Neon version of the code we can make use of MOVPRFX and
USQADD to avoid needing to apply the bias separately from the UV
coefficient multiply additions.

Reduction in runtime observed compared to the existing Neon code:

Cortex-A510: -26.1%
Cortex-A520:  -5.9%
Cortex-A715: -49.5%
Cortex-A720: -49.4%
  Cortex-X2: -22.5%
  Cortex-X3: -23.5%
  Cortex-X4: -21.6%

Bug: libyuv:973
Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-13 22:02:06 +00:00
George Steed
34abe98fe2 [AArch64] Add SVE2 implementations for NV{12,21}ToARGBRow
We need a permute to duplicate the UV components, so we can share a
common implementation for both NV12 and NV21 by varying the inputs to
the INDEX instruction that generates the TBL indices.

Observed reductions in runtimes compared to the existing Neon code:

            | NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 |             -29.1% |             -29.1%
Cortex-A720 |              -4.8% |              -4.8%
  Cortex-X2 |              -9.2% |              -9.2%

Bug: libyuv:973
Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-12 16:24:40 +00:00
George Steed
a758a15dbf [AArch64] Add I8MM implementation of ARGBColorMatrixRow
We cannot use the standard dot-product instructions since the matrix of
coefficients are signed, but I8MM supports mixed-sign products which
work well here.

Reduction in runtimes observed compared to the previous Neon
implementation:

Cortex-A510: -50.8%
Cortex-A520: -33.3%
Cortex-A715: -38.6%
Cortex-A720: -38.5%
  Cortex-X2: -43.2%
  Cortex-X3: -40.0%
  Cortex-X4: -55.0%

Change-Id: Ia4fe486faf8f43d0b837ad21bb37e2159f3bdb77
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5621577
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-12 16:17:59 +00:00
George Steed
89cf221baa [AArch64] Avoid unnecessary widening in I422ToARGB1555Row_NEON
The existing code first widens the component vectors from 8-bit elements
to 16-bits to construct the final ARGB1555 result, however this is
unnecessary since the inputs to the widening are themselves the result
of having just been narrowed in the RGBTORGB8 macro.

By making use of the new RGBTORGB8_TOP macro we can get rid of both the
widening as well as the prior narrowing step.

Also remove volatile from the asm, it is unnecessary.

Reduction in runtime observed for I422ToARGB1555Row_NEON:

 Cortex-A55:  -7.8%
 Cortex-A76: -15.0%
Cortex-A720: -20.3%
  Cortex-X1: -20.2%
  Cortex-X2: -20.3%

Bug: libyuv:976
Change-Id: Id031c5d4d788828297adcc2fe2c2cd8d99b45433
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5616050
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-11 23:36:13 +00:00
George Steed
e6c4b9ad2e [Arm][AArch64] Remove unused ARGBToUVJ444Row_NEON definition
There is no corresponding declaration in a header file and it appears to
be unused, so remove from both the Arm and AArch64.

Change-Id: I4de9fb7ce8e8dff6e76f4a99fdd93c743f92bf18
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5587507
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-10 18:36:31 +00:00
George Steed
c8974cf8d4 [AArch64] Add SME feature detection on Linux
This commit just adds the kCpuHasSME to represent that the CPU has the
Arm Scalable Matrix Extension enabled, but this commit does not
introduce any code to actually use it yet.

Add a test to check that the HWCAP value is interpreted correctly.

Change-Id: I2de7bca26ca44ff3ee278b59108298a299a171b7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598869
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-08 23:34:22 +00:00
George Steed
910f8e3645 [AArch64] Remove redundant semicolons after ANY41CT
Introduced by 5b4160b9c322fda98e2208d80c2ea75dd7e7f25f.

Bug: 345650115
Change-Id: I68c4c34ad9701f62729590ad137d743324497d28
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5604588
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-06-08 23:33:54 +00:00
George Steed
a68b959873 [AArch64] Add initial build system support for SME
Extend both the CMake and BUILD.gn configurations to support building a
library with the Arm Scalable Matrix Extension (SME). Add an initial
(empty) rotate_sme.cc source file to populate the library for now.

Change-Id: Icd4bd6a8ce72ba132299b00c99478a18a85d869a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5588664
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-08 23:32:41 +00:00
George Steed
3f657221f0 [AArch64] Remove unused vars in I{210,410}{,Alpha}ToARGBRow_NEON
The elements of the YUV constants are passed directly to the inline asm
block, so no need to pull them out into variables first.

Also remove "volatile" from inline asm blocks, it is unnecessary.

Bug: 344998222
Change-Id: I7d97dec8c7495651e5a31c10eda2d4aeed36fe6a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598764
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-07 02:39:20 +00:00
George Steed
96bbdb53ed [AArch64] Add SVE2 implementation of I422ToRGBARow
This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we
just need to interleave differently for the output.

The RGBA format actually saves us an instruction compared to ARGB since
there is no need to merge in the alpha component, we can just replace
the odd elements of the alpha vector itself during the narrowing.

Also rename some existing macros to make more sense when distinguishing
between ARGB and RGBA.

Reductions in runtime observed compared to the existing Neon code:

Cortex-A510: -27.0%
Cortex-A720:  -5.3%
  Cortex-X2: -14.7%

Bug: libyuv:973
Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
004352ba16 [AArch64] Add SVE2 implementations for AYUVTo{UV,VU}Row
These kernels are mostly identical to each other except for the order of
the results, so we can use a single macro to parameterize the pairwise
addition and use the same macro for both implementations, just with the
register order flipped.

Similar to other 2x2 kernels the implementation here differs slightly
for the last element if the problem size is odd, so use an "any" kernel
to avoid needing to handle this in the common code path.

Observed reduction in runtime compared to the existing Neon code:

            | AYUVToUVRow | AYUVToVURow
Cortex-A510 |      -33.1% |      -33.0%
Cortex-A720 |      -25.1% |      -25.1%
  Cortex-X2 |      -59.5% |      -53.9%
  Cortex-X4 |      -39.2% |      -39.4%

Bug: libyuv:973
Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
d0da5a3298 [AArch64] Add SVE2 implementation of ARGB1555ToARGBRow
Avoiding LD4 and unrolling gives a good perf improvement for the little
core especially.

Observed reduction in runtime relative to the existing Neon code:

Cortex-A510: -69.7%
Cortex-A720:  -7.7%
  Cortex-X2: -41.9%
  Cortex-X4: -14.5%

Bug: libyuv:973
Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
250e1e1ba3 [AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow
Observed performance improvements compared to the existing Neon
implementation:

Cortex-A510: -21.7%
Cortex-A720: -49.2%
  Cortex-X2: -62.6%

Bug: libyuv:973
Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-06-03 23:15:04 +00:00
George Steed
dff7bad43d [AArch64] Use full Neon vectors in ARGB4444ToARGBRow_NEON
The existing Neon code narrows the input 16-bit packed data to 8-bit
elements and separates the color channels, causing us to only process
half a Neon vector per instruction for the channel widening from 4-bit
color data to 8-bits.

We can note that the processing being done is identical for all color
channels and therefore we can keep them partially interleaved during the
widening step. This allows us to use full Neon vectors for the whole
loop body.

Reductions in runtimes observed for ARGB4444ToARGBRow_NEON:

 Cortex-A55: -30.7%
Cortex-A510: -44.3%
 Cortex-A76: -51.6%
  Cortex-X2: -54.2%

Bug: libyuv:976
Change-Id: I9d9cda7e16eb07619c6d7f1de2e6b8c0fb6d64cf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5594389
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:52:33 +00:00
George Steed
7633c818ec [AArch64] Remove pointless MOVI in ARGB1555ToARGBRow_NEON
This function takes the alpha component from the loaded data rather than
hard-coding it to 255, so initialising v3 to 255 is unused here.

Change-Id: I668825e0eeb317d1365035ce3bb47f3d92081c6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5594388
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:47:01 +00:00
George Steed
6c70eb2819 [AArch64] Add Neon impls for I{210,410}ToAR30Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 I210ToAR30Row on Cortex-A55: -43.8%
I210ToAR30Row on Cortex-A510: -27.0%
 I210ToAR30Row on Cortex-A76: -50.4%
 I410ToAR30Row on Cortex-A55: -44.3%
I410ToAR30Row on Cortex-A510: -17.5%
 I410ToAR30Row on Cortex-A76: -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:46:12 +00:00
George Steed
214b4a25c7 [Arm] Clean up rotate_neon.cc kernels
Get rid of unused tail loops, since they are already handled by the
"any" kernels.

Also remove unnecessary "volatile" specifier from asm blocks.

Bug: libyuv:976
Change-Id: I4676fc807bcaedbb5f0f52b1bed20a172fef4ed6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5553719
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:23:40 +00:00
George Steed
bce3392830 [AArch64] Add SVE2 implementation of ARGBToRGB565Row
Observed performance improvements compared to the existing Neon
implementation:

Cortex-A510: -27.1%
Cortex-A720: -49.4%
  Cortex-X2: -67.9%

Bug: libyuv:973
Change-Id: I321dc080a6e89301cd959c2ee18bc6680f749312
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505538
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-31 17:42:27 +00:00
George Steed
812b4955b2 [AArch64] Add Neon impls for I{210,410}ToARGBRow_NEON
There is are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToARGBRow | I410ToARGBRow
 Cortex-A55 |        -55.6% |        -56.2%
Cortex-A510 |        -22.6% |        -35.6%
 Cortex-A76 |        -48.1% |        -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 17:40:48 +00:00
George Steed
5b4160b9c3 [AArch64] Add Neon impls for I{210,410}AlphaToARGBRow_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210AlphaToARGBRow | I410AlphaToARGBRow
 Cortex-A55 |             -55.3% |             -56.1%
Cortex-A510 |             -27.9% |             -42.6%
 Cortex-A76 |             -54.9% |             -60.3%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:41:31 +00:00
George Steed
e348995a92 [AArch64] Optimize MergeXR30Row_10_NEON
By keeping intermediate data as 16-bits wide we can compute twice as
much and use ST2 to store the final result. This appears to be much
better even on micro-architectures where ST2 is slightly slower than
ST1.

We save a couple of instructions by taking advantage of multiply-add
instructions to perform an effective shift-left and bitwise-or, since we
know the set of nonzero bits are disjoint after the UMIN.

Reduction in runtime observed for MergeXR30Row_10_NEON:

 Cortex-A55: -34.2%
Cortex-A510: -35.6%
 Cortex-A76: -44.9%
  Cortex-X2: -48.3%

Bug: libyuv:976
Change-Id: I6e2627f9aa8e400ea82ff381ed587fcfc0d94648
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509199
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:32:55 +00:00
George Steed
56258c125b [AArch64] Avoid redundant shift around RGB565 conversion
The existing code performs a narrowing shift right (in RGBTORGB8)
followed by a widening left shift (in ARGBTORGB565). This is redundant
since we could have simply not performed a narrowing operation to begin
with and instead done a saturating left shift to saturate against the
top of the 16-bit lanes rather than the narrowed 8-bit lanes.

To enable this we introduce new RGBTORGB8_TOP and ARGBTORGB565_FROM_TOP
macros which produce and consume values from the high half of each
16-bit lane rather than a narrowed 8-bit intermediate.

Reduction in runtime for selected kernels:

                     | Cortex-A55 | Cortex-A510 | Cortex-A76 | Cortex-X2
I422ToRGB565Row_NEON |     -10.8% |       -6.1% |     -17.2% |    -23.6%
NV12ToRGB565Row_NEON |     -11.4% |       -4.9% |     -20.4% |    -17.4%

Bug: libyuv:976
Change-Id: I3337b8f41ff62a7af1b70a56b774239bdb55d0f1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509197
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:29:54 +00:00
George Steed
c5f9583b1c [AArch64] Avoid extracting alpha in ARGB1555ToYRow_NEON
The existing implementation of this kernel uses the ARGB1555TOARGB macro
which extracts and sign-extends the alpha component into v3, however
this particular kernel does not need the alpha component. We can avoid
calculating the alpha component completely by using the existing
RGB555TOARGB macro, so use that instead.

Reduction in runtimes observed for ARGB1555ToYRow_NEON (no noticeable
improvement observed on Cortex-A510):

Cortex-A55: -3.6%
Cortex-A76: -20.9%
 Cortex-X2: -15.1%

Bug: libyuv:976
Change-Id: I2cf2729c8297c53dcd32d0df28e64d4d5c7f6def
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509200
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-31 08:28:27 +00:00
George Steed
7c122e8859 [AArch64] Use ST2 to avoid TRN step in TransposeWx16_NEON
ST2 with 64-bit lanes has good performance on all micro-architectures of
interest and saves us 8 TRN instructions, so use that instead.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55:  -8.6%
Cortex-A510:  -4.9%
Cortex-A520:  -6.0%
 Cortex-A76: -14.4%
Cortex-A720:  -5.3%
  Cortex-X1: -13.6%
  Cortex-X2:  -5.8%

Bug: libyuv:976
Change-Id: I08bb5517bbdc54c4784fce42a885b12f91e7a982
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5581597
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:27:05 +00:00
George Steed
6b9604dffc [AArch64] Remove unused code from TransposeUVWx8_NEON
We already have an "any" helper function set up for this kernel, so use
it to match the other existing architecture paths. This change also
affects the 32-bit Arm paths, which will be cleaned up in a later
commit.

With this change the kernel is now only entered with width as a multiple
of eight, so remove the now-unneeded tail loops.

Also remove volatile specifier from the asm block, it is unnecessary.

Change-Id: If37428ac2d6035a8c27eec9bd80d014a98ac3eb1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5553717
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-27 21:52:56 +00:00
George Steed
d0c28db56c [AArch64] Optimize Merge{ARGB,XRGB}16To8Row_NEON
Rather than shifting the data into the low half of each lane and then
using a saturating narrowing operation, we can do the saturation as part
of a shift into the highest half of the lane and then use a simpler TRN2
instruction to extract pairs of high halves into full vectors. This also
has the nice advantage of allowing us to use ST2 rather than ST4 for
storing the result, since ST4 is known to be slow on some
micro-architectures.

Reduction in runtimes observed for the two kernels:

             | MergeARGB16To8Row_NEON | MergeXRGB16To8Row_NEON
  Cortex-A55 |                  -8.0% |                 -12.2%
 Cortex-A510 |                 -29.9% |                 -31.4%
  Cortex-A76 |                 -29.0% |                 -32.0%
   Cortex-X2 |                 -33.5% |                 -43.4%

Bug: libyuv:976
Change-Id: I9da3beedc27ab43527b3642aa6d4decf3b5b6683
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509198
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-21 07:55:03 +00:00
George Steed
4f7fd808b7 [AArch64] Use full vectors in TransposeWx{8 => 16}_NEON
The existing Neon code only makes use of 64-bit vectors throughout which
limits the performance on larger cores. To avoid this, swap the Neon
code from a Wx8 implementation to a Wx16 implementation and process
blocks of 16 full vectors at a time.

The original code also handled widths that were not exact multiples of
16, however this should already be handled by the "any" kernel so it is
removed.

Finally, avoid duplicating the TransposeWx16_C fallback kernel
definition in all architectures that need it, and just put it once in
rotate_common.cc instead.

Observed speedups for TransposePlane across a range of
micro-architectures:

 Cortex-A53: -40.0%
 Cortex-A55: -20.7%
 Cortex-A57: -43.9%
Cortex-A510: -43.5%
Cortex-A520: -43.9%
Cortex-A720: -31.1%
  Cortex-X2: -38.3%
  Cortex-X4: -43.6%

Change-Id: Ic7c4d5f24eb27091d743ddc00cd95ef178b6984e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5545459
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:46:42 +00:00
George Steed
9fac9a4a82 [AArch64] Add Neon implementations for {ARGB,ABGR}ToAR30Row
There are existing x86 implementations for these kernels but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | ABGRToAR30Row | ARGBToAR30Row
 Cortex-A55 |        -55.1% |        -55.1%
Cortex-A510 |        -39.3% |        -40.1%
 Cortex-A76 |        -62.3% |        -63.6%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:35:07 +00:00
George Steed
83c48c782a [AArch64] Improve ARGB4444TOARGB using SRI instructions
Also avoid constructing the alpha component when it isn't needed by
introducing a new ARGB4444TORGB macro.

Reduction in runtime for selected kernels:

                       | Cortex-A55 | Cortex-A510 | Cortex-A76
ARGB4444ToARGBRow_NEON |     -27.5% |      -27.9% |     -29.1%
  ARGB4444ToUVRow_NEON |     -20.2% |      -25.2% |     -21.7%
   ARGB4444ToYRow_NEON |     -16.0% |      -20.2% |     -21.3%

Bug: libyuv:976
Change-Id: Ida061e1c49ba228b02c2f691a067b58edad073a8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509196
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:29:11 +00:00
George Steed
5618a5c762 [AArch64] Use REV16 rather than TBL in SwapUVRow_NEON
We don't need a general-purpose purmute here, REV16 does exactly what we
want and saves us needing to load the permute indices array.

Bug: libyuv:976
Change-Id: Ib3bc2e4d21b00d53aeda6a11c6e6f1016ca6029e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509201
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-21 07:26:54 +00:00
George Steed
c6632d43ae [AArch64] Impose feature dependencies in detection code
The strict architectural requirements between features are reasonably
relaxed and difficult to map out fully, in particular:

* FEAT_DotProd is architecturally available from Armv8.1-A and becomes
  mandatory from Armv8.4-A.

* FEAT_I8MM is architecturally available from Armv8.1-A and becomes
  mandatory from Armv8.6-A. It does not strictly depend on FEAT_DotProd
  being implemented however I am not aware of a micro-architecture where
  FEAT_I8MM is implemented without FEAT_DotProd also being implemented.

* FEAT_SVE is architecturally available from Armv8.2-A. It does not
  strictly depend on either of FEAT_DotProd or FEAT_I8MM being
  implemented. The only micro-architecture I am aware of where FEAT_SVE
  is implemented without FEAT_DotProd and FEAT_I8MM both also being
  implemented is the Fujitsu A64FX.

* FEAT_SVE2 is architecturally available from Armv9.0-A. If FEAT_SVE2 is
  implemented then FEAT_SVE must also be implemented. Since Armv9.0-A is
  based on Armv8.5-A this implies that FEAT_DotProd is also implemented.
  Interestingly this means that FEAT_I8MM is not mandatory since it only
  becomes mandatory from Armv8.6-A (Armv9.1-A), however I am not aware
  of a micro-architecture where FEAT_SVE2 is implemented without all
  three of the above features also being implemented.

Additionally, when testing under emulation there are sometimes bugs
where even mandatory architecture relationships are broken. For example
there is one known case where SVE2 may be reported as available even
when SVE is explicitly disabled.

To simplify these dependencies, don't try to enable later extensions
unless earlier extensions are reported implemented. This notably
penalises code if it were to run on a Fujitsu A64FX, however this is not
a likely target for libyuv deployment.

Change-Id: Ifa32f7a43043641f99afb120e591945e136c9fd1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5546385
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:21:49 +00:00
Wan-Teh Chang
ec6f15079f Remove unneeded #ifdef HAVE_JPEG code
Change-Id: Ic7e1393b48bec735625197243b3d436ea01cfb07
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5529467
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-09 23:02:18 +00:00
George Steed
ee830a5f77 [AArch64] Enable feature detection on Windows and Apple Silicon
Using the platform-specific functions IsProcessorFeaturePresent and
sysctlbyname to check individual features.

Bug: libyuv:980
Change-Id: I7971238ca72e5df862c30c2e65331c46dc634074
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465591
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-03 18:42:51 +00:00
George Steed
a114f85e50 [AArch64] Fix naming in ARGBToUVMatrixRow_SVE2 etc constants
Avoid abbreviations and capitalize ARGB and UV naming, as suggested
here:
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537

Bug: libyuv:973
Change-Id: I0d0143154594c03e6aca7c859b874e39634ca54f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5513544
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-03 17:25:14 +00:00
George Steed
6f1d8b1e11 [AArch64] Add SVE2 implementations for ARGBToUVRow and similar
By maintaining the interleaved format of the data we can use a common
kernel for all input channel orderings and simply pass a different
vector of constants instead.

A similar approach is possible with only Neon by making use of
multiplies and repeated application of ADDP to combine channels, however
this is slower on older cores like Cortex-A53 so is not pursued further.

For odd problem sizes we need a slightly different implementation for
the final element, so introduce an "any" kernel to address that rather
than bloating the code for the common case.

Observed affect on runtimes compared to the existing Neon kernels:

             | Cortex-A510 | Cortex-A720 | Cortex-X2
ABGRToUVJRow |      -15.5% |       +5.4% |    -33.1%
 ABGRToUVRow |      -15.6% |       +5.3% |    -35.9%
ARGBToUVJRow |      -10.1% |       +5.4% |    -32.7%
 ARGBToUVRow |      -10.1% |       +5.4% |    -29.3%
 BGRAToUVRow |      -15.5% |       +4.6% |    -32.8%
 RGBAToUVRow |      -10.1% |       +4.2% |    -36.0%

Bug: libyuv:973
Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-01 19:46:43 +00:00
George Steed
67e5e79dbe [AArch64] Add Neon implementation of HashDjb2
Reduction in runtime observed compared to the existing C code compiled
with LLVM 18:

 Cortex-A55: -46.2%
Cortex-A510: -60.4%
 Cortex-A76: -82.9%
Cortex-A720: -87.4%
  Cortex-X1: -90.0%
  Cortex-X2: -91.7%

Change-Id: I39a4479f78299508043a864e64fb40578c66ce19
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5494094
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-01 19:37:31 +00:00
George Steed
1eae2efbc7 [AArch64] Use LD1/ST1 rather than LD4/ST4 in ARGBShadeRow_NEON
The use of LD4 and ST4 to de-interleave ARGB color channels is
unnecessary here since we can just adjust the scale multiplicand to
match the interleaved layout. LD4 and ST4 are known to perform poorly on
some micro-architectures so using LD1 and ST1 here should be preferred.

Reduction in runtime for ARGBShadeRow_NEON:

  Cortex-A55: -19.9%
 Cortex-A510: -50.8%
  Cortex-A76: -36.0%
   Cortex-X2: -46.4%

Bug: libyuv:976
Change-Id: I10a0e6a0a62242826d39b1e963063770f084226a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5494093
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-30 00:48:35 +00:00
George Steed
ce32eb773f [AArch64] Avoid extraneous CMP in I{444,422}ToARGBRow_SVE2 impl
We can use subs to set condition flags as part of the subtract, no need
for a separate compare instruction. No performance difference observed
from this change, but it now matches the other SVE2 kernels.

Also remove unnecessary volatile from asm blocks.

Bug: libyuv:973
Change-Id: I9bb4f5f1101086602f7d5223feaeae0fb63b385c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463951
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:56:22 +00:00