8 Commits

Author SHA1 Message Date
George Steed
a114f85e50 [AArch64] Fix naming in ARGBToUVMatrixRow_SVE2 etc constants
Avoid abbreviations and capitalize ARGB and UV naming, as suggested
here:
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537

Bug: libyuv:973
Change-Id: I0d0143154594c03e6aca7c859b874e39634ca54f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5513544
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-03 17:25:14 +00:00
George Steed
6f1d8b1e11 [AArch64] Add SVE2 implementations for ARGBToUVRow and similar
By maintaining the interleaved format of the data we can use a common
kernel for all input channel orderings and simply pass a different
vector of constants instead.

A similar approach is possible with only Neon by making use of
multiplies and repeated application of ADDP to combine channels, however
this is slower on older cores like Cortex-A53 so is not pursued further.

For odd problem sizes we need a slightly different implementation for
the final element, so introduce an "any" kernel to address that rather
than bloating the code for the common case.

Observed affect on runtimes compared to the existing Neon kernels:

             | Cortex-A510 | Cortex-A720 | Cortex-X2
ABGRToUVJRow |      -15.5% |       +5.4% |    -33.1%
 ABGRToUVRow |      -15.6% |       +5.3% |    -35.9%
ARGBToUVJRow |      -10.1% |       +5.4% |    -32.7%
 ARGBToUVRow |      -10.1% |       +5.4% |    -29.3%
 BGRAToUVRow |      -15.5% |       +4.6% |    -32.8%
 RGBAToUVRow |      -10.1% |       +4.2% |    -36.0%

Bug: libyuv:973
Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-01 19:46:43 +00:00
George Steed
ce32eb773f [AArch64] Avoid extraneous CMP in I{444,422}ToARGBRow_SVE2 impl
We can use subs to set condition flags as part of the subtract, no need
for a separate compare instruction. No performance difference observed
from this change, but it now matches the other SVE2 kernels.

Also remove unnecessary volatile from asm blocks.

Bug: libyuv:973
Change-Id: I9bb4f5f1101086602f7d5223feaeae0fb63b385c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463951
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:56:22 +00:00
George Steed
f483007b9a [AArch64] Add SVE implementation for I422AlphaToARGBRow
This is mostly identical to the existing I422ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -32.1%
Cortex-A720:  -5.1%
  Cortex-X2: -10.1%

Bug: libyuv:973
Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:54:07 +00:00
George Steed
b53b27d6bf [AArch64] Add SVE implementation for I444AlphaToARGBRow
This is mostly identical to the existing I444ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -34.2%
Cortex-A720: -17.6%
  Cortex-X2:  -9.6%

Bug: libyuv:973
Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:54:02 +00:00
George Steed
6ac90403a1 [AArch64] Add SVE implementation for I422ToARGBRow
We need a new macro for reading I422 data, but is otherwise mostly
identical to the existing I444ToARGBRow_SVE implementation.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -25.0%
Cortex-A720:  -5.0%
  Cortex-X2: -10.8%

Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-27 18:26:11 +00:00
George Steed
e52007eff9 [AArch64] Add SVE2 implementation for I444ToARGBRow
Being able to use SVE2 functionality for these kernels has a number of
performance wins compared to the existing Neon code:

* For the Y component calculation we are able to use UMULH, versus the
  existing UMULL x2 + UZP2 sequence in Neon.

* For the RGBTORGBA8 calculation we are able to take advantage of
  interleaving narrowing instructions, allowing us to use ST2 rather
  than ST4 for the store. This is a big performance win on some
  micro-architectures where ST4 is costly.

* The use of predication means we do not need to add "any" kernels, we
  can simply rerun the calculation with a not-full predicate for the
  final iteration.

To avoid the overhead of generating a predicate register on every
iteration we duplicate the loop body and only generate a predicate on
the final iteration of the loop. This costs a small amount on the final
iteration but should still be significantly quicker than the overhead of
a function call needed by the "any" cases. Duplicating the loop body to
reduce the use of the WHILELT instruction improves little core
performance by ~12% by itself but has negligable impact on other
micro-architectures.

Reduction in runtime for the new SVE2 implementation compared to the
existing Neon implementation on selected micro-architectures:

Cortex-A510: -36.5%
Cortex-A720: -17.3%
  Cortex-X2: -11.3%

Bug: libyuv:973
Change-Id: I2a485f0dfa077a56f96b80a667ad38bbea47b4b4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424739
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-09 03:11:01 +00:00
George Steed
9a8be20def [AArch64] Add :libyuv_sve library in preparation for SVE kernels
This commit only adds the bare minimum to get the new library building
through GN, the actual content of row_sve.cc is empty for now until we
start porting some kernels across.

Bug: libyuv:973
Change-Id: Ibdf4fc258761f3e507d700f27a405099c667ac75
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424738
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-09 03:10:01 +00:00