34 Commits

Author SHA1 Message Date
George Steed
b753822d47 [AArch64] Add SVE2 implementation of P210ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -32.8%
Cortex-A520:  +8.7% (!)
Cortex-A715: -18.9%
Cortex-A720: -18.9%
  Cortex-X2:  -7.9%
  Cortex-X3:  -8.8%
  Cortex-X4:  +1.0% (!)
Cortex-X925:  -8.6%

Bug: b/42280942
Change-Id: Ibe557500c3788b4fb39372c92b2f42ba216e6fea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975320
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-11-12 18:32:55 +00:00
George Steed
5c12e0b2de [AArch64] Add SVE2 implementations of HalfFloat{,1}Row
For HalfFloat1Row, SVE has direct 16-bit integer to half-float
conversion instructions so there is no need to widen to 32-bits.

For HalfFloatRow, SVE zero-extending loads avoid the need for seperate
UXTL(2) instructions.

Observed reductions in runtime compared to the existing Neon code:

            | HalfFloat1Row | HalfFloatRow
Cortex-A510 |        -38.3% |       -17.3%
Cortex-A520 |        -37.6% |       -18.8%
Cortex-A720 |        -50.1% |        -7.8%
  Cortex-X2 |        -50.2% |        -0.4%
  Cortex-X4 |        -51.5% |       -12.5%

Bug: b/42280942
Change-Id: I445071ccd453113144ce42d465ba03c9ee89ec9e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975319
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-07 18:53:00 +00:00
George Steed
f27b983f38 [AArch64] Add SVE2 implementation of DivideRow_16
SVE contains the UMULH instruction which allows us to multiply and take
the high half of the result in a single instruction rather than needing
separate widening multiply and then narrowing shift steps.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -21.2%
Cortex-A520: -20.9%
Cortex-A715: -47.9%
Cortex-A720: -47.6%
  Cortex-X2:  -5.2%
  Cortex-X3:  -2.6%
  Cortex-X4: -32.4%
Cortex-X925:  -1.5%

Bug: b/42280942
Change-Id: I25154699b17772db1fb5cb84c049919181d86f4b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975318
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-07 18:46:02 +00:00
George Steed
22ac86800e [AArch64] Add SVE2 implementation of I422ToARGB4444Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -35.5%
Cortex-A520: -38.2%
Cortex-A715: -19.8%
Cortex-A720: -19.8%
  Cortex-X2: -24.2%
  Cortex-X3: -24.1%
  Cortex-X4: -21.6%
Cortex-X925: -19.5%

Bug: b/42280942
Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f4eaeca22a [AArch64] Add SVE2 implementation of I422ToARGB1555Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.8%
Cortex-A520: -42.6%
Cortex-A715: -22.5%
Cortex-A720: -22.6%
  Cortex-X2: -22.7%
  Cortex-X3: -22.4%
  Cortex-X4: -19.4%
Cortex-X925: -27.0%

Bug: b/42280942
Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f40042533c [AArch64] Add SVE2 implementation of I422ToRGB565Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.1%
Cortex-A520: -38.2%
Cortex-A715: -21.5%
Cortex-A720: -21.6%
  Cortex-X2: -21.6%
  Cortex-X3: -22.0%
  Cortex-X4: -23.5%
Cortex-X925: -21.7%

Bug: b/42280942
Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
0dce974ca0 [AArch64] Add SVE2 implementation of I422ToRGB24Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -57.8%
Cortex-A520: -41.7%
Cortex-A715: -28.0%
Cortex-A720: -28.1%
  Cortex-X2: -29.7%
  Cortex-X3: -28.7%
  Cortex-X4: -30.5%
Cortex-X925: -30.3%

Bug: b/42280942
Change-Id: I328bd16babda75fb089c8da8f2714465f658187e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802965
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-24 02:17:32 +00:00
Wan-Teh Chang
6157cc4583 Remove the ' separators in hex integer constants
They are a C++14 feature, not supported in C++11 mode (-std=c++11).

Change-Id: I618020342d4964b994aefa06af83b2e8d553a032
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786607
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-14 20:50:28 +00:00
Wan-Teh Chang
3cf54e90d3 Fix -Wmissing-prototypes warnings
Declare functions as static. Declare functions in a header. Include the
header that declares the functions. Delete undeclared and unused
functions ScaleFilterRows_NEON() and ScaleRowUp2_16_NEON(). Delete
unused function ScaleY() in psnr_main.cc.

Change-Id: I182ec30611df83c61ffd01bbab595cd61fb5f1e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-12 19:08:24 +00:00
George Steed
42d33341d3 [AArch64] Unroll {RAW,RGB24}To{ARGB,RGBA}Row_SVE2
Unrolling gives a nice improvement to the little cores and even a small
improvement to the big cores thanks to avoiding the loop control
overhead.

Observed performance improvement relative to the existing SVE2 code.

                    | Cortex-A510 | Cortex-A720 | Cortex-X2
  RAWToARGBRow_SVE2 |      -28.4% |      -10.1% |     -3.5%
  RAWToRGBARow_SVE2 |      -28.5% |      -10.1% |     -4.4%
RGB24ToARGBRow_SVE2 |      -28.5% |      -10.4% |     -5.5%

Bug: libyuv:973
Change-Id: I7aa03fdaa1a24ecfdd13418647a02e5effe8333f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725174
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 16:01:56 +00:00
George Steed
4ad050b5ec [AArch64] Unroll {I422,I422Alpha}ToARGBRow_SVE2
Since the UV components are duplicated in I422 we end up wasting half of
the vector bandwidth processing the same elements twice. By unrolling
the kernel to process two vectors of Y per iteration we can fill a whole
vector of U/V components.

Rather than packing RGBA components into pairs during the narrowing we
now just narrow into individual component vectors and use ST4B instead.
This by itself is slower on some micro-architectures like Cortex-A510
but the benefit from unrolling significantly outweights this.

            | I422AlphaToARGBRow_SVE2 | I422ToARGBRow_SVE2
Cortex-A510 |                  -46.2% |             -48.8%
Cortex-A720 |                  -20.8% |             -21.0%
  Cortex-X2 |                  -11.3% |              -7.5%
  Cortex-X4 |                  -15.4% |             -15.5%

Bug: libyuv:973
Change-Id: I69389c4279861f7a460ae0c28186f023c728c4e8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725173
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 15:55:59 +00:00
George Steed
a64fffe632 Revert "Disable NV12ToARGB_SVE2 which fails the 'any' test"
This reverts commit f480fa1c4a4af0ce3c34cd7b1ab0d85f1a36ce17.

This code has a number of small issues:

* The YUVTORGB_SVE_SETUP macro requires p0 to be initialized to
  all-true, however the existing kernel does not initialise p0 until
  after this macro is called, so flip the order.

* The p2 register is missing from the clobber list, so add it.

* The existing code uses the wrong condition flags when determining
  whether to do the tail iteration using WHILE instructions or not.
  Additionally the number of tail iterations is incorrect, as it was
  incorrectly not changed from when the tail code was always executed.

While we are here, make another few small improvements:

* Remove the single-quote digit separators as requested here:
  https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133

* Remove "volatile" from the asm block counting the vector length.  This
  particular asm block cannot be removed by the compiler since the
  output register is consumed by subsequent code, so "volatile" is
  unnecessary here and we remove it.

* Add some additional empty comments to force clang-format to put macros
  into the next line rather than on the same line as other asm.

Bug: b/352371649
Change-Id: I45676fab95343f588cf11ce2cf9186ffbe87489e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703586
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-15 18:13:42 +00:00
George Steed
899bc48327 [AArch64] Add SVE2 implementations of ARGBTo{RAW,RGB24}Row
There is no nice way of forming the TBL permute indices here since we
are operating on sets of three bytes at a time, so instead load the
appropriate indices from a static array. We can make use of SVE
predication to ensure we are operating on a multiple of three bytes for
the load/store instructions rather than needing to make use of more
expensive LD4 or ST3 instructions.

Reduction in runtime observed compared to the existing Neon
implementations:

            | ARGBToRAWRow | ARGBToRGB24Row
Cortex-A510 |       -50.8% |         -19.9%
Cortex-A720 |       -39.8% |         -39.1%
  Cortex-X2 |       -66.5% |         -51.9%

Bug: libyuv:973
Change-Id: Iaead678715a3d70d54cf823391272a6196836769
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631544
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-08 20:27:54 +00:00
George Steed
5236846b64 [AArch64] Keep UV interleaved in some *ToARGBRow_SVE2 kernels
The existing I4XXTORGB_SVE macro operates only on even byte lanes of the
loaded U/V vectors. This is sub-optimal since we are effectively wasting
half of the vector in any pre-processing steps before the conversion.
In particular, where the UV components are loaded from interleaved data
we can save a TBL instruction by maintaining the interleaved format.

This commit introduces a new NVTORGB_SVE macro to handle the case where
U/V components are interleaved into even/odd bytes of a vector,
mirroring a similar macro in the AArch64 Neon implementation.

Reduction in runtimes observed compared to the existing SVE2 code:

                   | Cortex-A510 | Cortex-A720 | Cortex-X2
NV12ToARGBRow_SVE2 |       -5.3% |       -0.2% |     -4.4%
NV21ToARGBRow_SVE2 |       -5.3% |       -0.2% |     -4.4%
UYVYToARGBRow_SVE2 |       -5.6% |        0.0% |     -4.6%
YUY2ToARGBRow_SVE2 |       -5.5% |       -0.1% |     -4.2%

Bug: libyuv:973
Change-Id: I418de2e684e0b6b0b9e41c39b564438531e44671
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-08 20:26:23 +00:00
George Steed
555f80f3ce [AArch64] Add SVE2 implementation of RGB24ToARGBRow
This can make use of the existing helper functions for RAWToARGBRow_SVE2
and RAWToRGBARow_SVE2 since the layouts are similar, we just need to
adjust the TBL constants to match the different input layout.

Observed reduction in runtime compared to the existing Neon kernel:

Cortex-A510: -25.6%
Cortex-A720: -15.2%
  Cortex-X2: -10.2%
  Cortex-X4: -30.2%

Bug: libyuv:973
Change-Id: Ie3676693286be90d09f0045766c3492cbc04ea64
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5638555
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-08 20:12:05 +00:00
George Steed
11ff6067a5 [AArch64] Add SVE2 implementation of RAWToRGB24Row
There is no nice way of forming the TBL permute indices here since we
are operating on sets of three bytes at a time, so instead load the
appropriate indices from a static array. We can make use of SVE
predication to ensure we are operating on a multiple of three bytes for
the load/store instructions rather than needing to make use of more
expensive LD3 or ST3 instructions.

Reduction in runtime observed compared to the existing Neon
implementation:

Cortex-A510: -39.2%
Cortex-A720: -34.5%
  Cortex-X2: -31.0%

Bug: libyuv:973
Change-Id: I68560bde7a529e5cec150b0e9d3ffe4341038fb8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631543
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-08 15:55:14 +00:00
George Steed
c613c3f102 [AArch64] Add SVE2 implementations for RAWTo{ARGB,RGBA}Row
We can construct particular predicates to load only up to 3/4 of a full
vector, allowing us to use TBL to shuffle elements into the correct
place rather than needing to rely on more expensive LD3 or ST4
instructions.

Reduction in runtimes observed compared to the existing Neon
implementation:

            | RAWToARGBRow | RAWToRGBARow
Cortex-A510 |       -32.4% |       -31.9%
Cortex-A720 |       -15.7% |       -15.6%
  Cortex-X2 |       -24.6% |       -24.4%

Bug: libyuv:973
Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-06 22:40:15 +00:00
Frank Barchard
616bee5420 Add volatile for gcc inline to avoid being removed
Bug: b/42280943
Change-Id: I4439077a92ffa6dff91d2d10accd5251b76f7544
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5671187
Reviewed-by: David Gao <davidgao@google.com>
2024-07-02 01:25:24 +00:00
George Steed
367dd50755 [AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow
This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels
except reading the YUV components all from the same interleaved input
array. We load four-byte elements and then use TBL to de-interleave the
UV components.

Unlike the NV{12,21} cases we need to de-interleave bytes rather than
widened 16-bit elements. Since we need a TBL instruction already it
would ordinarily be possible to perform the zero-extension from bytes to
16-bit elements by setting the index for every other byte to be out of
range. Such an approach does not work in SVE since at a vector length of
2048 bits since all possible byte values (0-255) are valid indices into
the vector. We instead get around this by rewriting the I4XXTORGB_SVE
macro to perform widening multiplies, operating on the low byte of each
16-bit UV element instead of the full value and therefore eliminating
the need for a zero-extension.

Observed reductions in runtimes compared to the existing Neon code:

            | UYVYToARGBRow | YUY2ToARGBRow
Cortex-A510 |        -30.2% |        -30.2%
Cortex-A720 |         -4.8% |         -4.7%
  Cortex-X2 |         -9.6% |        -10.1%

Bug: libyuv:973
Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-13 22:06:46 +00:00
George Steed
cd4113f4e8 [AArch64] Add SVE2 implementation of I400ToARGBRow
This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we
can pre-calculate the UV component results before the loop body.

Unlike in the Neon version of the code we can make use of MOVPRFX and
USQADD to avoid needing to apply the bias separately from the UV
coefficient multiply additions.

Reduction in runtime observed compared to the existing Neon code:

Cortex-A510: -26.1%
Cortex-A520:  -5.9%
Cortex-A715: -49.5%
Cortex-A720: -49.4%
  Cortex-X2: -22.5%
  Cortex-X3: -23.5%
  Cortex-X4: -21.6%

Bug: libyuv:973
Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-13 22:02:06 +00:00
George Steed
34abe98fe2 [AArch64] Add SVE2 implementations for NV{12,21}ToARGBRow
We need a permute to duplicate the UV components, so we can share a
common implementation for both NV12 and NV21 by varying the inputs to
the INDEX instruction that generates the TBL indices.

Observed reductions in runtimes compared to the existing Neon code:

            | NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 |             -29.1% |             -29.1%
Cortex-A720 |              -4.8% |              -4.8%
  Cortex-X2 |              -9.2% |              -9.2%

Bug: libyuv:973
Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-12 16:24:40 +00:00
George Steed
96bbdb53ed [AArch64] Add SVE2 implementation of I422ToRGBARow
This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we
just need to interleave differently for the output.

The RGBA format actually saves us an instruction compared to ARGB since
there is no need to merge in the alpha component, we can just replace
the odd elements of the alpha vector itself during the narrowing.

Also rename some existing macros to make more sense when distinguishing
between ARGB and RGBA.

Reductions in runtime observed compared to the existing Neon code:

Cortex-A510: -27.0%
Cortex-A720:  -5.3%
  Cortex-X2: -14.7%

Bug: libyuv:973
Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
004352ba16 [AArch64] Add SVE2 implementations for AYUVTo{UV,VU}Row
These kernels are mostly identical to each other except for the order of
the results, so we can use a single macro to parameterize the pairwise
addition and use the same macro for both implementations, just with the
register order flipped.

Similar to other 2x2 kernels the implementation here differs slightly
for the last element if the problem size is odd, so use an "any" kernel
to avoid needing to handle this in the common code path.

Observed reduction in runtime compared to the existing Neon code:

            | AYUVToUVRow | AYUVToVURow
Cortex-A510 |      -33.1% |      -33.0%
Cortex-A720 |      -25.1% |      -25.1%
  Cortex-X2 |      -59.5% |      -53.9%
  Cortex-X4 |      -39.2% |      -39.4%

Bug: libyuv:973
Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
d0da5a3298 [AArch64] Add SVE2 implementation of ARGB1555ToARGBRow
Avoiding LD4 and unrolling gives a good perf improvement for the little
core especially.

Observed reduction in runtime relative to the existing Neon code:

Cortex-A510: -69.7%
Cortex-A720:  -7.7%
  Cortex-X2: -41.9%
  Cortex-X4: -14.5%

Bug: libyuv:973
Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
250e1e1ba3 [AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow
Observed performance improvements compared to the existing Neon
implementation:

Cortex-A510: -21.7%
Cortex-A720: -49.2%
  Cortex-X2: -62.6%

Bug: libyuv:973
Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-06-03 23:15:04 +00:00
George Steed
bce3392830 [AArch64] Add SVE2 implementation of ARGBToRGB565Row
Observed performance improvements compared to the existing Neon
implementation:

Cortex-A510: -27.1%
Cortex-A720: -49.4%
  Cortex-X2: -67.9%

Bug: libyuv:973
Change-Id: I321dc080a6e89301cd959c2ee18bc6680f749312
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505538
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-31 17:42:27 +00:00
George Steed
a114f85e50 [AArch64] Fix naming in ARGBToUVMatrixRow_SVE2 etc constants
Avoid abbreviations and capitalize ARGB and UV naming, as suggested
here:
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537

Bug: libyuv:973
Change-Id: I0d0143154594c03e6aca7c859b874e39634ca54f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5513544
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-03 17:25:14 +00:00
George Steed
6f1d8b1e11 [AArch64] Add SVE2 implementations for ARGBToUVRow and similar
By maintaining the interleaved format of the data we can use a common
kernel for all input channel orderings and simply pass a different
vector of constants instead.

A similar approach is possible with only Neon by making use of
multiplies and repeated application of ADDP to combine channels, however
this is slower on older cores like Cortex-A53 so is not pursued further.

For odd problem sizes we need a slightly different implementation for
the final element, so introduce an "any" kernel to address that rather
than bloating the code for the common case.

Observed affect on runtimes compared to the existing Neon kernels:

             | Cortex-A510 | Cortex-A720 | Cortex-X2
ABGRToUVJRow |      -15.5% |       +5.4% |    -33.1%
 ABGRToUVRow |      -15.6% |       +5.3% |    -35.9%
ARGBToUVJRow |      -10.1% |       +5.4% |    -32.7%
 ARGBToUVRow |      -10.1% |       +5.4% |    -29.3%
 BGRAToUVRow |      -15.5% |       +4.6% |    -32.8%
 RGBAToUVRow |      -10.1% |       +4.2% |    -36.0%

Bug: libyuv:973
Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-01 19:46:43 +00:00
George Steed
ce32eb773f [AArch64] Avoid extraneous CMP in I{444,422}ToARGBRow_SVE2 impl
We can use subs to set condition flags as part of the subtract, no need
for a separate compare instruction. No performance difference observed
from this change, but it now matches the other SVE2 kernels.

Also remove unnecessary volatile from asm blocks.

Bug: libyuv:973
Change-Id: I9bb4f5f1101086602f7d5223feaeae0fb63b385c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463951
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:56:22 +00:00
George Steed
f483007b9a [AArch64] Add SVE implementation for I422AlphaToARGBRow
This is mostly identical to the existing I422ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -32.1%
Cortex-A720:  -5.1%
  Cortex-X2: -10.1%

Bug: libyuv:973
Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:54:07 +00:00
George Steed
b53b27d6bf [AArch64] Add SVE implementation for I444AlphaToARGBRow
This is mostly identical to the existing I444ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -34.2%
Cortex-A720: -17.6%
  Cortex-X2:  -9.6%

Bug: libyuv:973
Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:54:02 +00:00
George Steed
6ac90403a1 [AArch64] Add SVE implementation for I422ToARGBRow
We need a new macro for reading I422 data, but is otherwise mostly
identical to the existing I444ToARGBRow_SVE implementation.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -25.0%
Cortex-A720:  -5.0%
  Cortex-X2: -10.8%

Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-27 18:26:11 +00:00
George Steed
e52007eff9 [AArch64] Add SVE2 implementation for I444ToARGBRow
Being able to use SVE2 functionality for these kernels has a number of
performance wins compared to the existing Neon code:

* For the Y component calculation we are able to use UMULH, versus the
  existing UMULL x2 + UZP2 sequence in Neon.

* For the RGBTORGBA8 calculation we are able to take advantage of
  interleaving narrowing instructions, allowing us to use ST2 rather
  than ST4 for the store. This is a big performance win on some
  micro-architectures where ST4 is costly.

* The use of predication means we do not need to add "any" kernels, we
  can simply rerun the calculation with a not-full predicate for the
  final iteration.

To avoid the overhead of generating a predicate register on every
iteration we duplicate the loop body and only generate a predicate on
the final iteration of the loop. This costs a small amount on the final
iteration but should still be significantly quicker than the overhead of
a function call needed by the "any" cases. Duplicating the loop body to
reduce the use of the WHILELT instruction improves little core
performance by ~12% by itself but has negligable impact on other
micro-architectures.

Reduction in runtime for the new SVE2 implementation compared to the
existing Neon implementation on selected micro-architectures:

Cortex-A510: -36.5%
Cortex-A720: -17.3%
  Cortex-X2: -11.3%

Bug: libyuv:973
Change-Id: I2a485f0dfa077a56f96b80a667ad38bbea47b4b4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424739
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-09 03:11:01 +00:00
George Steed
9a8be20def [AArch64] Add :libyuv_sve library in preparation for SVE kernels
This commit only adds the bare minimum to get the new library building
through GN, the actual content of row_sve.cc is empty for now until we
start porting some kernels across.

Bug: libyuv:973
Change-Id: Ibdf4fc258761f3e507d700f27a405099c667ac75
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424738
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-09 03:10:01 +00:00