libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2026-08-01 09:16:25 +08:00

History

George Steed 367dd50755 [AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels except reading the YUV components all from the same interleaved input array. We load four-byte elements and then use TBL to de-interleave the UV components. Unlike the NV{12,21} cases we need to de-interleave bytes rather than widened 16-bit elements. Since we need a TBL instruction already it would ordinarily be possible to perform the zero-extension from bytes to 16-bit elements by setting the index for every other byte to be out of range. Such an approach does not work in SVE since at a vector length of 2048 bits since all possible byte values (0-255) are valid indices into the vector. We instead get around this by rewriting the I4XXTORGB_SVE macro to perform widening multiplies, operating on the low byte of each 16-bit UV element instead of the full value and therefore eliminating the need for a zero-extension. Observed reductions in runtimes compared to the existing Neon code: \| UYVYToARGBRow \| YUY2ToARGBRow Cortex-A510 \| -30.2% \| -30.2% Cortex-A720 \| -4.8% \| -4.7% Cortex-X2 \| -9.6% \| -10.1% Bug: libyuv:973 Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-13 22:06:46 +00:00
..
libyuv	[AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow	2024-06-13 22:06:46 +00:00
libyuv.h	NV12 Copy, include scale_uv.h	2020-12-08 18:54:16 +00:00

George Steed 367dd50755 [AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow

This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels
except reading the YUV components all from the same interleaved input
array. We load four-byte elements and then use TBL to de-interleave the
UV components.

Unlike the NV{12,21} cases we need to de-interleave bytes rather than
widened 16-bit elements. Since we need a TBL instruction already it
would ordinarily be possible to perform the zero-extension from bytes to
16-bit elements by setting the index for every other byte to be out of
range. Such an approach does not work in SVE since at a vector length of
2048 bits since all possible byte values (0-255) are valid indices into
the vector. We instead get around this by rewriting the I4XXTORGB_SVE
macro to perform widening multiplies, operating on the low byte of each
16-bit UV element instead of the full value and therefore eliminating
the need for a zero-extension.

Observed reductions in runtimes compared to the existing Neon code:

            | UYVYToARGBRow | YUY2ToARGBRow
Cortex-A510 |        -30.2% |        -30.2%
Cortex-A720 |         -4.8% |         -4.7%
  Cortex-X2 |         -9.6% |        -10.1%

Bug: libyuv:973
Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132
Reviewed-by: Frank Barchard <fbarchard@chromium.org>

2024-06-13 22:06:46 +00:00

libyuv

[AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow

2024-06-13 22:06:46 +00:00

libyuv.h

NV12 Copy, include scale_uv.h

2020-12-08 18:54:16 +00:00