George Steed 1b2f6cdbe8 [AArch64] Unroll I210ToAR30Row_{SVE2,SME}
Now that we have a STOREAR30_SVE_2X implementation, we can use this to
unroll other kernels. The predication on I210ToAR30Row needs adjusting
to allow loading two vectors of Y compared to one vector of U/V, and
additionally UZP is needed to ensure the data arrangement in vector
lanes matches the U/V layout. LD2H could also be used, however this
provides no performance improvement on most cores and would necessitate
the addition of an "any" kernel to handle the case where width % 2 != 0.

Reduction in run times of I210ToAR30Row_SVE2 observed compared to the
previous SVE2 implementation: (note that even in the observed slowdowns,
the SVE2 implementation still outperforms the existing Neon code)

Cortex-A510: -37.1%
Cortex-A520: -39.1%
Cortex-A710: +1.6% (!)
Cortex-A715: +6.5% (!)
Cortex-A720: +6.5% (!)
  Cortex-X2: -2.9%
  Cortex-X3: -2.2%
  Cortex-X4: -8.8%
Cortex-X925: -3.5%

Change-Id: I2ff285b48105883526eceb8be1fcbe0e033a553b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6640989
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2025-06-12 14:10:21 -07:00
..
libyuv [AArch64] Unroll I210ToAR30Row_{SVE2,SME} 2025-06-12 14:10:21 -07:00
libyuv.h NV12 Copy, include scale_uv.h 2020-12-08 18:54:16 +00:00