George Steed 3d66e94fb5 [AArch64] Improve ARGBToUVRow_SVE2 and related kernels
This commit reworks the implementation of ARGBToUVMatrixRow_SVE2, using
an approach similar to that recently used in
61bdaee13a701d2b52c6dc943ccc5c888077a591.

In particular we can rework these SVE2 implementations to use 8-bit
dot-product instructions instead of 16-bit, allowing us to process more
data in a single vector.

To ensure that the input values fit in 8-bits, negate the UV constants
arrays passed to the kernel and undo the now-unnecessary flipping of the
middle two component values.

This commit mostly reverses the performance inversion where the Neon
I8MM implementation was previously faster than the SVE2 implementation.
The reduction in runtime observed compared to the existing Neon I8MM
implementation is now:

Cortex-A510:  +5.6% (!)
Cortex-A520:  -3.0%
Cortex-A710: -12.6%
Cortex-A715: -10.9%
Cortex-A720: -10.8%
  Cortex-X2:  -3.8%
  Cortex-X3: -10.3%
  Cortex-X4:  -9.5%
Cortex-X925:  -6.7%

Change-Id: I30253976dc8e3651cfb5fd39b63a6763975d41e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6640990
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2025-06-12 14:10:44 -07:00
..
compare_common.cc clang-tidy applied 2021-04-01 21:42:47 +00:00
compare_gcc.cc ARGBToJ444 use 256 for fixed point scale UV 2025-02-27 13:04:15 -08:00
compare_msa.cc use unix line endings 2018-06-20 23:19:59 +00:00
compare_neon64.cc ARGBToUV 64 bit use ymm8 for shuffler 2025-05-12 15:09:40 -07:00
compare_neon.cc Apply format with no code changes 2025-02-24 23:57:01 -08:00
compare_win.cc ARGBToJ444 use 256 for fixed point scale UV 2025-02-27 13:04:15 -08:00
compare.cc [AArch64] Add Neon implementation of HashDjb2 2024-05-01 19:37:31 +00:00
convert_argb.cc Add SVE2 and SME implementations of I422ToAR30Row 2025-05-27 11:39:00 -07:00
convert_from_argb.cc Add Neon I8MM implementations of ARGB to UV and variants 2025-05-12 11:14:00 -07:00
convert_from.cc Sub sampling conversions use CopyPlane for Y channel 2025-01-02 13:34:11 -08:00
convert_jpeg.cc PlaneScale, UVScale and ARGBScale test 3x and 4x down sample. 2020-10-28 20:41:59 +00:00
convert_to_argb.cc Apply clang format 2025-01-02 13:31:20 -08:00
convert_to_i420.cc Apply clang format 2025-01-02 13:31:20 -08:00
convert.cc Add Neon I8MM implementations of ARGB to UV and variants 2025-05-12 11:14:00 -07:00
cpu_id.cc Detect SME without SVE dependency 2025-03-31 17:27:40 -07:00
mjpeg_decoder.cc Add AMXINT8 cpu detect 2024-02-15 21:44:47 +00:00
mjpeg_validate.cc Update to r1732 for more robust jpeg 2019-07-01 22:32:36 +00:00
planar_functions.cc Add Neon implementation of Convert8To16Row 2025-05-29 13:37:48 -07:00
rotate_any.cc [AArch64] Fix rotate by odd sizes 2024-07-15 18:13:31 +00:00
rotate_argb.cc Apply clang format 2025-01-02 13:31:20 -08:00
rotate_common.cc [AArch64] Use full vectors in TransposeWx{8 => 16}_NEON 2024-05-21 07:46:42 +00:00
rotate_gcc.cc ARGBToJ444 use 256 for fixed point scale UV 2025-02-27 13:04:15 -08:00
rotate_lsx.cc [AArch64] Use full vectors in TransposeWx{8 => 16}_NEON 2024-05-21 07:46:42 +00:00
rotate_msa.cc cpuid show vector length on ARM and RISCV 2024-07-02 18:10:56 +00:00
rotate_neon64.cc Apply format with no code changes 2025-02-24 23:57:01 -08:00
rotate_neon.cc Apply format with no code changes 2025-02-24 23:57:01 -08:00
rotate_sme.cc [AArch64] Re-enable SME only for Linux and new versions of Clang 2024-09-23 09:29:53 +00:00
rotate_win.cc ARGBToJ444 use 256 for fixed point scale UV 2025-02-27 13:04:15 -08:00
rotate.cc [AArch64] Add SME implementation of CopyRow 2024-12-12 03:02:07 -08:00
row_any.cc ubsan compliant '_any' functions using ptrdiff_t for pointer math 2025-06-10 15:01:52 -07:00
row_common.cc ARGBToJ444 use 256 for fixed point scale UV 2025-02-27 13:04:15 -08:00
row_gcc.cc ARGBToUV 64 bit use ymm8 for shuffler 2025-05-12 15:09:40 -07:00
row_lasx.cc Fix unified sources build for LoongArch LASX 2025-04-01 09:48:19 -07:00
row_lsx.cc Fix unified sources build for LoongArch LASX 2025-04-01 09:48:19 -07:00
row_msa.cc Fix Bugs on mips platform V2. 2022-03-01 13:16:31 +00:00
row_neon64.cc TestI400LargeSize test __x86_64__, _M_X64, or __aarch64__ 2025-06-10 15:53:02 -07:00
row_neon.cc ARGBToUV allow 32 bit x86 build 2025-04-28 12:11:00 -07:00
row_rvv.cc Apply clang format 2025-01-02 13:31:20 -08:00
row_sme.cc Add SVE2 and SME implementations of I422ToAR30Row 2025-05-27 11:39:00 -07:00
row_sve.cc [AArch64] Improve ARGBToUVRow_SVE2 and related kernels 2025-06-12 14:10:44 -07:00
row_win.cc ARGBToJ444 use 256 for fixed point scale UV 2025-02-27 13:04:15 -08:00
scale_any.cc [AArch64] Unroll and use TBL in ScaleRowDown34_NEON 2024-09-16 15:37:27 +00:00
scale_argb.cc RVV disable 64 bit elements and vcombine_v 2025-03-25 12:51:25 -07:00
scale_common.cc [AArch64] Add SME implementations of InterpolateRow{,_16,_16To8} 2024-12-12 03:03:41 -08:00
scale_gcc.cc ARGBToUV 64 bit use ymm8 for shuffler 2025-05-12 15:09:40 -07:00
scale_lsx.cc DetilePlane and unittest for NEON 2022-01-31 20:05:55 +00:00
scale_msa.cc Switch to C99 types 2018-01-23 19:16:05 +00:00
scale_neon64.cc Apply format with no code changes 2025-02-24 23:57:01 -08:00
scale_neon.cc Apply format with no code changes 2025-02-24 23:57:01 -08:00
scale_rgb.cc Apply clang format 2025-01-02 13:31:20 -08:00
scale_rvv.cc RVV disable 64 bit elements and vcombine_v 2025-03-25 12:51:25 -07:00
scale_sme.cc Apply clang format 2025-01-02 13:31:20 -08:00
scale_uv.cc Apply clang format 2025-01-02 13:31:20 -08:00
scale_win.cc ARGBToJ444 use 256 for fixed point scale UV 2025-02-27 13:04:15 -08:00
scale.cc J420ToI420 using planar 8 bit scaling 2025-01-22 02:50:24 -08:00
test.sh Optimze ABGRToI420 for AVX2 2020-06-04 18:24:45 +00:00
video_common.cc Lint cleanup after C99 change CL 2018-01-24 19:16:03 +00:00