George Steed 11c57f4f12 [AArch64] Add Neon implementation of ScaleRowDown2_16_NEON
The auto-vectorized implementation unrolls to process 32 elements per
iteration, so unroll the new Neon implementation to match and avoid a
performance regression on little cores.

Performance relative to the auto-vectorized C implementation compiled
with LLVM 19:

 Cortex-A55: -35.8%
Cortex-A510: -20.4%
Cortex-A520: -22.1%
 Cortex-A76: -54.8%
Cortex-A710: -44.5%
Cortex-A715: -31.1%
Cortex-A720: -31.4%
  Cortex-X1: -48.5%
  Cortex-X2: -47.8%
  Cortex-X3: -47.6%
  Cortex-X4: -51.1%
Cortex-X925: -14.6%

Bug: b/42280942
Change-Id: Ib4e89ba230d554f2717052e934ca0e8a109ccc42
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040153
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-25 21:10:05 +00:00
..
libyuv [AArch64] Add Neon implementation of ScaleRowDown2_16_NEON 2024-11-25 21:10:05 +00:00
libyuv.h NV12 Copy, include scale_uv.h 2020-12-08 18:54:16 +00:00