[AArch64] Replace SHRN{,2} pair by UZP2 in DivideRow_16_NEON

Shift instructions have worse throughput than other permute instructions on some micro-architectures, and we can avoid the need for two separate narrowing instructions by taking the high halves of each lane directly through use of the UZP2 instruction. Reduction in runtime for DivideRow_16_NEON: Cortex-A55: -6.2% Cortex-A510: -30.0% Cortex-A76: -11.9% Cortex-X2: -46.8% Bug: libyuv:976 Change-Id: I4aa06eab06ab6134bb80bc3af5328a1a83b3d249 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463949 Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2026-02-16 07:09:53 +08:00 · 2024-03-13 16:40:33 +00:00 · 2024-03-13 16:40:33 +00:00 · 4f52235a67
commit 4f52235a67
parent 53b65220da
1 changed files with 2 additions and 4 deletions
--- a/source/row_neon64.cc
+++ b/source/row_neon64.cc
@ -4680,10 +4680,8 @@ void DivideRow_16_NEON(const uint16_t* src_y,
      "umull       v2.4s, v3.4h, v4.4h           \n"
      "umull2      v3.4s, v3.8h, v4.8h           \n"
      "prfm        pldl1keep, [%0, 448]          \n"
-      "shrn        v0.4h, v0.4s, #16             \n"
+      "uzp2        v0.8h, v0.8h, v1.8h           \n"
-      "shrn2       v0.8h, v1.4s, #16             \n"
+      "uzp2        v1.8h, v2.8h, v3.8h           \n"
      "shrn        v1.4h, v2.4s, #16             \n"
      "shrn2       v1.8h, v3.4s, #16             \n"
      "stp         q0, q1, [%1], #32             \n"  // store 16 pixels
      "subs        %w2, %w2, #16                 \n"  // 16 src pixels per loop
      "b.gt        1b                            \n"