[AArch64] Replace SHRN{,2} pair by UZP2 in DivideRow_16_NEON

Shift instructions have worse throughput than other permute instructions
on some micro-architectures, and we can avoid the need for two separate
narrowing instructions by taking the high halves of each lane directly
through use of the UZP2 instruction.

Reduction in runtime for DivideRow_16_NEON:

 Cortex-A55:  -6.2%
Cortex-A510: -30.0%
 Cortex-A76: -11.9%
  Cortex-X2: -46.8%

Bug: libyuv:976
Change-Id: I4aa06eab06ab6134bb80bc3af5328a1a83b3d249
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463949
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This commit is contained in:
George Steed 2024-03-13 16:40:33 +00:00 committed by Frank Barchard
parent 53b65220da
commit 4f52235a67

View File

@ -4680,10 +4680,8 @@ void DivideRow_16_NEON(const uint16_t* src_y,
"umull v2.4s, v3.4h, v4.4h \n"
"umull2 v3.4s, v3.8h, v4.8h \n"
"prfm pldl1keep, [%0, 448] \n"
"shrn v0.4h, v0.4s, #16 \n"
"shrn2 v0.8h, v1.4s, #16 \n"
"shrn v1.4h, v2.4s, #16 \n"
"shrn2 v1.8h, v3.4s, #16 \n"
"uzp2 v0.8h, v0.8h, v1.8h \n"
"uzp2 v1.8h, v2.8h, v3.8h \n"
"stp q0, q1, [%1], #32 \n" // store 16 pixels
"subs %w2, %w2, #16 \n" // 16 src pixels per loop
"b.gt 1b \n"