195 Commits

Author SHA1 Message Date
George Steed
d1ec694ad3 [AArch64] Add P{210,410}To{ARGB,AR30}Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

              | Cortex-A55 | Cortex-A510 | Cortex-A76
P210ToARGBRow |     -59.8% |      -16.8% |     -53.2%
P210ToAR30Row |     -48.1% |      -21.8% |     -54.0%
P410ToARGBRow |     -56.5% |      -32.2% |     -54.1%
P410ToAR30Row |     -42.4% |       -4.5% |     -50.4%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-06 22:37:08 +00:00
Frank Barchard
611806a155 [AArch64] Fix SVE/SME vector length printing in cpuid
A semicolon is treated as the start of a comment by some assemblers
causing the vector length to be reported incorrectly, so use a newline
instead.

- Add volatile asm in row_gcc and row_neon64

Bug: b/5631539
Change-Id: I6b0836fcdd9247ef7b9e8ceda01df3150519ecf8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5666060
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-02 19:44:41 +00:00
George Steed
d32436e8f8 [AArch64] Add Neon implementation for I422ToAR30Row_NEON
There is an existing x86 implementation for this kernel, but not for
AArch64, so add one.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 Cortex-A55: -43.1%
Cortex-A510: -22.3%
 Cortex-A76: -54.8%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-02 18:16:33 +00:00
George Steed
bbd9cedc4f [AArch64] Add Neon impls for I212To{ARGB,AR30}Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToAR30Row | I210ToARGBRow
 Cortex-A55 |        -40.8% |        -54.4%
Cortex-A510 |        -26.2% |        -22.7%
 Cortex-A76 |        -49.2% |        -44.5%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-02 18:16:33 +00:00
Frank Barchard
616bee5420 Add volatile for gcc inline to avoid being removed
Bug: b/42280943
Change-Id: I4439077a92ffa6dff91d2d10accd5251b76f7544
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5671187
Reviewed-by: David Gao <davidgao@google.com>
2024-07-02 01:25:24 +00:00
George Steed
a758a15dbf [AArch64] Add I8MM implementation of ARGBColorMatrixRow
We cannot use the standard dot-product instructions since the matrix of
coefficients are signed, but I8MM supports mixed-sign products which
work well here.

Reduction in runtimes observed compared to the previous Neon
implementation:

Cortex-A510: -50.8%
Cortex-A520: -33.3%
Cortex-A715: -38.6%
Cortex-A720: -38.5%
  Cortex-X2: -43.2%
  Cortex-X3: -40.0%
  Cortex-X4: -55.0%

Change-Id: Ia4fe486faf8f43d0b837ad21bb37e2159f3bdb77
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5621577
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-12 16:17:59 +00:00
George Steed
89cf221baa [AArch64] Avoid unnecessary widening in I422ToARGB1555Row_NEON
The existing code first widens the component vectors from 8-bit elements
to 16-bits to construct the final ARGB1555 result, however this is
unnecessary since the inputs to the widening are themselves the result
of having just been narrowed in the RGBTORGB8 macro.

By making use of the new RGBTORGB8_TOP macro we can get rid of both the
widening as well as the prior narrowing step.

Also remove volatile from the asm, it is unnecessary.

Reduction in runtime observed for I422ToARGB1555Row_NEON:

 Cortex-A55:  -7.8%
 Cortex-A76: -15.0%
Cortex-A720: -20.3%
  Cortex-X1: -20.2%
  Cortex-X2: -20.3%

Bug: libyuv:976
Change-Id: Id031c5d4d788828297adcc2fe2c2cd8d99b45433
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5616050
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-11 23:36:13 +00:00
George Steed
e6c4b9ad2e [Arm][AArch64] Remove unused ARGBToUVJ444Row_NEON definition
There is no corresponding declaration in a header file and it appears to
be unused, so remove from both the Arm and AArch64.

Change-Id: I4de9fb7ce8e8dff6e76f4a99fdd93c743f92bf18
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5587507
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-10 18:36:31 +00:00
George Steed
3f657221f0 [AArch64] Remove unused vars in I{210,410}{,Alpha}ToARGBRow_NEON
The elements of the YUV constants are passed directly to the inline asm
block, so no need to pull them out into variables first.

Also remove "volatile" from inline asm blocks, it is unnecessary.

Bug: 344998222
Change-Id: I7d97dec8c7495651e5a31c10eda2d4aeed36fe6a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598764
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-07 02:39:20 +00:00
George Steed
dff7bad43d [AArch64] Use full Neon vectors in ARGB4444ToARGBRow_NEON
The existing Neon code narrows the input 16-bit packed data to 8-bit
elements and separates the color channels, causing us to only process
half a Neon vector per instruction for the channel widening from 4-bit
color data to 8-bits.

We can note that the processing being done is identical for all color
channels and therefore we can keep them partially interleaved during the
widening step. This allows us to use full Neon vectors for the whole
loop body.

Reductions in runtimes observed for ARGB4444ToARGBRow_NEON:

 Cortex-A55: -30.7%
Cortex-A510: -44.3%
 Cortex-A76: -51.6%
  Cortex-X2: -54.2%

Bug: libyuv:976
Change-Id: I9d9cda7e16eb07619c6d7f1de2e6b8c0fb6d64cf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5594389
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:52:33 +00:00
George Steed
7633c818ec [AArch64] Remove pointless MOVI in ARGB1555ToARGBRow_NEON
This function takes the alpha component from the loaded data rather than
hard-coding it to 255, so initialising v3 to 255 is unused here.

Change-Id: I668825e0eeb317d1365035ce3bb47f3d92081c6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5594388
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:47:01 +00:00
George Steed
6c70eb2819 [AArch64] Add Neon impls for I{210,410}ToAR30Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 I210ToAR30Row on Cortex-A55: -43.8%
I210ToAR30Row on Cortex-A510: -27.0%
 I210ToAR30Row on Cortex-A76: -50.4%
 I410ToAR30Row on Cortex-A55: -44.3%
I410ToAR30Row on Cortex-A510: -17.5%
 I410ToAR30Row on Cortex-A76: -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:46:12 +00:00
George Steed
812b4955b2 [AArch64] Add Neon impls for I{210,410}ToARGBRow_NEON
There is are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToARGBRow | I410ToARGBRow
 Cortex-A55 |        -55.6% |        -56.2%
Cortex-A510 |        -22.6% |        -35.6%
 Cortex-A76 |        -48.1% |        -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 17:40:48 +00:00
George Steed
5b4160b9c3 [AArch64] Add Neon impls for I{210,410}AlphaToARGBRow_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210AlphaToARGBRow | I410AlphaToARGBRow
 Cortex-A55 |             -55.3% |             -56.1%
Cortex-A510 |             -27.9% |             -42.6%
 Cortex-A76 |             -54.9% |             -60.3%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:41:31 +00:00
George Steed
e348995a92 [AArch64] Optimize MergeXR30Row_10_NEON
By keeping intermediate data as 16-bits wide we can compute twice as
much and use ST2 to store the final result. This appears to be much
better even on micro-architectures where ST2 is slightly slower than
ST1.

We save a couple of instructions by taking advantage of multiply-add
instructions to perform an effective shift-left and bitwise-or, since we
know the set of nonzero bits are disjoint after the UMIN.

Reduction in runtime observed for MergeXR30Row_10_NEON:

 Cortex-A55: -34.2%
Cortex-A510: -35.6%
 Cortex-A76: -44.9%
  Cortex-X2: -48.3%

Bug: libyuv:976
Change-Id: I6e2627f9aa8e400ea82ff381ed587fcfc0d94648
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509199
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:32:55 +00:00
George Steed
56258c125b [AArch64] Avoid redundant shift around RGB565 conversion
The existing code performs a narrowing shift right (in RGBTORGB8)
followed by a widening left shift (in ARGBTORGB565). This is redundant
since we could have simply not performed a narrowing operation to begin
with and instead done a saturating left shift to saturate against the
top of the 16-bit lanes rather than the narrowed 8-bit lanes.

To enable this we introduce new RGBTORGB8_TOP and ARGBTORGB565_FROM_TOP
macros which produce and consume values from the high half of each
16-bit lane rather than a narrowed 8-bit intermediate.

Reduction in runtime for selected kernels:

                     | Cortex-A55 | Cortex-A510 | Cortex-A76 | Cortex-X2
I422ToRGB565Row_NEON |     -10.8% |       -6.1% |     -17.2% |    -23.6%
NV12ToRGB565Row_NEON |     -11.4% |       -4.9% |     -20.4% |    -17.4%

Bug: libyuv:976
Change-Id: I3337b8f41ff62a7af1b70a56b774239bdb55d0f1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509197
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:29:54 +00:00
George Steed
c5f9583b1c [AArch64] Avoid extracting alpha in ARGB1555ToYRow_NEON
The existing implementation of this kernel uses the ARGB1555TOARGB macro
which extracts and sign-extends the alpha component into v3, however
this particular kernel does not need the alpha component. We can avoid
calculating the alpha component completely by using the existing
RGB555TOARGB macro, so use that instead.

Reduction in runtimes observed for ARGB1555ToYRow_NEON (no noticeable
improvement observed on Cortex-A510):

Cortex-A55: -3.6%
Cortex-A76: -20.9%
 Cortex-X2: -15.1%

Bug: libyuv:976
Change-Id: I2cf2729c8297c53dcd32d0df28e64d4d5c7f6def
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509200
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-31 08:28:27 +00:00
George Steed
d0c28db56c [AArch64] Optimize Merge{ARGB,XRGB}16To8Row_NEON
Rather than shifting the data into the low half of each lane and then
using a saturating narrowing operation, we can do the saturation as part
of a shift into the highest half of the lane and then use a simpler TRN2
instruction to extract pairs of high halves into full vectors. This also
has the nice advantage of allowing us to use ST2 rather than ST4 for
storing the result, since ST4 is known to be slow on some
micro-architectures.

Reduction in runtimes observed for the two kernels:

             | MergeARGB16To8Row_NEON | MergeXRGB16To8Row_NEON
  Cortex-A55 |                  -8.0% |                 -12.2%
 Cortex-A510 |                 -29.9% |                 -31.4%
  Cortex-A76 |                 -29.0% |                 -32.0%
   Cortex-X2 |                 -33.5% |                 -43.4%

Bug: libyuv:976
Change-Id: I9da3beedc27ab43527b3642aa6d4decf3b5b6683
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509198
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-21 07:55:03 +00:00
George Steed
9fac9a4a82 [AArch64] Add Neon implementations for {ARGB,ABGR}ToAR30Row
There are existing x86 implementations for these kernels but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | ABGRToAR30Row | ARGBToAR30Row
 Cortex-A55 |        -55.1% |        -55.1%
Cortex-A510 |        -39.3% |        -40.1%
 Cortex-A76 |        -62.3% |        -63.6%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:35:07 +00:00
George Steed
83c48c782a [AArch64] Improve ARGB4444TOARGB using SRI instructions
Also avoid constructing the alpha component when it isn't needed by
introducing a new ARGB4444TORGB macro.

Reduction in runtime for selected kernels:

                       | Cortex-A55 | Cortex-A510 | Cortex-A76
ARGB4444ToARGBRow_NEON |     -27.5% |      -27.9% |     -29.1%
  ARGB4444ToUVRow_NEON |     -20.2% |      -25.2% |     -21.7%
   ARGB4444ToYRow_NEON |     -16.0% |      -20.2% |     -21.3%

Bug: libyuv:976
Change-Id: Ida061e1c49ba228b02c2f691a067b58edad073a8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509196
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:29:11 +00:00
George Steed
5618a5c762 [AArch64] Use REV16 rather than TBL in SwapUVRow_NEON
We don't need a general-purpose purmute here, REV16 does exactly what we
want and saves us needing to load the permute indices array.

Bug: libyuv:976
Change-Id: Ib3bc2e4d21b00d53aeda6a11c6e6f1016ca6029e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509201
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-21 07:26:54 +00:00
George Steed
1eae2efbc7 [AArch64] Use LD1/ST1 rather than LD4/ST4 in ARGBShadeRow_NEON
The use of LD4 and ST4 to de-interleave ARGB color channels is
unnecessary here since we can just adjust the scale multiplicand to
match the interleaved layout. LD4 and ST4 are known to perform poorly on
some micro-architectures so using LD1 and ST1 here should be preferred.

Reduction in runtime for ARGBShadeRow_NEON:

  Cortex-A55: -19.9%
 Cortex-A510: -50.8%
  Cortex-A76: -36.0%
   Cortex-X2: -46.4%

Bug: libyuv:976
Change-Id: I10a0e6a0a62242826d39b1e963063770f084226a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5494093
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-30 00:48:35 +00:00
George Steed
356232b687 [AArch64] Replace UQXTN{,2} with UZP2 in Convert16To8Row_NEON
The existing code makes use of a pair of shifts to put the bits we want
in the low part of each vector lane and then a pair of UQXTN and UQXTN2
instructions to perform a saturating cast down from 16-bit elements to
8-bit elements. We can instead achieve the same thing by adding eight to
the first shift amount so that the bits we want appear in the high half
of the lane, doing the saturation at the same time, and then simply use
UZP2 to pull out the high halves of each lane in a single instruction.

Reduction in runtime for Convert16To8Row_NEON:

  Cortex-A55: -19.7%
 Cortex-A510: -23.5%
  Cortex-A76: -35.4%
   Cortex-X2: -34.1%

Bug: libyuv:976
Change-Id: I9a80c0f4f2c6b5203f23e422c0970d3167052f91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463950
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-25 21:23:55 +00:00
George Steed
4f52235a67 [AArch64] Replace SHRN{,2} pair by UZP2 in DivideRow_16_NEON
Shift instructions have worse throughput than other permute instructions
on some micro-architectures, and we can avoid the need for two separate
narrowing instructions by taking the high halves of each lane directly
through use of the UZP2 instruction.

Reduction in runtime for DivideRow_16_NEON:

 Cortex-A55:  -6.2%
Cortex-A510: -30.0%
 Cortex-A76: -11.9%
  Cortex-X2: -46.8%

Bug: libyuv:976
Change-Id: I4aa06eab06ab6134bb80bc3af5328a1a83b3d249
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463949
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-25 21:21:52 +00:00
George Steed
9e223c3fc0 [AArch64] Replace instances of ORR with MOV where possible
The MOV instruction is an alias of ORR where both registers are the
same and should be preferred.

Both ORR and MOV are not zero-cost instructions on all
micro-architectures so there may be better ways to express these
kernels, but this is left for a later commit.

Bug: libyuv:975
Change-Id: I29b7f182a57a61855cb7f8a867691080f153b10b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5332385
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-25 20:48:16 +00:00
George Steed
4838e7a194 [AArch64] Load full vectors in ARGB{Add,Subtract}Row
Using full vectors for Add and Subtract is a win across the board. Using
full vectors for the multiply is less obviously a win, especially for
smaller cores like Cortex-A53 or Cortex-A57, so is not considered for
this change.

Observed changes in performance with this change compared to the
existing Neon code:

            | ARGBAddRow_NEON | ARGBSubtractRow_NEON
 Cortex-A55 |           -5.1% |                -5.1%
Cortex-A510 |          -18.4% |               -18.4%
 Cortex-A76 |          -28.9% |               -28.7%
Cortex-A720 |          -36.1% |               -36.2%
  Cortex-X1 |          -14.2% |               -14.4%
  Cortex-X2 |          -12.5% |               -12.5%

Bug: libyuv:976
Change-Id: I85316d4399c93b53baa62d0d43b2fa453517f5b4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5457433
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-18 19:02:43 +00:00
George Steed
90070986ae [AArch64] Improve RGB565TOARGB using SRI instructions
The existing code performs a lot of shifts and combines the R and B
components into a single vector unnecessarily. We can express this much
more cleanly by making use of the SRI instruction to insert and replace
shifted bits into the original data, performing the 5/6-bit to 8-bit
expansion in a single instruction if the source bits are already in the
high bits of the byte. We still need a single separate XTN instruction
to narrow the B component before the left shift since Neon does not have
a narrowing left shift instruction.

Reduction in runtime for selected kernels:

               Kernel | Cortex-A55 | Cortex-A76 | Cortex-X2
    RGB565ToYRow_NEON |     -22.1% |     -23.4% |    -25.1%
   RGB565ToUVRow_NEON |     -26.8% |     -20.5% |    -18.8%
 RGB565ToARGBRow_NEON |     -38.9% |     -32.0% |    -23.5%

Bug: libyuv:976
Change-Id: I77b8d58287b70dbb9549451fc15ed3dd0d2a4dda
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5374286
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-04-18 19:01:26 +00:00
George Steed
1ca7c4e1cc [AArch64] Avoid lane-indexed loads for UV when loading I444/I422
Most micro-architectures seem to prefer an additional ZIP1 instruction
in READYUV422 to needing a lane-indexed LD1 load instruction.

We introduce a new macro to handle the YUV to RGB conversion where the U
and V components are in separate vectors. This avoids causing a slowdown
for the UV-interleaved input format kernels (NV12 and NV21) where we do
not want to separate them.

Reduction in runtime for selected kernels on Cortex cores (no
performance difference observed on Cortex-A55):

                           A510     A76    A720      X1      X2
 I422AlphaToARGBRow_NEON  -4.3%   -7.3%  -10.1%   -4.0%   -4.4%
  I422ToARGB1555Row_NEON  -4.5%   +0.4%   -7.9%   -4.8%   -3.9%
  I422ToARGB4444Row_NEON  -7.7%   -2.6%   -4.1%   -1.9%   -1.3%
      I422ToARGBRow_NEON  -3.7%   -2.9%  -10.2%   -3.8%   -4.4%
     I422ToRGB24Row_NEON  -5.9%   +5.4%   -3.2%   -4.3%   -4.3%
    I422ToRGB565Row_NEON  -4.8%   -2.8%   -8.5%   -3.8%   -4.6%
      I422ToRGBARow_NEON  -3.7%   +4.6%  -10.5%   -3.0%   -4.5%
 I444AlphaToARGBRow_NEON  -3.5%   +2.7%   -3.7%   -5.0%   -8.2%
      I444ToARGBRow_NEON  -1.8%  -15.1%   -3.5%   -6.5%   -8.1%
     I444ToRGB24Row_NEON  -2.0%   -6.8%   +0.1%   -4.7%   +1.2%

There are a few cases which are slower on Cortex-A76, but significant
speedups elsewhere.

Bug: libyuv:976
Change-Id: Ib3b4ef81f7bfc1d7ff9c4c24aef9ad86741410ff
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465580
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-18 18:46:59 +00:00
George Steed
bfedc8bc11 [AArch64] Improve ARGB{,1}555TOARGB using SRI instructions
The existing transformations can be more cleanly expressed by using SRI
instructions to perform a shift and simultaneously merge in to an
existing value.

Reduction in runtime for selected kernels:

                Kernel | Cortex-A55 | Cortex-A76 | Cortex-X2
   ARGB1555ToYRow_NEON |     -26.2% |     -14.9% |    -28.2%
  ARGB1555ToUVRow_NEON |     -25.2% |     -18.4% |    -20.9%
ARGB1555ToARGBRow_NEON |     -43.6% |     -32.8% |    -19.7%

Bug: libyuv:976
Change-Id: Id07ac6f2cd3eb9bb70f9e29fc1f4b29fe26156ec
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5383444
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-18 18:46:10 +00:00
George Steed
95b0a3326c [AArch64] Improve ARGBTOARGB4444 using SRI instructions
The existing sequence to convert from 8-bit ARGB to 4-bit ARGB4444 makes
use of a lot of shifts and bit-clears before ORR'ing the pairs together.
This is unnecessary since we can do the same with the SRI instruction,
so use that instead.

Reduction in runtime for selected kernels:

                Kernel | Cortex-A55 | Cortex-A76
ARGBToARGB4444Row_NEON |     -15.3% |     -16.6%
I422ToARGB4444Row_NEON |      -2.7% |     -11.9%

Bug: libyuv:976
Change-Id: I86cd86c7adf1105558787a679272179821f31a9d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5383443
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-18 18:26:27 +00:00
George Steed
b265c311b7 [AArch64] Avoid unnecessary work in READYUV400
The value of UV components in the vector are known and the vectors are
never overwritten, so we can hoist the UV-specific parts of the
calculation out of the loop.

Reduction in runtimes for I400ToARGBRow_NEON:

 Cortex-A55: -10.0%
Cortex-A510:  -3.7%
 Cortex-A76: -19.3%
  Cortex-X2: -14.4%

Bug: libyuv:976
Change-Id: I17d6de4e1790f71407e12ff84548568cc3ebbe1a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5457434
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-17 16:47:58 +00:00
George Steed
ea56460300 [AArch64] Use LD1/ST1 rather than LD4/ST4 in ARGBMultiplyRow_NEON
There is no need to de-interleave channels here since we are applying
the same operation across all lanes. LD4 and ST4 are known to be
significantly slower than LD1/ST1 on some micro-architectures so we
should prefer to avoid them where possible.

Reduction in runtimes observed for ARGBMultiplyRow_NEON:

 Cortex-A55: -22.3%
Cortex-A510: -56.6%
 Cortex-A76: -45.5%
  Cortex-X2: -54.6%

Change-Id: I9103111a109a4d87d358e06eb513746314aaf66a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454832
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-16 07:28:56 +00:00
George Steed
7266cda79c [AArch64] Use LD1/ST1 rather than LD4/ST4 in ARGBSubtractRow_NEON
There is no need to de-interleave channels here since we are applying
the same operation across all lanes. LD4 and ST4 are known to be
significantly slower than LD1/ST1 on some micro-architectures so we
should prefer to avoid them where possible.

Reduction in runtimes observed for ARGBSubtractRow_NEON:

 Cortex-A55: -15.0%
Cortex-A510: -59.8%
 Cortex-A76: -54.4%
  Cortex-X2: -70.4%

Change-Id: Ifbfce9e6a45159932c09d9b0229215a36fa22f43
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454833
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-16 07:04:43 +00:00
George Steed
e646991347 [AArch64] Use LD1/ST1 rather than LD4/ST4 in ARGBAddRow_NEON
There is no need to de-interleave channels here since we are applying
the same operation across all lanes. LD4 and ST4 are known to be
significantly slower than LD1/ST1 on some micro-architectures so we
should prefer to avoid them where possible.

Reduction in runtimes observed for ARGBAddRow_NEON:

 Cortex-A55: -15.0%
Cortex-A510: -59.8%
 Cortex-A76: -54.4%
  Cortex-X2: -70.4%

Change-Id: Id04e5259d8e5e7511dad5df85cdf9759b392cb99
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454831
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-16 07:03:44 +00:00
George Steed
f2e78e1304 [AArch64] Use Neon dot-product instructions in ARGBToYMatrixRow
Using the dot-product instructions here allows us to avoid needing LD4
for loading individual colour channels, which gives a big benefit on
some micro-architectures where such instructions perform significantly
worse than LD1. In addition the dot-product instructions have higher
throughput compared to the Neon

Observed reduction in runtimes for selected kernels moving from *_NEON
to *_NEON_DotProd:

     Kernel | Cortex-A55 | Cortex-A510 | Cortex-A76 | Cortex-X2
ABGRToYJRow |      -6.5% |      -22.5% |     -43.5% |    -71.2%
 ABGRToYRow |      -6.5% |      -22.5% |     -43.5% |    -68.3%
ARGBToYJRow |      -6.5% |      -22.5% |     -43.5% |    -68.1%
 ARGBToYRow |      -6.5% |      -22.5% |     -43.5% |    -68.1%
 BGRAToYRow |      -6.5% |      -22.5% |     -42.3% |    -68.4%
RGBAToYJRow |      -6.5% |      -22.5% |     -42.2% |    -73.7%
 RGBAToYRow |      -6.5% |      -22.5% |     -42.3% |    -64.9%

Bug: libyuv:977
Change-Id: If244190a7bdacf7e6e6b16af7e6853ee13ff6585
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424737
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-09 03:09:36 +00:00
George Steed
ba796a32e7 [AArch64] Remove out of date TODO around ARGBMultiplyRow_NEON
The comment refers to the code needing to be re-enabled but as far as I
can tell it is already enabled, so simply remove the comment.

Change-Id: Id014e8b7f5cd43c8211e1d38758299de2fad49de
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5387650
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-03-25 22:44:45 +00:00
George Steed
5d694bec38 [AArch64] Replace UQSHRN{,2} pair by UZP2 in YUVTORGB
The existing Neon code makes use of a pair of UQSHRN and UQSHRN2
instructions to extract the top half of a widened multiply result.

These instructions would ordinarily saturate, however saturation can
never happen in this case since we are shifting by 16 to get the top
half of each element, the top bits remain as-is.

We could move this to using a slightly simpler non-saturating shift,
however in this case it is simpler and faster to just use UZP2 to
extract the top half of each 32-bit lane directly.

Reduction in runtime for selected kernels:

                  Kernel | Cortex-A55 | Cortex-A76 | Cortex-X2
      I400ToARGBRow_NEON |      -9.4% |     -14.9% |    -13.9%
 I422AlphaToARGBRow_NEON |      -7.9% |     -11.4% |    -11.5%
  I422ToARGB1555Row_NEON |      -7.3% |     -17.2% |    -14.7%
  I422ToARGB4444Row_NEON |      -7.6% |     -17.9% |    -13.7%
      I422ToARGBRow_NEON |      -8.2% |      -9.8% |    -11.9%
     I422ToRGB24Row_NEON |      -8.0% |     -13.3% |    -12.8%
    I422ToRGB565Row_NEON |      -7.5% |     -15.1% |    -14.6%
      I422ToRGBARow_NEON |      -8.3% |     -13.1% |    -12.2%
 I444AlphaToARGBRow_NEON |      -8.3% |      -7.6% |    -12.7%
      I444ToARGBRow_NEON |      -8.6% |      -3.5% |    -13.5%
     I444ToRGB24Row_NEON |      -8.5% |      -7.8% |    -13.4%
      NV12ToARGBRow_NEON |      -8.8% |      -1.4% |    -12.0%
     NV12ToRGB24Row_NEON |      -8.5% |     -11.5% |    -12.3%
    NV12ToRGB565Row_NEON |      -7.9% |     -15.0% |    -15.7%
      NV21ToARGBRow_NEON |      -8.7% |      -1.6% |    -12.3%
     NV21ToRGB24Row_NEON |      -8.4% |     -11.5% |    -12.0%
      UYVYToARGBRow_NEON |      -8.8% |      -8.9% |    -11.9%
      YUY2ToARGBRow_NEON |      -8.7% |     -10.8% |    -13.3%

Bug: libyuv:976
Change-Id: I6c505fe722e5f91f93718b85fe881ad056d8602d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5366653
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-03-14 20:04:46 +00:00
George Steed
8d0d885c2f [AArch64] Avoid LD2 in YUY2ToARGBRow_NEON
In this case we have an LD2 instruction followed by a pair of permutes
(ZIP1 and TBL). On some micro-architectures LD2 involves use of the
vector pipelines, so in these cases it is preferable to do an LD1 and
then a different pair of permutes (TRN + TBL) instead to avoid the extra
vector pipeline usage.

Reduction in runtime on selected kernels (no observed performance delta
on Cortex-A55):

            Kernel | Cortex-A76 | Cortex-X2
UYVYToARGBRow_NEON |      -2.6% |     -8.8%
YUY2ToARGBRow_NEON |      -6.2% |     -4.9%

Bug: libyuv:976
Change-Id: I7ca45e0c7bf7cb50cc5ab37c6a01215d9689039a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5366652
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-03-14 19:51:05 +00:00
George Steed
188e4e3afb [AArch64] Avoid unnecessary lane-indexed loads in READYUV
The existing code makes use of a pair of lane-indexed load instructions
to fill the two halves of the input vector, however this has the effect
of introducing an unnecessary dependency on the value of the vector from
the previous loop iteration.

This doesn't really seem to affect little core performance since these
cores never execute enough work concurrently to hit the bottleneck,
however we can improve performance on mid and big cores quite a bit by
using LDR instead of LD1 to load the low lane, zeroing the upper portion
of the vector rather than keeping the previous value.

Reduction in runtime for select kernels (no observed performance delta
on Cortex-A55):

                  Kernel | Cortex-A76 | Cortex-X2
  I422ToARGB4444Row_NEON |     -23.1% |    -49.3%
      I422ToARGBRow_NEON |      -1.2% |     -2.5%
     I422ToRGB24Row_NEON |     -11.7% |     -7.0%
      I422ToRGBARow_NEON |      -4.7% |     -3.4%
 I444AlphaToARGBRow_NEON |      -1.1% |     -2.4%
      I444ToARGBRow_NEON |      -1.6% |     -3.2%
     I444ToRGB24Row_NEON |      -9.6% |     -6.8%

Bug: libyuv:976
Change-Id: I8c9413e0e6ed97b8f060ce42b6e8abdfb77914b9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5365868
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-03-13 18:35:31 +00:00
Frank Barchard
650be7496f Fix warnings for missing prototypes
- Add static to internal scale and rotate functions
- Remove unittest that tested an internal scale function
- Remove unused private functions
- Include missing scale_argb.h header
- Bump version and apply clang format

Bug: libyuv:830
Change-Id: I45bab0423b86334f9707f935aedd0c6efc442dd4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4658956
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2023-06-30 17:46:56 +00:00
Frank Barchard
a366ad714a ARGBAttenuate use (a + b + 255) >> 8
- Makes ARM and Intel match and fixes some off by 1 cases
- Add ARGBToUV444MatrixRow_NEON
- Add ConvertFP16ToFP32Column_NEON
- scale_rvv fix intinsic build error
- disable row_win version of ARGBAttenuate/Unattenuate

Bug: libyuv:936, libyuv:956
Change-Id: Ied99aaad3a11a8eb69212b628c58f86ec0723c38
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617013
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-06-16 21:37:53 +00:00
Frank Barchard
b08ccb6a83 FP16 to FP32 float conversion row function
Bug: None
Change-Id: I97aab6aafd41c3bf36bfbf33fdcc424e5b3fd6e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4590225
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2023-06-07 00:02:40 +00:00
Frank Barchard
157b153b60 Fix tidy warning that uint32_t dither4 should not be const
- Remove const from uint32_t dither4 parameter to fix clang-tidy warning
- Apply clang format
- Bump version
- Remove unused MMI source; superceded by MSA

Bug: None
Change-Id: Id49991db25bca4e99590b415312542d917471c62
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581882
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-06-02 00:42:02 +00:00
Frank Barchard
464c51a035 AArch32 YUVTORGB_SETUP use load and dup to avoid modifying pointer
- Allows code to be optimized with clang 17 -flto-thin
- Bump version number to 1864 to allow detection of fix
- Apply clang format to standardize formatting; No impact on code generated

Bug: chromium:1424089
Change-Id: Ib745836b27915a5e4cb1d7d928ee52659360612b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4370052
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2023-03-24 19:32:30 +00:00
Frank Barchard
1a971f8cc3 clang 17 -flto-thin bug fix for Neon YUVtoRGB and ARGBToRGB565Dither
- YUV to RGB AArch32 kRGBCoeffBias rewind pointer
- ARGBToRGB565Dither declare width and source pointers as modified


Bug: chromium:1424089
Change-Id: I987180652331bab16ce27d8d166399a687ee890e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4370099
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-03-24 10:59:40 +00:00
Frank Barchard
3f219a3501 GCC warning fix for MT2T
- Fix redundent assignment compile warning in GCC
- Apply clang-format
- Bump version to 1863

Bug: libyuv:955
Change-Id: If2b6588cd5a7f068a1745fe7763e90caa7277101
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4344729
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2023-03-16 06:57:20 +00:00
Justin Green
76468711d5 M2T2 Unpack fixes
Fix the algorithm for unpacking the lower 2 bits of M2T2 pixels.

Bug: b:258474032
Change-Id: Iea1d63f26e3f127a70ead26bc04ea3d939e793e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4337978
Commit-Queue: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-03-14 14:59:26 +00:00
Frank Barchard
88b050f337 MergeUV AVX512BW use assembly
- Convert MergeUVRow_AVX512BW to assembly
- Enable MergeUVRow_AVX512BW for Windows with clangcl
- MergeUVRow_AVX2 use vpmovzxbw and vpsllw
- MergeUVRow_16_AVX2 use vpmovzxbw and vpsllw with different shift for U and V

AMD Zen 4 640x360 100000 iterations
Was
AVX512 MergeUVPlane_Opt (884 ms)
AVX2   MergeUVPlane_Opt (945 ms)
AVX2   MergeUVPlane_16_Opt (2167 ms)

Now
AVX512 MergeUVPlane_Opt (865 ms)
AVX2   MergeUVPlane_Opt (943 ms)
SSE2   MergeUVPlane_Opt (973 ms)
AVX2   MergeUVPlane_16_Opt (2102 ms)

Bug: None
Change-Id: I658ada2a75d44c3f93be8bd3ed96f83d5fa2ab8d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4271230
Reviewed-by: Fritz Koenig <frkoenig@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-22 21:19:08 +00:00
Frank Barchard
2bdc210be9 MergeUV_AVX512BW for I420ToNV12
On Skylake Xeon 640x360 100000 iterations
AVX512   MergeUVPlane_Opt (1196 ms)
AVX2     MergeUVPlane_Opt (1565 ms)
SSE2     MergeUVPlane_Opt (1780 ms)
Pixel 7  MergeUVPlane_Opt (1177 ms)

Bug: None
Change-Id: If47d4fa957cf27781bba5fd6a2f0bf554101a5c6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4242247
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-13 20:14:57 +00:00
Frank Barchard
0faf8dd0e0 Fix for DivideRow_NEON functions
- was dup of 8h but mul of 4s.  now use umull

Bug: libyuv:951
Change-Id: If6cb01f5f006c2235886b81ce120642d7e24a9bb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4166563
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-01-18 00:30:05 +00:00