Some of the color conversion kernels already have Streaming-SVE
implementations however many do not. We can re-use the existing SVE
implementation by moving it to a new shared row_sve.h header and marking
it with a "streaming-compatible" attribute to ensure it can be called
from both streaming and non-streaming execution modes.
As part of this move to a common header we also add duplicated
streaming-mode implementations of the following kernels that did not
previously have an SME implementation:
- I210AlphaToARGBRow_SME
- I210ToAR30Row_SME
- I210ToARGBRow_SME
- I212ToAR30Row_SME
- I212ToARGBRow_SME
- I400ToARGBRow_SME
- I410AlphaToARGBRow_SME
- I410ToAR30Row_SME
- I410ToARGBRow_SME
- I422AlphaToARGBRow_SME
- I422ToARGB1555Row_SME
- I422ToARGB4444Row_SME
- I422ToRGB24Row_SME
- I422ToRGB565Row_SME
- I422ToRGBARow_SME
- I444AlphaToARGBRow_SME
- NV12ToARGBRow_SME
- NV12ToRGB24Row_SME
- NV21ToARGBRow_SME
- NV21ToRGB24Row_SME
- P210ToAR30Row_SME
- P210ToARGBRow_SME
- P410ToAR30Row_SME
- P410ToARGBRow_SME
- UYVYToARGBRow_SME
- YUY2ToARGBRow_SME
Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
InterpolateRow_SME and InterpolateRow_16_SME need special cases to
handle if source_y_fraction is 256 since this would overflow a byte and
can just be a call to memcpy instead.
InterpolateRow_16To8_SME is never called with a source_y_fraction value
of 256 so there is no need for a special case here.
Change-Id: I67805b5db2c411acb93ada626cf414b35620f467
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Add a streaming-SVE implementation of CopyRow using normal vector
load/store instructions.
Change-Id: Ia551413f9740a96473fa2e8a0958953be2f4b04e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074374
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel.
SVE has a "widening multiply get high half" instruction in UMULH,
however using the same technique as the Neon code to avoid the need for
a widening multiply at all is more performant here.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Some projects require scaling support for the NV24 format, but libyuv currently lacks this functionality. This commit adds a scaling function for NV24, enabling its use in projects that require NV24 format processing.
Change-Id: I6e6b2bea342e1df7f387056ab3bc5003da983bb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6068715
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Mostly just straightforward copies of the Neon code ported to
Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2
SME kernels, but operating on 16-bit data rather than 8-bit.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: I7bad0719d24cdb1760d1039c63c0e77726b28a54
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070784
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Mostly just straightforward copies of the Neon code ported to
Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2
and ScaleUVRowDown2 SME kernels, but operating on 32-bit ARGB tuples
rather than 8-bit data or 16-bit UV tuples.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: I15600c2498cc592f5ea1d97b78fafec327de7947
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070783
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel
and use ST2 to avoid needing a separate ZIP instruction.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Force macros onto empty lines with empty comments and adjust some other
comments to be consistent with the rest of the file.
Change-Id: I1a35283608b868c53e91b337187ebe0e402c9834
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067152
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Now that we have the `_2X` versions of the macros we can use these to
implement `ToRGB24` kernels. These cannot use the bottom/top approach
previously used by other SVE kernels since there are three rather than
two or four elements each.
Reduction in runtimes observed compared to the existing Neon
implementations:
| NV12ToRGB24Row | NV21ToRGB24Row
Cortex-A510 | -60.7% | -60.7%
Cortex-A520 | -46.0% | -46.0%
Cortex-A715 | -25.2% | -25.2%
Cortex-A720 | -25.2% | -25.2%
Cortex-X2 | -28.9% | -29.0%
Cortex-X3 | -28.2% | -28.1%
Cortex-X4 | -30.8% | -30.7%
Cortex-X925 | -28.8% | -28.9%
Change-Id: I39853d124bfdcac38584109870b398b8ecd5b632
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067149
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
These were mistakenly copied from the main loop body, however this
particular block of the code is only executed at most once so we do not
need to perform the address updates.
Also adjust formatting with clang-format to match other kernels.
Change-Id: I8214821417d5e4f455ebe8805e1a37a9728ab8d2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067154
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can reuse most of the logic from the existing I422TORGB_SVE_2X macro
and simply amend the existing READNV_SVE macro to read twice as much
data.
Unrolling is primarily beneficial for little cores but also provides
some smaller benefits to larger cores as well.
| NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 | -48.0% | -47.9%
Cortex-A520 | -48.1% | -48.2%
Cortex-A715 | -20.4% | -20.4%
Cortex-A720 | -20.6% | -20.6%
Cortex-X2 | -7.1% | -7.3%
Cortex-X3 | -4.0% | -4.3%
Cortex-X4 | -14.1% | -14.3%
Cortex-X925 | -8.2% | -8.6%
Change-Id: I195005d23e743d7d46319220ad05ee89bb7385ae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067148
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Several of the existing SVE kernels used calculations of the form:
remainder = width & (vl - 1) == 0 ? vl : width & (vl - 1);
This is due to initial SVE contributed code unconditionally using the
predicated tail for the final iteration even if the width was a perfect
multiple of the vector length.
In the current code the fully-predicated main body loop will instead
iterate through the width completely and simply skip over the tail
entirely. Skipping over the tail means that the case handled by the
ternary condition now never occurs, and the remainder calculation can
now simply be:
remainder = width & (vl - 1);
This avoids the need for a compare and conditional select in the
function prologue.
Change-Id: Ia73f5f8bc66fad6bea64439dc2beeaccb54622d2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067151
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing instruction arrangement is sub-optimal on little cores
since it has instructions with dependencies next to each other, so
spread them out to improve performance.
No significant change observed on bigger cores, but little cores do show
some small improvements except for the *Alpha* kernels which regress
slightly.
Runtimes observed compared to the previous SVE implementation:
| Cortex-A510 | Cortex-A520
I210AlphaToARGBRow | (!) +7.0% | (!) +6.8%
I210ToAR30Row | -10.3% | -9.9%
I210ToARGBRow | -2.4% | -2.3%
I212ToAR30Row | -10.3% | -9.9%
I212ToARGBRow | -2.4% | -2.3%
Change-Id: I626942ce02c4610cfac1ea4f8e7890653ee4324f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067150
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Fix errors in ARGBAttenuateRow_LASX and ARGBAttenuateRow_LSX functions
caused by changes in calculation methods.
In addition, add the option to automatically add "-mlsx" and "-mlasx" to
enable SIMD optimization when compiling with cmake on LoongArch
platform.
Bug: libyuv:913
Change-Id: I7215f5198d3fb94f981d60969dc21a483006023e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802829
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
The auto-vectorized implementation unrolls to process 32 elements per
iteration, so unroll the new Neon implementation to match and avoid a
performance regression on little cores.
Performance relative to the auto-vectorized C implementation compiled
with LLVM 19:
Cortex-A55: -35.8%
Cortex-A510: -20.4%
Cortex-A520: -22.1%
Cortex-A76: -54.8%
Cortex-A710: -44.5%
Cortex-A715: -31.1%
Cortex-A720: -31.4%
Cortex-X1: -48.5%
Cortex-X2: -47.8%
Cortex-X3: -47.6%
Cortex-X4: -51.1%
Cortex-X925: -14.6%
Bug: b/42280942
Change-Id: Ib4e89ba230d554f2717052e934ca0e8a109ccc42
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040153
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The #ifdef surrounding the use of this kernel is never defined and
ScaleRowDown2_16_NEON does not exist, so add the missing #define and
remove the use of ScaleRowDown2_16_NEON for now. Additionally since
there is no implementation of this kernel for 32-bit Arm, restrict the
define to only be present on AArch64.
Bug: b/42280942
Change-Id: Icc35c145c1bad1c0df2933a2d8bc7dcf7fe63cb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040152
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Remove special case Scale of 1 which used fp16 cvt but requires cpuid
- Port aarch64 to aarch32
- Use C for aarch32 with small (denormal) scale value
Bug: 377693555
Change-Id: I38e207e79ac54907ed6e65118b8109288fddb207
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6043392
Reviewed-by: Wan-Teh Chang <wtc@google.com>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: Ie15bb4e7484b61e78f405ad4e8a8a7bbb66b7edb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979727
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: I401eb6ad14b3159917c8e3a79ab20dde318d28b6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979726
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: Ic4ba5f97dc57afc558c08a57e9b5009d6e487e0f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979725
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
For HalfFloat1Row, SVE has direct 16-bit integer to half-float
conversion instructions so there is no need to widen to 32-bits.
For HalfFloatRow, SVE zero-extending loads avoid the need for seperate
UXTL(2) instructions.
Observed reductions in runtime compared to the existing Neon code:
| HalfFloat1Row | HalfFloatRow
Cortex-A510 | -38.3% | -17.3%
Cortex-A520 | -37.6% | -18.8%
Cortex-A720 | -50.1% | -7.8%
Cortex-X2 | -50.2% | -0.4%
Cortex-X4 | -51.5% | -12.5%
Bug: b/42280942
Change-Id: I445071ccd453113144ce42d465ba03c9ee89ec9e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975319
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
SVE contains the UMULH instruction which allows us to multiply and take
the high half of the result in a single instruction rather than needing
separate widening multiply and then narrowing shift steps.
Observed reduction in runtime compared to the existing Neon code:
Cortex-A510: -21.2%
Cortex-A520: -20.9%
Cortex-A715: -47.9%
Cortex-A720: -47.6%
Cortex-X2: -5.2%
Cortex-X3: -2.6%
Cortex-X4: -32.4%
Cortex-X925: -1.5%
Bug: b/42280942
Change-Id: I25154699b17772db1fb5cb84c049919181d86f4b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975318
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: I5021aeda30f4c5f1aa4cc6326c8d7886851d2c09
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913885
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The assignment of ScaleUVRowDown2Box_NEON is already done in the block
immediately below this one, so just remove this code.
Change-Id: I83c0f18dbe66e908cd4fbce73e20e96a137860cf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979723
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing C implementation compiled with a recent LLVM is
auto-vectorised and unrolled to process four vectors per loop iteration,
making the Neon implementation slower than the C implementation on
little cores. To avoid this, unroll the Neon implementation to also
process four vectors per iteration.
Reduction in cycle counts observed compared to the existing Neon
implementation:
| HalfFloat1Row_NEON | HalfFloatRow_NEON
Cortex-A510 | -37.1% | -40.8%
Cortex-A520 | -32.3% | -37.4%
Cortex-A720 | 0.0% | -10.6%
Cortex-X2 | 0.0% | -7.8%
Cortex-X4 | +0.3% | -6.9%
Bug: b/42280945
Change-Id: I12b474c970fc4355d75ed924c4ca6169badda2bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872805
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: Ie6b91bd4407130ba2653838088e81e72e4460f68
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913884
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Including associated changes for adding a new scale_sme.cc file.
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: I47d149613fbabd8c203605a809811f1a668e8fb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913883
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
This is based on an unrolled version of the existing SVE2 code. The
implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.
Change-Id: I83d8e58aafd814125b3446fb1c9ec4a5fb56fe3e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913882
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Including addition of a new row_sme.cc file and associated
infrastructure.
The actual implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.
Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Replace LD4 and TRN instructions with LD1s and TBL since LD4 is known to
be slow on some micro-architectures, and remove other unnecessary
permutes.
Reduction in run times:
Cortex-A55: -24.8%
Cortex-A510: -32.7%
Cortex-A520: -37.7%
Cortex-A76: -51.8%
Cortex-A715: -58.9%
Cortex-A720: -58.9%
Cortex-X1: -54.8%
Cortex-X2: -50.3%
Cortex-X3: -57.1%
Cortex-X4: -49.8%
Cortex-X925: -52.0%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: b/42280945
Change-Id: Ie96bac30fffbe41f8d1501ee289795830ab127e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872803
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Replace LD4 and TRN instructions with LD1s and TBL since LD4 is known to
be slow on some micro-architectures, and remove other unnecessary permutes.
Reduction in run times:
Cortex-A55: -17.9%
Cortex-A510: -28.7%
Cortex-A520: -31.8%
Cortex-A76: -40.8%
Cortex-A715: -46.1%
Cortex-A720: -46.1%
Cortex-X1: -44.3%
Cortex-X2: -40.1%
Cortex-X3: -46.3%
Cortex-X4: -40.2%
Cortex-X925: -42.3%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: b/42280945
Change-Id: I84e2cd04912fc11d59b4407a1836f047b74a4c92
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872802
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.
Observed reduction in runtime compared to the existing Neon code:
Cortex-A510: -35.5%
Cortex-A520: -38.2%
Cortex-A715: -19.8%
Cortex-A720: -19.8%
Cortex-X2: -24.2%
Cortex-X3: -24.1%
Cortex-X4: -21.6%
Cortex-X925: -19.5%
Bug: b/42280942
Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.
Observed reduction in runtime compared to the existing Neon code:
Cortex-A510: -41.8%
Cortex-A520: -42.6%
Cortex-A715: -22.5%
Cortex-A720: -22.6%
Cortex-X2: -22.7%
Cortex-X3: -22.4%
Cortex-X4: -19.4%
Cortex-X925: -27.0%
Bug: b/42280942
Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.
Observed reduction in runtime compared to the existing Neon code:
Cortex-A510: -41.1%
Cortex-A520: -38.2%
Cortex-A715: -21.5%
Cortex-A720: -21.6%
Cortex-X2: -21.6%
Cortex-X3: -22.0%
Cortex-X4: -23.5%
Cortex-X925: -21.7%
Bug: b/42280942
Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Lane-indexed LD2 instructions are slow and introduce an unnecessary
dependency on the previous iteration of the loop. To avoid this
dependency use a scalar load for the first iteration and lane-indexed
LD1 for the remainder, then TRN1 and TRN2 to split out the even and odd
elements.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: -6.7%
Cortex-A510: -13.2%
Cortex-A520: -13.1%
Cortex-A76: -54.5%
Cortex-A715: -60.3%
Cortex-A720: -61.0%
Cortex-X1: -69.1%
Cortex-X2: -68.6%
Cortex-X3: -73.9%
Cortex-X4: -73.8%
Cortex-X925: -69.0%
Bug: b/42280945
Change-Id: I1c4adfb82a43bdcf2dd4cc212088fc21a5812244
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872804
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code performs a pair of stores since there is no AArch64
instruction in Neon to store exactly 12 bytes from a vector register.
It is guaranteed to be safe to write full vectors until the last
iteration of the loop, since the extra four bytes will be over-written
by subsequent iterations. This allows us to avoid duplicating the store
instruction and address arithmetic.
Reduction in runtime observed relative to the existing Neon
implementation:
Cortex-A55: +2.0%
Cortex-A510: -25.3%
Cortex-A520: -15.1%
Cortex-A76: -32.2%
Cortex-A715: -19.7%
Cortex-A720: -19.6%
Cortex-X1: -31.6%
Cortex-X2: -27.1%
Cortex-X3: -25.9%
Cortex-X4: -24.7%
Cortex-X925: -35.8%
Bug: b/42280945
Change-Id: I222ed662f169d82f5f472bebb1bcfe6d428ccae2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872843
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing behaviour does not round correctly in all cases, so adjust
it to match the existing Neon implementation.
Update the tests to require bit-exactness and disable other
implementations that do not round correctly.
Change-Id: Ie790fb4b4805b555d74d689d83802e1dd4f33df5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5869115
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This was previously disabled in
679e851f653866a49e21f69fe8380bd20123f0ee, so re-enable it but only for
Linux where SME is known to work correctly.
Change-Id: I2626b03f3854b27162df1b55fc6767e02ffe318d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802958
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
While we will return kCpuHasNEON if the file fails to open, this does
unnecessarily introduce filesystem operations which are not needed e.g.
on embedded non-Linux platforms. When not building for Linux, we can
simply rely on the compiler flags to determine whether NEON support is
present for Arm32.
Change-Id: Ifb0eab2a46969fca5f733ce624abdf54da9b32a2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778479
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: George Steed <george.steed@arm.com>
The existing code uses three MOV instructions through a temporary
register to swap the low and high halves of a vector register, however
this can be done with a pair of ZIP instructions instead.
Also use a pair of RSHRN rather than RSHRN2 to allow these to execute in
parallel on little cores.
Reduction in runtime observed compared to the existing Neon
implementation:
Cortex-A55: -8.3%
Cortex-A510: -20.6%
Cortex-A520: -16.6%
Cortex-A76: -6.8%
Cortex-A715: -6.2%
Cortex-A720: -6.2%
Cortex-X1: -22.0%
Cortex-X2: -18.7%
Cortex-X3: -21.1%
Cortex-X4: -25.8%
Cortex-X925: -21.9%
Change-Id: I87ae133be86c3c9f850d5848ec19d9b71ebda4d9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872801
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
ST3 is known to be slow on a number of modern micro-architectures. By
unrolling the code we are able to use TBL to shuffle elements into the
correct indices without needing to use LD4 and ST3, giving a good
improvement in performance across the board.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: -14.4%
Cortex-A510: -66.0%
Cortex-A520: -50.8%
Cortex-A76: -60.5%
Cortex-A715: -63.9%
Cortex-A720: -64.2%
Cortex-X1: -74.3%
Cortex-X2: -75.4%
Cortex-X3: -75.5%
Cortex-X4: -48.1%
Bug: b/42280945
Change-Id: Ia1efb03af2d6ec00bc5a4b72168963fede9f0c83
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785971
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Processing more data per loop iteration means that we can use the full
128-bit Neon vectors and also allows us to use e.g. UZP1 to perform XTN
+ XTN2 in a single instruction.
The early Cortex-X cores are not a fan of ST4 .16b with a
post-increment, so split out the pointer increment to a separate
instruction to avoid this bottleneck.
Reductions in runtime observed for ARGB1555ToARGBRow_NEON:
Cortex-A55: -18.1%
Cortex-A510: -11.2%
Cortex-A520: -39.5%
Cortex-A76: -18.0%
Cortex-A715: -34.8%
Cortex-A720: -34.8%
Cortex-X1: -0.9%
Cortex-X2: -4.6%
Cortex-X3: -3.6%
Cortex-X4: -20.8%
Bug: libyuv:976
Change-Id: Iae2ac24ffdbc718cd1e05bb77191f8d1df3fcf6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790975
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
The existing code only makes use of half of the vector lanes in the
RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more
data to allow using full vectors, adjusting the "any" kernel macros to
match. For the RGB565ToUVRow kernel we already have plenty of data but
currently call the macro twice as much as needed, so refactor the code
to only call it once but operating with full vectors instead.
Reduction in runtimes observed for selected micro-architectures:
| RGB565ToARGBRow | RGB565ToUVRow | RGB565ToYRow
Cortex-A53 | -35.2% | -28.8% | -31.1%
Cortex-A55 | -32.5% | -34.4% | -42.9%
Cortex-A510 | -21.6% | -27.7% | -47.2%
Cortex-A76 | -0.9% | -42.0% | -21.4%
Cortex-A720 | -28.6% | -37.2% | -26.1%
Cortex-X1 | -3.2% | -42.3% | -23.4%
Bug: b/42280945
Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
By loading packed 16-bit AR/GB data and operating on that directly we
avoid the need to perform a separate widening step before the
conversion.
Reduction in runtime observed compared to the existing Neon code:
Cortex-A55: -13.2%
Cortex-A510: -5.4%
Cortex-A76: -21.5%
Cortex-A720: -25.2%
Cortex-X1: -50.6%
Cortex-X2: -36.8%
Bug: b/42280945
Change-Id: I780c71fdff1d017464c6e4e38f86979dda0e43ad
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790973
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
We can use the dot product instructions to apply the coefficients
directly without the need for LD4 de-interleaving load instructions,
since these are known to be slow on some micro-architectures.
ST4 is also known to be slow on more modern micro-architectures, however
avoiding this is left for a future SVE implementation where we can make
use of interleaving-narrowing instructions.
Reduction in cycle counts observed compared to existing Neon code:
Cortex-A55: -5.8%
Cortex-A510: -18.9%
Cortex-A76: -21.8%
Cortex-A720: -30.2%
Cortex-X1: -28.6%
Cortex-X2: -23.4%
Bug: b/42280946
Change-Id: I5887559649cc805a810d867b652c85d48285657d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790970
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can use dot product instructions to apply the coefficients without
needing to use LD4 deinterleaving load instructions, and then TBL to mix
in the original alpha component. This is significantly faster on some
micro-architectures where LD4 instructions are known to be slow compared
to normal loads.
Reduction in cycle counts observed compared to existing Neon code:
Cortex-A55: -12.6%
Cortex-A510: -48.6%
Cortex-A76: -39.7%
Cortex-A720: -52.3%
Cortex-X1: -63.5%
Cortex-X2: -67.0%
Bug: b/42280946
Change-Id: I3641785e74873438acc00d675f5bc490dfa95b50
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785972
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can use wider load/store instructions here which is mostly an
improvement across the board.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: +4.9% (!)
Cortex-A510: -46.3%
Cortex-A520: -49.0%
Cortex-A76: -12.2%
Cortex-A715: -15.5%
Cortex-A720: -15.0%
Cortex-X1: -12.4%
Cortex-X2: -12.5%
Cortex-X3: -12.3%
Cortex-X4: +0.3%
Bug: b/42280945
Change-Id: Id8af6499c63919924c2a954dfe7765b703ce4820
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785970
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Replace LD4 with a pair of LD2 instructions to avoid needing an ST2
instruction for storing the result, since ST2 instructions are known to
be slow on some micro-architectures.
Observed reduction in runtimes compared to the existing Neon code:
Cortex-A55: -23.3%
Cortex-A510: -49.6%
Cortex-A520: -31.1%
Cortex-A76: -44.5%
Cortex-A715: -45.8%
Cortex-A720: -46.0%
Cortex-X1: -74.5%
Cortex-X2: -72.4%
Cortex-X3: -76.8%
Cortex-X4: -39.5%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Iab9e802d0784d69b7e970dcc8f1f4036985cd2e1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790972
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Use separate permute instructions to avoid using LD4/ST2 as these
instructions are known to be slow on some micro-architectures.
Observed reduction in runtimes compared to the existing Neon code:
Cortex-A55: -12.4%
Cortex-A510: -44.8%
Cortex-A520: -31.1%
Cortex-A76: -55.3%
Cortex-A715: -63.7%
Cortex-A720: -62.3%
Cortex-X1: -79.0%
Cortex-X2: -78.9%
Cortex-X3: -79.6%
Cortex-X4: -59.8%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I33cf27ae5e16c1ce62f1f343043e6bd9fca92558
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790971
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Scaling 48 pixels at a time, but calling code checked for 24 pixels
- Added test for scaling to 1080x1920
libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560
Was
libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560
[ RUN ] LibYUVScaleTest.I420ScaleTo1080x1920_Box
Segmentation fault
Traceback (most recent call last):
Now
[ RUN ] LibYUVScaleTest.I420ScaleTo1080x1920_Box
filter 3 - 6741 us C - 3566 us OPT
[ OK ] LibYUVScaleTest.I420ScaleTo1080x1920_Box (43 ms)
Bug: b/366045177
Change-Id: I0ea6c2d6a32b2e7ca44cd030abc9f248115be44a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5857554
Reviewed-by: Wan-Teh Chang <wtc@google.com>
- avx2 is pack/perm is mutating order
- cvt method maintains channel order on avx512
Sapphire Rapids
Benchmark of 640x360 on Sapphire Rapids
AVX512BW
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms)
AVX2
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms)
SSE2
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms)
Skylake Xeon
Now vpmovuswb
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms)
Was vpackuswb
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms)
Switch from vpunpcklwd to vpbroadcastw for scale value parameter
Was
vpunpcklwd %%xmm2,%%xmm2,%%xmm2
vbroadcastss %%xmm2,%%ymm2
Now
vpbroadcastw %%xmm2,%%ymm2
Bug: 357439226, 357721018
Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Add (void) casts to the 'src_width' parameters that are only used in
assertions.
Change-Id: I72d1b55f50a9b02b07b206e40e5583005b27928b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786606
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- convert full Y plane with row coalescing if possible
- convert rows of UV from 10 bit to 8 bit then call MergeUV
libyuv_test '--gunit_filter=*010ToNV12_Opt' --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1
Note: Google Test filter = *010ToNV12_Opt
Skylake Xeon Was 2 pass planes
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (4512 ms)
Now 2 pass rows
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (2400 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (2265 ms)
On Samsung S23
libyuv_test --gunit_filter=*.????ToNV12_Opt --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000'
Was
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (3563 ms)
Now
[ OK ] LibYUVConvertTest.AYUVToNV12_Opt (3068 ms
[ OK ] LibYUVConvertTest.ARGBToNV12_Opt (2990 ms
[ OK ] LibYUVConvertTest.ABGRToNV12_Opt (2904 ms
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (1177 ms
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (1150 ms <- now
[ OK ] LibYUVConvertTest.I444ToNV12_Opt (1118 ms
[ OK ] LibYUVConvertTest.MM21ToNV12_Opt (1008 ms
[ OK ] LibYUVConvertTest.UYVYToNV12_Opt (1007 ms
[ OK ] LibYUVConvertTest.YUY2ToNV12_Opt (938 ms)
[ OK ] LibYUVConvertTest.NV21ToNV12_Opt (496 ms)
[ OK ] LibYUVConvertTest.I420ToNV12_Opt (466 ms)
Bug: b/357439226, b/357721018
Change-Id: I48405929ae835b171e7d556a16794eac22c50ae9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5782404
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Declare functions as static. Declare functions in a header. Include the
header that declares the functions. Delete undeclared and unused
functions ScaleFilterRows_NEON() and ScaleRowUp2_16_NEON(). Delete
unused function ScaleY() in psnr_main.cc.
Change-Id: I182ec30611df83c61ffd01bbab595cd61fb5f1e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
I010, also known as YUV420P10, is 10 bit YUV pixel format with 3 planes.
Both I010 and NV12 are 4:2:0 subsampling. NV12 has a Y plane, and an
interleaved UV plane.
Bug: 357721018
Change-Id: If215529b9eda8e0fb32aed666ca179c90244aaff
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5764823
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- P010 and NV12 have the same layout: Full size Y plane and half size UV plane.
P010 and NV12 are 4:2:0 subsampling
- P010 uses upper 10 bits of 16 bit elements
- NV12 uses 8 bit elements
- The Convert16To8 used internally will discard the low 2 bits.
- UV order is the same - U first in memory, followed by V, interleaved
- UV plane is be rounded up in size to allow odd size Y to have UV values
- Similar code could be used to convert P210ToNV16, P410ToNV24, with the size
of the UV plane affected by subsampling 4:2:2 and 4:4:4 variants.
Bug: b/357439226
Change-Id: I5d6ec84d97d0e0cc4008eeb18a929ea28570d6d9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5761958
Reviewed-by: Wan-Teh Chang <wtc@google.com>
We can use wider load/store instructions and avoid the need to waste
half of the ADDP/RSHRN vector data. The duplicated UADDLP and UADALP
instructions also provide a good improvement on little cores due to
their limited out-of-order capability.
The mask in the "any" kernel definition is already set up to handle an
unrolling of eight so no change to scale_any.cc is needed.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: -19.5%
Cortex-A520: -38.3%
Cortex-A76: -36.0%
Cortex-A715: -18.1%
Cortex-A720: -17.9%
Cortex-X1: -25.4%
Cortex-X2: -18.5%
Cortex-X3: -8.2%
Cortex-X4: -3.8%
Bug: b/42280945
Change-Id: Iebba5da4db5e25af4b9fa5651c7396364dedffba
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725172
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can make use of wider instructions for the loads and stores as well
as the URHADD instructions. In addition the duplicated instructions of
the code from the unrolling provides a further small improvement for
little cores with limited out-of-order capability.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: -23.5%
Cortex-A510: -35.4%
Cortex-A520: -40.5%
Cortex-A76: -15.1%
Cortex-A715: -6.2%
Cortex-A720: -6.2%
Cortex-X1: -17.9%
Cortex-X2: -18.4%
Cortex-X3: -18.3%
Cortex-X4: -14.0%
Bug: b/42280945
Change-Id: I5905e026a0507870bfc580b702906d6acb4ed6f4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725170
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Unrolling gives a nice improvement to the little cores and even a small
improvement to the big cores thanks to avoiding the loop control
overhead.
Observed performance improvement relative to the existing SVE2 code.
| Cortex-A510 | Cortex-A720 | Cortex-X2
RAWToARGBRow_SVE2 | -28.4% | -10.1% | -3.5%
RAWToRGBARow_SVE2 | -28.5% | -10.1% | -4.4%
RGB24ToARGBRow_SVE2 | -28.5% | -10.4% | -5.5%
Bug: libyuv:973
Change-Id: I7aa03fdaa1a24ecfdd13418647a02e5effe8333f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725174
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Since the UV components are duplicated in I422 we end up wasting half of
the vector bandwidth processing the same elements twice. By unrolling
the kernel to process two vectors of Y per iteration we can fill a whole
vector of U/V components.
Rather than packing RGBA components into pairs during the narrowing we
now just narrow into individual component vectors and use ST4B instead.
This by itself is slower on some micro-architectures like Cortex-A510
but the benefit from unrolling significantly outweights this.
| I422AlphaToARGBRow_SVE2 | I422ToARGBRow_SVE2
Cortex-A510 | -46.2% | -48.8%
Cortex-A720 | -20.8% | -21.0%
Cortex-X2 | -11.3% | -7.5%
Cortex-X4 | -15.4% | -15.5%
Bug: libyuv:973
Change-Id: I69389c4279861f7a460ae0c28186f023c728c4e8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725173
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can make use of the ZA tile register to do the transpose and
de-interleaving of UV components without any explicit permute
instructions: the tile is loaded horizontally placing UV components into
alternative columns, then we can just store the independent components
vertically.
Change-Id: I67bd82dc840a43888290be1c9db8a3c05f16d730
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703588
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can make use of the ZA tile register to do the transpose without any
explicit permute instructions: just load the tile horizontally and store
it vertically.
Change-Id: I1c31e89af52a408e3491e62d6c9e6fee41b1b80a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703587
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We cannot use the standard dot-product instructions since the
coefficients multiplication results are both added and subtracted, but
I8MM supports mixed-sign dot products which work well here. We need to
add an additional variant of the coefficient structs since we need
negative constants for the elements that were previously subtracted.
Reduction in runtimes observed compared to the previous Neon
implementation:
Cortex-A510: -37.3%
Cortex-A520: -31.1%
Cortex-A715: -37.1%
Cortex-A720: -37.0%
Cortex-X2: -62.1%
Cortex-X3: -62.2%
Cortex-X4: -40.4%
Bug: libyuv:977
Change-Id: Idc3d9a6408c30e1bce3816a1ed926ecd76792236
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712928
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
This reverts commit f480fa1c4a4af0ce3c34cd7b1ab0d85f1a36ce17.
This code has a number of small issues:
* The YUVTORGB_SVE_SETUP macro requires p0 to be initialized to
all-true, however the existing kernel does not initialise p0 until
after this macro is called, so flip the order.
* The p2 register is missing from the clobber list, so add it.
* The existing code uses the wrong condition flags when determining
whether to do the tail iteration using WHILE instructions or not.
Additionally the number of tail iterations is incorrect, as it was
incorrectly not changed from when the tail code was always executed.
While we are here, make another few small improvements:
* Remove the single-quote digit separators as requested here:
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133
* Remove "volatile" from the asm block counting the vector length. This
particular asm block cannot be removed by the compiler since the
output register is consumed by subsequent code, so "volatile" is
unnecessary here and we remove it.
* Add some additional empty comments to force clang-format to put macros
into the next line rather than on the same line as other asm.
Bug: b/352371649
Change-Id: I45676fab95343f588cf11ce2cf9186ffbe87489e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703586
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing disabled gtest rotate tests fail because the existing "any"
kernels always assume we are processing height=8 rows at a time. This
was recently changed to 16 on AArch64 which triggered this bug.
To fix this, amend the TANY macro to explicitly specify the fallback
kernel, such that we can use the height=16 kernel to match the SIMD
optimized version where necessary. Also change other architecture
versions to match.
Bug: b/352351302
Change-Id: I8080fa8f44c7c67fa970a78fb426f2f801a9a00e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703585
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing ARGB4444TORGB macro only makes use of 64 bit wide vectors
rather than the full 128 bits available, so unroll it to allow us to
process more data per instruction.
For ARGB4444ToUVRow_NEON we already have enough data available each
iteration to make use of full vectors, but for ARGB4444ToYRow_NEON we
also need to adjust the "any" kernel to allow us to process 16 elements
per iteration.
Reduction in runtimes observed compared to the existing Neon kernels:
| ARGB4444ToUVRow | ARGB4444ToYRow
Cortex-A55 | -27.8% | -34.6%
Cortex-A510 | -37.0% | -44.4%
Cortex-A76 | -40.2% | -22.0%
Cortex-A720 | -33.4% | -35.5%
Cortex-X1 | -34.1% | -19.7%
Cortex-X2 | -32.1% | -26.3%
Bug: libyuv:976
Change-Id: I08f6286bab0ebf5e24d5d5803f8c45ec6ba776ee
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631541
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code makes use of lane-indexed LD2 instructions to load the
input data however this creates a strong dependency chain between
consecutive load instructions. We can reduce this dependency chain by
instead loading two vectors with wider lane-indexed LD1 instructions and
then performing a permute to unzip the data.
We can also avoid the need for a complex sequence of DUP + EXT
instructions by using TBL to permute the data exactly as we want it.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: =0.0%
Cortex-A510: -44.2%
Cortex-A520: -47.6%
Cortex-A76: -45.8%
Cortex-A715: -58.3%
Cortex-A720: -58.4%
Cortex-X1: -66.7%
Cortex-X2: -68.0%
Cortex-X3: -67.9%
Cortex-X4: -70.0%
Change-Id: I8a1d1fe08d8a2ddb0b86d4a44f0d49b69ab03ece
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5683126
Reviewed-by: Frank Barchard <fbarchard@chromium.org>