Add a streaming-SVE implementation of CopyRow using normal vector
load/store instructions.
Change-Id: Ia551413f9740a96473fa2e8a0958953be2f4b04e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074374
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel.
SVE has a "widening multiply get high half" instruction in UMULH,
however using the same technique as the Neon code to avoid the need for
a widening multiply at all is more performant here.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Some projects require scaling support for the NV24 format, but libyuv currently lacks this functionality. This commit adds a scaling function for NV24, enabling its use in projects that require NV24 format processing.
Change-Id: I6e6b2bea342e1df7f387056ab3bc5003da983bb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6068715
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Mostly just straightforward copies of the Neon code ported to
Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2
SME kernels, but operating on 16-bit data rather than 8-bit.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: I7bad0719d24cdb1760d1039c63c0e77726b28a54
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070784
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Mostly just straightforward copies of the Neon code ported to
Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2
and ScaleUVRowDown2 SME kernels, but operating on 32-bit ARGB tuples
rather than 8-bit data or 16-bit UV tuples.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: I15600c2498cc592f5ea1d97b78fafec327de7947
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070783
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel
and use ST2 to avoid needing a separate ZIP instruction.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Force macros onto empty lines with empty comments and adjust some other
comments to be consistent with the rest of the file.
Change-Id: I1a35283608b868c53e91b337187ebe0e402c9834
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067152
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing tests reuse the intermediate buffers between the reference
and optimized implementations. In particular the existing tests appear
to pass even if the optimized implementation is completely empty, so
long as it does not modify the desintation buffers since these are
already filled with correct values from the reference code.
To avoid this, allocate separate buffers for optimized and reference
implementations to store intermediate data between function calls.
Additionally remove unused buffers from HalfMergeUVPlane_Opt tests.
Change-Id: I7e9ea21fc193e7be21cc24e2be0d7a122e068f6e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074941
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Now that we have the `_2X` versions of the macros we can use these to
implement `ToRGB24` kernels. These cannot use the bottom/top approach
previously used by other SVE kernels since there are three rather than
two or four elements each.
Reduction in runtimes observed compared to the existing Neon
implementations:
| NV12ToRGB24Row | NV21ToRGB24Row
Cortex-A510 | -60.7% | -60.7%
Cortex-A520 | -46.0% | -46.0%
Cortex-A715 | -25.2% | -25.2%
Cortex-A720 | -25.2% | -25.2%
Cortex-X2 | -28.9% | -29.0%
Cortex-X3 | -28.2% | -28.1%
Cortex-X4 | -30.8% | -30.7%
Cortex-X925 | -28.8% | -28.9%
Change-Id: I39853d124bfdcac38584109870b398b8ecd5b632
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067149
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
These were mistakenly copied from the main loop body, however this
particular block of the code is only executed at most once so we do not
need to perform the address updates.
Also adjust formatting with clang-format to match other kernels.
Change-Id: I8214821417d5e4f455ebe8805e1a37a9728ab8d2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067154
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can reuse most of the logic from the existing I422TORGB_SVE_2X macro
and simply amend the existing READNV_SVE macro to read twice as much
data.
Unrolling is primarily beneficial for little cores but also provides
some smaller benefits to larger cores as well.
| NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 | -48.0% | -47.9%
Cortex-A520 | -48.1% | -48.2%
Cortex-A715 | -20.4% | -20.4%
Cortex-A720 | -20.6% | -20.6%
Cortex-X2 | -7.1% | -7.3%
Cortex-X3 | -4.0% | -4.3%
Cortex-X4 | -14.1% | -14.3%
Cortex-X925 | -8.2% | -8.6%
Change-Id: I195005d23e743d7d46319220ad05ee89bb7385ae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067148
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Several of the existing SVE kernels used calculations of the form:
remainder = width & (vl - 1) == 0 ? vl : width & (vl - 1);
This is due to initial SVE contributed code unconditionally using the
predicated tail for the final iteration even if the width was a perfect
multiple of the vector length.
In the current code the fully-predicated main body loop will instead
iterate through the width completely and simply skip over the tail
entirely. Skipping over the tail means that the case handled by the
ternary condition now never occurs, and the remainder calculation can
now simply be:
remainder = width & (vl - 1);
This avoids the need for a compare and conditional select in the
function prologue.
Change-Id: Ia73f5f8bc66fad6bea64439dc2beeaccb54622d2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067151
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing instruction arrangement is sub-optimal on little cores
since it has instructions with dependencies next to each other, so
spread them out to improve performance.
No significant change observed on bigger cores, but little cores do show
some small improvements except for the *Alpha* kernels which regress
slightly.
Runtimes observed compared to the previous SVE implementation:
| Cortex-A510 | Cortex-A520
I210AlphaToARGBRow | (!) +7.0% | (!) +6.8%
I210ToAR30Row | -10.3% | -9.9%
I210ToARGBRow | -2.4% | -2.3%
I212ToAR30Row | -10.3% | -9.9%
I212ToARGBRow | -2.4% | -2.3%
Change-Id: I626942ce02c4610cfac1ea4f8e7890653ee4324f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067150
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
(Initially uploaded here https://crrev.com/c/5726652)
This logic was copied from the login in chromium/src at https://chromium-review.googlesource.com/c/chromium/src/+/4666325 as that is the current version of buildtools that libyuv uses
This is needed to be able to remove the old path of downloading remote exec configs on ci builders
Test: CQ tryjobs
No-Try: true
Bug: b/292501270
Change-Id: Idea22e9a499e57d86f1e1e8ed9c0ca346aa162b6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6055341
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Christoffer Dewerin <jansson@chromium.org>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
Fix errors in ARGBAttenuateRow_LASX and ARGBAttenuateRow_LSX functions
caused by changes in calculation methods.
In addition, add the option to automatically add "-mlsx" and "-mlasx" to
enable SIMD optimization when compiling with cmake on LoongArch
platform.
Bug: libyuv:913
Change-Id: I7215f5198d3fb94f981d60969dc21a483006023e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802829
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
The auto-vectorized implementation unrolls to process 32 elements per
iteration, so unroll the new Neon implementation to match and avoid a
performance regression on little cores.
Performance relative to the auto-vectorized C implementation compiled
with LLVM 19:
Cortex-A55: -35.8%
Cortex-A510: -20.4%
Cortex-A520: -22.1%
Cortex-A76: -54.8%
Cortex-A710: -44.5%
Cortex-A715: -31.1%
Cortex-A720: -31.4%
Cortex-X1: -48.5%
Cortex-X2: -47.8%
Cortex-X3: -47.6%
Cortex-X4: -51.1%
Cortex-X925: -14.6%
Bug: b/42280942
Change-Id: Ib4e89ba230d554f2717052e934ca0e8a109ccc42
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040153
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The #ifdef surrounding the use of this kernel is never defined and
ScaleRowDown2_16_NEON does not exist, so add the missing #define and
remove the use of ScaleRowDown2_16_NEON for now. Additionally since
there is no implementation of this kernel for 32-bit Arm, restrict the
define to only be present on AArch64.
Bug: b/42280942
Change-Id: Icc35c145c1bad1c0df2933a2d8bc7dcf7fe63cb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040152
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Remove special case Scale of 1 which used fp16 cvt but requires cpuid
- Port aarch64 to aarch32
- Use C for aarch32 with small (denormal) scale value
Bug: 377693555
Change-Id: I38e207e79ac54907ed6e65118b8109288fddb207
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6043392
Reviewed-by: Wan-Teh Chang <wtc@google.com>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: Ie15bb4e7484b61e78f405ad4e8a8a7bbb66b7edb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979727
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: I401eb6ad14b3159917c8e3a79ab20dde318d28b6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979726
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: Ic4ba5f97dc57afc558c08a57e9b5009d6e487e0f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979725
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
For HalfFloat1Row, SVE has direct 16-bit integer to half-float
conversion instructions so there is no need to widen to 32-bits.
For HalfFloatRow, SVE zero-extending loads avoid the need for seperate
UXTL(2) instructions.
Observed reductions in runtime compared to the existing Neon code:
| HalfFloat1Row | HalfFloatRow
Cortex-A510 | -38.3% | -17.3%
Cortex-A520 | -37.6% | -18.8%
Cortex-A720 | -50.1% | -7.8%
Cortex-X2 | -50.2% | -0.4%
Cortex-X4 | -51.5% | -12.5%
Bug: b/42280942
Change-Id: I445071ccd453113144ce42d465ba03c9ee89ec9e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975319
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
SVE contains the UMULH instruction which allows us to multiply and take
the high half of the result in a single instruction rather than needing
separate widening multiply and then narrowing shift steps.
Observed reduction in runtime compared to the existing Neon code:
Cortex-A510: -21.2%
Cortex-A520: -20.9%
Cortex-A715: -47.9%
Cortex-A720: -47.6%
Cortex-X2: -5.2%
Cortex-X3: -2.6%
Cortex-X4: -32.4%
Cortex-X925: -1.5%
Bug: b/42280942
Change-Id: I25154699b17772db1fb5cb84c049919181d86f4b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975318
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: I5021aeda30f4c5f1aa4cc6326c8d7886851d2c09
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913885
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The assignment of ScaleUVRowDown2Box_NEON is already done in the block
immediately below this one, so just remove this code.
Change-Id: I83c0f18dbe66e908cd4fbce73e20e96a137860cf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979723
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing C implementation compiled with a recent LLVM is
auto-vectorised and unrolled to process four vectors per loop iteration,
making the Neon implementation slower than the C implementation on
little cores. To avoid this, unroll the Neon implementation to also
process four vectors per iteration.
Reduction in cycle counts observed compared to the existing Neon
implementation:
| HalfFloat1Row_NEON | HalfFloatRow_NEON
Cortex-A510 | -37.1% | -40.8%
Cortex-A520 | -32.3% | -37.4%
Cortex-A720 | 0.0% | -10.6%
Cortex-X2 | 0.0% | -7.8%
Cortex-X4 | +0.3% | -6.9%
Bug: b/42280945
Change-Id: I12b474c970fc4355d75ed924c4ca6169badda2bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872805
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: Ie6b91bd4407130ba2653838088e81e72e4460f68
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913884
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Including associated changes for adding a new scale_sme.cc file.
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: I47d149613fbabd8c203605a809811f1a668e8fb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913883
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
This is based on an unrolled version of the existing SVE2 code. The
implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.
Change-Id: I83d8e58aafd814125b3446fb1c9ec4a5fb56fe3e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913882
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Including addition of a new row_sme.cc file and associated
infrastructure.
The actual implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.
Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>