This can make use of the existing load/convert/store macros that are
already present for other kernels, so add I422ToAR30Row_SVE2 and
I422ToAR30Row_SME to match the existing kernels.
Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:
Cortex-A510: -9.1%
Cortex-A520: +6.8% (!)
Cortex-A710: -4.0%
Cortex-A715: -1.1%
Cortex-A720: -1.1%
Cortex-X2: -5.7%
Cortex-X3: -5.9%
Cortex-X4: -2.8%
Cortex-X925: -4.0%
Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so
they are usable in both SVE2 and SME implementations, and use them to
add new I444ToRGB24Row implementations for SVE2 and SME. We need to use
the unrolled versions here to use the ST3B interleaving store
instructions, since there is no partial vector version of this store
instruction.
Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:
Cortex-A510: -57.6%
Cortex-A520: -38.1%
Cortex-A710: -15.5%
Cortex-A715: -9.2%
Cortex-A720: -9.2%
Cortex-X2: -25.8%
Cortex-X3: -26.2%
Cortex-X4: -23.2%
Cortex-X925: -17.8%
Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Some of the color conversion kernels already have Streaming-SVE
implementations however many do not. We can re-use the existing SVE
implementation by moving it to a new shared row_sve.h header and marking
it with a "streaming-compatible" attribute to ensure it can be called
from both streaming and non-streaming execution modes.
As part of this move to a common header we also add duplicated
streaming-mode implementations of the following kernels that did not
previously have an SME implementation:
- I210AlphaToARGBRow_SME
- I210ToAR30Row_SME
- I210ToARGBRow_SME
- I212ToAR30Row_SME
- I212ToARGBRow_SME
- I400ToARGBRow_SME
- I410AlphaToARGBRow_SME
- I410ToAR30Row_SME
- I410ToARGBRow_SME
- I422AlphaToARGBRow_SME
- I422ToARGB1555Row_SME
- I422ToARGB4444Row_SME
- I422ToRGB24Row_SME
- I422ToRGB565Row_SME
- I422ToRGBARow_SME
- I444AlphaToARGBRow_SME
- NV12ToARGBRow_SME
- NV12ToRGB24Row_SME
- NV21ToARGBRow_SME
- NV21ToRGB24Row_SME
- P210ToAR30Row_SME
- P210ToARGBRow_SME
- P410ToAR30Row_SME
- P410ToARGBRow_SME
- UYVYToARGBRow_SME
- YUY2ToARGBRow_SME
Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Now that we have the `_2X` versions of the macros we can use these to
implement `ToRGB24` kernels. These cannot use the bottom/top approach
previously used by other SVE kernels since there are three rather than
two or four elements each.
Reduction in runtimes observed compared to the existing Neon
implementations:
| NV12ToRGB24Row | NV21ToRGB24Row
Cortex-A510 | -60.7% | -60.7%
Cortex-A520 | -46.0% | -46.0%
Cortex-A715 | -25.2% | -25.2%
Cortex-A720 | -25.2% | -25.2%
Cortex-X2 | -28.9% | -29.0%
Cortex-X3 | -28.2% | -28.1%
Cortex-X4 | -30.8% | -30.7%
Cortex-X925 | -28.8% | -28.9%
Change-Id: I39853d124bfdcac38584109870b398b8ecd5b632
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067149
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Fix errors in ARGBAttenuateRow_LASX and ARGBAttenuateRow_LSX functions
caused by changes in calculation methods.
In addition, add the option to automatically add "-mlsx" and "-mlasx" to
enable SIMD optimization when compiling with cmake on LoongArch
platform.
Bug: libyuv:913
Change-Id: I7215f5198d3fb94f981d60969dc21a483006023e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802829
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
This is based on an unrolled version of the existing SVE2 code. The
implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.
Change-Id: I83d8e58aafd814125b3446fb1c9ec4a5fb56fe3e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913882
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Including addition of a new row_sme.cc file and associated
infrastructure.
The actual implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.
Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.
Observed reduction in runtime compared to the existing Neon code:
Cortex-A510: -35.5%
Cortex-A520: -38.2%
Cortex-A715: -19.8%
Cortex-A720: -19.8%
Cortex-X2: -24.2%
Cortex-X3: -24.1%
Cortex-X4: -21.6%
Cortex-X925: -19.5%
Bug: b/42280942
Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.
Observed reduction in runtime compared to the existing Neon code:
Cortex-A510: -41.8%
Cortex-A520: -42.6%
Cortex-A715: -22.5%
Cortex-A720: -22.6%
Cortex-X2: -22.7%
Cortex-X3: -22.4%
Cortex-X4: -19.4%
Cortex-X925: -27.0%
Bug: b/42280942
Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.
Observed reduction in runtime compared to the existing Neon code:
Cortex-A510: -41.1%
Cortex-A520: -38.2%
Cortex-A715: -21.5%
Cortex-A720: -21.6%
Cortex-X2: -21.6%
Cortex-X3: -22.0%
Cortex-X4: -23.5%
Cortex-X925: -21.7%
Bug: b/42280942
Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Processing more data per loop iteration means that we can use the full
128-bit Neon vectors and also allows us to use e.g. UZP1 to perform XTN
+ XTN2 in a single instruction.
The early Cortex-X cores are not a fan of ST4 .16b with a
post-increment, so split out the pointer increment to a separate
instruction to avoid this bottleneck.
Reductions in runtime observed for ARGB1555ToARGBRow_NEON:
Cortex-A55: -18.1%
Cortex-A510: -11.2%
Cortex-A520: -39.5%
Cortex-A76: -18.0%
Cortex-A715: -34.8%
Cortex-A720: -34.8%
Cortex-X1: -0.9%
Cortex-X2: -4.6%
Cortex-X3: -3.6%
Cortex-X4: -20.8%
Bug: libyuv:976
Change-Id: Iae2ac24ffdbc718cd1e05bb77191f8d1df3fcf6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790975
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
The existing code only makes use of half of the vector lanes in the
RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more
data to allow using full vectors, adjusting the "any" kernel macros to
match. For the RGB565ToUVRow kernel we already have plenty of data but
currently call the macro twice as much as needed, so refactor the code
to only call it once but operating with full vectors instead.
Reduction in runtimes observed for selected micro-architectures:
| RGB565ToARGBRow | RGB565ToUVRow | RGB565ToYRow
Cortex-A53 | -35.2% | -28.8% | -31.1%
Cortex-A55 | -32.5% | -34.4% | -42.9%
Cortex-A510 | -21.6% | -27.7% | -47.2%
Cortex-A76 | -0.9% | -42.0% | -21.4%
Cortex-A720 | -28.6% | -37.2% | -26.1%
Cortex-X1 | -3.2% | -42.3% | -23.4%
Bug: b/42280945
Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This can make use of the existing helper functions for RAWToARGBRow_SVE2
and RAWToRGBARow_SVE2 since the layouts are similar, we just need to
adjust the TBL constants to match the different input layout.
Observed reduction in runtime compared to the existing Neon kernel:
Cortex-A510: -25.6%
Cortex-A720: -15.2%
Cortex-X2: -10.2%
Cortex-X4: -30.2%
Bug: libyuv:973
Change-Id: Ie3676693286be90d09f0045766c3492cbc04ea64
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5638555
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
We can construct particular predicates to load only up to 3/4 of a full
vector, allowing us to use TBL to shuffle elements into the correct
place rather than needing to rely on more expensive LD3 or ST4
instructions.
Reduction in runtimes observed compared to the existing Neon
implementation:
| RAWToARGBRow | RAWToRGBARow
Cortex-A510 | -32.4% | -31.9%
Cortex-A720 | -15.7% | -15.6%
Cortex-X2 | -24.6% | -24.4%
Bug: libyuv:973
Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| Cortex-A55 | Cortex-A510 | Cortex-A76
P210ToARGBRow | -59.8% | -16.8% | -53.2%
P210ToAR30Row | -48.1% | -21.8% | -54.0%
P410ToARGBRow | -56.5% | -32.2% | -54.1%
P410ToAR30Row | -42.4% | -4.5% | -50.4%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is an existing x86 implementation for this kernel, but not for
AArch64, so add one.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
Cortex-A55: -43.1%
Cortex-A510: -22.3%
Cortex-A76: -54.8%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210ToAR30Row | I210ToARGBRow
Cortex-A55 | -40.8% | -54.4%
Cortex-A510 | -26.2% | -22.7%
Cortex-A76 | -49.2% | -44.5%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels
except reading the YUV components all from the same interleaved input
array. We load four-byte elements and then use TBL to de-interleave the
UV components.
Unlike the NV{12,21} cases we need to de-interleave bytes rather than
widened 16-bit elements. Since we need a TBL instruction already it
would ordinarily be possible to perform the zero-extension from bytes to
16-bit elements by setting the index for every other byte to be out of
range. Such an approach does not work in SVE since at a vector length of
2048 bits since all possible byte values (0-255) are valid indices into
the vector. We instead get around this by rewriting the I4XXTORGB_SVE
macro to perform widening multiplies, operating on the low byte of each
16-bit UV element instead of the full value and therefore eliminating
the need for a zero-extension.
Observed reductions in runtimes compared to the existing Neon code:
| UYVYToARGBRow | YUY2ToARGBRow
Cortex-A510 | -30.2% | -30.2%
Cortex-A720 | -4.8% | -4.7%
Cortex-X2 | -9.6% | -10.1%
Bug: libyuv:973
Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we
can pre-calculate the UV component results before the loop body.
Unlike in the Neon version of the code we can make use of MOVPRFX and
USQADD to avoid needing to apply the bias separately from the UV
coefficient multiply additions.
Reduction in runtime observed compared to the existing Neon code:
Cortex-A510: -26.1%
Cortex-A520: -5.9%
Cortex-A715: -49.5%
Cortex-A720: -49.4%
Cortex-X2: -22.5%
Cortex-X3: -23.5%
Cortex-X4: -21.6%
Bug: libyuv:973
Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We need a permute to duplicate the UV components, so we can share a
common implementation for both NV12 and NV21 by varying the inputs to
the INDEX instruction that generates the TBL indices.
Observed reductions in runtimes compared to the existing Neon code:
| NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 | -29.1% | -29.1%
Cortex-A720 | -4.8% | -4.8%
Cortex-X2 | -9.2% | -9.2%
Bug: libyuv:973
Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we
just need to interleave differently for the output.
The RGBA format actually saves us an instruction compared to ARGB since
there is no need to merge in the alpha component, we can just replace
the odd elements of the alpha vector itself during the narrowing.
Also rename some existing macros to make more sense when distinguishing
between ARGB and RGBA.
Reductions in runtime observed compared to the existing Neon code:
Cortex-A510: -27.0%
Cortex-A720: -5.3%
Cortex-X2: -14.7%
Bug: libyuv:973
Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Avoiding LD4 and unrolling gives a good perf improvement for the little
core especially.
Observed reduction in runtime relative to the existing Neon code:
Cortex-A510: -69.7%
Cortex-A720: -7.7%
Cortex-X2: -41.9%
Cortex-X4: -14.5%
Bug: libyuv:973
Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
I210ToAR30Row on Cortex-A55: -43.8%
I210ToAR30Row on Cortex-A510: -27.0%
I210ToAR30Row on Cortex-A76: -50.4%
I410ToAR30Row on Cortex-A55: -44.3%
I410ToAR30Row on Cortex-A510: -17.5%
I410ToAR30Row on Cortex-A76: -57.2%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210ToARGBRow | I410ToARGBRow
Cortex-A55 | -55.6% | -56.2%
Cortex-A510 | -22.6% | -35.6%
Cortex-A76 | -48.1% | -57.2%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210AlphaToARGBRow | I410AlphaToARGBRow
Cortex-A55 | -55.3% | -56.1%
Cortex-A510 | -27.9% | -42.6%
Cortex-A76 | -54.9% | -60.3%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This is mostly identical to the existing I422ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.
Reduction in runtimes observed compared to the existing Neon code:
Cortex-A510: -32.1%
Cortex-A720: -5.1%
Cortex-X2: -10.1%
Bug: libyuv:973
Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This is mostly identical to the existing I444ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.
Reduction in runtimes observed compared to the existing Neon code:
Cortex-A510: -34.2%
Cortex-A720: -17.6%
Cortex-X2: -9.6%
Bug: libyuv:973
Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We need a new macro for reading I422 data, but is otherwise mostly
identical to the existing I444ToARGBRow_SVE implementation.
Reduction in runtimes observed compared to the existing Neon code:
Cortex-A510: -25.0%
Cortex-A720: -5.0%
Cortex-X2: -10.8%
Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Being able to use SVE2 functionality for these kernels has a number of
performance wins compared to the existing Neon code:
* For the Y component calculation we are able to use UMULH, versus the
existing UMULL x2 + UZP2 sequence in Neon.
* For the RGBTORGBA8 calculation we are able to take advantage of
interleaving narrowing instructions, allowing us to use ST2 rather
than ST4 for the store. This is a big performance win on some
micro-architectures where ST4 is costly.
* The use of predication means we do not need to add "any" kernels, we
can simply rerun the calculation with a not-full predicate for the
final iteration.
To avoid the overhead of generating a predicate register on every
iteration we duplicate the loop body and only generate a predicate on
the final iteration of the loop. This costs a small amount on the final
iteration but should still be significantly quicker than the overhead of
a function call needed by the "any" cases. Duplicating the loop body to
reduce the use of the WHILELT instruction improves little core
performance by ~12% by itself but has negligable impact on other
micro-architectures.
Reduction in runtime for the new SVE2 implementation compared to the
existing Neon implementation on selected micro-architectures:
Cortex-A510: -36.5%
Cortex-A720: -17.3%
Cortex-X2: -11.3%
Bug: libyuv:973
Change-Id: I2a485f0dfa077a56f96b80a667ad38bbea47b4b4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424739
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Add scalar code for AR64ToAB64, ARGBToRGBA, ARGBToBGRA, ARGBToABGR, RGBAToARGB, BGRAToARGB, and ABGRToARGB.
They are originally implemented by ARGBShffle.
This CL independetly implements them, and only enables for risc-v now.
This CL also add RVV implementation for `RGBA-family <-> RGBA-family` color conversions.
* Run on SiFive internal FPGA(VLEN=128):
Test Case Speedup
AR64ToAB64_Opt x4.6
ARGBToRGBA_Opt x6
ARGBToBGRA_Opt x6
ARGBToABGR_Opt x6
RGBAToARGB_Opt x6
Change-Id: Ie0630901046084aa259699fcdeccc64170d7103f
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4797451
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
ScaleUVRowUp2_(Bi)linear_RVV function is equal to other platforms' ScaleRowUp2_(Bi)linear_Any_XXX.
We process entire row in this function.
Other platforms only implement non-edge part of image and process edge with scalar.
ScaleRowUp2_(Bi)linear_Any_XXX: Combine ScaleRowUp2_(Bi)linear_XXX(non-edge) + ScaleRowUp2_(Bi)linear_C(edge) by SBUH2LANY/SU2BLANY.
* Run on SiFive internal FPGA:
Test case RVV function Speedup
I444ScaleFrom640x360_Bilinear ScaleRowUp2_Bilinear_RVV 8.21
I444ScaleFrom640x360_Linear ScaleRowUp2_Linear_RVV 8.08
UVScaleFrom640x360_Bilinear ScaleUVRowUp2_Bilinear_RVV 7.80
UVScaleFrom640x360_Linear ScaleUVRowUp2_Linear_RVV 7.03
Change-Id: I539245ce51858f077506a78f0e7e82377ac6a95d
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4666062
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>