The existing Neon code narrows the input 16-bit packed data to 8-bit
elements and separates the color channels, causing us to only process
half a Neon vector per instruction for the channel widening from 4-bit
color data to 8-bits.
We can note that the processing being done is identical for all color
channels and therefore we can keep them partially interleaved during the
widening step. This allows us to use full Neon vectors for the whole
loop body.
Reductions in runtimes observed for ARGB4444ToARGBRow_NEON:
Cortex-A55: -30.7%
Cortex-A510: -44.3%
Cortex-A76: -51.6%
Cortex-X2: -54.2%
Bug: libyuv:976
Change-Id: I9d9cda7e16eb07619c6d7f1de2e6b8c0fb6d64cf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5594389
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This function takes the alpha component from the loaded data rather than
hard-coding it to 255, so initialising v3 to 255 is unused here.
Change-Id: I668825e0eeb317d1365035ce3bb47f3d92081c6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5594388
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
I210ToAR30Row on Cortex-A55: -43.8%
I210ToAR30Row on Cortex-A510: -27.0%
I210ToAR30Row on Cortex-A76: -50.4%
I410ToAR30Row on Cortex-A55: -44.3%
I410ToAR30Row on Cortex-A510: -17.5%
I410ToAR30Row on Cortex-A76: -57.2%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Get rid of unused tail loops, since they are already handled by the
"any" kernels.
Also remove unnecessary "volatile" specifier from asm blocks.
Bug: libyuv:976
Change-Id: I4676fc807bcaedbb5f0f52b1bed20a172fef4ed6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5553719
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210ToARGBRow | I410ToARGBRow
Cortex-A55 | -55.6% | -56.2%
Cortex-A510 | -22.6% | -35.6%
Cortex-A76 | -48.1% | -57.2%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210AlphaToARGBRow | I410AlphaToARGBRow
Cortex-A55 | -55.3% | -56.1%
Cortex-A510 | -27.9% | -42.6%
Cortex-A76 | -54.9% | -60.3%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
By keeping intermediate data as 16-bits wide we can compute twice as
much and use ST2 to store the final result. This appears to be much
better even on micro-architectures where ST2 is slightly slower than
ST1.
We save a couple of instructions by taking advantage of multiply-add
instructions to perform an effective shift-left and bitwise-or, since we
know the set of nonzero bits are disjoint after the UMIN.
Reduction in runtime observed for MergeXR30Row_10_NEON:
Cortex-A55: -34.2%
Cortex-A510: -35.6%
Cortex-A76: -44.9%
Cortex-X2: -48.3%
Bug: libyuv:976
Change-Id: I6e2627f9aa8e400ea82ff381ed587fcfc0d94648
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509199
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code performs a narrowing shift right (in RGBTORGB8)
followed by a widening left shift (in ARGBTORGB565). This is redundant
since we could have simply not performed a narrowing operation to begin
with and instead done a saturating left shift to saturate against the
top of the 16-bit lanes rather than the narrowed 8-bit lanes.
To enable this we introduce new RGBTORGB8_TOP and ARGBTORGB565_FROM_TOP
macros which produce and consume values from the high half of each
16-bit lane rather than a narrowed 8-bit intermediate.
Reduction in runtime for selected kernels:
| Cortex-A55 | Cortex-A510 | Cortex-A76 | Cortex-X2
I422ToRGB565Row_NEON | -10.8% | -6.1% | -17.2% | -23.6%
NV12ToRGB565Row_NEON | -11.4% | -4.9% | -20.4% | -17.4%
Bug: libyuv:976
Change-Id: I3337b8f41ff62a7af1b70a56b774239bdb55d0f1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509197
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing implementation of this kernel uses the ARGB1555TOARGB macro
which extracts and sign-extends the alpha component into v3, however
this particular kernel does not need the alpha component. We can avoid
calculating the alpha component completely by using the existing
RGB555TOARGB macro, so use that instead.
Reduction in runtimes observed for ARGB1555ToYRow_NEON (no noticeable
improvement observed on Cortex-A510):
Cortex-A55: -3.6%
Cortex-A76: -20.9%
Cortex-X2: -15.1%
Bug: libyuv:976
Change-Id: I2cf2729c8297c53dcd32d0df28e64d4d5c7f6def
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509200
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
ST2 with 64-bit lanes has good performance on all micro-architectures of
interest and saves us 8 TRN instructions, so use that instead.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: -8.6%
Cortex-A510: -4.9%
Cortex-A520: -6.0%
Cortex-A76: -14.4%
Cortex-A720: -5.3%
Cortex-X1: -13.6%
Cortex-X2: -5.8%
Bug: libyuv:976
Change-Id: I08bb5517bbdc54c4784fce42a885b12f91e7a982
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5581597
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We already have an "any" helper function set up for this kernel, so use
it to match the other existing architecture paths. This change also
affects the 32-bit Arm paths, which will be cleaned up in a later
commit.
With this change the kernel is now only entered with width as a multiple
of eight, so remove the now-unneeded tail loops.
Also remove volatile specifier from the asm block, it is unnecessary.
Change-Id: If37428ac2d6035a8c27eec9bd80d014a98ac3eb1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5553717
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Rather than shifting the data into the low half of each lane and then
using a saturating narrowing operation, we can do the saturation as part
of a shift into the highest half of the lane and then use a simpler TRN2
instruction to extract pairs of high halves into full vectors. This also
has the nice advantage of allowing us to use ST2 rather than ST4 for
storing the result, since ST4 is known to be slow on some
micro-architectures.
Reduction in runtimes observed for the two kernels:
| MergeARGB16To8Row_NEON | MergeXRGB16To8Row_NEON
Cortex-A55 | -8.0% | -12.2%
Cortex-A510 | -29.9% | -31.4%
Cortex-A76 | -29.0% | -32.0%
Cortex-X2 | -33.5% | -43.4%
Bug: libyuv:976
Change-Id: I9da3beedc27ab43527b3642aa6d4decf3b5b6683
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509198
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
The existing Neon code only makes use of 64-bit vectors throughout which
limits the performance on larger cores. To avoid this, swap the Neon
code from a Wx8 implementation to a Wx16 implementation and process
blocks of 16 full vectors at a time.
The original code also handled widths that were not exact multiples of
16, however this should already be handled by the "any" kernel so it is
removed.
Finally, avoid duplicating the TransposeWx16_C fallback kernel
definition in all architectures that need it, and just put it once in
rotate_common.cc instead.
Observed speedups for TransposePlane across a range of
micro-architectures:
Cortex-A53: -40.0%
Cortex-A55: -20.7%
Cortex-A57: -43.9%
Cortex-A510: -43.5%
Cortex-A520: -43.9%
Cortex-A720: -31.1%
Cortex-X2: -38.3%
Cortex-X4: -43.6%
Change-Id: Ic7c4d5f24eb27091d743ddc00cd95ef178b6984e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5545459
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| ABGRToAR30Row | ARGBToAR30Row
Cortex-A55 | -55.1% | -55.1%
Cortex-A510 | -39.3% | -40.1%
Cortex-A76 | -62.3% | -63.6%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We don't need a general-purpose purmute here, REV16 does exactly what we
want and saves us needing to load the permute indices array.
Bug: libyuv:976
Change-Id: Ib3bc2e4d21b00d53aeda6a11c6e6f1016ca6029e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509201
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
The strict architectural requirements between features are reasonably
relaxed and difficult to map out fully, in particular:
* FEAT_DotProd is architecturally available from Armv8.1-A and becomes
mandatory from Armv8.4-A.
* FEAT_I8MM is architecturally available from Armv8.1-A and becomes
mandatory from Armv8.6-A. It does not strictly depend on FEAT_DotProd
being implemented however I am not aware of a micro-architecture where
FEAT_I8MM is implemented without FEAT_DotProd also being implemented.
* FEAT_SVE is architecturally available from Armv8.2-A. It does not
strictly depend on either of FEAT_DotProd or FEAT_I8MM being
implemented. The only micro-architecture I am aware of where FEAT_SVE
is implemented without FEAT_DotProd and FEAT_I8MM both also being
implemented is the Fujitsu A64FX.
* FEAT_SVE2 is architecturally available from Armv9.0-A. If FEAT_SVE2 is
implemented then FEAT_SVE must also be implemented. Since Armv9.0-A is
based on Armv8.5-A this implies that FEAT_DotProd is also implemented.
Interestingly this means that FEAT_I8MM is not mandatory since it only
becomes mandatory from Armv8.6-A (Armv9.1-A), however I am not aware
of a micro-architecture where FEAT_SVE2 is implemented without all
three of the above features also being implemented.
Additionally, when testing under emulation there are sometimes bugs
where even mandatory architecture relationships are broken. For example
there is one known case where SVE2 may be reported as available even
when SVE is explicitly disabled.
To simplify these dependencies, don't try to enable later extensions
unless earlier extensions are reported implemented. This notably
penalises code if it were to run on a Fujitsu A64FX, however this is not
a likely target for libyuv deployment.
Change-Id: Ifa32f7a43043641f99afb120e591945e136c9fd1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5546385
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Using the platform-specific functions IsProcessorFeaturePresent and
sysctlbyname to check individual features.
Bug: libyuv:980
Change-Id: I7971238ca72e5df862c30c2e65331c46dc634074
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465591
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
By maintaining the interleaved format of the data we can use a common
kernel for all input channel orderings and simply pass a different
vector of constants instead.
A similar approach is possible with only Neon by making use of
multiplies and repeated application of ADDP to combine channels, however
this is slower on older cores like Cortex-A53 so is not pursued further.
For odd problem sizes we need a slightly different implementation for
the final element, so introduce an "any" kernel to address that rather
than bloating the code for the common case.
Observed affect on runtimes compared to the existing Neon kernels:
| Cortex-A510 | Cortex-A720 | Cortex-X2
ABGRToUVJRow | -15.5% | +5.4% | -33.1%
ABGRToUVRow | -15.6% | +5.3% | -35.9%
ARGBToUVJRow | -10.1% | +5.4% | -32.7%
ARGBToUVRow | -10.1% | +5.4% | -29.3%
BGRAToUVRow | -15.5% | +4.6% | -32.8%
RGBAToUVRow | -10.1% | +4.2% | -36.0%
Bug: libyuv:973
Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The use of LD4 and ST4 to de-interleave ARGB color channels is
unnecessary here since we can just adjust the scale multiplicand to
match the interleaved layout. LD4 and ST4 are known to perform poorly on
some micro-architectures so using LD1 and ST1 here should be preferred.
Reduction in runtime for ARGBShadeRow_NEON:
Cortex-A55: -19.9%
Cortex-A510: -50.8%
Cortex-A76: -36.0%
Cortex-X2: -46.4%
Bug: libyuv:976
Change-Id: I10a0e6a0a62242826d39b1e963063770f084226a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5494093
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can use subs to set condition flags as part of the subtract, no need
for a separate compare instruction. No performance difference observed
from this change, but it now matches the other SVE2 kernels.
Also remove unnecessary volatile from asm blocks.
Bug: libyuv:973
Change-Id: I9bb4f5f1101086602f7d5223feaeae0fb63b385c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463951
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This is mostly identical to the existing I422ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.
Reduction in runtimes observed compared to the existing Neon code:
Cortex-A510: -32.1%
Cortex-A720: -5.1%
Cortex-X2: -10.1%
Bug: libyuv:973
Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This is mostly identical to the existing I444ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.
Reduction in runtimes observed compared to the existing Neon code:
Cortex-A510: -34.2%
Cortex-A720: -17.6%
Cortex-X2: -9.6%
Bug: libyuv:973
Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We need a new macro for reading I422 data, but is otherwise mostly
identical to the existing I444ToARGBRow_SVE implementation.
Reduction in runtimes observed compared to the existing Neon code:
Cortex-A510: -25.0%
Cortex-A720: -5.0%
Cortex-X2: -10.8%
Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
We can use the Neon dot-product instructions as a slightly faster
widening accumulation. This also has the advantage of widening to 32
bits so avoids the risk of overflow present in the original Neon code.
Reduction in runtimes observed for HammingDistance compared to the
existing Neon code:
Cortex-A55: -4.4%
Cortex-A510: -26.5%
Cortex-A76: -8.1%
Cortex-A720: -15.5%
Cortex-X1: -4.1%
Cortex-X2: -5.1%
Bug: libyuv:977
Change-Id: I9e5e10d228c339d905cb2e668a9811ff0a6af5de
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5490049
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The kernel is only ever called with count as a multiple of 32 so it is
safe to unroll this and maintain two accumulators.
Reduction in runtime observed compared to the existing
SumSquareError_NEON_DotProd implementation:
Cortex-A55: -28.2%
Cortex-A510: -27.6%
Cortex-A76: -33.0%
Cortex-A720: -35.3%
Cortex-X1: -16.9%
Cortex-X2: -13.3%
Bug: libyuv:977
Change-Id: Iee423106c38e97cc38007d73fa80e8374dd96721
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5490048
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This re-lands commit ba0bba5b2b7e38c9365a5d152b4efa0458863213.
Now with additional #ifdef __linux__ guards to avoid compiling
Linux-specific code on non-Linux platforms. Non-linux feature detection
will be added in a separate patch.
Using getauxval(AT_HWCAP{,2}) has the advantage of also working under
emulation where faking /proc/cpuinfo is not supported.
For the Chromium sandbox, getauxval is supported since API version 18.
The minimum supported API version at time of writing is 21 so we should
be able to use getauxval unconditionally. On the off-chance the call
fails it will return 0 and we will correctly fall-back to using only
Neon.
If we want to read the current CPU implementer or part number we could
do this by checking HWCAP_CPUID and then reading MIDR_EL1. This will
cause a kernel trap to emulate the EL1 read but should still be a lot
faster than reading the whole of /proc/cpuinfo.
Bug: libyuv:980
Change-Id: I8ae103ea7e32ef44db72f3c9896417bfe97ff5c5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465590
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code makes use of a pair of shifts to put the bits we want
in the low part of each vector lane and then a pair of UQXTN and UQXTN2
instructions to perform a saturating cast down from 16-bit elements to
8-bit elements. We can instead achieve the same thing by adding eight to
the first shift amount so that the bits we want appear in the high half
of the lane, doing the saturation at the same time, and then simply use
UZP2 to pull out the high halves of each lane in a single instruction.
Reduction in runtime for Convert16To8Row_NEON:
Cortex-A55: -19.7%
Cortex-A510: -23.5%
Cortex-A76: -35.4%
Cortex-X2: -34.1%
Bug: libyuv:976
Change-Id: I9a80c0f4f2c6b5203f23e422c0970d3167052f91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463950
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Shift instructions have worse throughput than other permute instructions
on some micro-architectures, and we can avoid the need for two separate
narrowing instructions by taking the high halves of each lane directly
through use of the UZP2 instruction.
Reduction in runtime for DivideRow_16_NEON:
Cortex-A55: -6.2%
Cortex-A510: -30.0%
Cortex-A76: -11.9%
Cortex-X2: -46.8%
Bug: libyuv:976
Change-Id: I4aa06eab06ab6134bb80bc3af5328a1a83b3d249
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463949
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The Neon dot-product instructions perform two widening steps rather than
one, saving us the need to widen the absolute difference to 16-bits
before accumulating. Additionally, the dot-product instructions tend to
have better performance characteristics than traditional widening
multiply instructions like SMLAL used in the existing
SumSquareError_NEON code.
Observed reduction in runtimes compared to the existing Neon kernel:
Cortex-A55: -9.1%
Cortex-A510: -36.7%
Cortex-A76: -37.6%
Cortex-A720: -48.8%
Cortex-X1: -56.1%
Cortex-X2: -42.6%
Bug: libyuv:977
Change-Id: Ie20c69040cc47a803d8e95620d31e0bf1e1dac12
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463945
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The MOV instruction is an alias of ORR where both registers are the
same and should be preferred.
Both ORR and MOV are not zero-cost instructions on all
micro-architectures so there may be better ways to express these
kernels, but this is left for a later commit.
Bug: libyuv:975
Change-Id: I29b7f182a57a61855cb7f8a867691080f153b10b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5332385
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This reverts commit ba0bba5b2b7e38c9365a5d152b4efa0458863213.
Reason for revert: breaks builds on windows and mac
Step _compile_ failed. Error logs are shown below:
[1/104] CXX obj/libyuv_internal/cpu_id.o
FAILED: obj/libyuv_internal/cpu_id.o
../../buildtools/reclient/rewrapper -cfg=../../buildtools/reclient_cfgs/chromium-browser-clang/rewra...(too long)
../../source/cpu_id.cc:25:10: fatal error: 'sys/auxv.h' file not found
25 | #include // For getauxval()
| ^~~~~~~~~~~~
1 error generated.
More information in raw_io.output_text[failure_summary]
Original change's description:
> [AArch64] Use getauxval(AT_HWCAP{,2}) for feature detection
>
> This has the advantage of also working under emulation where
> faking /proc/cpuinfo is not supported.
>
> For the Chromium sandbox, getauxval is supported since API version 18.
> The minimum supported API version at time of writing is 21 so we should
> be able to use getauxval unconditionally. On the off-chance the call
> fails it will return 0 and we will correctly fall-back to using only
> Neon.
>
> Change-Id: Ibbaa9caec1915ac0725c42d6cd2abc7ce19786c7
> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5453620
> Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Change-Id: Ic0f764217af7b4d998f19a8f78fc04ca85a45a3b
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463918
Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The "memory" clobber needs to be present even if the asm does not store
anything to memory, since otherwise the compiler would be allowed to
reorder earlier stores to the pointers after they would be needed by the
asm.
Also fix up the zero-initialisation of accumulators in
SumSquareError_NEON, since EOR'ing a register by itself is not a
recognised zeroing idiom on most AArch64 micro-architectures.
Bug: libyuv:976
Change-Id: I3175367abf6f59db8371b4478f1156950277d7c5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5378705
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This has the advantage of also working under emulation where
faking /proc/cpuinfo is not supported.
For the Chromium sandbox, getauxval is supported since API version 18.
The minimum supported API version at time of writing is 21 so we should
be able to use getauxval unconditionally. On the off-chance the call
fails it will return 0 and we will correctly fall-back to using only
Neon.
Change-Id: Ibbaa9caec1915ac0725c42d6cd2abc7ce19786c7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5453620
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Using full vectors for Add and Subtract is a win across the board. Using
full vectors for the multiply is less obviously a win, especially for
smaller cores like Cortex-A53 or Cortex-A57, so is not considered for
this change.
Observed changes in performance with this change compared to the
existing Neon code:
| ARGBAddRow_NEON | ARGBSubtractRow_NEON
Cortex-A55 | -5.1% | -5.1%
Cortex-A510 | -18.4% | -18.4%
Cortex-A76 | -28.9% | -28.7%
Cortex-A720 | -36.1% | -36.2%
Cortex-X1 | -14.2% | -14.4%
Cortex-X2 | -12.5% | -12.5%
Bug: libyuv:976
Change-Id: I85316d4399c93b53baa62d0d43b2fa453517f5b4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5457433
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code performs a lot of shifts and combines the R and B
components into a single vector unnecessarily. We can express this much
more cleanly by making use of the SRI instruction to insert and replace
shifted bits into the original data, performing the 5/6-bit to 8-bit
expansion in a single instruction if the source bits are already in the
high bits of the byte. We still need a single separate XTN instruction
to narrow the B component before the left shift since Neon does not have
a narrowing left shift instruction.
Reduction in runtime for selected kernels:
Kernel | Cortex-A55 | Cortex-A76 | Cortex-X2
RGB565ToYRow_NEON | -22.1% | -23.4% | -25.1%
RGB565ToUVRow_NEON | -26.8% | -20.5% | -18.8%
RGB565ToARGBRow_NEON | -38.9% | -32.0% | -23.5%
Bug: libyuv:976
Change-Id: I77b8d58287b70dbb9549451fc15ed3dd0d2a4dda
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5374286
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Most micro-architectures seem to prefer an additional ZIP1 instruction
in READYUV422 to needing a lane-indexed LD1 load instruction.
We introduce a new macro to handle the YUV to RGB conversion where the U
and V components are in separate vectors. This avoids causing a slowdown
for the UV-interleaved input format kernels (NV12 and NV21) where we do
not want to separate them.
Reduction in runtime for selected kernels on Cortex cores (no
performance difference observed on Cortex-A55):
A510 A76 A720 X1 X2
I422AlphaToARGBRow_NEON -4.3% -7.3% -10.1% -4.0% -4.4%
I422ToARGB1555Row_NEON -4.5% +0.4% -7.9% -4.8% -3.9%
I422ToARGB4444Row_NEON -7.7% -2.6% -4.1% -1.9% -1.3%
I422ToARGBRow_NEON -3.7% -2.9% -10.2% -3.8% -4.4%
I422ToRGB24Row_NEON -5.9% +5.4% -3.2% -4.3% -4.3%
I422ToRGB565Row_NEON -4.8% -2.8% -8.5% -3.8% -4.6%
I422ToRGBARow_NEON -3.7% +4.6% -10.5% -3.0% -4.5%
I444AlphaToARGBRow_NEON -3.5% +2.7% -3.7% -5.0% -8.2%
I444ToARGBRow_NEON -1.8% -15.1% -3.5% -6.5% -8.1%
I444ToRGB24Row_NEON -2.0% -6.8% +0.1% -4.7% +1.2%
There are a few cases which are slower on Cortex-A76, but significant
speedups elsewhere.
Bug: libyuv:976
Change-Id: Ib3b4ef81f7bfc1d7ff9c4c24aef9ad86741410ff
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465580
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing transformations can be more cleanly expressed by using SRI
instructions to perform a shift and simultaneously merge in to an
existing value.
Reduction in runtime for selected kernels:
Kernel | Cortex-A55 | Cortex-A76 | Cortex-X2
ARGB1555ToYRow_NEON | -26.2% | -14.9% | -28.2%
ARGB1555ToUVRow_NEON | -25.2% | -18.4% | -20.9%
ARGB1555ToARGBRow_NEON | -43.6% | -32.8% | -19.7%
Bug: libyuv:976
Change-Id: Id07ac6f2cd3eb9bb70f9e29fc1f4b29fe26156ec
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5383444
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing sequence to convert from 8-bit ARGB to 4-bit ARGB4444 makes
use of a lot of shifts and bit-clears before ORR'ing the pairs together.
This is unnecessary since we can do the same with the SRI instruction,
so use that instead.
Reduction in runtime for selected kernels:
Kernel | Cortex-A55 | Cortex-A76
ARGBToARGB4444Row_NEON | -15.3% | -16.6%
I422ToARGB4444Row_NEON | -2.7% | -11.9%
Bug: libyuv:976
Change-Id: I86cd86c7adf1105558787a679272179821f31a9d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5383443
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The value of UV components in the vector are known and the vectors are
never overwritten, so we can hoist the UV-specific parts of the
calculation out of the loop.
Reduction in runtimes for I400ToARGBRow_NEON:
Cortex-A55: -10.0%
Cortex-A510: -3.7%
Cortex-A76: -19.3%
Cortex-X2: -14.4%
Bug: libyuv:976
Change-Id: I17d6de4e1790f71407e12ff84548568cc3ebbe1a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5457434
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
There is no need to de-interleave channels here since we are applying
the same operation across all lanes. LD4 and ST4 are known to be
significantly slower than LD1/ST1 on some micro-architectures so we
should prefer to avoid them where possible.
Reduction in runtimes observed for ARGBMultiplyRow_NEON:
Cortex-A55: -22.3%
Cortex-A510: -56.6%
Cortex-A76: -45.5%
Cortex-X2: -54.6%
Change-Id: I9103111a109a4d87d358e06eb513746314aaf66a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454832
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no need to de-interleave channels here since we are applying
the same operation across all lanes. LD4 and ST4 are known to be
significantly slower than LD1/ST1 on some micro-architectures so we
should prefer to avoid them where possible.
Reduction in runtimes observed for ARGBSubtractRow_NEON:
Cortex-A55: -15.0%
Cortex-A510: -59.8%
Cortex-A76: -54.4%
Cortex-X2: -70.4%
Change-Id: Ifbfce9e6a45159932c09d9b0229215a36fa22f43
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454833
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no need to de-interleave channels here since we are applying
the same operation across all lanes. LD4 and ST4 are known to be
significantly slower than LD1/ST1 on some micro-architectures so we
should prefer to avoid them where possible.
Reduction in runtimes observed for ARGBAddRow_NEON:
Cortex-A55: -15.0%
Cortex-A510: -59.8%
Cortex-A76: -54.4%
Cortex-X2: -70.4%
Change-Id: Id04e5259d8e5e7511dad5df85cdf9759b392cb99
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454831
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Use a pair of LD2s to load data interleaved and perform a couple of
additions on the registers in order to avoid needing LD4 and ST4
instructions, since these are costly on some micro-architectures.
Reduction in run times:
Cortex-A55: -20.5%
Cortex-A510: -28.3%
Cortex-A76: -21.5%
Bug: libyuv:976
Change-Id: If66e1e148b031c2cd288ff412f351d7a0b9b91e7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5371774
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Replace indexed LD1 instructions with LDRs to avoid loop-carried
dependencies on unused lanes between consecutive iterations of the loop.
Reduction in run times:
Cortex-A55: -10.9%
Cortex-A510: -70.7%
Cortex-A76: -56.8%
Bug: libyuv:976
Change-Id: Ia767e76002c7823177e80163ebf034e023e9a6cc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5371771
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>