1892 Commits

Author SHA1 Message Date
Darren Hsieh
b5a18f9d93 [RVV] Optimize ScaleARGBFilterCols with RVV
* Run on SiFive internal FPGA:

Test Case	                Speedup
ARGBScaleDownBy3by8_Linear      x2.05
ARGBScaleDownBy3by8_Bilinear    x1.76
ARGBScaleDownBy3by8_Box         x1.76

Bug: 42280924
Co-Developed-by: Bruce Lai <bruce.lai@sifive.com>
Change-Id: Ib9979b1f2ca92d2ef5aa373f9b2459c246ded6c8
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5103572
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-29 17:32:00 -08:00
George Steed
cce8950816 [AArch64] Remove unused SVE INDEX instrs from NV{12,21} kernels
When reading subsampled UV data in NV{12,21} we previously needed to
permute the data to both (a) duplicate each element into the
corresponding pair of lanes for the Y elements; and (b) arrange the UV
components in the correct lanes. This was done in a vector-length
agnostic way by generating the permute indices dynamically at runtime
through an SVE INDEX instruction.

Now that we are using the READNV_SVE_2X macro everywhere these
instructions are now redundant: the multiplications are done on the
subsampled UV data before the duplication and the conversion macro takes
arguments that adjust whether we need to operate on the even or odd
lanes of the vector.

Since the permute indices generated by these INDEX instructions are now
unused, remove them.

Change-Id: I3298a83aadfda52c4cc89bc4fd6518b06765a187
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6089957
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-26 14:47:00 -08:00
George Steed
45c7107f95 [AArch64] Fix compilation when SME is not supported
The STREAMING_COMPATIBLE macro is designed to enable use of the
__arm_streaming_compatible attribute with the intent that this macro
expanded to empty if SME is not supported by the compiler or platform
being compiled for, however in reality this macro remained undefined
causing compilation to fail. Fix this by defining the macro to empty as
originally intended.

No-Try: True
Change-Id: I8f5a8a606289b7c045fa1cce609f5a6d644891ac
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087913
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-12-13 08:16:50 -08:00
George Steed
7fd0bd197e [AArch64] Port YUVToRGB color conversions to SME
Some of the color conversion kernels already have Streaming-SVE
implementations however many do not. We can re-use the existing SVE
implementation by moving it to a new shared row_sve.h header and marking
it with a "streaming-compatible" attribute to ensure it can be called
from both streaming and non-streaming execution modes.

As part of this move to a common header we also add duplicated
streaming-mode implementations of the following kernels that did not
previously have an SME implementation:

- I210AlphaToARGBRow_SME
- I210ToAR30Row_SME
- I210ToARGBRow_SME
- I212ToAR30Row_SME
- I212ToARGBRow_SME
- I400ToARGBRow_SME
- I410AlphaToARGBRow_SME
- I410ToAR30Row_SME
- I410ToARGBRow_SME
- I422AlphaToARGBRow_SME
- I422ToARGB1555Row_SME
- I422ToARGB4444Row_SME
- I422ToRGB24Row_SME
- I422ToRGB565Row_SME
- I422ToRGBARow_SME
- I444AlphaToARGBRow_SME
- NV12ToARGBRow_SME
- NV12ToRGB24Row_SME
- NV21ToARGBRow_SME
- NV21ToRGB24Row_SME
- P210ToAR30Row_SME
- P210ToARGBRow_SME
- P410ToAR30Row_SME
- P410ToARGBRow_SME
- UYVYToARGBRow_SME
- YUY2ToARGBRow_SME

Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 03:07:54 -08:00
George Steed
c2e7f8389a [AArch64] Add SME implementations of InterpolateRow{,_16,_16To8}
InterpolateRow_SME and InterpolateRow_16_SME need special cases to
handle if source_y_fraction is 256 since this would overflow a byte and
can just be a call to memcpy instead.

InterpolateRow_16To8_SME is never called with a source_y_fraction value
of 256 so there is no need for a special case here.

Change-Id: I67805b5db2c411acb93ada626cf414b35620f467
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 03:03:41 -08:00
George Steed
2d8652f3e7 [AArch64] Add SME implementation of CopyRow
Add a streaming-SVE implementation of CopyRow using normal vector
load/store instructions.

Change-Id: Ia551413f9740a96473fa2e8a0958953be2f4b04e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074374
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 03:02:07 -08:00
George Steed
418b6df0de [AArch64] Add SME implementation of Convert16To8Row
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel.
SVE has a "widening multiply get high half" instruction in UMULH,
however using the same technique as the Neon code to avoid the need for
a widening multiply at all is more performant here.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-12-12 03:01:55 -08:00
runzezhang
192b8c2238 Add NV24 scaling support to libyuv
Some projects require scaling support for the NV24 format, but libyuv currently lacks this functionality. This commit adds a scaling function for NV24, enabling its use in projects that require NV24 format processing.

Change-Id: I6e6b2bea342e1df7f387056ab3bc5003da983bb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6068715
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 02:46:11 -08:00
George Steed
85331e00cc [AArch64] Add SME impls of ScaleRowDown2{,Linear,Box}_16
Mostly just straightforward copies of the Neon code ported to
Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2
SME kernels, but operating on 16-bit data rather than 8-bit.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: I7bad0719d24cdb1760d1039c63c0e77726b28a54
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070784
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-12-12 01:21:08 -08:00
George Steed
15f2ae7d70 [AArch64] Add SME impls of ScaleARGBRowDown2{,Linear,Box}
Mostly just straightforward copies of the Neon code ported to
Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2
and ScaleUVRowDown2 SME kernels, but operating on 32-bit ARGB tuples
rather than 8-bit data or 16-bit UV tuples.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: I15600c2498cc592f5ea1d97b78fafec327de7947
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070783
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-12-12 01:19:20 -08:00
George Steed
7391559cb4 [AArch64] Add SME implementation of MergeUVRow{,_16}
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel
and use ST2 to avoid needing a separate ZIP instruction.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 01:16:19 -08:00
George Steed
8f659daffd [AArch64] Add SVE2 implementations of NV{12,21}ToRGB24Row
Now that we have the `_2X` versions of the macros we can use these to
implement `ToRGB24` kernels. These cannot use the bottom/top approach
previously used by other SVE kernels since there are three rather than
two or four elements each.

Reduction in runtimes observed compared to the existing Neon
implementations:

            | NV12ToRGB24Row | NV21ToRGB24Row
Cortex-A510 |         -60.7% |         -60.7%
Cortex-A520 |         -46.0% |         -46.0%
Cortex-A715 |         -25.2% |         -25.2%
Cortex-A720 |         -25.2% |         -25.2%
  Cortex-X2 |         -28.9% |         -29.0%
  Cortex-X3 |         -28.2% |         -28.1%
  Cortex-X4 |         -30.8% |         -30.7%
Cortex-X925 |         -28.8% |         -28.9%

Change-Id: I39853d124bfdcac38584109870b398b8ecd5b632
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067149
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-04 17:51:08 +00:00
George Steed
9144583f22 [AArch64] Add SME impls of MultiplyRow_16 and ARGBMultiplyRow
Mostly just a translation of the existing Neon code to SME.

Change-Id: Ic3d6b8ac774c9a1bb9204ed6c78c8802668bffe9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067147
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-03 22:11:19 +00:00
George Steed
9a9752134e [AArch64] Add Neon implementation of ScaleRowDown2Linear_16
Reduction in runtime observed relative to the auto-vectorized C
implementation compiled with LLVM 19:

  Cortex-A55: -13.7%
 Cortex-A510: -49.0%
 Cortex-A520: -32.0%
  Cortex-A76: -34.3%
 Cortex-A710: -56.7%
 Cortex-A715: -45.4%
 Cortex-A720: -44.7%
   Cortex-X1: -70.6%
   Cortex-X2: -67.9%
   Cortex-X3: -72.2%
   Cortex-X4: -40.0%
 Cortex-X925: -24.1%

Bug: b/42280942
Change-Id: I977899a2239e752400c9901f4d8482a76841269a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040154
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-25 21:10:26 +00:00
George Steed
11c57f4f12 [AArch64] Add Neon implementation of ScaleRowDown2_16_NEON
The auto-vectorized implementation unrolls to process 32 elements per
iteration, so unroll the new Neon implementation to match and avoid a
performance regression on little cores.

Performance relative to the auto-vectorized C implementation compiled
with LLVM 19:

 Cortex-A55: -35.8%
Cortex-A510: -20.4%
Cortex-A520: -22.1%
 Cortex-A76: -54.8%
Cortex-A710: -44.5%
Cortex-A715: -31.1%
Cortex-A720: -31.4%
  Cortex-X1: -48.5%
  Cortex-X2: -47.8%
  Cortex-X3: -47.6%
  Cortex-X4: -51.1%
Cortex-X925: -14.6%

Bug: b/42280942
Change-Id: Ib4e89ba230d554f2717052e934ca0e8a109ccc42
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040153
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-25 21:10:05 +00:00
George Steed
952d6a282f [AArch64] Enable use of ScaleRowDown2Box_16_NEON
The #ifdef surrounding the use of this kernel is never defined and
ScaleRowDown2_16_NEON does not exist, so add the missing #define and
remove the use of ScaleRowDown2_16_NEON for now. Additionally since
there is no implementation of this kernel for 32-bit Arm, restrict the
define to only be present on AArch64.

Bug: b/42280942
Change-Id: Icc35c145c1bad1c0df2933a2d8bc7dcf7fe63cb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040152
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-24 19:58:00 +00:00
George Steed
9ed07258c7 [AArch64] Add SVE2 implementation of I410ToAR30Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -18.1%
Cortex-A520:  -6.0%
Cortex-A715: -22.0%
Cortex-A720: -21.1%
  Cortex-X2:  -9.4%
  Cortex-X3: -12.0%
  Cortex-X4:  -7.6%
Cortex-X925:  -5.8%

Bug: b/42280942
Change-Id: I853a028e08f1f1076ac20cd9c7f4f8ac8a211ac1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023584
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:59:55 +00:00
George Steed
3dd047733e [AArch64] Add SVE2 implementation of I410AlphaToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -37.2%
Cortex-A520:  -6.9%
Cortex-A715: -14.8%
Cortex-A720: -16.0%
  Cortex-X2: -14.8%
  Cortex-X3: -17.5%
  Cortex-X4: -12.8%
Cortex-X925: -13.0%

Bug: b/42280942
Change-Id: I1977fd1e1dfac25021724483fd89c6ff3e227d8b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023582
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:58:11 +00:00
George Steed
e84d809348 [AArch64] Add SVE2 implementation of I410ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -37.9%
Cortex-A520:  -9.2%
Cortex-A715: -14.3%
Cortex-A720: -14.2%
  Cortex-X2: -10.9%
  Cortex-X3: -11.1%
  Cortex-X4: -12.5%
Cortex-X925: -10.6%

Bug: b/42280942
Change-Id: I6720b07c900c7dfbd849ee38e413e98b9374dac2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023581
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:54:48 +00:00
George Steed
7c9c72ab4b [AArch64] Add SVE2 implementation of I210ToAR30Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -15.5%
Cortex-A520:  -3.8%
Cortex-A715: -15.8%
Cortex-A720: -15.8%
  Cortex-X2:  -7.9%
  Cortex-X3:  -6.5%
  Cortex-X4:  -5.0%
Cortex-X925:  -5.3%

Bug: b/42280942
Change-Id: I5171537fd125b3214d25a0ae503a8f40dbeb6042
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023583
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-11-23 00:53:16 +00:00
George Steed
fc3569ad27 [AArch64] Add SVE2 implementation of I210AlphaToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -33.9%
Cortex-A520:  -4.2%
Cortex-A715: -22.0%
Cortex-A720: -22.4%
  Cortex-X2: -14.6%
  Cortex-X3: -14.5%
  Cortex-X4: -11.6%
Cortex-X925: -12.6%

Bug: b/42280942
Change-Id: Ifb4ed7a865c369d584af498cc65b84d065cfb207
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023580
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:47:32 +00:00
George Steed
50108f29fb [AArch64] Add SVE2 implementation of I212ToAR30Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -15.4%
Cortex-A520:  -3.8%
Cortex-A715: -15.7%
Cortex-A720: -15.6%
  Cortex-X2:  -7.9%
  Cortex-X3:  -5.7%
  Cortex-X4:  -5.3%
Cortex-X925:  -4.8%

Bug: b/42280942
Change-Id: I99846820682687c8e0f52d05f5aa3d50369fe0a2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025829
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:27:57 +00:00
George Steed
305a7a4ede [AArch64] Add SVE2 implementation of I212ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -34.5%
Cortex-A520:  -6.5%
Cortex-A715: -10.1%
Cortex-A720: -16.1%
  Cortex-X2: -11.9%
  Cortex-X3: -11.9%
  Cortex-X4:  -9.3%
Cortex-X925: -11.2%

Bug: b/42280942
Change-Id: Idc30e69552f7d227217ac7011a786210b11e4752
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025828
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-23 00:21:27 +00:00
Frank Barchard
595146434a HalfFloat fix SigIll on aarch64
- Remove special case Scale of 1 which used fp16 cvt but requires cpuid
- Port aarch64 to aarch32
- Use C for aarch32 with small (denormal) scale value

Bug: 377693555
Change-Id: I38e207e79ac54907ed6e65118b8109288fddb207
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6043392
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-11-22 22:08:00 +00:00
Frank Barchard
307b951229 Add CopyPlane_Unaligned, _Any and _Invert tests/benchmarksCpuId test
- Add AMD_ERMSB detect for ERMS on AMD

Bug: 379457420
Change-Id: I608568556024faf19abe4d0662aeeee553a0a349
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6032852
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-11-19 23:53:05 +00:00
Frank Barchard
1c501a8f3f CpuId test FSMR - Fast Short Rep Movsb
- Renumber cpuid bits to use low byte to ID the type of CPU and upper 24 bits for features

Intel CPUs starting at Icelake support FSMR
adl:Has FSMR 0x8000
arl:Has FSMR 0x0
bdw:Has FSMR 0x0
clx:Has FSMR 0x0
cnl:Has FSMR 0x0
cpx:Has FSMR 0x0
emr:Has FSMR 0x8000
glm:Has FSMR 0x0
glp:Has FSMR 0x0
gnr:Has FSMR 0x8000
gnr256:Has FSMR 0x8000
hsw:Has FSMR 0x0
icl:Has FSMR 0x8000
icx:Has FSMR 0x8000
ivb:Has FSMR 0x0
knl:Has FSMR 0x0
knm:Has FSMR 0x0
lnl:Has FSMR 0x8000
mrm:Has FSMR 0x0
mtl:Has FSMR 0x8000
nhm:Has FSMR 0x0
pnr:Has FSMR 0x0
rpl:Has FSMR 0x8000
skl:Has FSMR 0x0
skx:Has FSMR 0x0
slm:Has FSMR 0x0
slt:Has FSMR 0x0
snb:Has FSMR 0x0
snr:Has FSMR 0x0
spr:Has FSMR 0x8000
srf:Has FSMR 0x0
tgl:Has FSMR 0x8000
tnt:Has FSMR 0x0
wsm:Has FSMR 0x0

Intel CPUs starting at Ivybridge support ERMS

adl:Has ERMS 0x4000
arl:Has ERMS 0x4000
bdw:Has ERMS 0x4000
clx:Has ERMS 0x4000
cnl:Has ERMS 0x4000
cpx:Has ERMS 0x4000
emr:Has ERMS 0x4000
glm:Has ERMS 0x4000
glp:Has ERMS 0x4000
gnr:Has ERMS 0x4000
gnr256:Has ERMS 0x4000
hsw:Has ERMS 0x4000
icl:Has ERMS 0x4000
icx:Has ERMS 0x4000
ivb:Has ERMS 0x4000
knl:Has ERMS 0x4000
knm:Has ERMS 0x4000
lnl:Has ERMS 0x4000
mrm:Has ERMS 0x0
mtl:Has ERMS 0x4000
nhm:Has ERMS 0x0
pnr:Has ERMS 0x0
rpl:Has ERMS 0x4000
skl:Has ERMS 0x4000
skx:Has ERMS 0x4000
slm:Has ERMS 0x4000
slt:Has ERMS 0x0
snb:Has ERMS 0x0
snr:Has ERMS 0x4000
spr:Has ERMS 0x4000
srf:Has ERMS 0x4000
tgl:Has ERMS 0x4000
tnt:Has ERMS 0x4000
wsm:Has ERMS 0x0
Change-Id: I18e5a3905f2691ab66d4d0cb6f668c0a0ff72d37
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6027541
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2024-11-18 17:56:45 +00:00
Frank Barchard
75f7cfdde5 SplitRGB for SSE4 and AVX2
libyuv_test '--gunit_filter=*SplitRGB*' --libyuv_width=640 --libyuv_height=360 --libyuv_repeat=100000 --libyuv_flags=-1 --libyuv_cpu_info=-1
Note: Google Test filter = *SplitRGB*

Skylake Xeon x86 32 bit
AVX2  LibYUVPlanarTest.SplitRGBPlane_Opt (4143 ms)
SSE4  LibYUVPlanarTest.SplitRGBPlane_Opt (4543 ms)
SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5346 ms)
C     LibYUVPlanarTest.SplitRGBPlane_Opt (22965 ms)

Skylake Xeon x86 64 bit
AVX2  LibYUVPlanarTest.SplitRGBPlane_Opt (4470 ms)
SSE4  LibYUVPlanarTest.SplitRGBPlane_Opt (4723 ms)
SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5465 ms)
C     LibYUVPlanarTest.SplitRGBPlane_Opt (4707 ms)

Bug: 379186682
Change-Id: Idce67a4ded836f2ee31854aa06f3903e7bcb7791
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6024314
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2024-11-15 00:46:25 +00:00
George Steed
823d960afc [AArch64] Add SVE2 implementations of {P210,P410}ToAR30Row
Observed reductions in runtime compared to the existing Neon code:

            | P210ToAR30Row | P410ToAR30Row
Cortex-A510 |       -16.5%  |        -21.2%
Cortex-A520 | (!)    +2.7%  |         -8.7%
Cortex-A715 |        -6.1%  |         -6.1%
Cortex-A720 |        -6.2%  |         -5.9%
  Cortex-X2 |        -4.1%  |         -4.2%
  Cortex-X3 |        -4.2%  |         -4.2%
  Cortex-X4 |        -1.2%  |         -1.2%
Cortex-X925 |        -3.6%  |         -2.8%

Bug: b/42280942
Change-Id: I40723a370fad1ccb53f8ccd9d32cddb502500dd6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023036
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-14 16:52:21 +00:00
George Steed
0ddf3f7b90 [AArch64] Add SVE2 implementation of I210ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -34.5%
Cortex-A520:  -6.5%
Cortex-A715: -10.1%
Cortex-A720: -13.9%
  Cortex-X2: -11.9%
  Cortex-X3: -11.6%
  Cortex-X4:  -9.5%
Cortex-X925: -11.5%

Bug: b/42280942
Change-Id: Ie97dc3b5efd021ecfea14d4c477cc205191e09c3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023037
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-14 16:36:41 +00:00
George Steed
5b906a0ec8 [AArch64] Add SVE2 implementation of P410ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -34.7%
Cortex-A520:  -2.4%
Cortex-A715: -18.7%
Cortex-A720: -18.8%
  Cortex-X2:  -7.7%
  Cortex-X3:  -8.9%
  Cortex-X4:  +1.0% (!)
Cortex-X925:  -8.3%

Bug: b/42280942
Change-Id: I90dca0573887a9a24e2172378a9e0eb6812e2131
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975321
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-12 18:34:56 +00:00
George Steed
b753822d47 [AArch64] Add SVE2 implementation of P210ToARGBRow
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -32.8%
Cortex-A520:  +8.7% (!)
Cortex-A715: -18.9%
Cortex-A720: -18.9%
  Cortex-X2:  -7.9%
  Cortex-X3:  -8.8%
  Cortex-X4:  +1.0% (!)
Cortex-X925:  -8.6%

Bug: b/42280942
Change-Id: Ibe557500c3788b4fb39372c92b2f42ba216e6fea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975320
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-11-12 18:32:55 +00:00
George Steed
721ad4aa18 [AArch64] Add SME implementation of ScaleUVRowDown2Box
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: Ie15bb4e7484b61e78f405ad4e8a8a7bbb66b7edb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979727
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-12 18:30:30 +00:00
George Steed
576218dbce [AArch64] Add SME implementation of ScaleUVRowDown2Linear
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: I401eb6ad14b3159917c8e3a79ab20dde318d28b6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979726
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-12 18:28:57 +00:00
George Steed
551cee7845 [AArch64] Add SME implementation of ScaleUVRowDown2
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: Ic4ba5f97dc57afc558c08a57e9b5009d6e487e0f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979725
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-12 18:24:28 +00:00
George Steed
5c12e0b2de [AArch64] Add SVE2 implementations of HalfFloat{,1}Row
For HalfFloat1Row, SVE has direct 16-bit integer to half-float
conversion instructions so there is no need to widen to 32-bits.

For HalfFloatRow, SVE zero-extending loads avoid the need for seperate
UXTL(2) instructions.

Observed reductions in runtime compared to the existing Neon code:

            | HalfFloat1Row | HalfFloatRow
Cortex-A510 |        -38.3% |       -17.3%
Cortex-A520 |        -37.6% |       -18.8%
Cortex-A720 |        -50.1% |        -7.8%
  Cortex-X2 |        -50.2% |        -0.4%
  Cortex-X4 |        -51.5% |       -12.5%

Bug: b/42280942
Change-Id: I445071ccd453113144ce42d465ba03c9ee89ec9e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975319
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-07 18:53:00 +00:00
George Steed
f27b983f38 [AArch64] Add SVE2 implementation of DivideRow_16
SVE contains the UMULH instruction which allows us to multiply and take
the high half of the result in a single instruction rather than needing
separate widening multiply and then narrowing shift steps.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -21.2%
Cortex-A520: -20.9%
Cortex-A715: -47.9%
Cortex-A720: -47.6%
  Cortex-X2:  -5.2%
  Cortex-X3:  -2.6%
  Cortex-X4: -32.4%
Cortex-X925:  -1.5%

Bug: b/42280942
Change-Id: I25154699b17772db1fb5cb84c049919181d86f4b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975318
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-07 18:46:02 +00:00
George Steed
aec4b4e22e [AArch64] Add SME implementation of ScaleRowDown2Box
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead.  We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: I5021aeda30f4c5f1aa4cc6326c8d7886851d2c09
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913885
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-07 18:42:21 +00:00
George Steed
51d07554a0 [AArch64] Add SME implementation of ScaleRowDown2Linear
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead.  We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: Ie6b91bd4407130ba2653838088e81e72e4460f68
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913884
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-30 17:57:15 +00:00
George Steed
593965cea2 [AArch64] Add SME implementation of ScaleRowDown2
Including associated changes for adding a new scale_sme.cc file.

There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead.  We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: I47d149613fbabd8c203605a809811f1a668e8fb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913883
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-30 17:56:41 +00:00
George Steed
237f39cb8c [AArch64] Add SME implementation of I444ToARGBRow
This is based on an unrolled version of the existing SVE2 code. The
implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.

Change-Id: I83d8e58aafd814125b3446fb1c9ec4a5fb56fe3e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913882
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-29 18:10:23 +00:00
George Steed
22c5c18778 [AArch64] Add SME implementation of I422ToARGBRow
Including addition of a new row_sme.cc file and associated
infrastructure.

The actual implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.

Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-29 05:49:28 +00:00
George Steed
22ac86800e [AArch64] Add SVE2 implementation of I422ToARGB4444Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -35.5%
Cortex-A520: -38.2%
Cortex-A715: -19.8%
Cortex-A720: -19.8%
  Cortex-X2: -24.2%
  Cortex-X3: -24.1%
  Cortex-X4: -21.6%
Cortex-X925: -19.5%

Bug: b/42280942
Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f4eaeca22a [AArch64] Add SVE2 implementation of I422ToARGB1555Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.8%
Cortex-A520: -42.6%
Cortex-A715: -22.5%
Cortex-A720: -22.6%
  Cortex-X2: -22.7%
  Cortex-X3: -22.4%
  Cortex-X4: -19.4%
Cortex-X925: -27.0%

Bug: b/42280942
Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f40042533c [AArch64] Add SVE2 implementation of I422ToRGB565Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.1%
Cortex-A520: -38.2%
Cortex-A715: -21.5%
Cortex-A720: -21.6%
  Cortex-X2: -21.6%
  Cortex-X3: -22.0%
  Cortex-X4: -23.5%
Cortex-X925: -21.7%

Bug: b/42280942
Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
0dce974ca0 [AArch64] Add SVE2 implementation of I422ToRGB24Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -57.8%
Cortex-A520: -41.7%
Cortex-A715: -28.0%
Cortex-A720: -28.1%
  Cortex-X2: -29.7%
  Cortex-X3: -28.7%
  Cortex-X4: -30.5%
Cortex-X925: -30.3%

Bug: b/42280942
Change-Id: I328bd16babda75fb089c8da8f2714465f658187e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802965
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-24 02:17:32 +00:00
Frank Barchard
ffd791f749 Check malloc allocation sizes are less than SIZE_MAX
Bug: b/371615496
Change-Id: I75a94b08469d6d6b6fd55a8659031cbcb3d48eed
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5912039
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-10-07 21:34:15 +00:00
George Steed
dfa279fc65 Re-enable SME when building for AArch64 Android
Now that SME has been re-enabled for Linux for a while, also re-enable
it for Android when building with a sufficiently new version of LLVM.

Bug: b/359006069
Change-Id: Ibaa47e31826cf20136a11d551621fd62c1abab3c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5908389
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-10-04 17:43:26 +00:00
George Steed
02c6e8baca Change ARGBMultiplyRow_C to match Neon
The existing behaviour does not round correctly in all cases, so adjust
it to match the existing Neon implementation.

Update the tests to require bit-exactness and disable other
implementations that do not round correctly.

Change-Id: Ie790fb4b4805b555d74d689d83802e1dd4f33df5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5869115
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-23 21:48:33 +00:00
George Steed
a37e6bc81b [AArch64] Re-enable SME only for Linux and new versions of Clang
This was previously disabled in
679e851f653866a49e21f69fe8380bd20123f0ee, so re-enable it but only for
Linux where SME is known to work correctly.

Change-Id: I2626b03f3854b27162df1b55fc6767e02ffe318d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802958
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-23 09:29:53 +00:00
George Steed
8315fa1d3a Avoid duplication of CPU feature disable macros
The same conditions are repeated across all *_row.h headers which makes
it harder than necessary to guard enabling new architecture features
depending on compiler versions etc.

Avoid this duplication by merging the conditions into a new
cpu_support.h header.

Change-Id: Ibe7dfcef138edca6cc36870f1cfbb1bb108083e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802957
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-23 09:28:24 +00:00
George Steed
432d186116 [AArch64] Add Neon dot-product implementation for ARGBSepiaRow
We can use the dot product instructions to apply the coefficients
directly without the need for LD4 de-interleaving load instructions,
since these are known to be slow on some micro-architectures.

ST4 is also known to be slow on more modern micro-architectures, however
avoiding this is left for a future SVE implementation where we can make
use of interleaving-narrowing instructions.

Reduction in cycle counts observed compared to existing Neon code:

 Cortex-A55:  -5.8%
Cortex-A510: -18.9%
 Cortex-A76: -21.8%
Cortex-A720: -30.2%
  Cortex-X1: -28.6%
  Cortex-X2: -23.4%

Bug: b/42280946
Change-Id: I5887559649cc805a810d867b652c85d48285657d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790970
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:31:35 +00:00
George Steed
1c31461771 [AArch64] Add Neon dot-product implementation for ARGBGrayRow
We can use dot product instructions to apply the coefficients without
needing to use LD4 deinterleaving load instructions, and then TBL to mix
in the original alpha component. This is significantly faster on some
micro-architectures where LD4 instructions are known to be slow compared
to normal loads.

Reduction in cycle counts observed compared to existing Neon code:

 Cortex-A55: -12.6%
Cortex-A510: -48.6%
 Cortex-A76: -39.7%
Cortex-A720: -52.3%
  Cortex-X1: -63.5%
  Cortex-X2: -67.0%

Bug: b/42280946
Change-Id: I3641785e74873438acc00d675f5bc490dfa95b50
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785972
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:31:11 +00:00
Frank Barchard
4620f17058 ScalePlane crash fix for 3/4 scaling
- Scaling 48 pixels at a time, but calling code checked for 24 pixels
- Added test for scaling to 1080x1920
  libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560

Was
libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560
[ RUN      ] LibYUVScaleTest.I420ScaleTo1080x1920_Box
Segmentation fault
Traceback (most recent call last):

Now
[ RUN      ] LibYUVScaleTest.I420ScaleTo1080x1920_Box
filter 3 -     6741 us C -     3566 us OPT
[       OK ] LibYUVScaleTest.I420ScaleTo1080x1920_Box (43 ms)

Bug: b/366045177
Change-Id: I0ea6c2d6a32b2e7ca44cd030abc9f248115be44a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5857554
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-09-13 01:20:39 +00:00
Frank Barchard
679e851f65 Convert16To8Row_AVX512BW using vpmovuswb
- avx2 is pack/perm is mutating order
- cvt method maintains channel order on avx512

Sapphire Rapids

Benchmark of 640x360 on Sapphire Rapids
AVX512BW
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms)

AVX2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms)

SSE2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms)

Skylake Xeon
Now vpmovuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms)

Was vpackuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms)

Switch from vpunpcklwd to vpbroadcastw for scale value parameter
Was
vpunpcklwd  %%xmm2,%%xmm2,%%xmm2
vbroadcastss %%xmm2,%%ymm2

Now
vpbroadcastw %%xmm2,%%ymm2

Bug: 357439226, 357721018
Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-15 20:13:33 +00:00
Wan-Teh Chang
0c2cf03c5c Fix a -Wundef warning on macOS with Apple silicon
Change-Id: Ia78dcc913e06dd8876119a96bd7760c1d2af4341
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5788821
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-14 22:10:43 +00:00
Wan-Teh Chang
02e2ff4745 Note stride params of HalfFloatPlane are in bytes
The HalfFloatPlane() function does not follow libyuv's convention of buffer
stride in units of the corresponding buffer pointer. Document that.

Change-Id: Id8d466ccc2df263a49ad788ab349bc3993a48259
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5770639
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-12 20:17:23 +00:00
Wan-Teh Chang
3cf54e90d3 Fix -Wmissing-prototypes warnings
Declare functions as static. Declare functions in a header. Include the
header that declares the functions. Delete undeclared and unused
functions ScaleFilterRows_NEON() and ScaleRowUp2_16_NEON(). Delete
unused function ScaleY() in psnr_main.cc.

Change-Id: I182ec30611df83c61ffd01bbab595cd61fb5f1e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-12 19:08:24 +00:00
Frank Barchard
a97746349b Add test for I010ToNV12
- Add support for negative height to invert
- Fix off by 1 on odd width and height
- Bump version to 1895

Initial I010 is 2 step planar conversion

libyuv_test '--gunit_filter=*010ToNV12_Opt' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Skylake Xeon
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (2675 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (1547 ms)
Pixel 7
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (464 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (125 ms)

Bug: b/357721018, b/357439226
Change-Id: I2ae59783cf328a6592d0ab80c374ae4dc281daf3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778595
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-12 18:57:56 +00:00
Chunbo Hua
e23bc72e8e Bump version number in order to expose new API
Bug: 357721018
Change-Id: I2c6e115cd049db2038631195305c5907764d5c7b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5768078
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-07 22:10:05 +00:00
Chunbo Hua
fc94178260 Implement I010ToNV12 conversion
I010, also known as YUV420P10, is 10 bit YUV pixel format with 3 planes.
Both I010 and NV12 are 4:2:0 subsampling. NV12 has a Y plane, and an
interleaved UV plane.

Bug: 357721018
Change-Id: If215529b9eda8e0fb32aed666ca179c90244aaff
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5764823
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-06 17:36:13 +00:00
Frank Barchard
32ccd53bb3 Add P010ToNV12 to convert 10 bit biplanar to 8 bit biplanar
- P010 and NV12 have the same layout: Full size Y plane and half size UV plane.
  P010 and NV12 are 4:2:0 subsampling
- P010 uses upper 10 bits of 16 bit elements
- NV12 uses 8 bit elements
- The Convert16To8 used internally will discard the low 2 bits.
- UV order is the same - U first in memory, followed by V, interleaved
- UV plane is be rounded up in size to allow odd size Y to have UV values
- Similar code could be used to convert P210ToNV16, P410ToNV24, with the size
  of the UV plane affected by subsampling 4:2:2 and 4:4:4 variants.

Bug: b/357439226
Change-Id: I5d6ec84d97d0e0cc4008eeb18a929ea28570d6d9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5761958
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-05 18:55:44 +00:00
Frank Barchard
4cd90347e7 Rotate use NULL for C compatability
Bug: b/353323977
Change-Id: I2472f23ce8fcc0bc09a292bd6fb758304c6c2b18
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5735714
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-07-23 18:02:47 +00:00
George Steed
b5f9d7cb76 [AArch64] Add SME implementation of TransposeUVWxH
We can make use of the ZA tile register to do the transpose and
de-interleaving of UV components without any explicit permute
instructions: the tile is loaded horizontally placing UV components into
alternative columns, then we can just store the independent components
vertically.

Change-Id: I67bd82dc840a43888290be1c9db8a3c05f16d730
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703588
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 12:15:40 +00:00
George Steed
15ecca81f7 [AArch64] Add SME implementation of TransposeWxH
We can make use of the ZA tile register to do the transpose without any
explicit permute instructions: just load the tile horizontally and store
it vertically.

Change-Id: I1c31e89af52a408e3491e62d6c9e6fee41b1b80a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703587
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 12:14:39 +00:00
George Steed
a4ccf9940e [AArch64] Add I8MM implementation of ARGBToUV444Row
We cannot use the standard dot-product instructions since the
coefficients multiplication results are both added and subtracted, but
I8MM supports mixed-sign dot products which work well here.  We need to
add an additional variant of the coefficient structs since we need
negative constants for the elements that were previously subtracted.

Reduction in runtimes observed compared to the previous Neon
implementation:

Cortex-A510: -37.3%
Cortex-A520: -31.1%
Cortex-A715: -37.1%
Cortex-A720: -37.0%
  Cortex-X2: -62.1%
  Cortex-X3: -62.2%
  Cortex-X4: -40.4%

Bug: libyuv:977
Change-Id: Idc3d9a6408c30e1bce3816a1ed926ecd76792236
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712928
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-16 17:32:52 +00:00
George Steed
a64fffe632 Revert "Disable NV12ToARGB_SVE2 which fails the 'any' test"
This reverts commit f480fa1c4a4af0ce3c34cd7b1ab0d85f1a36ce17.

This code has a number of small issues:

* The YUVTORGB_SVE_SETUP macro requires p0 to be initialized to
  all-true, however the existing kernel does not initialise p0 until
  after this macro is called, so flip the order.

* The p2 register is missing from the clobber list, so add it.

* The existing code uses the wrong condition flags when determining
  whether to do the tail iteration using WHILE instructions or not.
  Additionally the number of tail iterations is incorrect, as it was
  incorrectly not changed from when the tail code was always executed.

While we are here, make another few small improvements:

* Remove the single-quote digit separators as requested here:
  https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133

* Remove "volatile" from the asm block counting the vector length.  This
  particular asm block cannot be removed by the compiler since the
  output register is consumed by subsequent code, so "volatile" is
  unnecessary here and we remove it.

* Add some additional empty comments to force clang-format to put macros
  into the next line rather than on the same line as other asm.

Bug: b/352371649
Change-Id: I45676fab95343f588cf11ce2cf9186ffbe87489e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703586
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-15 18:13:42 +00:00
Frank Barchard
ec9a8781c7 Disable NV12ToARGB_SVE2 which fails the 'any' test
Bug: b/352371649
Change-Id: I9cd332432c1baa1fb64d4040fa5f207cc54dc82c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5698374
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-07-11 23:32:13 +00:00
George Steed
899bc48327 [AArch64] Add SVE2 implementations of ARGBTo{RAW,RGB24}Row
There is no nice way of forming the TBL permute indices here since we
are operating on sets of three bytes at a time, so instead load the
appropriate indices from a static array. We can make use of SVE
predication to ensure we are operating on a multiple of three bytes for
the load/store instructions rather than needing to make use of more
expensive LD4 or ST3 instructions.

Reduction in runtime observed compared to the existing Neon
implementations:

            | ARGBToRAWRow | ARGBToRGB24Row
Cortex-A510 |       -50.8% |         -19.9%
Cortex-A720 |       -39.8% |         -39.1%
  Cortex-X2 |       -66.5% |         -51.9%

Bug: libyuv:973
Change-Id: Iaead678715a3d70d54cf823391272a6196836769
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631544
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-08 20:27:54 +00:00
George Steed
555f80f3ce [AArch64] Add SVE2 implementation of RGB24ToARGBRow
This can make use of the existing helper functions for RAWToARGBRow_SVE2
and RAWToRGBARow_SVE2 since the layouts are similar, we just need to
adjust the TBL constants to match the different input layout.

Observed reduction in runtime compared to the existing Neon kernel:

Cortex-A510: -25.6%
Cortex-A720: -15.2%
  Cortex-X2: -10.2%
  Cortex-X4: -30.2%

Bug: libyuv:973
Change-Id: Ie3676693286be90d09f0045766c3492cbc04ea64
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5638555
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-08 20:12:05 +00:00
George Steed
11ff6067a5 [AArch64] Add SVE2 implementation of RAWToRGB24Row
There is no nice way of forming the TBL permute indices here since we
are operating on sets of three bytes at a time, so instead load the
appropriate indices from a static array. We can make use of SVE
predication to ensure we are operating on a multiple of three bytes for
the load/store instructions rather than needing to make use of more
expensive LD3 or ST3 instructions.

Reduction in runtime observed compared to the existing Neon
implementation:

Cortex-A510: -39.2%
Cortex-A720: -34.5%
  Cortex-X2: -31.0%

Bug: libyuv:973
Change-Id: I68560bde7a529e5cec150b0e9d3ffe4341038fb8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631543
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-08 15:55:14 +00:00
George Steed
c613c3f102 [AArch64] Add SVE2 implementations for RAWTo{ARGB,RGBA}Row
We can construct particular predicates to load only up to 3/4 of a full
vector, allowing us to use TBL to shuffle elements into the correct
place rather than needing to rely on more expensive LD3 or ST4
instructions.

Reduction in runtimes observed compared to the existing Neon
implementation:

            | RAWToARGBRow | RAWToRGBARow
Cortex-A510 |       -32.4% |       -31.9%
Cortex-A720 |       -15.7% |       -15.6%
  Cortex-X2 |       -24.6% |       -24.4%

Bug: libyuv:973
Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-06 22:40:15 +00:00
George Steed
d1ec694ad3 [AArch64] Add P{210,410}To{ARGB,AR30}Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

              | Cortex-A55 | Cortex-A510 | Cortex-A76
P210ToARGBRow |     -59.8% |      -16.8% |     -53.2%
P210ToAR30Row |     -48.1% |      -21.8% |     -54.0%
P410ToARGBRow |     -56.5% |      -32.2% |     -54.1%
P410ToAR30Row |     -42.4% |       -4.5% |     -50.4%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-06 22:37:08 +00:00
Frank Barchard
611806a155 [AArch64] Fix SVE/SME vector length printing in cpuid
A semicolon is treated as the start of a comment by some assemblers
causing the vector length to be reported incorrectly, so use a newline
instead.

- Add volatile asm in row_gcc and row_neon64

Bug: b/5631539
Change-Id: I6b0836fcdd9247ef7b9e8ceda01df3150519ecf8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5666060
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-02 19:44:41 +00:00
George Steed
d32436e8f8 [AArch64] Add Neon implementation for I422ToAR30Row_NEON
There is an existing x86 implementation for this kernel, but not for
AArch64, so add one.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 Cortex-A55: -43.1%
Cortex-A510: -22.3%
 Cortex-A76: -54.8%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-02 18:16:33 +00:00
George Steed
bbd9cedc4f [AArch64] Add Neon impls for I212To{ARGB,AR30}Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToAR30Row | I210ToARGBRow
 Cortex-A55 |        -40.8% |        -54.4%
Cortex-A510 |        -26.2% |        -22.7%
 Cortex-A76 |        -49.2% |        -44.5%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-02 18:16:33 +00:00
Frank Barchard
fa16ddbb9f cpuid show vector length on ARM and RISCV
- additional asm volatile changes from github
- rotate mips remove C function - moved to common

Run on Samsung S22
[ RUN      ] LibYUVBaseTest.TestCpuHas
Kernel Version 5.10
Has Arm 0x2
Has Neon 0x4
Has Neon DotProd 0x10
Has Neon I8MM 0x20
Has SVE 0x40
Has SVE2 0x80
Has SME 0x0
SVE vector length: 16 bytes
[       OK ] LibYUVBaseTest.TestCpuHas (0 ms)
[ RUN      ] LibYUVBaseTest.TestCompilerMacros
__ATOMIC_RELAXED 0
__cplusplus 201703
__clang_major__ 17
__clang_minor__ 0
__GNUC__ 4
__GNUC_MINOR__ 2
__aarch64__ 1
__clang__ 1
__llvm__ 1
__pic__ 2
INT_TYPES_DEFINED
__has_feature

Run on RISCV qemu emulating SiFive X280:
[ RUN      ] LibYUVBaseTest.TestCpuHas
Kernel Version 6.6
Has RISCV 0x10000000
Has RVV 0x20000000
RVV vector length: 64 bytes
[       OK ] LibYUVBaseTest.TestCpuHas (4 ms)
[ RUN      ] LibYUVBaseTest.TestCompilerMacros
__ATOMIC_RELAXED 0
__cplusplus 202002
__clang_major__ 9999
__clang_minor__ 0
__GNUC__ 4
__GNUC_MINOR__ 2
__riscv 1
__riscv_vector 1
__riscv_v_intrinsic 12000
__riscv_zve64x 1000000
__clang__ 1
__llvm__ 1
__pic__ 2
INT_TYPES_DEFINED
__has_feature

Bug: b/42280943
Change-Id: I53cf0450be4965a28942e113e4c77295ace70999
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5672088
Reviewed-by: David Gao <davidgao@google.com>
2024-07-02 18:10:56 +00:00
Frank Barchard
616bee5420 Add volatile for gcc inline to avoid being removed
Bug: b/42280943
Change-Id: I4439077a92ffa6dff91d2d10accd5251b76f7544
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5671187
Reviewed-by: David Gao <davidgao@google.com>
2024-07-02 01:25:24 +00:00
Frank Barchard
efd164d64e Disable RVV ScaleDownBy4 if compiler option is not enabled
- Some configs have int64 elements off by default.
  Disable ScaleDownBy4 row function to avoid compile error

Bug: 344954354
Change-Id: Ie0d74daea72375eff6438ab54cb2803d68d67e52
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598460
Reviewed-by: James Zern <jzern@google.com>
2024-06-18 01:52:40 +00:00
Bruce Lai
7758c961c5 Support RVV v0.12 intrinsics for row_rvv.cc & scale_rvv.cc
1. Add two defined marco LIBYUV_RVV_HAS_TUPLE_TYPE & LIBYUV_RVV_HAS_VXRM_ARG

Intrinsic v0.12 introduces
- tuple type in segment load & store
- vxrm argument in fixed-point intrinsics (e.g vnclip)

These two marcos are controled by __riscv_v_intrinsic.

2. Support RVV v0.12 intrinsics in row_rvv.cc & scale_rvv.cc

Change-Id: I921f91d9dc8fdda031e7b6647d0e296aa2793c39
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4767120
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-17 18:01:49 +00:00
George Steed
367dd50755 [AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow
This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels
except reading the YUV components all from the same interleaved input
array. We load four-byte elements and then use TBL to de-interleave the
UV components.

Unlike the NV{12,21} cases we need to de-interleave bytes rather than
widened 16-bit elements. Since we need a TBL instruction already it
would ordinarily be possible to perform the zero-extension from bytes to
16-bit elements by setting the index for every other byte to be out of
range. Such an approach does not work in SVE since at a vector length of
2048 bits since all possible byte values (0-255) are valid indices into
the vector. We instead get around this by rewriting the I4XXTORGB_SVE
macro to perform widening multiplies, operating on the low byte of each
16-bit UV element instead of the full value and therefore eliminating
the need for a zero-extension.

Observed reductions in runtimes compared to the existing Neon code:

            | UYVYToARGBRow | YUY2ToARGBRow
Cortex-A510 |        -30.2% |        -30.2%
Cortex-A720 |         -4.8% |         -4.7%
  Cortex-X2 |         -9.6% |        -10.1%

Bug: libyuv:973
Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-13 22:06:46 +00:00
George Steed
cd4113f4e8 [AArch64] Add SVE2 implementation of I400ToARGBRow
This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we
can pre-calculate the UV component results before the loop body.

Unlike in the Neon version of the code we can make use of MOVPRFX and
USQADD to avoid needing to apply the bias separately from the UV
coefficient multiply additions.

Reduction in runtime observed compared to the existing Neon code:

Cortex-A510: -26.1%
Cortex-A520:  -5.9%
Cortex-A715: -49.5%
Cortex-A720: -49.4%
  Cortex-X2: -22.5%
  Cortex-X3: -23.5%
  Cortex-X4: -21.6%

Bug: libyuv:973
Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-13 22:02:06 +00:00
George Steed
34abe98fe2 [AArch64] Add SVE2 implementations for NV{12,21}ToARGBRow
We need a permute to duplicate the UV components, so we can share a
common implementation for both NV12 and NV21 by varying the inputs to
the INDEX instruction that generates the TBL indices.

Observed reductions in runtimes compared to the existing Neon code:

            | NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 |             -29.1% |             -29.1%
Cortex-A720 |              -4.8% |              -4.8%
  Cortex-X2 |              -9.2% |              -9.2%

Bug: libyuv:973
Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-12 16:24:40 +00:00
George Steed
a758a15dbf [AArch64] Add I8MM implementation of ARGBColorMatrixRow
We cannot use the standard dot-product instructions since the matrix of
coefficients are signed, but I8MM supports mixed-sign products which
work well here.

Reduction in runtimes observed compared to the previous Neon
implementation:

Cortex-A510: -50.8%
Cortex-A520: -33.3%
Cortex-A715: -38.6%
Cortex-A720: -38.5%
  Cortex-X2: -43.2%
  Cortex-X3: -40.0%
  Cortex-X4: -55.0%

Change-Id: Ia4fe486faf8f43d0b837ad21bb37e2159f3bdb77
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5621577
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-12 16:17:59 +00:00
George Steed
c8974cf8d4 [AArch64] Add SME feature detection on Linux
This commit just adds the kCpuHasSME to represent that the CPU has the
Arm Scalable Matrix Extension enabled, but this commit does not
introduce any code to actually use it yet.

Add a test to check that the HWCAP value is interpreted correctly.

Change-Id: I2de7bca26ca44ff3ee278b59108298a299a171b7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598869
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-08 23:34:22 +00:00
George Steed
96bbdb53ed [AArch64] Add SVE2 implementation of I422ToRGBARow
This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we
just need to interleave differently for the output.

The RGBA format actually saves us an instruction compared to ARGB since
there is no need to merge in the alpha component, we can just replace
the odd elements of the alpha vector itself during the narrowing.

Also rename some existing macros to make more sense when distinguishing
between ARGB and RGBA.

Reductions in runtime observed compared to the existing Neon code:

Cortex-A510: -27.0%
Cortex-A720:  -5.3%
  Cortex-X2: -14.7%

Bug: libyuv:973
Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
004352ba16 [AArch64] Add SVE2 implementations for AYUVTo{UV,VU}Row
These kernels are mostly identical to each other except for the order of
the results, so we can use a single macro to parameterize the pairwise
addition and use the same macro for both implementations, just with the
register order flipped.

Similar to other 2x2 kernels the implementation here differs slightly
for the last element if the problem size is odd, so use an "any" kernel
to avoid needing to handle this in the common code path.

Observed reduction in runtime compared to the existing Neon code:

            | AYUVToUVRow | AYUVToVURow
Cortex-A510 |      -33.1% |      -33.0%
Cortex-A720 |      -25.1% |      -25.1%
  Cortex-X2 |      -59.5% |      -53.9%
  Cortex-X4 |      -39.2% |      -39.4%

Bug: libyuv:973
Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
d0da5a3298 [AArch64] Add SVE2 implementation of ARGB1555ToARGBRow
Avoiding LD4 and unrolling gives a good perf improvement for the little
core especially.

Observed reduction in runtime relative to the existing Neon code:

Cortex-A510: -69.7%
Cortex-A720:  -7.7%
  Cortex-X2: -41.9%
  Cortex-X4: -14.5%

Bug: libyuv:973
Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
250e1e1ba3 [AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow
Observed performance improvements compared to the existing Neon
implementation:

Cortex-A510: -21.7%
Cortex-A720: -49.2%
  Cortex-X2: -62.6%

Bug: libyuv:973
Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-06-03 23:15:04 +00:00
George Steed
6c70eb2819 [AArch64] Add Neon impls for I{210,410}ToAR30Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 I210ToAR30Row on Cortex-A55: -43.8%
I210ToAR30Row on Cortex-A510: -27.0%
 I210ToAR30Row on Cortex-A76: -50.4%
 I410ToAR30Row on Cortex-A55: -44.3%
I410ToAR30Row on Cortex-A510: -17.5%
 I410ToAR30Row on Cortex-A76: -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:46:12 +00:00
George Steed
bce3392830 [AArch64] Add SVE2 implementation of ARGBToRGB565Row
Observed performance improvements compared to the existing Neon
implementation:

Cortex-A510: -27.1%
Cortex-A720: -49.4%
  Cortex-X2: -67.9%

Bug: libyuv:973
Change-Id: I321dc080a6e89301cd959c2ee18bc6680f749312
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505538
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-31 17:42:27 +00:00
George Steed
812b4955b2 [AArch64] Add Neon impls for I{210,410}ToARGBRow_NEON
There is are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToARGBRow | I410ToARGBRow
 Cortex-A55 |        -55.6% |        -56.2%
Cortex-A510 |        -22.6% |        -35.6%
 Cortex-A76 |        -48.1% |        -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 17:40:48 +00:00
George Steed
5b4160b9c3 [AArch64] Add Neon impls for I{210,410}AlphaToARGBRow_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210AlphaToARGBRow | I410AlphaToARGBRow
 Cortex-A55 |             -55.3% |             -56.1%
Cortex-A510 |             -27.9% |             -42.6%
 Cortex-A76 |             -54.9% |             -60.3%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:41:31 +00:00
George Steed
4f7fd808b7 [AArch64] Use full vectors in TransposeWx{8 => 16}_NEON
The existing Neon code only makes use of 64-bit vectors throughout which
limits the performance on larger cores. To avoid this, swap the Neon
code from a Wx8 implementation to a Wx16 implementation and process
blocks of 16 full vectors at a time.

The original code also handled widths that were not exact multiples of
16, however this should already be handled by the "any" kernel so it is
removed.

Finally, avoid duplicating the TransposeWx16_C fallback kernel
definition in all architectures that need it, and just put it once in
rotate_common.cc instead.

Observed speedups for TransposePlane across a range of
micro-architectures:

 Cortex-A53: -40.0%
 Cortex-A55: -20.7%
 Cortex-A57: -43.9%
Cortex-A510: -43.5%
Cortex-A520: -43.9%
Cortex-A720: -31.1%
  Cortex-X2: -38.3%
  Cortex-X4: -43.6%

Change-Id: Ic7c4d5f24eb27091d743ddc00cd95ef178b6984e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5545459
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:46:42 +00:00
George Steed
9fac9a4a82 [AArch64] Add Neon implementations for {ARGB,ABGR}ToAR30Row
There are existing x86 implementations for these kernels but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | ABGRToAR30Row | ARGBToAR30Row
 Cortex-A55 |        -55.1% |        -55.1%
Cortex-A510 |        -39.3% |        -40.1%
 Cortex-A76 |        -62.3% |        -63.6%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:35:07 +00:00
George Steed
ee830a5f77 [AArch64] Enable feature detection on Windows and Apple Silicon
Using the platform-specific functions IsProcessorFeaturePresent and
sysctlbyname to check individual features.

Bug: libyuv:980
Change-Id: I7971238ca72e5df862c30c2e65331c46dc634074
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465591
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-03 18:42:51 +00:00
George Steed
6f1d8b1e11 [AArch64] Add SVE2 implementations for ARGBToUVRow and similar
By maintaining the interleaved format of the data we can use a common
kernel for all input channel orderings and simply pass a different
vector of constants instead.

A similar approach is possible with only Neon by making use of
multiplies and repeated application of ADDP to combine channels, however
this is slower on older cores like Cortex-A53 so is not pursued further.

For odd problem sizes we need a slightly different implementation for
the final element, so introduce an "any" kernel to address that rather
than bloating the code for the common case.

Observed affect on runtimes compared to the existing Neon kernels:

             | Cortex-A510 | Cortex-A720 | Cortex-X2
ABGRToUVJRow |      -15.5% |       +5.4% |    -33.1%
 ABGRToUVRow |      -15.6% |       +5.3% |    -35.9%
ARGBToUVJRow |      -10.1% |       +5.4% |    -32.7%
 ARGBToUVRow |      -10.1% |       +5.4% |    -29.3%
 BGRAToUVRow |      -15.5% |       +4.6% |    -32.8%
 RGBAToUVRow |      -10.1% |       +4.2% |    -36.0%

Bug: libyuv:973
Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-01 19:46:43 +00:00
George Steed
67e5e79dbe [AArch64] Add Neon implementation of HashDjb2
Reduction in runtime observed compared to the existing C code compiled
with LLVM 18:

 Cortex-A55: -46.2%
Cortex-A510: -60.4%
 Cortex-A76: -82.9%
Cortex-A720: -87.4%
  Cortex-X1: -90.0%
  Cortex-X2: -91.7%

Change-Id: I39a4479f78299508043a864e64fb40578c66ce19
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5494094
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-01 19:37:31 +00:00
George Steed
f483007b9a [AArch64] Add SVE implementation for I422AlphaToARGBRow
This is mostly identical to the existing I422ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -32.1%
Cortex-A720:  -5.1%
  Cortex-X2: -10.1%

Bug: libyuv:973
Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:54:07 +00:00
George Steed
b53b27d6bf [AArch64] Add SVE implementation for I444AlphaToARGBRow
This is mostly identical to the existing I444ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -34.2%
Cortex-A720: -17.6%
  Cortex-X2:  -9.6%

Bug: libyuv:973
Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:54:02 +00:00
George Steed
6ac90403a1 [AArch64] Add SVE implementation for I422ToARGBRow
We need a new macro for reading I422 data, but is otherwise mostly
identical to the existing I444ToARGBRow_SVE implementation.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -25.0%
Cortex-A720:  -5.0%
  Cortex-X2: -10.8%

Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-27 18:26:11 +00:00