2724 Commits

Author SHA1 Message Date
George Steed
aec4b4e22e [AArch64] Add SME implementation of ScaleRowDown2Box
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead.  We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: I5021aeda30f4c5f1aa4cc6326c8d7886851d2c09
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913885
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-07 18:42:21 +00:00
George Steed
b0f72309c6 Remove duplicate kernel assignment from scale_uv.cc
The assignment of ScaleUVRowDown2Box_NEON is already done in the block
immediately below this one, so just remove this code.

Change-Id: I83c0f18dbe66e908cd4fbce73e20e96a137860cf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979723
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-11-01 15:42:21 +00:00
George Steed
f00c43f4d6 [AArch64] Unroll HalfFloat{,1}Row_NEON
The existing C implementation compiled with a recent LLVM is
auto-vectorised and unrolled to process four vectors per loop iteration,
making the Neon implementation slower than the C implementation on
little cores. To avoid this, unroll the Neon implementation to also
process four vectors per iteration.

Reduction in cycle counts observed compared to the existing Neon
implementation:

            | HalfFloat1Row_NEON | HalfFloatRow_NEON
Cortex-A510 |             -37.1% |            -40.8%
Cortex-A520 |             -32.3% |            -37.4%
Cortex-A720 |               0.0% |            -10.6%
  Cortex-X2 |               0.0% |             -7.8%
  Cortex-X4 |              +0.3% |             -6.9%

Bug: b/42280945
Change-Id: I12b474c970fc4355d75ed924c4ca6169badda2bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872805
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-30 17:58:29 +00:00
George Steed
51d07554a0 [AArch64] Add SME implementation of ScaleRowDown2Linear
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead.  We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: Ie6b91bd4407130ba2653838088e81e72e4460f68
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913884
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-30 17:57:15 +00:00
George Steed
593965cea2 [AArch64] Add SME implementation of ScaleRowDown2
Including associated changes for adding a new scale_sme.cc file.

There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead.  We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.

Change-Id: I47d149613fbabd8c203605a809811f1a668e8fb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913883
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-30 17:56:41 +00:00
George Steed
237f39cb8c [AArch64] Add SME implementation of I444ToARGBRow
This is based on an unrolled version of the existing SVE2 code. The
implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.

Change-Id: I83d8e58aafd814125b3446fb1c9ec4a5fb56fe3e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913882
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-29 18:10:23 +00:00
George Steed
22c5c18778 [AArch64] Add SME implementation of I422ToARGBRow
Including addition of a new row_sme.cc file and associated
infrastructure.

The actual implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.

Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-29 05:49:28 +00:00
George Steed
775fd92e59 [AArch64] Optimize ScaleRowDown38_3_Box_NEON
Replace LD4 and TRN instructions with LD1s and TBL since LD4 is known to
be slow on some micro-architectures, and remove other unnecessary
permutes.

Reduction in run times:

 Cortex-A55: -24.8%
Cortex-A510: -32.7%
Cortex-A520: -37.7%
 Cortex-A76: -51.8%
Cortex-A715: -58.9%
Cortex-A720: -58.9%
  Cortex-X1: -54.8%
  Cortex-X2: -50.3%
  Cortex-X3: -57.1%
  Cortex-X4: -49.8%
Cortex-X925: -52.0%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: b/42280945
Change-Id: Ie96bac30fffbe41f8d1501ee289795830ab127e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872803
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-28 17:04:22 +00:00
George Steed
0bce5120f6 [AArch64] Optimize ScaleRowDown38_2_Box_NEON
Replace LD4 and TRN instructions with LD1s and TBL since LD4 is known to
be slow on some micro-architectures, and remove other unnecessary permutes.

Reduction in run times:

 Cortex-A55: -17.9%
Cortex-A510: -28.7%
Cortex-A520: -31.8%
 Cortex-A76: -40.8%
Cortex-A715: -46.1%
Cortex-A720: -46.1%
  Cortex-X1: -44.3%
  Cortex-X2: -40.1%
  Cortex-X3: -46.3%
  Cortex-X4: -40.2%
Cortex-X925: -42.3%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: b/42280945
Change-Id: I84e2cd04912fc11d59b4407a1836f047b74a4c92
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872802
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-28 17:03:54 +00:00
George Steed
22ac86800e [AArch64] Add SVE2 implementation of I422ToARGB4444Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -35.5%
Cortex-A520: -38.2%
Cortex-A715: -19.8%
Cortex-A720: -19.8%
  Cortex-X2: -24.2%
  Cortex-X3: -24.1%
  Cortex-X4: -21.6%
Cortex-X925: -19.5%

Bug: b/42280942
Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f4eaeca22a [AArch64] Add SVE2 implementation of I422ToARGB1555Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.8%
Cortex-A520: -42.6%
Cortex-A715: -22.5%
Cortex-A720: -22.6%
  Cortex-X2: -22.7%
  Cortex-X3: -22.4%
  Cortex-X4: -19.4%
Cortex-X925: -27.0%

Bug: b/42280942
Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f40042533c [AArch64] Add SVE2 implementation of I422ToRGB565Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.1%
Cortex-A520: -38.2%
Cortex-A715: -21.5%
Cortex-A720: -21.6%
  Cortex-X2: -21.6%
  Cortex-X3: -22.0%
  Cortex-X4: -23.5%
Cortex-X925: -21.7%

Bug: b/42280942
Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
4621b0cc7f [AArch64] Rework data loading in ScaleFilterCols_NEON
Lane-indexed LD2 instructions are slow and introduce an unnecessary
dependency on the previous iteration of the loop. To avoid this
dependency use a scalar load for the first iteration and lane-indexed
LD1 for the remainder, then TRN1 and TRN2 to split out the even and odd
elements.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55:  -6.7%
Cortex-A510: -13.2%
Cortex-A520: -13.1%
 Cortex-A76: -54.5%
Cortex-A715: -60.3%
Cortex-A720: -61.0%
  Cortex-X1: -69.1%
  Cortex-X2: -68.6%
  Cortex-X3: -73.9%
  Cortex-X4: -73.8%
Cortex-X925: -69.0%

Bug: b/42280945
Change-Id: I1c4adfb82a43bdcf2dd4cc212088fc21a5812244
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872804
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:25:23 +00:00
George Steed
faade2f73f [AArch64] Avoid partial vector stores in ScaleRowDown38_NEON
The existing code performs a pair of stores since there is no AArch64
instruction in Neon to store exactly 12 bytes from a vector register.

It is guaranteed to be safe to write full vectors until the last
iteration of the loop, since the extra four bytes will be over-written
by subsequent iterations. This allows us to avoid duplicating the store
instruction and address arithmetic.

Reduction in runtime observed relative to the existing Neon
implementation:

 Cortex-A55:  +2.0%
Cortex-A510: -25.3%
Cortex-A520: -15.1%
 Cortex-A76: -32.2%
Cortex-A715: -19.7%
Cortex-A720: -19.6%
  Cortex-X1: -31.6%
  Cortex-X2: -27.1%
  Cortex-X3: -25.9%
  Cortex-X4: -24.7%
Cortex-X925: -35.8%

Bug: b/42280945
Change-Id: I222ed662f169d82f5f472bebb1bcfe6d428ccae2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872843
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 20:52:08 +00:00
George Steed
0dce974ca0 [AArch64] Add SVE2 implementation of I422ToRGB24Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -57.8%
Cortex-A520: -41.7%
Cortex-A715: -28.0%
Cortex-A720: -28.1%
  Cortex-X2: -29.7%
  Cortex-X3: -28.7%
  Cortex-X4: -30.5%
Cortex-X925: -30.3%

Bug: b/42280942
Change-Id: I328bd16babda75fb089c8da8f2714465f658187e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802965
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-24 02:17:32 +00:00
Wan-Teh Chang
6ac7c8f251 Revert "Do not enable libyuv_use_sme for is_android"
This reverts commit 51e2e12b9b59452b1ad16c33a88bbcdd085b5450.

Reason for revert: The llvm bug fix
https://github.com/llvm/llvm-project/pull/102979 has been rolled into
Chrome in https://chromium-review.googlesource.com/5921462.

Original change's description:
> Do not enable libyuv_use_sme for is_android
>
> Revert the changes to libyuv.gni in commit dfa279f.
>
> The linker error "undefined symbol: __getauxval" referenced by
> sme-abi-init.c:26 on Android, previously reported in
> https://libyuv.g-issues.chromium.org/issues/359006069#comment2, has not
> been fixed yet. See
> https://chromium-review.googlesource.com/c/chromium/src/+/5918245?tab=checks.
>
> Change-Id: I94bd243e2863b9c316909f63f757fd95ec55dc18
> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5917455
> Reviewed-by: Frank Barchard <fbarchard@chromium.org>

Bug: 359006069
Change-Id: Ic801c1bcb65894fdfe718ba6454669c8623a2e15
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5935026
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com>
Reviewed-by: George Steed <george.steed@arm.com>
2024-10-15 18:20:36 +00:00
Wan-Teh Chang
a8e59d2074 Fix the test case
The test case should have the dst width and height, and the src width
and height should be specified by the --libyuv_width and --libyuv_height
options to libyuv_unittest.

Tested:
libyuv_unittest --gtest_filter=LibYUVScaleTest.I420ScaleTo264x216_Box \
  --libyuv_width=352 --libyuv_height=288

Bug: b/369963535, b/366045177
Change-Id: I8166a264c9c4840e0d16c0d3c1818c18aebc1b2e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5896466
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-09 08:26:10 +00:00
Wan-Teh Chang
51e2e12b9b Do not enable libyuv_use_sme for is_android
Revert the changes to libyuv.gni in commit dfa279f.

The linker error "undefined symbol: __getauxval" referenced by
sme-abi-init.c:26 on Android, previously reported in
https://libyuv.g-issues.chromium.org/issues/359006069#comment2, has not
been fixed yet. See
https://chromium-review.googlesource.com/c/chromium/src/+/5918245?tab=checks.

Change-Id: I94bd243e2863b9c316909f63f757fd95ec55dc18
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5917455
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-09 08:24:00 +00:00
Frank Barchard
7633328b5f Make functions that malloc check for ubsan math overflow
- add support for negative heights
- sanity check null pointers and invalid width/height

Bug: b/371615496
Change-Id: Icbefcb1ccc5cdf90e417c73440c6fad3b63ed7df
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5917072
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-10-08 21:08:34 +00:00
Wan-Teh Chang
364b7fa81b Remove redundant unsigned integer overflow tests
Bug: b/371615496
Change-Id: I28df888942085138a54e18c7e939300d959c68b0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5914872
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-08 01:14:35 +00:00
Frank Barchard
ffd791f749 Check malloc allocation sizes are less than SIZE_MAX
Bug: b/371615496
Change-Id: I75a94b08469d6d6b6fd55a8659031cbcb3d48eed
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5912039
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-10-07 21:34:15 +00:00
George Steed
dfa279fc65 Re-enable SME when building for AArch64 Android
Now that SME has been re-enabled for Linux for a while, also re-enable
it for Android when building with a sufficiently new version of LLVM.

Bug: b/359006069
Change-Id: Ibaa47e31826cf20136a11d551621fd62c1abab3c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5908389
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-10-04 17:43:26 +00:00
Wan-Teh Chang
77f3acade4 ScalePlaneDown34: test dst_width%24 == 0 for armv7
In ScalePlaneDown34(), check if dst_width % 24 == 0 for armv7, and check
if dst_width % 48 == 0 for aarch64.

No-Try: True
Bug: b/369963535, b/366045177
Change-Id: I7dc1227517c83c97a1d1052ef2230d5cec41da10
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5896492
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2024-09-27 23:00:19 +00:00
Frank Barchard
61bf0b61f7 Fix for ARGB scaling down by 4x horizontally but not vertically
Add test ARGBScaleTo50x1_Box
libyuv_test '--gunit_filter=*ARGBScaleTo50x1*' --libyuv_width=200 --libyuv_height=50

Bug:  chromium:361611480
Change-Id: Ic984951d74eb0c377c6746f61e91593a8a7d1a66
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5884656
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-09-24 18:00:47 +00:00
George Steed
02c6e8baca Change ARGBMultiplyRow_C to match Neon
The existing behaviour does not round correctly in all cases, so adjust
it to match the existing Neon implementation.

Update the tests to require bit-exactness and disable other
implementations that do not round correctly.

Change-Id: Ie790fb4b4805b555d74d689d83802e1dd4f33df5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5869115
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-23 21:48:33 +00:00
George Steed
a37e6bc81b [AArch64] Re-enable SME only for Linux and new versions of Clang
This was previously disabled in
679e851f653866a49e21f69fe8380bd20123f0ee, so re-enable it but only for
Linux where SME is known to work correctly.

Change-Id: I2626b03f3854b27162df1b55fc6767e02ffe318d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802958
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-23 09:29:53 +00:00
George Steed
8315fa1d3a Avoid duplication of CPU feature disable macros
The same conditions are repeated across all *_row.h headers which makes
it harder than necessary to guard enabling new architecture features
depending on compiler versions etc.

Avoid this duplication by merging the conditions into a new
cpu_support.h header.

Change-Id: Ibe7dfcef138edca6cc36870f1cfbb1bb108083e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802957
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-23 09:28:24 +00:00
Wan-Teh Chang
85e55115f0 Untangle arm and aarch64 #ifdefs in GetCpuFlags()
Change-Id: I5df39c20a700aee38954bc9288fdee116138645d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5879350
Reviewed-by: George Steed <george.steed@arm.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-20 23:40:19 +00:00
Alex Richardson
f1b28b3510 Avoid reading /proc/cpuinfo for non-Linux Arm platforms
While we will return kCpuHasNEON if the file fails to open, this does
unnecessarily introduce filesystem operations which are not needed e.g.
on embedded non-Linux platforms. When not building for Linux, we can
simply rely on the compiler flags to determine whether NEON support is
present for Arm32.

Change-Id: Ifb0eab2a46969fca5f733ce624abdf54da9b32a2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778479
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: George Steed <george.steed@arm.com>
2024-09-20 22:22:03 +00:00
George Steed
0d5a31eccb Update README.md and environment_variables.md for Arm
Now that there are newer architecture extensions used, update the
documentation to reflect this.

Also add missing empty lines after headers in environment_variables.md
to ensure the file is valid markdown.

Change-Id: I61d5616e1f815f80186440f27dd68ac5460c38b1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5868021
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-20 00:29:33 +00:00
George Steed
7eb552c891 [AArch64] Avoid unnecessary MOVs in ScaleARGBRowDownEvenBox_NEON
The existing code uses three MOV instructions through a temporary
register to swap the low and high halves of a vector register, however
this can be done with a pair of ZIP instructions instead.

Also use a pair of RSHRN rather than RSHRN2 to allow these to execute in
parallel on little cores.

Reduction in runtime observed compared to the existing Neon
implementation:

 Cortex-A55:  -8.3%
Cortex-A510: -20.6%
Cortex-A520: -16.6%
 Cortex-A76:  -6.8%
Cortex-A715:  -6.2%
Cortex-A720:  -6.2%
  Cortex-X1: -22.0%
  Cortex-X2: -18.7%
  Cortex-X3: -21.1%
  Cortex-X4: -25.8%
Cortex-X925: -21.9%

Change-Id: I87ae133be86c3c9f850d5848ec19d9b71ebda4d9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872801
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-20 00:28:12 +00:00
George Steed
23a6a412e5 [AArch64] Unroll and use TBL in ScaleRowDown34_NEON
ST3 is known to be slow on a number of modern micro-architectures. By
unrolling the code we are able to use TBL to shuffle elements into the
correct indices without needing to use LD4 and ST3, giving a good
improvement in performance across the board.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55: -14.4%
Cortex-A510: -66.0%
Cortex-A520: -50.8%
 Cortex-A76: -60.5%
Cortex-A715: -63.9%
Cortex-A720: -64.2%
  Cortex-X1: -74.3%
  Cortex-X2: -75.4%
  Cortex-X3: -75.5%
  Cortex-X4: -48.1%

Bug: b/42280945
Change-Id: Ia1efb03af2d6ec00bc5a4b72168963fede9f0c83
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785971
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 15:37:27 +00:00
George Steed
d5303f4f77 [AArch64] Unroll ARGB1555ToARGBRow_NEON to use full Neon vectors
Processing more data per loop iteration means that we can use the full
128-bit Neon vectors and also allows us to use e.g. UZP1 to perform XTN
+ XTN2 in a single instruction.

The early Cortex-X cores are not a fan of ST4 .16b with a
post-increment, so split out the pointer increment to a separate
instruction to avoid this bottleneck.

Reductions in runtime observed for ARGB1555ToARGBRow_NEON:

 Cortex-A55: -18.1%
Cortex-A510: -11.2%
Cortex-A520: -39.5%
 Cortex-A76: -18.0%
Cortex-A715: -34.8%
Cortex-A720: -34.8%
  Cortex-X1:  -0.9%
  Cortex-X2:  -4.6%
  Cortex-X3:  -3.6%
  Cortex-X4: -20.8%

Bug: libyuv:976
Change-Id: Iae2ac24ffdbc718cd1e05bb77191f8d1df3fcf6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790975
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-16 04:36:43 +00:00
George Steed
772f0fde1c [AArch64] Use full Neon vectors in RGB565To{ARGB,UV,Y}Row_NEON
The existing code only makes use of half of the vector lanes in the
RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more
data to allow using full vectors, adjusting the "any" kernel macros to
match. For the RGB565ToUVRow kernel we already have plenty of data but
currently call the macro twice as much as needed, so refactor the code
to only call it once but operating with full vectors instead.

Reduction in runtimes observed for selected micro-architectures:

            | RGB565ToARGBRow | RGB565ToUVRow | RGB565ToYRow
 Cortex-A53 |          -35.2% |        -28.8% |       -31.1%
 Cortex-A55 |          -32.5% |        -34.4% |       -42.9%
Cortex-A510 |          -21.6% |        -27.7% |       -47.2%
 Cortex-A76 |           -0.9% |        -42.0% |       -21.4%
Cortex-A720 |          -28.6% |        -37.2% |       -26.1%
  Cortex-X1 |           -3.2% |        -42.3% |       -23.4%

Bug: b/42280945
Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:35:47 +00:00
George Steed
2dfb84b311 [AArch64] Unroll to use full vectors in ARGBToARGB1555Row_NEON
By loading packed 16-bit AR/GB data and operating on that directly we
avoid the need to perform a separate widening step before the
conversion.

Reduction in runtime observed compared to the existing Neon code:

 Cortex-A55: -13.2%
Cortex-A510:  -5.4%
 Cortex-A76: -21.5%
Cortex-A720: -25.2%
  Cortex-X1: -50.6%
  Cortex-X2: -36.8%

Bug: b/42280945
Change-Id: I780c71fdff1d017464c6e4e38f86979dda0e43ad
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790973
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-16 04:33:22 +00:00
George Steed
432d186116 [AArch64] Add Neon dot-product implementation for ARGBSepiaRow
We can use the dot product instructions to apply the coefficients
directly without the need for LD4 de-interleaving load instructions,
since these are known to be slow on some micro-architectures.

ST4 is also known to be slow on more modern micro-architectures, however
avoiding this is left for a future SVE implementation where we can make
use of interleaving-narrowing instructions.

Reduction in cycle counts observed compared to existing Neon code:

 Cortex-A55:  -5.8%
Cortex-A510: -18.9%
 Cortex-A76: -21.8%
Cortex-A720: -30.2%
  Cortex-X1: -28.6%
  Cortex-X2: -23.4%

Bug: b/42280946
Change-Id: I5887559649cc805a810d867b652c85d48285657d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790970
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:31:35 +00:00
George Steed
1c31461771 [AArch64] Add Neon dot-product implementation for ARGBGrayRow
We can use dot product instructions to apply the coefficients without
needing to use LD4 deinterleaving load instructions, and then TBL to mix
in the original alpha component. This is significantly faster on some
micro-architectures where LD4 instructions are known to be slow compared
to normal loads.

Reduction in cycle counts observed compared to existing Neon code:

 Cortex-A55: -12.6%
Cortex-A510: -48.6%
 Cortex-A76: -39.7%
Cortex-A720: -52.3%
  Cortex-X1: -63.5%
  Cortex-X2: -67.0%

Bug: b/42280946
Change-Id: I3641785e74873438acc00d675f5bc490dfa95b50
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785972
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:31:11 +00:00
George Steed
2d62d8d22a [AArch64] Unroll ScaleRowDown4_NEON
We can use wider load/store instructions here which is mostly an
improvement across the board.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55:  +4.9% (!)
Cortex-A510: -46.3%
Cortex-A520: -49.0%
 Cortex-A76: -12.2%
Cortex-A715: -15.5%
Cortex-A720: -15.0%
  Cortex-X1: -12.4%
  Cortex-X2: -12.5%
  Cortex-X3: -12.3%
  Cortex-X4:  +0.3%

Bug: b/42280945
Change-Id: Id8af6499c63919924c2a954dfe7765b703ce4820
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785970
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:30:04 +00:00
George Steed
e6297afd14 [AArch64] Optimize ScaleARGBRowDown2Linear_NEON
Replace LD4 with a pair of LD2 instructions to avoid needing an ST2
instruction for storing the result, since ST2 instructions are known to
be slow on some micro-architectures.

Observed reduction in runtimes compared to the existing Neon code:

 Cortex-A55: -23.3%
Cortex-A510: -49.6%
Cortex-A520: -31.1%
 Cortex-A76: -44.5%
Cortex-A715: -45.8%
Cortex-A720: -46.0%
  Cortex-X1: -74.5%
  Cortex-X2: -72.4%
  Cortex-X3: -76.8%
  Cortex-X4: -39.5%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Iab9e802d0784d69b7e970dcc8f1f4036985cd2e1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790972
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:28:25 +00:00
George Steed
00886670bb [AArch64] Avoid LD4/ST2 in ScaleARGBRowDown2_NEON
Use separate permute instructions to avoid using LD4/ST2 as these
instructions are known to be slow on some micro-architectures.

Observed reduction in runtimes compared to the existing Neon code:

 Cortex-A55: -12.4%
Cortex-A510: -44.8%
Cortex-A520: -31.1%
 Cortex-A76: -55.3%
Cortex-A715: -63.7%
Cortex-A720: -62.3%
  Cortex-X1: -79.0%
  Cortex-X2: -78.9%
  Cortex-X3: -79.6%
  Cortex-X4: -59.8%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I33cf27ae5e16c1ce62f1f343043e6bd9fca92558
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790971
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:27:39 +00:00
Frank Barchard
4620f17058 ScalePlane crash fix for 3/4 scaling
- Scaling 48 pixels at a time, but calling code checked for 24 pixels
- Added test for scaling to 1080x1920
  libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560

Was
libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560
[ RUN      ] LibYUVScaleTest.I420ScaleTo1080x1920_Box
Segmentation fault
Traceback (most recent call last):

Now
[ RUN      ] LibYUVScaleTest.I420ScaleTo1080x1920_Box
filter 3 -     6741 us C -     3566 us OPT
[       OK ] LibYUVScaleTest.I420ScaleTo1080x1920_Box (43 ms)

Bug: b/366045177
Change-Id: I0ea6c2d6a32b2e7ca44cd030abc9f248115be44a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5857554
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-09-13 01:20:39 +00:00
Wan-Teh Chang
41d0cd3360 Install yuvconvert with install(TARGETS)
The original code
  INSTALL ( PROGRAMS ${CMAKE_BINARY_DIR}/yuvconvert ...)
fails on Windows because it is missing the .exe file extension. Change
it to install( TARGETS yuvconvert ...) based on CMake documentation:
  [The PROGRAMS form] is intended to install programs that are not
  targets, such as shell scripts. Use the TARGETS form to install
  targets built within the project.

Note that this change was first made in the cmake-mingw.patch file in
the mingw-w64-libyuv package:
https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-libyuv

Change-Id: Ia571aa61e136cef477f05e051fef2cfb1db4b77d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5840469
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-05 20:57:59 +00:00
Wan-Teh Chang
552d775b43 Also install the DLL import library
The target artifact of the ARCHIVE kind means DLL import libraries for
shared library targets on Windows. See
https://cmake.org/cmake/help/latest/command/install.html#targets

Change-Id: Id22e362b39648d5b155c6f2359876ee9d90786a3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5837740
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-04 19:22:07 +00:00
Chunbo Hua
874f391dbf Validate memory right after malloc
The failure of malloc would make a NULL pointer. But if in this case,
things like reinterpret_cast is done to some shift from the NULL point,
it will return a valid pointer although its content would be Access
Violation area.

Bug: 359949838
Change-Id: Ie73bca426671ee85315b96f187a6de8c955cada6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5789885
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-21 00:07:37 +00:00
Chunbo Hua
e434b8c5ae Fix build script instructions for Windows
Bug: 359296629
Change-Id: I118650a89d3e3c9500b592f3f62f058e378f514e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785268
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-21 00:05:21 +00:00
Frank Barchard
679e851f65 Convert16To8Row_AVX512BW using vpmovuswb
- avx2 is pack/perm is mutating order
- cvt method maintains channel order on avx512

Sapphire Rapids

Benchmark of 640x360 on Sapphire Rapids
AVX512BW
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms)

AVX2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms)

SSE2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms)

Skylake Xeon
Now vpmovuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms)

Was vpackuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms)

Switch from vpunpcklwd to vpbroadcastw for scale value parameter
Was
vpunpcklwd  %%xmm2,%%xmm2,%%xmm2
vbroadcastss %%xmm2,%%ymm2

Now
vpbroadcastw %%xmm2,%%ymm2

Bug: 357439226, 357721018
Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-15 20:13:33 +00:00
Wan-Teh Chang
c21dda06dd Spell CMake commands in lowercase
CMake commands are spelled in lowercase in the CMake Reference
Documentation at https://cmake.org/cmake/help/latest/index.html.

Change-Id: I5fc39c8fa65e83785f9c776cb5bef94a498c15f0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787519
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-15 01:49:29 +00:00
Wan-Teh Chang
0c2cf03c5c Fix a -Wundef warning on macOS with Apple silicon
Change-Id: Ia78dcc913e06dd8876119a96bd7760c1d2af4341
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5788821
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-14 22:10:43 +00:00
Wan-Teh Chang
6157cc4583 Remove the ' separators in hex integer constants
They are a C++14 feature, not supported in C++11 mode (-std=c++11).

Change-Id: I618020342d4964b994aefa06af83b2e8d553a032
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786607
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-14 20:50:28 +00:00
Wan-Teh Chang
2707098fb1 Fix -Wunused-parameter warnings in release builds
Add (void) casts to the 'src_width' parameters that are only used in
assertions.

Change-Id: I72d1b55f50a9b02b07b206e40e5583005b27928b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786606
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-14 20:49:37 +00:00