2894 Commits

Author SHA1 Message Date
George Steed
7eb552c891 [AArch64] Avoid unnecessary MOVs in ScaleARGBRowDownEvenBox_NEON
The existing code uses three MOV instructions through a temporary
register to swap the low and high halves of a vector register, however
this can be done with a pair of ZIP instructions instead.

Also use a pair of RSHRN rather than RSHRN2 to allow these to execute in
parallel on little cores.

Reduction in runtime observed compared to the existing Neon
implementation:

 Cortex-A55:  -8.3%
Cortex-A510: -20.6%
Cortex-A520: -16.6%
 Cortex-A76:  -6.8%
Cortex-A715:  -6.2%
Cortex-A720:  -6.2%
  Cortex-X1: -22.0%
  Cortex-X2: -18.7%
  Cortex-X3: -21.1%
  Cortex-X4: -25.8%
Cortex-X925: -21.9%

Change-Id: I87ae133be86c3c9f850d5848ec19d9b71ebda4d9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872801
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-20 00:28:12 +00:00
George Steed
23a6a412e5 [AArch64] Unroll and use TBL in ScaleRowDown34_NEON
ST3 is known to be slow on a number of modern micro-architectures. By
unrolling the code we are able to use TBL to shuffle elements into the
correct indices without needing to use LD4 and ST3, giving a good
improvement in performance across the board.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55: -14.4%
Cortex-A510: -66.0%
Cortex-A520: -50.8%
 Cortex-A76: -60.5%
Cortex-A715: -63.9%
Cortex-A720: -64.2%
  Cortex-X1: -74.3%
  Cortex-X2: -75.4%
  Cortex-X3: -75.5%
  Cortex-X4: -48.1%

Bug: b/42280945
Change-Id: Ia1efb03af2d6ec00bc5a4b72168963fede9f0c83
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785971
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 15:37:27 +00:00
George Steed
d5303f4f77 [AArch64] Unroll ARGB1555ToARGBRow_NEON to use full Neon vectors
Processing more data per loop iteration means that we can use the full
128-bit Neon vectors and also allows us to use e.g. UZP1 to perform XTN
+ XTN2 in a single instruction.

The early Cortex-X cores are not a fan of ST4 .16b with a
post-increment, so split out the pointer increment to a separate
instruction to avoid this bottleneck.

Reductions in runtime observed for ARGB1555ToARGBRow_NEON:

 Cortex-A55: -18.1%
Cortex-A510: -11.2%
Cortex-A520: -39.5%
 Cortex-A76: -18.0%
Cortex-A715: -34.8%
Cortex-A720: -34.8%
  Cortex-X1:  -0.9%
  Cortex-X2:  -4.6%
  Cortex-X3:  -3.6%
  Cortex-X4: -20.8%

Bug: libyuv:976
Change-Id: Iae2ac24ffdbc718cd1e05bb77191f8d1df3fcf6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790975
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-16 04:36:43 +00:00
George Steed
772f0fde1c [AArch64] Use full Neon vectors in RGB565To{ARGB,UV,Y}Row_NEON
The existing code only makes use of half of the vector lanes in the
RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more
data to allow using full vectors, adjusting the "any" kernel macros to
match. For the RGB565ToUVRow kernel we already have plenty of data but
currently call the macro twice as much as needed, so refactor the code
to only call it once but operating with full vectors instead.

Reduction in runtimes observed for selected micro-architectures:

            | RGB565ToARGBRow | RGB565ToUVRow | RGB565ToYRow
 Cortex-A53 |          -35.2% |        -28.8% |       -31.1%
 Cortex-A55 |          -32.5% |        -34.4% |       -42.9%
Cortex-A510 |          -21.6% |        -27.7% |       -47.2%
 Cortex-A76 |           -0.9% |        -42.0% |       -21.4%
Cortex-A720 |          -28.6% |        -37.2% |       -26.1%
  Cortex-X1 |           -3.2% |        -42.3% |       -23.4%

Bug: b/42280945
Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:35:47 +00:00
George Steed
2dfb84b311 [AArch64] Unroll to use full vectors in ARGBToARGB1555Row_NEON
By loading packed 16-bit AR/GB data and operating on that directly we
avoid the need to perform a separate widening step before the
conversion.

Reduction in runtime observed compared to the existing Neon code:

 Cortex-A55: -13.2%
Cortex-A510:  -5.4%
 Cortex-A76: -21.5%
Cortex-A720: -25.2%
  Cortex-X1: -50.6%
  Cortex-X2: -36.8%

Bug: b/42280945
Change-Id: I780c71fdff1d017464c6e4e38f86979dda0e43ad
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790973
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-16 04:33:22 +00:00
George Steed
432d186116 [AArch64] Add Neon dot-product implementation for ARGBSepiaRow
We can use the dot product instructions to apply the coefficients
directly without the need for LD4 de-interleaving load instructions,
since these are known to be slow on some micro-architectures.

ST4 is also known to be slow on more modern micro-architectures, however
avoiding this is left for a future SVE implementation where we can make
use of interleaving-narrowing instructions.

Reduction in cycle counts observed compared to existing Neon code:

 Cortex-A55:  -5.8%
Cortex-A510: -18.9%
 Cortex-A76: -21.8%
Cortex-A720: -30.2%
  Cortex-X1: -28.6%
  Cortex-X2: -23.4%

Bug: b/42280946
Change-Id: I5887559649cc805a810d867b652c85d48285657d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790970
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:31:35 +00:00
George Steed
1c31461771 [AArch64] Add Neon dot-product implementation for ARGBGrayRow
We can use dot product instructions to apply the coefficients without
needing to use LD4 deinterleaving load instructions, and then TBL to mix
in the original alpha component. This is significantly faster on some
micro-architectures where LD4 instructions are known to be slow compared
to normal loads.

Reduction in cycle counts observed compared to existing Neon code:

 Cortex-A55: -12.6%
Cortex-A510: -48.6%
 Cortex-A76: -39.7%
Cortex-A720: -52.3%
  Cortex-X1: -63.5%
  Cortex-X2: -67.0%

Bug: b/42280946
Change-Id: I3641785e74873438acc00d675f5bc490dfa95b50
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785972
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:31:11 +00:00
George Steed
2d62d8d22a [AArch64] Unroll ScaleRowDown4_NEON
We can use wider load/store instructions here which is mostly an
improvement across the board.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55:  +4.9% (!)
Cortex-A510: -46.3%
Cortex-A520: -49.0%
 Cortex-A76: -12.2%
Cortex-A715: -15.5%
Cortex-A720: -15.0%
  Cortex-X1: -12.4%
  Cortex-X2: -12.5%
  Cortex-X3: -12.3%
  Cortex-X4:  +0.3%

Bug: b/42280945
Change-Id: Id8af6499c63919924c2a954dfe7765b703ce4820
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785970
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:30:04 +00:00
George Steed
e6297afd14 [AArch64] Optimize ScaleARGBRowDown2Linear_NEON
Replace LD4 with a pair of LD2 instructions to avoid needing an ST2
instruction for storing the result, since ST2 instructions are known to
be slow on some micro-architectures.

Observed reduction in runtimes compared to the existing Neon code:

 Cortex-A55: -23.3%
Cortex-A510: -49.6%
Cortex-A520: -31.1%
 Cortex-A76: -44.5%
Cortex-A715: -45.8%
Cortex-A720: -46.0%
  Cortex-X1: -74.5%
  Cortex-X2: -72.4%
  Cortex-X3: -76.8%
  Cortex-X4: -39.5%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Iab9e802d0784d69b7e970dcc8f1f4036985cd2e1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790972
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:28:25 +00:00
George Steed
00886670bb [AArch64] Avoid LD4/ST2 in ScaleARGBRowDown2_NEON
Use separate permute instructions to avoid using LD4/ST2 as these
instructions are known to be slow on some micro-architectures.

Observed reduction in runtimes compared to the existing Neon code:

 Cortex-A55: -12.4%
Cortex-A510: -44.8%
Cortex-A520: -31.1%
 Cortex-A76: -55.3%
Cortex-A715: -63.7%
Cortex-A720: -62.3%
  Cortex-X1: -79.0%
  Cortex-X2: -78.9%
  Cortex-X3: -79.6%
  Cortex-X4: -59.8%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I33cf27ae5e16c1ce62f1f343043e6bd9fca92558
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790971
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:27:39 +00:00
Frank Barchard
4620f17058 ScalePlane crash fix for 3/4 scaling
- Scaling 48 pixels at a time, but calling code checked for 24 pixels
- Added test for scaling to 1080x1920
  libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560

Was
libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560
[ RUN      ] LibYUVScaleTest.I420ScaleTo1080x1920_Box
Segmentation fault
Traceback (most recent call last):

Now
[ RUN      ] LibYUVScaleTest.I420ScaleTo1080x1920_Box
filter 3 -     6741 us C -     3566 us OPT
[       OK ] LibYUVScaleTest.I420ScaleTo1080x1920_Box (43 ms)

Bug: b/366045177
Change-Id: I0ea6c2d6a32b2e7ca44cd030abc9f248115be44a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5857554
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-09-13 01:20:39 +00:00
Wan-Teh Chang
41d0cd3360 Install yuvconvert with install(TARGETS)
The original code
  INSTALL ( PROGRAMS ${CMAKE_BINARY_DIR}/yuvconvert ...)
fails on Windows because it is missing the .exe file extension. Change
it to install( TARGETS yuvconvert ...) based on CMake documentation:
  [The PROGRAMS form] is intended to install programs that are not
  targets, such as shell scripts. Use the TARGETS form to install
  targets built within the project.

Note that this change was first made in the cmake-mingw.patch file in
the mingw-w64-libyuv package:
https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-libyuv

Change-Id: Ia571aa61e136cef477f05e051fef2cfb1db4b77d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5840469
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-05 20:57:59 +00:00
Wan-Teh Chang
552d775b43 Also install the DLL import library
The target artifact of the ARCHIVE kind means DLL import libraries for
shared library targets on Windows. See
https://cmake.org/cmake/help/latest/command/install.html#targets

Change-Id: Id22e362b39648d5b155c6f2359876ee9d90786a3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5837740
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-04 19:22:07 +00:00
Chunbo Hua
874f391dbf Validate memory right after malloc
The failure of malloc would make a NULL pointer. But if in this case,
things like reinterpret_cast is done to some shift from the NULL point,
it will return a valid pointer although its content would be Access
Violation area.

Bug: 359949838
Change-Id: Ie73bca426671ee85315b96f187a6de8c955cada6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5789885
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-21 00:07:37 +00:00
Chunbo Hua
e434b8c5ae Fix build script instructions for Windows
Bug: 359296629
Change-Id: I118650a89d3e3c9500b592f3f62f058e378f514e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785268
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-21 00:05:21 +00:00
Frank Barchard
679e851f65 Convert16To8Row_AVX512BW using vpmovuswb
- avx2 is pack/perm is mutating order
- cvt method maintains channel order on avx512

Sapphire Rapids

Benchmark of 640x360 on Sapphire Rapids
AVX512BW
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms)

AVX2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms)

SSE2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms)

Skylake Xeon
Now vpmovuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms)

Was vpackuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms)

Switch from vpunpcklwd to vpbroadcastw for scale value parameter
Was
vpunpcklwd  %%xmm2,%%xmm2,%%xmm2
vbroadcastss %%xmm2,%%ymm2

Now
vpbroadcastw %%xmm2,%%ymm2

Bug: 357439226, 357721018
Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-15 20:13:33 +00:00
Wan-Teh Chang
c21dda06dd Spell CMake commands in lowercase
CMake commands are spelled in lowercase in the CMake Reference
Documentation at https://cmake.org/cmake/help/latest/index.html.

Change-Id: I5fc39c8fa65e83785f9c776cb5bef94a498c15f0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787519
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-15 01:49:29 +00:00
Wan-Teh Chang
0c2cf03c5c Fix a -Wundef warning on macOS with Apple silicon
Change-Id: Ia78dcc913e06dd8876119a96bd7760c1d2af4341
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5788821
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-14 22:10:43 +00:00
Wan-Teh Chang
6157cc4583 Remove the ' separators in hex integer constants
They are a C++14 feature, not supported in C++11 mode (-std=c++11).

Change-Id: I618020342d4964b994aefa06af83b2e8d553a032
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786607
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-14 20:50:28 +00:00
Wan-Teh Chang
2707098fb1 Fix -Wunused-parameter warnings in release builds
Add (void) casts to the 'src_width' parameters that are only used in
assertions.

Change-Id: I72d1b55f50a9b02b07b206e40e5583005b27928b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786606
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-14 20:49:37 +00:00
Frank Barchard
336e6fd25b I010ToNV12 conversion using 2 step row function for UV
- convert full Y plane with row coalescing if possible
- convert rows of UV from 10 bit to 8 bit then call MergeUV

libyuv_test '--gunit_filter=*010ToNV12_Opt' --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1
Note: Google Test filter = *010ToNV12_Opt

Skylake Xeon Was 2 pass planes
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (4512 ms)
Now 2 pass rows
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (2400 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (2265 ms)

On Samsung S23
libyuv_test --gunit_filter=*.????ToNV12_Opt --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000'

Was
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (3563 ms)

Now
[       OK ] LibYUVConvertTest.AYUVToNV12_Opt (3068 ms
[       OK ] LibYUVConvertTest.ARGBToNV12_Opt (2990 ms
[       OK ] LibYUVConvertTest.ABGRToNV12_Opt (2904 ms
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (1177 ms
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (1150 ms <- now
[       OK ] LibYUVConvertTest.I444ToNV12_Opt (1118 ms
[       OK ] LibYUVConvertTest.MM21ToNV12_Opt (1008 ms
[       OK ] LibYUVConvertTest.UYVYToNV12_Opt (1007 ms
[       OK ] LibYUVConvertTest.YUY2ToNV12_Opt (938 ms)
[       OK ] LibYUVConvertTest.NV21ToNV12_Opt (496 ms)
[       OK ] LibYUVConvertTest.I420ToNV12_Opt (466 ms)


Bug: b/357439226, b/357721018
Change-Id: I48405929ae835b171e7d556a16794eac22c50ae9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5782404
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-13 19:30:16 +00:00
Wan-Teh Chang
5dfa75670d scale_neon.cc: Fix -Wmissing-prototypes warnings
Somehow I missed this file in
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601.

Change-Id: Ibd8ed7102d1af12fb929e2ec9bcc87da7d8be306
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785253
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-13 03:50:51 +00:00
Wan-Teh Chang
02e2ff4745 Note stride params of HalfFloatPlane are in bytes
The HalfFloatPlane() function does not follow libyuv's convention of buffer
stride in units of the corresponding buffer pointer. Document that.

Change-Id: Id8d466ccc2df263a49ad788ab349bc3993a48259
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5770639
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-12 20:17:23 +00:00
Wan-Teh Chang
3cf54e90d3 Fix -Wmissing-prototypes warnings
Declare functions as static. Declare functions in a header. Include the
header that declares the functions. Delete undeclared and unused
functions ScaleFilterRows_NEON() and ScaleRowUp2_16_NEON(). Delete
unused function ScaleY() in psnr_main.cc.

Change-Id: I182ec30611df83c61ffd01bbab595cd61fb5f1e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-12 19:08:24 +00:00
Frank Barchard
a97746349b Add test for I010ToNV12
- Add support for negative height to invert
- Fix off by 1 on odd width and height
- Bump version to 1895

Initial I010 is 2 step planar conversion

libyuv_test '--gunit_filter=*010ToNV12_Opt' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Skylake Xeon
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (2675 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (1547 ms)
Pixel 7
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (464 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (125 ms)

Bug: b/357721018, b/357439226
Change-Id: I2ae59783cf328a6592d0ab80c374ae4dc281daf3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778595
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-12 18:57:56 +00:00
Wan-Teh Chang
5045476744 Restrict libyuv_use_sme to is_linux
When libyuv is used in Chrome, there are linker errors or other build
errors on the following platforms:
- Android: undefined symbol: __getauxval
- Fuchsia: undefined symbol: __aarch64_sme_accessible
- macOS: undefined symbol: __arm_tpidr2_save
- Windows: Incorrect size for TransposeWxH_SME prologue: 52 bytes of
  instructions in range, but .seh directives corresponding to 40 bytes

Restrict libyuv_use_sme to is_linux (which excludes Android and
ChromeOS) to work around these errors.

Bug: libyuv:359006069
Change-Id: Ia3ffd6e4ce4859ae7f811836cb1d8d61f6943b6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5779858
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-11 20:37:04 +00:00
Wan-Teh Chang
1fad3ab1fa Run "gn format" on BUILD.gn and libyuv.gni
Change-Id: I2f201383f0a8b91d5a97c9ec4556de4288aa6696
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5779859
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-11 20:32:05 +00:00
Wan-Teh Chang
0ffb0cb220 Link libyuv with libyuv_sme in BUILD.gn
This was missed in
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5588664.

Change-Id: I217e8ce3847de8d455973fd7cdf7daf53f2b3e83
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778972
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-10 17:03:30 +00:00
Chunbo Hua
e23bc72e8e Bump version number in order to expose new API
Bug: 357721018
Change-Id: I2c6e115cd049db2038631195305c5907764d5c7b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5768078
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-07 22:10:05 +00:00
Chunbo Hua
fc94178260 Implement I010ToNV12 conversion
I010, also known as YUV420P10, is 10 bit YUV pixel format with 3 planes.
Both I010 and NV12 are 4:2:0 subsampling. NV12 has a Y plane, and an
interleaved UV plane.

Bug: 357721018
Change-Id: If215529b9eda8e0fb32aed666ca179c90244aaff
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5764823
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-06 17:36:13 +00:00
Frank Barchard
32ccd53bb3 Add P010ToNV12 to convert 10 bit biplanar to 8 bit biplanar
- P010 and NV12 have the same layout: Full size Y plane and half size UV plane.
  P010 and NV12 are 4:2:0 subsampling
- P010 uses upper 10 bits of 16 bit elements
- NV12 uses 8 bit elements
- The Convert16To8 used internally will discard the low 2 bits.
- UV order is the same - U first in memory, followed by V, interleaved
- UV plane is be rounded up in size to allow odd size Y to have UV values
- Similar code could be used to convert P210ToNV16, P410ToNV24, with the size
  of the UV plane affected by subsampling 4:2:2 and 4:4:4 variants.

Bug: b/357439226
Change-Id: I5d6ec84d97d0e0cc4008eeb18a929ea28570d6d9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5761958
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-05 18:55:44 +00:00
Wan-Teh Chang
e462de319c Fix -Wundef warnings
Change-Id: I803b70f66ca938665ba39b961bdb31625c6bc503
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5758156
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-02 17:39:59 +00:00
Frank Barchard
4cd90347e7 Rotate use NULL for C compatability
Bug: b/353323977
Change-Id: I2472f23ce8fcc0bc09a292bd6fb758304c6c2b18
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5735714
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-07-23 18:02:47 +00:00
Martin Storsjö
36abe81e92 cmake: Check whether SME functions can be compiled
This fixes builds with top-of-tree Clang for Windows; SME functions
require backing up/restoring things that Clang can't express in
Windows unwind information - see
https://github.com/llvm/llvm-project/issues/80009 for more
context (primarily about SVE).

Similar checks would also be needed for SVE and dotprod functions,
if building with older toolchains.

Change-Id: Iab3eeb0a125c3fac9814648288261a056bffc900
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5729969
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-23 16:58:50 +00:00
George Steed
8f039f639c [AArch64] Unroll ScaleRowDown4Box_NEON
We can use wider load/store instructions and avoid the need to waste
half of the ADDP/RSHRN vector data. The duplicated UADDLP and UADALP
instructions also provide a good improvement on little cores due to
their limited out-of-order capability.

The mask in the "any" kernel definition is already set up to handle an
unrolling of eight so no change to scale_any.cc is needed.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55: -19.5%
Cortex-A520: -38.3%
 Cortex-A76: -36.0%
Cortex-A715: -18.1%
Cortex-A720: -17.9%
  Cortex-X1: -25.4%
  Cortex-X2: -18.5%
  Cortex-X3:  -8.2%
  Cortex-X4:  -3.8%

Bug: b/42280945
Change-Id: Iebba5da4db5e25af4b9fa5651c7396364dedffba
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725172
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 19:52:21 +00:00
George Steed
dc392094fc [AArch64] Unroll ScaleRowDown34_0_Box_NEON
The additional parallel instruction streams provide a good benefit to
little cores with limited out-of-order capability.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55: -19.1%
Cortex-A510: -31.6%
Cortex-A520: -35.2%
 Cortex-A76: -14.3%
Cortex-A715:  +0.1%
Cortex-A720:  =0.0%
  Cortex-X1:  -6.6%
  Cortex-X2:  -0.1%
  Cortex-X3:  -0.2%
  Cortex-X4:  -7.2%

Bug: b/42280945
Change-Id: Idca21a5af1dc6f189e644a81537d41f50ef66498
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725171
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 19:52:01 +00:00
George Steed
776a509891 [AArch64] Unroll ScaleRowDown34_1_Box_NEON
We can make use of wider instructions for the loads and stores as well
as the URHADD instructions. In addition the duplicated instructions of
the code from the unrolling provides a further small improvement for
little cores with limited out-of-order capability.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55: -23.5%
Cortex-A510: -35.4%
Cortex-A520: -40.5%
 Cortex-A76: -15.1%
Cortex-A715:  -6.2%
Cortex-A720:  -6.2%
  Cortex-X1: -17.9%
  Cortex-X2: -18.4%
  Cortex-X3: -18.3%
  Cortex-X4: -14.0%

Bug: b/42280945
Change-Id: I5905e026a0507870bfc580b702906d6acb4ed6f4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725170
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 19:51:45 +00:00
George Steed
be5de19db3 [AArch64] Unroll ScaleRowUp2_Linear_NEON
On little cores with limited out-of-order capability this gives a good
improvement.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55: -21.3%
Cortex-A520: -33.6%
 Cortex-A76:  +1.1%
Cortex-A715:  =0.0%
Cortex-A720:  =0.0%
  Cortex-X1: +10.4% (!)
  Cortex-X2:  -5.3%
  Cortex-X3:  -4.3%
  Cortex-X4:  -9.9%

Bug: b/42280945
Change-Id: I45b3510f13c05b19d61052e2f8e447199dbd0551
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725169
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 19:51:17 +00:00
George Steed
42d33341d3 [AArch64] Unroll {RAW,RGB24}To{ARGB,RGBA}Row_SVE2
Unrolling gives a nice improvement to the little cores and even a small
improvement to the big cores thanks to avoiding the loop control
overhead.

Observed performance improvement relative to the existing SVE2 code.

                    | Cortex-A510 | Cortex-A720 | Cortex-X2
  RAWToARGBRow_SVE2 |      -28.4% |      -10.1% |     -3.5%
  RAWToRGBARow_SVE2 |      -28.5% |      -10.1% |     -4.4%
RGB24ToARGBRow_SVE2 |      -28.5% |      -10.4% |     -5.5%

Bug: libyuv:973
Change-Id: I7aa03fdaa1a24ecfdd13418647a02e5effe8333f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725174
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 16:01:56 +00:00
George Steed
4ad050b5ec [AArch64] Unroll {I422,I422Alpha}ToARGBRow_SVE2
Since the UV components are duplicated in I422 we end up wasting half of
the vector bandwidth processing the same elements twice. By unrolling
the kernel to process two vectors of Y per iteration we can fill a whole
vector of U/V components.

Rather than packing RGBA components into pairs during the narrowing we
now just narrow into individual component vectors and use ST4B instead.
This by itself is slower on some micro-architectures like Cortex-A510
but the benefit from unrolling significantly outweights this.

            | I422AlphaToARGBRow_SVE2 | I422ToARGBRow_SVE2
Cortex-A510 |                  -46.2% |             -48.8%
Cortex-A720 |                  -20.8% |             -21.0%
  Cortex-X2 |                  -11.3% |              -7.5%
  Cortex-X4 |                  -15.4% |             -15.5%

Bug: libyuv:973
Change-Id: I69389c4279861f7a460ae0c28186f023c728c4e8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725173
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 15:55:59 +00:00
George Steed
b5f9d7cb76 [AArch64] Add SME implementation of TransposeUVWxH
We can make use of the ZA tile register to do the transpose and
de-interleaving of UV components without any explicit permute
instructions: the tile is loaded horizontally placing UV components into
alternative columns, then we can just store the independent components
vertically.

Change-Id: I67bd82dc840a43888290be1c9db8a3c05f16d730
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703588
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 12:15:40 +00:00
George Steed
15ecca81f7 [AArch64] Add SME implementation of TransposeWxH
We can make use of the ZA tile register to do the transpose without any
explicit permute instructions: just load the tile horizontally and store
it vertically.

Change-Id: I1c31e89af52a408e3491e62d6c9e6fee41b1b80a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703587
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-19 12:14:39 +00:00
George Steed
a4ccf9940e [AArch64] Add I8MM implementation of ARGBToUV444Row
We cannot use the standard dot-product instructions since the
coefficients multiplication results are both added and subtracted, but
I8MM supports mixed-sign dot products which work well here.  We need to
add an additional variant of the coefficient structs since we need
negative constants for the elements that were previously subtracted.

Reduction in runtimes observed compared to the previous Neon
implementation:

Cortex-A510: -37.3%
Cortex-A520: -31.1%
Cortex-A715: -37.1%
Cortex-A720: -37.0%
  Cortex-X2: -62.1%
  Cortex-X3: -62.2%
  Cortex-X4: -40.4%

Bug: libyuv:977
Change-Id: Idc3d9a6408c30e1bce3816a1ed926ecd76792236
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712928
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-16 17:32:52 +00:00
George Steed
302d29d1a8 [AArch64] Add missing feature disable flags to unit_test.cc
Allow users to set LIBYUV_DISABLE_${FEATURE} environment variables to
disable individual architecture extensions.

Change-Id: I555dd64311789bd6d760e48045ac6734177a730b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712929
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-16 17:27:13 +00:00
George Steed
a64fffe632 Revert "Disable NV12ToARGB_SVE2 which fails the 'any' test"
This reverts commit f480fa1c4a4af0ce3c34cd7b1ab0d85f1a36ce17.

This code has a number of small issues:

* The YUVTORGB_SVE_SETUP macro requires p0 to be initialized to
  all-true, however the existing kernel does not initialise p0 until
  after this macro is called, so flip the order.

* The p2 register is missing from the clobber list, so add it.

* The existing code uses the wrong condition flags when determining
  whether to do the tail iteration using WHILE instructions or not.
  Additionally the number of tail iterations is incorrect, as it was
  incorrectly not changed from when the tail code was always executed.

While we are here, make another few small improvements:

* Remove the single-quote digit separators as requested here:
  https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133

* Remove "volatile" from the asm block counting the vector length.  This
  particular asm block cannot be removed by the compiler since the
  output register is consumed by subsequent code, so "volatile" is
  unnecessary here and we remove it.

* Add some additional empty comments to force clang-format to put macros
  into the next line rather than on the same line as other asm.

Bug: b/352371649
Change-Id: I45676fab95343f588cf11ce2cf9186ffbe87489e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703586
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-15 18:13:42 +00:00
George Steed
e1a93c79fc [AArch64] Fix rotate by odd sizes
The existing disabled gtest rotate tests fail because the existing "any"
kernels always assume we are processing height=8 rows at a time. This
was recently changed to 16 on AArch64 which triggered this bug.

To fix this, amend the TANY macro to explicitly specify the fallback
kernel, such that we can use the height=16 kernel to match the SIMD
optimized version where necessary. Also change other architecture
versions to match.

Bug: b/352351302
Change-Id: I8080fa8f44c7c67fa970a78fb426f2f801a9a00e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703585
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-15 18:13:31 +00:00
Frank Barchard
ec9a8781c7 Disable NV12ToARGB_SVE2 which fails the 'any' test
Bug: b/352371649
Change-Id: I9cd332432c1baa1fb64d4040fa5f207cc54dc82c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5698374
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-07-11 23:32:13 +00:00
George Steed
c1fe5663f5 [AArch64] Use full vectors in ARGB4444To{Y,UV}Row_NEON
The existing ARGB4444TORGB macro only makes use of 64 bit wide vectors
rather than the full 128 bits available, so unroll it to allow us to
process more data per instruction.

For ARGB4444ToUVRow_NEON we already have enough data available each
iteration to make use of full vectors, but for ARGB4444ToYRow_NEON we
also need to adjust the "any" kernel to allow us to process 16 elements
per iteration.

Reduction in runtimes observed compared to the existing Neon kernels:

            | ARGB4444ToUVRow | ARGB4444ToYRow
 Cortex-A55 |          -27.8% |         -34.6%
Cortex-A510 |          -37.0% |         -44.4%
 Cortex-A76 |          -40.2% |         -22.0%
Cortex-A720 |          -33.4% |         -35.5%
  Cortex-X1 |          -34.1% |         -19.7%
  Cortex-X2 |          -32.1% |         -26.3%

Bug: libyuv:976
Change-Id: I08f6286bab0ebf5e24d5d5803f8c45ec6ba776ee
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631541
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-10 23:12:43 +00:00
George Steed
5bac99fe09 [AArch64] Rework data loading in ScaleARGBFilterCols_NEON
The existing code makes use of lane-indexed LD2 instructions to load the
input data however this creates a strong dependency chain between
consecutive load instructions. We can reduce this dependency chain by
instead loading two vectors with wider lane-indexed LD1 instructions and
then performing a permute to unzip the data.

We can also avoid the need for a complex sequence of DUP + EXT
instructions by using TBL to permute the data exactly as we want it.

Reduction in runtimes observed compared to the existing Neon
implementation:

 Cortex-A55:  =0.0%
Cortex-A510: -44.2%
Cortex-A520: -47.6%
 Cortex-A76: -45.8%
Cortex-A715: -58.3%
Cortex-A720: -58.4%
  Cortex-X1: -66.7%
  Cortex-X2: -68.0%
  Cortex-X3: -67.9%
  Cortex-X4: -70.0%

Change-Id: I8a1d1fe08d8a2ddb0b86d4a44f0d49b69ab03ece
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5683126
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-10 23:10:43 +00:00
George Steed
a425b559bd [AArch64] Use full vectors in ARGB1555To{Y,UV}Row_NEON
The existing RGB555TOARGB macro only makes use of 64 bit wide vectors
rather than the full 128 bits available, so unroll it to allow us to
process more data per instruction.

For ARGB1555ToUVRow_NEON we already have enough data available each
iteration to make use of full vectors, but for ARGB1555ToYRow_NEON we
also need to adjust the "any" kernel to allow us to process 16 elements
per iteration.

Reduction in runtimes observed compared to the existing Neon kernels:

            | ARGB1555ToUVRow | ARGB1555ToYRow
 Cortex-A55 |          -28.8% |         -35.3%
Cortex-A510 |          -34.0% |         -48.5%
 Cortex-A76 |          -36.7% |         -25.1%
Cortex-A720 |          -29.7% |         -31.1%
  Cortex-X1 |          -31.6% |         -19.7%
  Cortex-X2 |          -27.6% |         -22.7%

Bug: libyuv:976
Change-Id: Idd745c133b5fb65001652a59f01ac1aa3bb42067
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631540
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-10 23:09:53 +00:00