These were mistakenly copied from the main loop body, however this
particular block of the code is only executed at most once so we do not
need to perform the address updates.
Also adjust formatting with clang-format to match other kernels.
Change-Id: I8214821417d5e4f455ebe8805e1a37a9728ab8d2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067154
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The auto-vectorized implementation unrolls to process 32 elements per
iteration, so unroll the new Neon implementation to match and avoid a
performance regression on little cores.
Performance relative to the auto-vectorized C implementation compiled
with LLVM 19:
Cortex-A55: -35.8%
Cortex-A510: -20.4%
Cortex-A520: -22.1%
Cortex-A76: -54.8%
Cortex-A710: -44.5%
Cortex-A715: -31.1%
Cortex-A720: -31.4%
Cortex-X1: -48.5%
Cortex-X2: -47.8%
Cortex-X3: -47.6%
Cortex-X4: -51.1%
Cortex-X925: -14.6%
Bug: b/42280942
Change-Id: Ib4e89ba230d554f2717052e934ca0e8a109ccc42
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040153
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Replace LD4 and TRN instructions with LD1s and TBL since LD4 is known to
be slow on some micro-architectures, and remove other unnecessary
permutes.
Reduction in run times:
Cortex-A55: -24.8%
Cortex-A510: -32.7%
Cortex-A520: -37.7%
Cortex-A76: -51.8%
Cortex-A715: -58.9%
Cortex-A720: -58.9%
Cortex-X1: -54.8%
Cortex-X2: -50.3%
Cortex-X3: -57.1%
Cortex-X4: -49.8%
Cortex-X925: -52.0%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: b/42280945
Change-Id: Ie96bac30fffbe41f8d1501ee289795830ab127e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872803
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Replace LD4 and TRN instructions with LD1s and TBL since LD4 is known to
be slow on some micro-architectures, and remove other unnecessary permutes.
Reduction in run times:
Cortex-A55: -17.9%
Cortex-A510: -28.7%
Cortex-A520: -31.8%
Cortex-A76: -40.8%
Cortex-A715: -46.1%
Cortex-A720: -46.1%
Cortex-X1: -44.3%
Cortex-X2: -40.1%
Cortex-X3: -46.3%
Cortex-X4: -40.2%
Cortex-X925: -42.3%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: b/42280945
Change-Id: I84e2cd04912fc11d59b4407a1836f047b74a4c92
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872802
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Lane-indexed LD2 instructions are slow and introduce an unnecessary
dependency on the previous iteration of the loop. To avoid this
dependency use a scalar load for the first iteration and lane-indexed
LD1 for the remainder, then TRN1 and TRN2 to split out the even and odd
elements.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: -6.7%
Cortex-A510: -13.2%
Cortex-A520: -13.1%
Cortex-A76: -54.5%
Cortex-A715: -60.3%
Cortex-A720: -61.0%
Cortex-X1: -69.1%
Cortex-X2: -68.6%
Cortex-X3: -73.9%
Cortex-X4: -73.8%
Cortex-X925: -69.0%
Bug: b/42280945
Change-Id: I1c4adfb82a43bdcf2dd4cc212088fc21a5812244
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872804
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code performs a pair of stores since there is no AArch64
instruction in Neon to store exactly 12 bytes from a vector register.
It is guaranteed to be safe to write full vectors until the last
iteration of the loop, since the extra four bytes will be over-written
by subsequent iterations. This allows us to avoid duplicating the store
instruction and address arithmetic.
Reduction in runtime observed relative to the existing Neon
implementation:
Cortex-A55: +2.0%
Cortex-A510: -25.3%
Cortex-A520: -15.1%
Cortex-A76: -32.2%
Cortex-A715: -19.7%
Cortex-A720: -19.6%
Cortex-X1: -31.6%
Cortex-X2: -27.1%
Cortex-X3: -25.9%
Cortex-X4: -24.7%
Cortex-X925: -35.8%
Bug: b/42280945
Change-Id: I222ed662f169d82f5f472bebb1bcfe6d428ccae2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872843
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code uses three MOV instructions through a temporary
register to swap the low and high halves of a vector register, however
this can be done with a pair of ZIP instructions instead.
Also use a pair of RSHRN rather than RSHRN2 to allow these to execute in
parallel on little cores.
Reduction in runtime observed compared to the existing Neon
implementation:
Cortex-A55: -8.3%
Cortex-A510: -20.6%
Cortex-A520: -16.6%
Cortex-A76: -6.8%
Cortex-A715: -6.2%
Cortex-A720: -6.2%
Cortex-X1: -22.0%
Cortex-X2: -18.7%
Cortex-X3: -21.1%
Cortex-X4: -25.8%
Cortex-X925: -21.9%
Change-Id: I87ae133be86c3c9f850d5848ec19d9b71ebda4d9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872801
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
ST3 is known to be slow on a number of modern micro-architectures. By
unrolling the code we are able to use TBL to shuffle elements into the
correct indices without needing to use LD4 and ST3, giving a good
improvement in performance across the board.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: -14.4%
Cortex-A510: -66.0%
Cortex-A520: -50.8%
Cortex-A76: -60.5%
Cortex-A715: -63.9%
Cortex-A720: -64.2%
Cortex-X1: -74.3%
Cortex-X2: -75.4%
Cortex-X3: -75.5%
Cortex-X4: -48.1%
Bug: b/42280945
Change-Id: Ia1efb03af2d6ec00bc5a4b72168963fede9f0c83
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785971
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can use wider load/store instructions here which is mostly an
improvement across the board.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: +4.9% (!)
Cortex-A510: -46.3%
Cortex-A520: -49.0%
Cortex-A76: -12.2%
Cortex-A715: -15.5%
Cortex-A720: -15.0%
Cortex-X1: -12.4%
Cortex-X2: -12.5%
Cortex-X3: -12.3%
Cortex-X4: +0.3%
Bug: b/42280945
Change-Id: Id8af6499c63919924c2a954dfe7765b703ce4820
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785970
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Replace LD4 with a pair of LD2 instructions to avoid needing an ST2
instruction for storing the result, since ST2 instructions are known to
be slow on some micro-architectures.
Observed reduction in runtimes compared to the existing Neon code:
Cortex-A55: -23.3%
Cortex-A510: -49.6%
Cortex-A520: -31.1%
Cortex-A76: -44.5%
Cortex-A715: -45.8%
Cortex-A720: -46.0%
Cortex-X1: -74.5%
Cortex-X2: -72.4%
Cortex-X3: -76.8%
Cortex-X4: -39.5%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Iab9e802d0784d69b7e970dcc8f1f4036985cd2e1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790972
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Use separate permute instructions to avoid using LD4/ST2 as these
instructions are known to be slow on some micro-architectures.
Observed reduction in runtimes compared to the existing Neon code:
Cortex-A55: -12.4%
Cortex-A510: -44.8%
Cortex-A520: -31.1%
Cortex-A76: -55.3%
Cortex-A715: -63.7%
Cortex-A720: -62.3%
Cortex-X1: -79.0%
Cortex-X2: -78.9%
Cortex-X3: -79.6%
Cortex-X4: -59.8%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I33cf27ae5e16c1ce62f1f343043e6bd9fca92558
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790971
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Declare functions as static. Declare functions in a header. Include the
header that declares the functions. Delete undeclared and unused
functions ScaleFilterRows_NEON() and ScaleRowUp2_16_NEON(). Delete
unused function ScaleY() in psnr_main.cc.
Change-Id: I182ec30611df83c61ffd01bbab595cd61fb5f1e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can use wider load/store instructions and avoid the need to waste
half of the ADDP/RSHRN vector data. The duplicated UADDLP and UADALP
instructions also provide a good improvement on little cores due to
their limited out-of-order capability.
The mask in the "any" kernel definition is already set up to handle an
unrolling of eight so no change to scale_any.cc is needed.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: -19.5%
Cortex-A520: -38.3%
Cortex-A76: -36.0%
Cortex-A715: -18.1%
Cortex-A720: -17.9%
Cortex-X1: -25.4%
Cortex-X2: -18.5%
Cortex-X3: -8.2%
Cortex-X4: -3.8%
Bug: b/42280945
Change-Id: Iebba5da4db5e25af4b9fa5651c7396364dedffba
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725172
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can make use of wider instructions for the loads and stores as well
as the URHADD instructions. In addition the duplicated instructions of
the code from the unrolling provides a further small improvement for
little cores with limited out-of-order capability.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: -23.5%
Cortex-A510: -35.4%
Cortex-A520: -40.5%
Cortex-A76: -15.1%
Cortex-A715: -6.2%
Cortex-A720: -6.2%
Cortex-X1: -17.9%
Cortex-X2: -18.4%
Cortex-X3: -18.3%
Cortex-X4: -14.0%
Bug: b/42280945
Change-Id: I5905e026a0507870bfc580b702906d6acb4ed6f4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725170
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code makes use of lane-indexed LD2 instructions to load the
input data however this creates a strong dependency chain between
consecutive load instructions. We can reduce this dependency chain by
instead loading two vectors with wider lane-indexed LD1 instructions and
then performing a permute to unzip the data.
We can also avoid the need for a complex sequence of DUP + EXT
instructions by using TBL to permute the data exactly as we want it.
Reduction in runtimes observed compared to the existing Neon
implementation:
Cortex-A55: =0.0%
Cortex-A510: -44.2%
Cortex-A520: -47.6%
Cortex-A76: -45.8%
Cortex-A715: -58.3%
Cortex-A720: -58.4%
Cortex-X1: -66.7%
Cortex-X2: -68.0%
Cortex-X3: -67.9%
Cortex-X4: -70.0%
Change-Id: I8a1d1fe08d8a2ddb0b86d4a44f0d49b69ab03ece
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5683126
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The MOV instruction is an alias of ORR where both registers are the
same and should be preferred.
Both ORR and MOV are not zero-cost instructions on all
micro-architectures so there may be better ways to express these
kernels, but this is left for a later commit.
Bug: libyuv:975
Change-Id: I29b7f182a57a61855cb7f8a867691080f153b10b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5332385
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Use a pair of LD2s to load data interleaved and perform a couple of
additions on the registers in order to avoid needing LD4 and ST4
instructions, since these are costly on some micro-architectures.
Reduction in run times:
Cortex-A55: -20.5%
Cortex-A510: -28.3%
Cortex-A76: -21.5%
Bug: libyuv:976
Change-Id: If66e1e148b031c2cd288ff412f351d7a0b9b91e7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5371774
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Replace indexed LD1 instructions with LDRs to avoid loop-carried
dependencies on unused lanes between consecutive iterations of the loop.
Reduction in run times:
Cortex-A55: -10.9%
Cortex-A510: -70.7%
Cortex-A76: -56.8%
Bug: libyuv:976
Change-Id: Ia767e76002c7823177e80163ebf034e023e9a6cc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5371771
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
There are a few functions in source/scale_neon64.cc which write memory
and set condition flags despite not declaring this in the asm clobber
list, so add the missing clobbers.
Also move a couple of memory/cc clobbers to the start of the clobber
list to match other kernels.
Bug: libyuv:974
Change-Id: I85f5ff5718e78a4481f7bc53cedaeceb14438895
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5309254
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- update cpu_id to use "re" for fopen to avoid leaking handles if a thread is started while the file is open.
Bug: libyuv:958
Change-Id: I1af9de68fce12e440e1226fc8070634ccb1bf090
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4417176
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
MOV Vy.4s, Vx.4s is not a valid instruction form (even though LLVM allows it).
It should be MOV Vy.16b, Vx.16b (.8b for 64-bit variants)
Bug: None
Change-Id: I3c3b42288a0ebc275962fa3adad707b351d00d4c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2780155
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
These are 16 bit bi-planar convert functions to scale UV plane to
Y plane's size using (bi)linear filter.
libyuv_unittest --gtest_filter=*ToP41*
R=fbarchard@chromium.org
Bug: libyuv:872
Change-Id: I3cb4fafe2b2c9eedd0d91cf4c619abb9ee107bc1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2690102
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
These are bi-planar convert functions to scale UV plane to Y plane's size using (bi)linear filter.
libyuv_unittest --gtest_filter=*ToNV24*
R=fbarchard@chromium.org
Change-Id: I3d98f833feeef00af3c903ac9ad0e41bdcbcb51f
Bug: libyuv:872
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2682152
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
new color util to compute constants needed based on white point.
[ RUN ] LibYUVColorTest.TestFullYUVV
hist -2 -1 0 1 2
red 0 1627136 13670144 1479936 0
green 319285 3456836 9243059 3440771 317265
blue 0 1561088 14202112 1014016 0
Bug: libyuv:877, b/178283356
Change-Id: If432ebfab76b01302fdb416a153c4f26ca0832d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2678859
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
These functions use (bi)linear filter, to scale U and V planes to the size of Y plane.
This will help enhance the quality of YUV to RGB conversion.
Also added 10bit and 12bit version:
I010ToI410
I210ToI410
I012ToI412
I212ToI412
libyuv_unittest --gtest_filter=LibYUVConvertTest.I42*ToI444*:LibYUVConvertTest.I*1*ToI41*
R=fbarchard@chromium.org
Change-Id: Ie4a711a5ba28f2ff1f44c021f7a5c149022264c5
Bug: libyuv:872
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2658097
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This allows the linker to move the variables from the .data section to
the .rodata section.
Bug: libyuv:254
Test: out/Release/libyuv_unittest --gtest_filter=* --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=999 --libyuv_flags=-1 --libyuv_cpu_info=-1
Change-Id: I6998570f1af4337d7b80313d9e18e36aa20d6ec0
Reviewed-on: https://chromium-review.googlesource.com/777033
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>