This one reuses the SIMD implementations for MergeUVRow_ from the
existing ARGBToNV12 functions.
Bug: libyuv:42280902
Change-Id: If0a4be133d657ed0262f29fdd568dac90b49636c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7564317
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
This allows for ABGR conversion using the same methods
Bug: libyuv:42280902
Change-Id: I5566e3150b30573a2326a900ce31ab095f8935f9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7564316
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
This was implemented by Gemini followed by manual review and some
tweaking for style. The 601 and JPEG constants are fully verified
against the existing non-matrix implementations. On x86 the C-only
versions appear to be about 25% slower than the optimized ones.
Bug: libyuv:42280902
Change-Id: Ia5b7cb499bad5c76faec53f36086ebb18f2b530f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7512030
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
GCC now supports vector segment load and store, which
was previously missing; and the reason why it was disabled.
Change-Id: I923fd8a15476de8dcc2103bb8335d4fcc3ca96a9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7241606
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Detect if arm cpu support FMMLA instruction
Bug: None
Change-Id: Ia7b83bf2735ddeeb8a85da44177e708c34e4b1fb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7085486
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Currently, ARGBToUVMatrixRow_AVX2 and ARGBToUVMatrixRow_SSSE3 fail to
compile with clang on 32bit PIC build with the error message: inline
assembly requires more registers than available
This is because in PIC code EBX is reserved for the GOT and with a frame
pointer EBP is also unavailable.
Fix this by copying the RGB-to-UV constants to stack locals first and
let the asm use simple stack-relative addressing.
Bug: 444157316
Change-Id: Ica90f0c35039303ecaa145534683f59659fb5d7f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6980714
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Skylake
Was ARGBToJ420_Opt (312 ms)
Now ARGBToJ420_Opt (242 ms)
Icelake
Was ARGBToJ420_Opt (302 ms)
Now ARGBToJ420_Opt (220 ms)
AMD Zen3 on Windows
Was ARGBToJ420_Opt (305 ms)
Now ARGBToJ420_Opt (216 ms)
32 bit x86 uses SSE
Now ARGBToJ420_Opt (326 ms)
MCA analysis of new AVX, SSE and old AVX
https://godbolt.org/z/37bdazWYr
Bug: None
Change-Id: I72f5504407751e164c3558aebe836dd15223d65f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6957477
Reviewed-by: Justin Green <greenjustin@google.com>
The UV subsample's 4-pixel rounding average and ARGBToJ444 fixed-point scaling
were updated in d32d19cc and c060118b. The LoongArch optimization is updated now.
Bug: 381138208
Change-Id: I3585d72564e4fffe514599b1a9b4fee8fbbd0266
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6878364
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
The y0_fraction and y1_fraction variables in InterpolateRow_NEON were
marked as modified by the inline-asm block, however
5eea7812826c551559fdcd4a6988fcf1fbe341f6 marked these variables as
`const` which caused both LLVM and GCC to emit errors about modification
of const variables.
There is no need for these variables to be modified in the loop since
they are read-only, so simply update the inline asm block constraints to
match.
Change-Id: I94ca3696c4163ede6ad27d645f0f445fcfb0a1c3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6818289
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- MCA says old version was 4 cycles and new version is 2.5 cycles/loop
- lunarlake is the only known cpu
mca -mcpu=lunarlake 100 iterations
Was vpmulhu
Iterations: 100
Instructions: 1200
Total Cycles: 426
Total uOps: 1200
Dispatch Width: 8
uOps Per Cycle: 2.82
IPC: 2.82
Block RThroughput: 4.0
Now vpsrlw
Iterations: 100
Instructions: 1200
Total Cycles: 279
Total uOps: 1400
Dispatch Width: 8
uOps Per Cycle: 5.02
IPC: 4.30
Block RThroughput: 2.5
Bug: None
Change-Id: I5a49e1cf1ed3dfb59fe9861a871df9862417c6a6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6697745
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Per the Software Development and Build Convention for LoongArch™
Architectures manual, on Linux we should use HWCAP instead of CPUCFG to
detect if LSX/LASX is available. The reason is the kernel may be
configured to disable them, and CPUCFG cannot provide info about the
kernel support.
Change-Id: I3f1b23e6d4c91c7da81311fbbe294e36ff178121
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6772567
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Mostly just a straightforward copy of the existing SVE2 code ported to
Streaming-SVE. Introduce new "any" kernels for non-multiple of two
cases, matching what we already do for SVE2.
The existing SVE2 code makes use of the Neon MOVI instruction that is
not supported in Streaming-SVE, so adjust the code to use FMOV instead
which has the same performance characteristics.
Change-Id: I74b7ea1fe8e6af75dfaf92826a4de775a1559f77
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6663806
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE. There is no benefit from this kernel when the SVE vector
length is only 128 bits, so skip writing a non-streaming SVE
implementation.
Change-Id: Ide34dbb7125b5f2a1edda6ef7111a1a49aad324f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6651565
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Was using avgb twice for non-exact and C for exact.
On Skylake Xeon:
Now SSE3
ARGBToJ420_Opt (326 ms)
Was
Exact C
ARGBToJ420_Opt (871 ms)
Not exact AVX2
ARGBToJ420_Opt (237 ms)
Not exact SSSE3
ARGBToJ420_Opt (312 ms)
Bug: 381138208
Change-Id: I6d1081bb52e36f06736c0c6575fa82bb2268629b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6629605
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
- Add +i8mm build option for sve ARGBToUV which uses usdot
- util/cpuid Get cpu count (windows, macos, linux)
- For each x86 cpu, detect hybrid (e-core)
- Includes a comment fix for ubsan unittest
- Bump version
- Apply clang format to util/*.c as well as all *.cc/*.h
Bug: 424637372
Change-Id: I08310e18051fff62c9e4e4a10d1e4361871119ac
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6635640
Reviewed-by: Wan-Teh Chang <wtc@google.com>
This commit reworks the implementation of ARGBToUVMatrixRow_SVE2, using
an approach similar to that recently used in
61bdaee13a701d2b52c6dc943ccc5c888077a591.
In particular we can rework these SVE2 implementations to use 8-bit
dot-product instructions instead of 16-bit, allowing us to process more
data in a single vector.
To ensure that the input values fit in 8-bits, negate the UV constants
arrays passed to the kernel and undo the now-unnecessary flipping of the
middle two component values.
This commit mostly reverses the performance inversion where the Neon
I8MM implementation was previously faster than the SVE2 implementation.
The reduction in runtime observed compared to the existing Neon I8MM
implementation is now:
Cortex-A510: +5.6% (!)
Cortex-A520: -3.0%
Cortex-A710: -12.6%
Cortex-A715: -10.9%
Cortex-A720: -10.8%
Cortex-X2: -3.8%
Cortex-X3: -10.3%
Cortex-X4: -9.5%
Cortex-X925: -6.7%
Change-Id: I30253976dc8e3651cfb5fd39b63a6763975d41e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6640990
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Add a Neon implementation of the Convert8To16Row kernel. Compared to the
C implementation we can take advantage of knowing that the "scale"
parameter is always an unsigned power of two and fits in 16-bits,
allowing us to combine this with the shift and avoid needing to widen
the input data.
Reduction in run times observed compared to the existing C
implementation:
Cortex-A55: -44.5%
Cortex-A510: -26.1%
Cortex-A520: -30.6%
Cortex-A76: -61.6%
Cortex-A710: -57.6%
Cortex-X1: -46.5%
Cortex-X2: -54.4%
Cortex-X3: -57.1%
Cortex-X4: -55.0%
Cortex-X925: -49.3%
Change-Id: I34b858605ece47e46588c0680a1d2afa7a90d7a0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6516186
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This can make use of the existing load/convert/store macros that are
already present for other kernels, so add I422ToAR30Row_SVE2 and
I422ToAR30Row_SME to match the existing kernels.
Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:
Cortex-A510: -9.1%
Cortex-A520: +6.8% (!)
Cortex-A710: -4.0%
Cortex-A715: -1.1%
Cortex-A720: -1.1%
Cortex-X2: -5.7%
Cortex-X3: -5.9%
Cortex-X4: -2.8%
Cortex-X925: -4.0%
Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so
they are usable in both SVE2 and SME implementations, and use them to
add new I444ToRGB24Row implementations for SVE2 and SME. We need to use
the unrolled versions here to use the ST3B interleaving store
instructions, since there is no partial vector version of this store
instruction.
Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:
Cortex-A510: -57.6%
Cortex-A520: -38.1%
Cortex-A710: -15.5%
Cortex-A715: -9.2%
Cortex-A720: -9.2%
Cortex-X2: -25.8%
Cortex-X3: -26.2%
Cortex-X4: -23.2%
Cortex-X925: -17.8%
Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The maximum coefficient is 128, so store constants negated to take
advantage of -128 being representable in 8-bit integers. This allows us
to use the I8MM USDOT instructions.
Reduction in time taken observed compared to the existing Neon
implementation, as a geomean of all ARGBToUV variants:
Cortex-A510: -7.1%
Cortex-A520: -2.1%
Cortex-A710: -8.4%
Cortex-A715: -0.3%
Cortex-A720: -0.3%
Cortex-X2: -40.0%
Cortex-X3: -43.3%
Cortex-X4: -11.3%
Cortex-X925: -2.5%
Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- make width loop count on stack
- set YMM constants in its own asm block
- make struct for shuffle and add constants
- disable clang format on row_neon.cc function
Bug: 413781394
Change-Id: I263f6862cb7589dc31ac65d118f7ebeb65dbb24a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6495259
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Several consumers of libyuv do unified sources build where many source
files are #include'd together to make compilation units larger and allow
for more optimization chances. But for LoongArch there is a wrinkle:
LASX and LSX code paths are implemented in separate files, unlike the
other currently supported architectures, and some definitions are
duplicated e.g. struct RgbConstants.
Since the duplicated content is identical across the two files, short of
some bigger refactoring, we can simply place #ifdef guards around the
definitions to fix unified sources build for LoongArch.
Change-Id: I952e8e0210221ec8bcc113f75fa1b9ba515ec323
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6272801
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
- use negative coefficients for UV to allow -128
- change shift to truncate instead of round for UV
- adapt all row_gcc RGB to UV into matrix functions
- add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc
Bug: 381138208
Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
SVE can make use of the UMULH instruction to avoid needing separate
widening multiply and narrowing steps for the scale application.
Reduction in runtime for Convert8To8Row_SVE2 observed compared to the
existing Neon implementation:
Cortex-A510: -13.2%
Cortex-A520: -16.4%
Cortex-A710: -37.1%
Cortex-A715: -38.5%
Cortex-A720: -38.4%
Cortex-X2: -33.2%
Cortex-X3: -31.8%
Cortex-X4: -31.8%
Cortex-X925: -13.9%
Change-Id: I17c0cb81661c5fbce786b47cdf481549cfdcbfc7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207692
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Replace ScalePlane with CopyPlane for Y channel
- Vertical mirroring is supported, but not horizontal mirroring.
- Check src_y is not null when dst_y is not null for all libyuv functions that allow a null dst_y.
- Apply clang-format
- Bump version to 1899
Bug: None
Change-Id: Id1805b52b8024ba95a7f1b098dabf45af48670eb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6128599
Reviewed-by: Wan-Teh Chang <wtc@google.com>
* Run on SiFive internal FPGA:
Test Case Speedup
ARGBScaleDownBy3by8_Linear x2.05
ARGBScaleDownBy3by8_Bilinear x1.76
ARGBScaleDownBy3by8_Box x1.76
Bug: 42280924
Co-Developed-by: Bruce Lai <bruce.lai@sifive.com>
Change-Id: Ib9979b1f2ca92d2ef5aa373f9b2459c246ded6c8
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5103572
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The HalfRow kernels assume that the fraction is exactly half, so there
is no need to calculate it.
No-Try: True
Change-Id: I2319d55ba99f202aa22c9693ec44c9891e7f72d5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087914
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
Some of the color conversion kernels already have Streaming-SVE
implementations however many do not. We can re-use the existing SVE
implementation by moving it to a new shared row_sve.h header and marking
it with a "streaming-compatible" attribute to ensure it can be called
from both streaming and non-streaming execution modes.
As part of this move to a common header we also add duplicated
streaming-mode implementations of the following kernels that did not
previously have an SME implementation:
- I210AlphaToARGBRow_SME
- I210ToAR30Row_SME
- I210ToARGBRow_SME
- I212ToAR30Row_SME
- I212ToARGBRow_SME
- I400ToARGBRow_SME
- I410AlphaToARGBRow_SME
- I410ToAR30Row_SME
- I410ToARGBRow_SME
- I422AlphaToARGBRow_SME
- I422ToARGB1555Row_SME
- I422ToARGB4444Row_SME
- I422ToRGB24Row_SME
- I422ToRGB565Row_SME
- I422ToRGBARow_SME
- I444AlphaToARGBRow_SME
- NV12ToARGBRow_SME
- NV12ToRGB24Row_SME
- NV21ToARGBRow_SME
- NV21ToRGB24Row_SME
- P210ToAR30Row_SME
- P210ToARGBRow_SME
- P410ToAR30Row_SME
- P410ToARGBRow_SME
- UYVYToARGBRow_SME
- YUY2ToARGBRow_SME
Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
InterpolateRow_SME and InterpolateRow_16_SME need special cases to
handle if source_y_fraction is 256 since this would overflow a byte and
can just be a call to memcpy instead.
InterpolateRow_16To8_SME is never called with a source_y_fraction value
of 256 so there is no need for a special case here.
Change-Id: I67805b5db2c411acb93ada626cf414b35620f467
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Add a streaming-SVE implementation of CopyRow using normal vector
load/store instructions.
Change-Id: Ia551413f9740a96473fa2e8a0958953be2f4b04e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074374
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel.
SVE has a "widening multiply get high half" instruction in UMULH,
however using the same technique as the Neon code to avoid the need for
a widening multiply at all is more performant here.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Some projects require scaling support for the NV24 format, but libyuv currently lacks this functionality. This commit adds a scaling function for NV24, enabling its use in projects that require NV24 format processing.
Change-Id: I6e6b2bea342e1df7f387056ab3bc5003da983bb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6068715
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Mostly just straightforward copies of the Neon code ported to
Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2
SME kernels, but operating on 16-bit data rather than 8-bit.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: I7bad0719d24cdb1760d1039c63c0e77726b28a54
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070784
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Mostly just straightforward copies of the Neon code ported to
Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2
and ScaleUVRowDown2 SME kernels, but operating on 32-bit ARGB tuples
rather than 8-bit data or 16-bit UV tuples.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: I15600c2498cc592f5ea1d97b78fafec327de7947
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070783
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel
and use ST2 to avoid needing a separate ZIP instruction.
These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.
Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Force macros onto empty lines with empty comments and adjust some other
comments to be consistent with the rest of the file.
Change-Id: I1a35283608b868c53e91b337187ebe0e402c9834
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067152
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Now that we have the `_2X` versions of the macros we can use these to
implement `ToRGB24` kernels. These cannot use the bottom/top approach
previously used by other SVE kernels since there are three rather than
two or four elements each.
Reduction in runtimes observed compared to the existing Neon
implementations:
| NV12ToRGB24Row | NV21ToRGB24Row
Cortex-A510 | -60.7% | -60.7%
Cortex-A520 | -46.0% | -46.0%
Cortex-A715 | -25.2% | -25.2%
Cortex-A720 | -25.2% | -25.2%
Cortex-X2 | -28.9% | -29.0%
Cortex-X3 | -28.2% | -28.1%
Cortex-X4 | -30.8% | -30.7%
Cortex-X925 | -28.8% | -28.9%
Change-Id: I39853d124bfdcac38584109870b398b8ecd5b632
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067149
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
These were mistakenly copied from the main loop body, however this
particular block of the code is only executed at most once so we do not
need to perform the address updates.
Also adjust formatting with clang-format to match other kernels.
Change-Id: I8214821417d5e4f455ebe8805e1a37a9728ab8d2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067154
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can reuse most of the logic from the existing I422TORGB_SVE_2X macro
and simply amend the existing READNV_SVE macro to read twice as much
data.
Unrolling is primarily beneficial for little cores but also provides
some smaller benefits to larger cores as well.
| NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 | -48.0% | -47.9%
Cortex-A520 | -48.1% | -48.2%
Cortex-A715 | -20.4% | -20.4%
Cortex-A720 | -20.6% | -20.6%
Cortex-X2 | -7.1% | -7.3%
Cortex-X3 | -4.0% | -4.3%
Cortex-X4 | -14.1% | -14.3%
Cortex-X925 | -8.2% | -8.6%
Change-Id: I195005d23e743d7d46319220ad05ee89bb7385ae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067148
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Several of the existing SVE kernels used calculations of the form:
remainder = width & (vl - 1) == 0 ? vl : width & (vl - 1);
This is due to initial SVE contributed code unconditionally using the
predicated tail for the final iteration even if the width was a perfect
multiple of the vector length.
In the current code the fully-predicated main body loop will instead
iterate through the width completely and simply skip over the tail
entirely. Skipping over the tail means that the case handled by the
ternary condition now never occurs, and the remainder calculation can
now simply be:
remainder = width & (vl - 1);
This avoids the need for a compare and conditional select in the
function prologue.
Change-Id: Ia73f5f8bc66fad6bea64439dc2beeaccb54622d2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067151
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing instruction arrangement is sub-optimal on little cores
since it has instructions with dependencies next to each other, so
spread them out to improve performance.
No significant change observed on bigger cores, but little cores do show
some small improvements except for the *Alpha* kernels which regress
slightly.
Runtimes observed compared to the previous SVE implementation:
| Cortex-A510 | Cortex-A520
I210AlphaToARGBRow | (!) +7.0% | (!) +6.8%
I210ToAR30Row | -10.3% | -9.9%
I210ToARGBRow | -2.4% | -2.3%
I212ToAR30Row | -10.3% | -9.9%
I212ToARGBRow | -2.4% | -2.3%
Change-Id: I626942ce02c4610cfac1ea4f8e7890653ee4324f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067150
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Fix errors in ARGBAttenuateRow_LASX and ARGBAttenuateRow_LSX functions
caused by changes in calculation methods.
In addition, add the option to automatically add "-mlsx" and "-mlasx" to
enable SIMD optimization when compiling with cmake on LoongArch
platform.
Bug: libyuv:913
Change-Id: I7215f5198d3fb94f981d60969dc21a483006023e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802829
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
The auto-vectorized implementation unrolls to process 32 elements per
iteration, so unroll the new Neon implementation to match and avoid a
performance regression on little cores.
Performance relative to the auto-vectorized C implementation compiled
with LLVM 19:
Cortex-A55: -35.8%
Cortex-A510: -20.4%
Cortex-A520: -22.1%
Cortex-A76: -54.8%
Cortex-A710: -44.5%
Cortex-A715: -31.1%
Cortex-A720: -31.4%
Cortex-X1: -48.5%
Cortex-X2: -47.8%
Cortex-X3: -47.6%
Cortex-X4: -51.1%
Cortex-X925: -14.6%
Bug: b/42280942
Change-Id: Ib4e89ba230d554f2717052e934ca0e8a109ccc42
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040153
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The #ifdef surrounding the use of this kernel is never defined and
ScaleRowDown2_16_NEON does not exist, so add the missing #define and
remove the use of ScaleRowDown2_16_NEON for now. Additionally since
there is no implementation of this kernel for 32-bit Arm, restrict the
define to only be present on AArch64.
Bug: b/42280942
Change-Id: Icc35c145c1bad1c0df2933a2d8bc7dcf7fe63cb7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040152
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Remove special case Scale of 1 which used fp16 cvt but requires cpuid
- Port aarch64 to aarch32
- Use C for aarch32 with small (denormal) scale value
Bug: 377693555
Change-Id: I38e207e79ac54907ed6e65118b8109288fddb207
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6043392
Reviewed-by: Wan-Teh Chang <wtc@google.com>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: Ie15bb4e7484b61e78f405ad4e8a8a7bbb66b7edb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979727
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: I401eb6ad14b3159917c8e3a79ab20dde318d28b6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979726
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We do
not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: Ic4ba5f97dc57afc558c08a57e9b5009d6e487e0f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979725
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
For HalfFloat1Row, SVE has direct 16-bit integer to half-float
conversion instructions so there is no need to widen to 32-bits.
For HalfFloatRow, SVE zero-extending loads avoid the need for seperate
UXTL(2) instructions.
Observed reductions in runtime compared to the existing Neon code:
| HalfFloat1Row | HalfFloatRow
Cortex-A510 | -38.3% | -17.3%
Cortex-A520 | -37.6% | -18.8%
Cortex-A720 | -50.1% | -7.8%
Cortex-X2 | -50.2% | -0.4%
Cortex-X4 | -51.5% | -12.5%
Bug: b/42280942
Change-Id: I445071ccd453113144ce42d465ba03c9ee89ec9e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975319
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
SVE contains the UMULH instruction which allows us to multiply and take
the high half of the result in a single instruction rather than needing
separate widening multiply and then narrowing shift steps.
Observed reduction in runtime compared to the existing Neon code:
Cortex-A510: -21.2%
Cortex-A520: -20.9%
Cortex-A715: -47.9%
Cortex-A720: -47.6%
Cortex-X2: -5.2%
Cortex-X3: -2.6%
Cortex-X4: -32.4%
Cortex-X925: -1.5%
Bug: b/42280942
Change-Id: I25154699b17772db1fb5cb84c049919181d86f4b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975318
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no benefit from an SVE version of this kernel for devices with
an SVE vector length of 128-bits, so skip directly to SME instead. We
do not use the ZA tile here, so this is a purely streaming-SVE (SSVE)
implementation.
Change-Id: I5021aeda30f4c5f1aa4cc6326c8d7886851d2c09
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913885
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The assignment of ScaleUVRowDown2Box_NEON is already done in the block
immediately below this one, so just remove this code.
Change-Id: I83c0f18dbe66e908cd4fbce73e20e96a137860cf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979723
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing C implementation compiled with a recent LLVM is
auto-vectorised and unrolled to process four vectors per loop iteration,
making the Neon implementation slower than the C implementation on
little cores. To avoid this, unroll the Neon implementation to also
process four vectors per iteration.
Reduction in cycle counts observed compared to the existing Neon
implementation:
| HalfFloat1Row_NEON | HalfFloatRow_NEON
Cortex-A510 | -37.1% | -40.8%
Cortex-A520 | -32.3% | -37.4%
Cortex-A720 | 0.0% | -10.6%
Cortex-X2 | 0.0% | -7.8%
Cortex-X4 | +0.3% | -6.9%
Bug: b/42280945
Change-Id: I12b474c970fc4355d75ed924c4ca6169badda2bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872805
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>