The z21 register is used in the I444TORGB_SVE_2X macro and other places,
so add it to the clobber list macro that is used throughout this file.
Change-Id: If4277c1ffcac0fa68cc44263acc6f41a9e82ec8b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6619508
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Sort the Arm Neon and Neon DotProd #define lists to match the
alphabetical ordering used for the SVE2 and SME lists.
Change-Id: Ibeb380f477d5476d0018d20a754557a5f93f2190
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6613686
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This CL switches libyuv builders from Ninja to Siso. Reclient will still
be used.
https://crrev.com/c/6605972 is the corresponding recipe change.
No-Try: true
Bug: chromium:412968361
Change-Id: I6ba063d0aa954185284a44d0b353278d71953e4b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6589372
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Junji Watanabe <jwata@google.com>
Reviewed-by: Christoffer Dewerin <jansson@chromium.org>
And enable LASX by default for LoongArch builds, because LASX is
widely supported among LoongArch desktops and servers, and performance
is better than with LSX alone.
Because the LoongArch SIMD code is written to only compile if the
respective codegen option is enabled, but the defaults and availability
differ between compiler versions and target `-march` setting, the
codegen flags are explicitly added to CFLAGS for wider compatibility.
Bug: None
Change-Id: I735ceac0f6b46eea2155e58ecf3630383ef5b728
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6241804
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Add a Neon implementation of the Convert8To16Row kernel. Compared to the
C implementation we can take advantage of knowing that the "scale"
parameter is always an unsigned power of two and fits in 16-bits,
allowing us to combine this with the shift and avoid needing to widen
the input data.
Reduction in run times observed compared to the existing C
implementation:
Cortex-A55: -44.5%
Cortex-A510: -26.1%
Cortex-A520: -30.6%
Cortex-A76: -61.6%
Cortex-A710: -57.6%
Cortex-X1: -46.5%
Cortex-X2: -54.4%
Cortex-X3: -57.1%
Cortex-X4: -55.0%
Cortex-X925: -49.3%
Change-Id: I34b858605ece47e46588c0680a1d2afa7a90d7a0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6516186
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This can make use of the existing load/convert/store macros that are
already present for other kernels, so add I422ToAR30Row_SVE2 and
I422ToAR30Row_SME to match the existing kernels.
Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:
Cortex-A510: -9.1%
Cortex-A520: +6.8% (!)
Cortex-A710: -4.0%
Cortex-A715: -1.1%
Cortex-A720: -1.1%
Cortex-X2: -5.7%
Cortex-X3: -5.9%
Cortex-X4: -2.8%
Cortex-X925: -4.0%
Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so
they are usable in both SVE2 and SME implementations, and use them to
add new I444ToRGB24Row implementations for SVE2 and SME. We need to use
the unrolled versions here to use the ST3B interleaving store
instructions, since there is no partial vector version of this store
instruction.
Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:
Cortex-A510: -57.6%
Cortex-A520: -38.1%
Cortex-A710: -15.5%
Cortex-A715: -9.2%
Cortex-A720: -9.2%
Cortex-X2: -25.8%
Cortex-X3: -26.2%
Cortex-X4: -23.2%
Cortex-X925: -17.8%
Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The maximum coefficient is 128, so store constants negated to take
advantage of -128 being representable in 8-bit integers. This allows us
to use the I8MM USDOT instructions.
Reduction in time taken observed compared to the existing Neon
implementation, as a geomean of all ARGBToUV variants:
Cortex-A510: -7.1%
Cortex-A520: -2.1%
Cortex-A710: -8.4%
Cortex-A715: -0.3%
Cortex-A720: -0.3%
Cortex-X2: -40.0%
Cortex-X3: -43.3%
Cortex-X4: -11.3%
Cortex-X925: -2.5%
Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The flag was deprecated by https://crrev.com/c/6414748 and
has no effect besides telling the user that it has no effect.
Bug: 414826937
Change-Id: Idd0ee2e7a3cab0f49c4f87da0f3901713f9ebf00
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6509300
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
- make width loop count on stack
- set YMM constants in its own asm block
- make struct for shuffle and add constants
- disable clang format on row_neon.cc function
Bug: 413781394
Change-Id: I263f6862cb7589dc31ac65d118f7ebeb65dbb24a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6495259
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
There are a few added source files since the (re-)addition of GYP build
support, for better SIMD optimization support (AArch64 SME & SVE,
LoongArch LSX & LASX, RISC-V RVV). This CL covers the LoongArch part in
preparation of fixing GYP builds for this architecture.
The files' arch-specific contents are all gated behind preprocessor
macro checks, so it is safe to have everything included in the build
unconditionally.
Bug: None
Change-Id: I2da37c1db79c2d8316ae42079e79efed2a2030a9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6241803
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The declarations of ARGBAffineRow_C and ARGBAffineRow_SSE2 and the code
to support those declarations are duplicated in planar_functions.h. They
are already in row.h, so we can simply remove them.
Change-Id: I9b522fdd201ca530f1268bf4200cd2e18b806ba5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6434733
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
The ENABLE_ROW_TESTS macro is not used in convert_test.cc.
Change-Id: Icc50ec465beca81e14a9683a717680e179a541dd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6434620
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
The code that disables Arm and Intel assembly code under MSan is
duplicated in cpu_support.h and planar_functions.h. This CL does not
address the code duplication.
Bug: b:407277484, b:407278016, b:407278132
Change-Id: If70fd8d3382916041d75efabcc84010ea3f1e60e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6430806
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Several consumers of libyuv do unified sources build where many source
files are #include'd together to make compilation units larger and allow
for more optimization chances. But for LoongArch there is a wrinkle:
LASX and LSX code paths are implemented in separate files, unlike the
other currently supported architectures, and some definitions are
duplicated e.g. struct RgbConstants.
Since the duplicated content is identical across the two files, short of
some bigger refactoring, we can simply place #ifdef guards around the
definitions to fix unified sources build for LoongArch.
Change-Id: I952e8e0210221ec8bcc113f75fa1b9ba515ec323
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6272801
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
We started to get the following error in libavif's GitHub CI workflows:
CMake Error at CMakeLists.txt:8 (cmake_minimum_required):
Compatibility with CMake < 3.5 has been removed from CMake.
Change-Id: If2490208cc3e7da22ff67557c5cdd4bd9f2499ad
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6416369
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Adds the sanitizer for the static library libyuv to enable CFI assembly
support
Bug: 400789169
Change-Id: I9be82d90d60535fdf59e4e729778a455e946e4cc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6414818
Reviewed-by: James Zern <jzern@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code for creating RGB565 data in SVE2 and SME produces two
vectors of interleaved 16-bit elements due to the nature of how SVE
widening instructions operate. This means that the indices of the 16-bit
data created appear in the two result vectors as such:
z18.b: [elem0 byte0, elem0 byte1, elem2 byte0, elem2 byte1, ...]
z19.b: [elem1 byte0, elem1 byte1, elem3 byte0, elem3 byte1, ...]
This is problematic for the final (predicated) iteration of the
conversion since the p1 predicate input to the ST2H instruction controls
storing the four bytes corresponding to the first two elements, in the
first two bytes of z18 and z19. This means that in the case that the
width is an odd number there is no way of storing just elem0 in z18
individually.
This patch addresses this by permuting the z18/z19 data such that the
two bytes from each element are split evenly across the two vectors:
z20.b: [elem0 byte0, elem1 byte0, elem2 byte0, elem3 byte0, ...]
z21.b: [elem0 byte1, elem1 byte1, elem2 byte1, elem3 byte1, ...]
Since we would now always store the same lanes from both vectors we can
continue to use the same predicate without further changes.
The existing (non-tail) loop body utilizes an all-true predicate so we
can avoid the extra permutes in this case, avoiding any performance
degradation.
Change-Id: I7d2be27c84cd9eb02cebac54a14c3498911f21d3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6395137
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This CL enables the CFI checks for libyuv to be used as a
shared library.
Bug: 400789169
Change-Id: I8c71df235ad6962d02740c976972d8f9dcea6c52
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6353950
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Hang Nguyen <hnt@chromium.org>
- use negative coefficients for UV to allow -128
- change shift to truncate instead of round for UV
- adapt all row_gcc RGB to UV into matrix functions
- add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc
Bug: 381138208
Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The existing implementation mistakenly refers to the parameter %2. This
works fine however the parameter is already named %[width], and using
the name should be preferred.
Change-Id: Ifaf8fc83cdfc9b15c79d52e7e47cb72b53270a12
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6225753
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
SVE can make use of the UMULH instruction to avoid needing separate
widening multiply and narrowing steps for the scale application.
Reduction in runtime for Convert8To8Row_SVE2 observed compared to the
existing Neon implementation:
Cortex-A510: -13.2%
Cortex-A520: -16.4%
Cortex-A710: -37.1%
Cortex-A715: -38.5%
Cortex-A720: -38.4%
Cortex-X2: -33.2%
Cortex-X3: -31.8%
Cortex-X4: -31.8%
Cortex-X925: -13.9%
Change-Id: I17c0cb81661c5fbce786b47cdf481549cfdcbfc7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207692
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>