- detect lack of dot product instruction to infer the cpu is low end
- only run the test on higher end arm
Bug: 416842099
Change-Id: Idd2dd16a624bbba280cf531644440024b12f7ecf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6804632
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Per the Software Development and Build Convention for LoongArch™
Architectures manual, on Linux we should use HWCAP instead of CPUCFG to
detect if LSX/LASX is available. The reason is the kernel may be
configured to disable them, and CPUCFG cannot provide info about the
kernel support.
Change-Id: I3f1b23e6d4c91c7da81311fbbe294e36ff178121
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6772567
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
The -march arguments used for NEON and SVE builds in libyuv are
incompatible with libc++ modules. This change disables libc++ modules
for these build configurations to fix the build.
Bug: 425535758
Change-Id: I578a0d9929c10177903c567bc268407470b45034
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6695664
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Mostly just a straightforward copy of the existing SVE2 code ported to
Streaming-SVE. Introduce new "any" kernels for non-multiple of two
cases, matching what we already do for SVE2.
The existing SVE2 code makes use of the Neon MOVI instruction that is
not supported in Streaming-SVE, so adjust the code to use FMOV instead
which has the same performance characteristics.
Change-Id: I74b7ea1fe8e6af75dfaf92826a4de775a1559f77
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6663806
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This CL enables use_siso=true by default.
Developer builds will switch to Siso, or get suggestion to run
`gn clean` to switch.
No-Try: true
Bug: chromium:412968361
Change-Id: I1913d6735d835c614dca863ca7781f9154a4e42a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6651381
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE. There is no benefit from this kernel when the SVE vector
length is only 128 bits, so skip writing a non-streaming SVE
implementation.
Change-Id: Ide34dbb7125b5f2a1edda6ef7111a1a49aad324f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6651565
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
FEAT_I8MM is not unconditionally enabled with -march=armv9-a since it
only becomes mandatory from Armv9.1-A, so explicitly specify it in both
BUILD.gn and CMakeLists.txt.
Also flip the order of +sve2+i8mm => +i8mm+sve2 to match occurrences
elsewhere.
Change-Id: I8c37580d3718f380b772cdb726d8c30bcd5b9e2c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6656718
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Was using avgb twice for non-exact and C for exact.
On Skylake Xeon:
Now SSE3
ARGBToJ420_Opt (326 ms)
Was
Exact C
ARGBToJ420_Opt (871 ms)
Not exact AVX2
ARGBToJ420_Opt (237 ms)
Not exact SSSE3
ARGBToJ420_Opt (312 ms)
Bug: 381138208
Change-Id: I6d1081bb52e36f06736c0c6575fa82bb2268629b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6629605
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
- Add +i8mm build option for sve ARGBToUV which uses usdot
- util/cpuid Get cpu count (windows, macos, linux)
- For each x86 cpu, detect hybrid (e-core)
- Includes a comment fix for ubsan unittest
- Bump version
- Apply clang format to util/*.c as well as all *.cc/*.h
Bug: 424637372
Change-Id: I08310e18051fff62c9e4e4a10d1e4361871119ac
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6635640
Reviewed-by: Wan-Teh Chang <wtc@google.com>
This commit reworks the implementation of ARGBToUVMatrixRow_SVE2, using
an approach similar to that recently used in
61bdaee13a701d2b52c6dc943ccc5c888077a591.
In particular we can rework these SVE2 implementations to use 8-bit
dot-product instructions instead of 16-bit, allowing us to process more
data in a single vector.
To ensure that the input values fit in 8-bits, negate the UV constants
arrays passed to the kernel and undo the now-unnecessary flipping of the
middle two component values.
This commit mostly reverses the performance inversion where the Neon
I8MM implementation was previously faster than the SVE2 implementation.
The reduction in runtime observed compared to the existing Neon I8MM
implementation is now:
Cortex-A510: +5.6% (!)
Cortex-A520: -3.0%
Cortex-A710: -12.6%
Cortex-A715: -10.9%
Cortex-A720: -10.8%
Cortex-X2: -3.8%
Cortex-X3: -10.3%
Cortex-X4: -9.5%
Cortex-X925: -6.7%
Change-Id: I30253976dc8e3651cfb5fd39b63a6763975d41e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6640990
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Now that we have a STOREAR30_SVE_2X implementation, we can use this to
unroll other kernels. The predication on I210ToAR30Row needs adjusting
to allow loading two vectors of Y compared to one vector of U/V, and
additionally UZP is needed to ensure the data arrangement in vector
lanes matches the U/V layout. LD2H could also be used, however this
provides no performance improvement on most cores and would necessitate
the addition of an "any" kernel to handle the case where width % 2 != 0.
Reduction in run times of I210ToAR30Row_SVE2 observed compared to the
previous SVE2 implementation: (note that even in the observed slowdowns,
the SVE2 implementation still outperforms the existing Neon code)
Cortex-A510: -37.1%
Cortex-A520: -39.1%
Cortex-A710: +1.6% (!)
Cortex-A715: +6.5% (!)
Cortex-A720: +6.5% (!)
Cortex-X2: -2.9%
Cortex-X3: -2.2%
Cortex-X4: -8.8%
Cortex-X925: -3.5%
Change-Id: I2ff285b48105883526eceb8be1fcbe0e033a553b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6640989
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
The existing STOREAR30_SVE macro works fine for out of order cores,
however for in-order cores the number of dependent vector instructions
laid out consecutively impacts performance.
We can improve this by unrolling the loop to process two sets of vectors
at a time, allowing little cores to process two independent streams of
vector instructions at the same time to improve performance. Using one
set of ZIP instructions at the end allows us to (a) avoid ST4 which we
know is slow on some micro-architectures, and (b) enable the use of
predication and avoid the need for separate "any" kernels.
Reduction in run times of I422ToAR30Row_SVE2 observed compared to the
previous SVE2 implementation:
Cortex-A510: -37.7%
Cortex-A520: -38.8%
Cortex-A710: -14.8%
Cortex-A715: -17.1%
Cortex-A720: -16.9%
Cortex-X2: -10.3%
Cortex-X3: -6.7%
Cortex-X4: -9.4%
Cortex-X925: -7.1%
Change-Id: I160fb41300d2d08fce2e6eb92181324fd723a02d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6632916
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
The z21 register is used in the I444TORGB_SVE_2X macro and other places,
so add it to the clobber list macro that is used throughout this file.
Change-Id: If4277c1ffcac0fa68cc44263acc6f41a9e82ec8b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6619508
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Sort the Arm Neon and Neon DotProd #define lists to match the
alphabetical ordering used for the SVE2 and SME lists.
Change-Id: Ibeb380f477d5476d0018d20a754557a5f93f2190
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6613686
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This CL switches libyuv builders from Ninja to Siso. Reclient will still
be used.
https://crrev.com/c/6605972 is the corresponding recipe change.
No-Try: true
Bug: chromium:412968361
Change-Id: I6ba063d0aa954185284a44d0b353278d71953e4b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6589372
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Junji Watanabe <jwata@google.com>
Reviewed-by: Christoffer Dewerin <jansson@chromium.org>
And enable LASX by default for LoongArch builds, because LASX is
widely supported among LoongArch desktops and servers, and performance
is better than with LSX alone.
Because the LoongArch SIMD code is written to only compile if the
respective codegen option is enabled, but the defaults and availability
differ between compiler versions and target `-march` setting, the
codegen flags are explicitly added to CFLAGS for wider compatibility.
Bug: None
Change-Id: I735ceac0f6b46eea2155e58ecf3630383ef5b728
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6241804
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Add a Neon implementation of the Convert8To16Row kernel. Compared to the
C implementation we can take advantage of knowing that the "scale"
parameter is always an unsigned power of two and fits in 16-bits,
allowing us to combine this with the shift and avoid needing to widen
the input data.
Reduction in run times observed compared to the existing C
implementation:
Cortex-A55: -44.5%
Cortex-A510: -26.1%
Cortex-A520: -30.6%
Cortex-A76: -61.6%
Cortex-A710: -57.6%
Cortex-X1: -46.5%
Cortex-X2: -54.4%
Cortex-X3: -57.1%
Cortex-X4: -55.0%
Cortex-X925: -49.3%
Change-Id: I34b858605ece47e46588c0680a1d2afa7a90d7a0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6516186
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This can make use of the existing load/convert/store macros that are
already present for other kernels, so add I422ToAR30Row_SVE2 and
I422ToAR30Row_SME to match the existing kernels.
Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:
Cortex-A510: -9.1%
Cortex-A520: +6.8% (!)
Cortex-A710: -4.0%
Cortex-A715: -1.1%
Cortex-A720: -1.1%
Cortex-X2: -5.7%
Cortex-X3: -5.9%
Cortex-X4: -2.8%
Cortex-X925: -4.0%
Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so
they are usable in both SVE2 and SME implementations, and use them to
add new I444ToRGB24Row implementations for SVE2 and SME. We need to use
the unrolled versions here to use the ST3B interleaving store
instructions, since there is no partial vector version of this store
instruction.
Reduction in time taken observed for the new SVE2 implementation,
compared to the existing Neon implementation:
Cortex-A510: -57.6%
Cortex-A520: -38.1%
Cortex-A710: -15.5%
Cortex-A715: -9.2%
Cortex-A720: -9.2%
Cortex-X2: -25.8%
Cortex-X3: -26.2%
Cortex-X4: -23.2%
Cortex-X925: -17.8%
Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The maximum coefficient is 128, so store constants negated to take
advantage of -128 being representable in 8-bit integers. This allows us
to use the I8MM USDOT instructions.
Reduction in time taken observed compared to the existing Neon
implementation, as a geomean of all ARGBToUV variants:
Cortex-A510: -7.1%
Cortex-A520: -2.1%
Cortex-A710: -8.4%
Cortex-A715: -0.3%
Cortex-A720: -0.3%
Cortex-X2: -40.0%
Cortex-X3: -43.3%
Cortex-X4: -11.3%
Cortex-X925: -2.5%
Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The flag was deprecated by https://crrev.com/c/6414748 and
has no effect besides telling the user that it has no effect.
Bug: 414826937
Change-Id: Idd0ee2e7a3cab0f49c4f87da0f3901713f9ebf00
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6509300
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
- make width loop count on stack
- set YMM constants in its own asm block
- make struct for shuffle and add constants
- disable clang format on row_neon.cc function
Bug: 413781394
Change-Id: I263f6862cb7589dc31ac65d118f7ebeb65dbb24a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6495259
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
There are a few added source files since the (re-)addition of GYP build
support, for better SIMD optimization support (AArch64 SME & SVE,
LoongArch LSX & LASX, RISC-V RVV). This CL covers the LoongArch part in
preparation of fixing GYP builds for this architecture.
The files' arch-specific contents are all gated behind preprocessor
macro checks, so it is safe to have everything included in the build
unconditionally.
Bug: None
Change-Id: I2da37c1db79c2d8316ae42079e79efed2a2030a9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6241803
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The declarations of ARGBAffineRow_C and ARGBAffineRow_SSE2 and the code
to support those declarations are duplicated in planar_functions.h. They
are already in row.h, so we can simply remove them.
Change-Id: I9b522fdd201ca530f1268bf4200cd2e18b806ba5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6434733
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
The ENABLE_ROW_TESTS macro is not used in convert_test.cc.
Change-Id: Icc50ec465beca81e14a9683a717680e179a541dd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6434620
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
The code that disables Arm and Intel assembly code under MSan is
duplicated in cpu_support.h and planar_functions.h. This CL does not
address the code duplication.
Bug: b:407277484, b:407278016, b:407278132
Change-Id: If70fd8d3382916041d75efabcc84010ea3f1e60e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6430806
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Several consumers of libyuv do unified sources build where many source
files are #include'd together to make compilation units larger and allow
for more optimization chances. But for LoongArch there is a wrinkle:
LASX and LSX code paths are implemented in separate files, unlike the
other currently supported architectures, and some definitions are
duplicated e.g. struct RgbConstants.
Since the duplicated content is identical across the two files, short of
some bigger refactoring, we can simply place #ifdef guards around the
definitions to fix unified sources build for LoongArch.
Change-Id: I952e8e0210221ec8bcc113f75fa1b9ba515ec323
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6272801
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
We started to get the following error in libavif's GitHub CI workflows:
CMake Error at CMakeLists.txt:8 (cmake_minimum_required):
Compatibility with CMake < 3.5 has been removed from CMake.
Change-Id: If2490208cc3e7da22ff67557c5cdd4bd9f2499ad
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6416369
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Adds the sanitizer for the static library libyuv to enable CFI assembly
support
Bug: 400789169
Change-Id: I9be82d90d60535fdf59e4e729778a455e946e4cc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6414818
Reviewed-by: James Zern <jzern@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code for creating RGB565 data in SVE2 and SME produces two
vectors of interleaved 16-bit elements due to the nature of how SVE
widening instructions operate. This means that the indices of the 16-bit
data created appear in the two result vectors as such:
z18.b: [elem0 byte0, elem0 byte1, elem2 byte0, elem2 byte1, ...]
z19.b: [elem1 byte0, elem1 byte1, elem3 byte0, elem3 byte1, ...]
This is problematic for the final (predicated) iteration of the
conversion since the p1 predicate input to the ST2H instruction controls
storing the four bytes corresponding to the first two elements, in the
first two bytes of z18 and z19. This means that in the case that the
width is an odd number there is no way of storing just elem0 in z18
individually.
This patch addresses this by permuting the z18/z19 data such that the
two bytes from each element are split evenly across the two vectors:
z20.b: [elem0 byte0, elem1 byte0, elem2 byte0, elem3 byte0, ...]
z21.b: [elem0 byte1, elem1 byte1, elem2 byte1, elem3 byte1, ...]
Since we would now always store the same lanes from both vectors we can
continue to use the same predicate without further changes.
The existing (non-tail) loop body utilizes an all-true predicate so we
can avoid the extra permutes in this case, avoiding any performance
degradation.
Change-Id: I7d2be27c84cd9eb02cebac54a14c3498911f21d3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6395137
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This CL enables the CFI checks for libyuv to be used as a
shared library.
Bug: 400789169
Change-Id: I8c71df235ad6962d02740c976972d8f9dcea6c52
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6353950
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Hang Nguyen <hnt@chromium.org>