Use ptrdiff_t instead of intptr_t for buffer offsets, such as stride,
width_temp, and src_step*.
Change-Id: I64e6701fa71ab59c94325a6dad8762d040035208
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7800070
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Fix int overflow of yi * src_stride overflow in ScalePlaneVertical(),
ScalePlaneVertical_16(), and ScalePlaneVertical_16To8() by casting the
operand src_stride to ptrdiff_t.
Adapted from the patches by Victor Miura <vmiura@google.com>.
Bug: 505814332
Change-Id: I4a4751041a213f7208b01eb18c43c9e196a36261
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7796558
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
ptrdiff_t is the appropriate type for a buffer offset. intptr_t is
intended for a different purpose.
Change-Id: I475c548338b61f573fb11766c24cde6d31fbbed8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7796559
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
- implement wrappers with RAW, RGB24, NV21 and JNV21 to call it.
Zen5
Was [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (1146 ms)
Now [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (1446 ms)
reason - the new code uses 1 pass for RAWToY but 2 pass for RAWToARGB,ARGBToUV. needs 1 RGBToUV
Bug: libyuv:42280902
Change-Id: Ife6fbed0829484045409e6d42b85cec1d1fd6052
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7780026
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
Adds RGBToYMatrixRow_AVX2 which reads 24 bit RGB values by reading 3 vectors instead of 4 and permutes them into 4 ARGB vectors before conversion.
Also adds RGBToYMatrixRow_Opt and RGBToYMatrixRow_2Step_Opt to convert_argb_test.cc to benchmark and compare the direct AVX2 conversion vs a 2-step approach.
./libyuv_test '--gunit_filter=*RAWToJ400_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=10000 --libyuv_flags=-1 --libyuv_cpu_info=-1
AMD Zen 5
Was LibYUVConvertTest.RAWToJ400_Opt (757 ms)
Now LibYUVConvertTest.RAWToJ400_Opt (699 ms)
Intel Skylake
Was LibYUVConvertTest.RAWToJ400_Opt (1705 ms)
Now LibYUVConvertTest.RAWToJ400_Opt (1426 ms)
Bug: 477295731
Change-Id: I29866baf4ad5fe7a3725e4a01f2fe24649510a7d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7777325
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
implementing horizontal paired adds and accumulation to improve
performance on SiFive x280, and fixes the remainder logic to use valid
vlseg4 loads. Adds TestARGBToUVRow_Any to test odd-width remainder
handling.
Also fixes a build break for non-RVV compilations by ensuring all RVV
functions and their closing cplusplus braces are correctly wrapped in
#if !defined(LIBYUV_DISABLE_RVV).
Also adds NV12ToNV21 as a macro alias for NV21ToNV12 in
planar_functions.h, as the conversion is bidirectional (swapping byte
pairs in the interleaved chroma plane). (Patch from
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7762904)
Bug: libyuv:42280902
Change-Id: If2d6cbb3e232d63d43e32aba33fa9b2eee8190e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7772164
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
This change implements ARGBToUV444MatrixRow_RVV, ARGBToUVMatrixRow_RVV,
and their wrappers (ARGBToUVRow_RVV, ARGBToUVJRow_RVV, etc.) using RVV
intrinsics, mirroring the NEON/AVX2 designs. It wires them into the
build and dispatch systems.
LIBYUV_RVV_HAS_TUPLE_TYPE is always true on new compilers. This macro
has been removed, assuming it is true everywhere, reducing the amount of
code in row_rvv.cc, scale_rvv.cc, and row.h.
Tested via: ~/bin/doyuv3v && ~/bin/runyuv3v TestARGBToI444Matrix
~/bin/doyuv3av
Bug: libyuv:42280902
Change-Id: I36d305386b297d69023c068aa9c62ab6b2ad039c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7769956
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- remove inline asm which was only for 32 bit
- add ARGBToYMatrixRow_AVX2
- add gn flag libyuv_enable_rowwin=true
Example of building with GN and Ninja:
Without the new flag:
gn gen out/Release "--args=is_debug=false"
ninja -C out/Release
With the new flag:
gn gen out/Release "--args=is_debug=false libyuv_enable_rowwin=true"
ninja -C out/Release
Bug: libyuv:42280806, 477295731, libyuv:42280902, libyuv:439628764
R=dalecurtis@chromium.org, rrwinterton@gmail.com
Change-Id: I451bf814622fba690005c02fbf5816819c6a08c2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7765790
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Benchmark on Icelake Xeon
Now AVX512BW:
[ OK ] LibYUVConvertTest.ARGBToNV12_Opt (1723 ms)
Was AVX2:
[ OK ] LibYUVConvertTest.ARGBToNV12_Opt (2144 ms)
- Added `ARGBToUVMatrixRow_AVX512BW` implementation in `source/row_gcc.cc`.
- Added corresponding `ARGBToUVRow_AVX512BW` and `ABGRToUVRow_AVX512BW` functions.
- Added unaligned wrappers `ARGBToUVRow_Any_AVX512BW` and `ABGRToUVRow_Any_AVX512BW` in `source/row_any.cc`.
- Updated `source/row_any.cc` to correctly size `vin` and `vout` buffers for AVX512BW width and adjusted the `ANY12MS` and `ANY12S` macros to handle `MASK=63`.
- Updated `include/libyuv/row.h` with the required AVX512BW headers and definitions, scoped appropriately.
- Wired all callers of `ARGBToUVRow_AVX2` and related functions in `source/convert.cc` and `source/convert_from_argb.cc` to dynamically use the `AVX512BW` implementations if the CPU flag indicates AVX-512BW support.
- Optimized AVX-512 code to generate the `-1` multiplier in a single instruction (`vpternlogd`) and reused it across word (`vpmaddwd`) dot products. Handled the resulting negation by replacing a subtraction with `vpaddw` offset adjustment.
Bug: 477295731
R=dalecurtis@chromium.org, rrwinterton@gmail.com
Change-Id: Ida5fb27e59ae4c1c3824737f009b80549cd20a06
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7763257
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
I have successfully ported the usage of ARGBToYRow_AVX2 to dynamically detect and utilize ARGBToYRow_AVX512BW when available.
Here's a summary of the changes:
1. Source Modifications: In both source/convert.cc and source/convert_from_argb.cc, I searched for all references where ARGBToYRow_AVX2 was
being conditionally used (which operates on 32 pixels).
2. AVX512BW Detection: Immediately following those blocks, I injected a new check for kCpuHasAVX512BW. If the CPU flag is present, the logic
now utilizes ARGBToYRow_Any_AVX512BW by default, falling back to the fully aligned ARGBToYRow_AVX512BW when the width is aligned to 64
bytes.
3. Profiling: After building and compiling the tests (doyuv3x), I validated the change using perfyuv3 ARGBToNV12_Opt | cat. The test
successfully executed and the performance profile indicated that ARGBToYRow_AVX512BW successfully executed (taking up ~18% of CPU cycles,
replacing the previous AVX2 specific instruction overhead for the Y row extraction).
The HAS_ARGBTOYROW_AVX512BW macro implementation now fully supports all AVX2 conversion paths to utilize AVX512BW when the system processor
flags allow it!
R=richard, rrwinterton@gmail.com
Change-Id: Iad811e12d301f5621e6f6d039105420861ade43e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7760779
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
In all functions that start with ARGB, BGRA, RGBA or ABGR in the include/libyuv/ headers, make sure the parameter variable name has the same 4 letters, but lower case, and the comment before the function should have the same matching name. Then make sure the implementation in source/ folder has the same variable names.
Change-Id: Idadbbbb993156eea16e318719f4888cb3bed5f6a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7760057
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- add ARGBToYMatrixRow_AVX512BW
- refactor SSE and AVX to use Matrix functions, making old functions
call the new ones.
Zen5 1280x720
Was AVX2 LibYUVConvertTest.ARGBToI444_Opt (1125 ms)
Now AVX512 LibYUVConvertTest.ARGBToI444_Opt (641 ms)
Details by Gemini:
1. Created 3 new Matrix functions:
Added ARGBToYMatrixRow_SSSE3, ARGBToYMatrixRow_AVX2, and
ARGBToYMatrixRow_AVX512BW to source/row_gcc.cc. These take the
const struct ArgbConstants* c parameter similarly to
ARGBToUV444MatrixRow_*. The x86 vector instructions dynamically
calculate the needed values using the properties of the constants
struct, including using vpmaddwd inside the AVX512 code to offset
the lack of a native vphaddw.
2. Replaced Old Functions with Wrappers:
Modified the existing implementations of ARGBToYRow_SSSE3,
ARGBToYJRow_SSSE3, ABGRToYRow_SSSE3, ABGRToYJRow_SSSE3,
RGBAToYRow_SSSE3, RGBAToYJRow_SSSE3, BGRAToYRow_SSSE3 (and their
_AVX2 equivalents) in source/row_gcc.cc to act as inline wrappers
calling the new ARGBToYMatrixRow_* functions, passing the right
matrix parameters (e.g. &kArgbI601Constants, &kArgbJPEGConstants,
&kAbgrI601Constants).
3. Added row_any.cc Handlers:
Added ANY11MC definitions to source/row_any.cc to autogenerate
ARGBToYMatrixRow_Any_SSSE3, ARGBToYMatrixRow_Any_AVX2, and
ARGBToYMatrixRow_Any_AVX512BW which safely handles non-aligned
tails.
4. Updated include/libyuv/row.h:
Updated the headers with the proper void declarations for all newly
generated Matrix and Any_ variants. Also defined
HAS_ARGBTOYROW_AVX512BW in the CPU macros.
5. Tested the Implementations:
Compiled and tested on Linux x86, which resulted in all tests passing
cleanly. Also successfully completed all Windows 32-bit build checks
ensuring 32-bit regression prevention without issues.
Bug: 477295731
Change-Id: I4f5eec9a961e24a9d760d0a1c0810fb5e29a0bd1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7759494
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Increases buffer sizes from 128 to 256 in ANY11, ANY11C, ANY11MC, ANY12,
and ANY12M macros to safely accommodate AVX512BW processing which can
write up to 256 bytes per operation.
Bug: libyuv:42280902, libyuv:502250231, 501882928
Change-Id: Icfba1982dc5fb6545255464f7decb2baec7be90f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7758060
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This fixes a build failure on bare-metal toolchains like
riscv64-unknown-elf-clang++ where strtok_r may be undeclared.
Bug: 477295731
Change-Id: If4edd6c6d2e975ae34278f479700ef9b996c0a3e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7744872
Reviewed-by: James Zern <jzern@google.com>
Adds a check for the AVX512F feature bit (cpu_info7[1] & 0x00010000)
before enabling AVX512 features. Alder Lake CPUs can report OS support
for YMM/ZMM but not actually support AVX512F, leading to incorrect
capability detection and crashes.
Bug: libyuv:500318522
Change-Id: I84167ee3fcfc7a2572afba148bbb275bd3ccb1e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7746229
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Dale Curtis <dalecurtis@chromium.org>
The latest Android NDK marks strtok as deprecated and suggests using
strtok_r instead.
Bug: 477295731
Change-Id: I2b20a2ae0a9e19ec93e31669ec380802e6902090
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7739107
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
These are about 25% faster than the C versions.
Bug: libyuv:42280902
Change-Id: I8b298670ee5f3ed5db35527fc41d6d9a51b020a1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7573682
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
This one reuses the SIMD implementations for MergeUVRow_ from the
existing ARGBToNV12 functions.
Bug: libyuv:42280902
Change-Id: If0a4be133d657ed0262f29fdd568dac90b49636c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7564317
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
This allows for ABGR conversion using the same methods
Bug: libyuv:42280902
Change-Id: I5566e3150b30573a2326a900ce31ab095f8935f9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7564316
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
This was implemented by Gemini followed by manual review and some
tweaking for style. The 601 and JPEG constants are fully verified
against the existing non-matrix implementations. On x86 the C-only
versions appear to be about 25% slower than the optimized ones.
Bug: libyuv:42280902
Change-Id: Ia5b7cb499bad5c76faec53f36086ebb18f2b530f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7512030
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
GCC now supports vector segment load and store, which
was previously missing; and the reason why it was disabled.
Change-Id: I923fd8a15476de8dcc2103bb8335d4fcc3ca96a9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7241606
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Detect if arm cpu support FMMLA instruction
Bug: None
Change-Id: Ia7b83bf2735ddeeb8a85da44177e708c34e4b1fb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7085486
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Currently, ARGBToUVMatrixRow_AVX2 and ARGBToUVMatrixRow_SSSE3 fail to
compile with clang on 32bit PIC build with the error message: inline
assembly requires more registers than available
This is because in PIC code EBX is reserved for the GOT and with a frame
pointer EBP is also unavailable.
Fix this by copying the RGB-to-UV constants to stack locals first and
let the asm use simple stack-relative addressing.
Bug: 444157316
Change-Id: Ica90f0c35039303ecaa145534683f59659fb5d7f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6980714
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Skylake
Was ARGBToJ420_Opt (312 ms)
Now ARGBToJ420_Opt (242 ms)
Icelake
Was ARGBToJ420_Opt (302 ms)
Now ARGBToJ420_Opt (220 ms)
AMD Zen3 on Windows
Was ARGBToJ420_Opt (305 ms)
Now ARGBToJ420_Opt (216 ms)
32 bit x86 uses SSE
Now ARGBToJ420_Opt (326 ms)
MCA analysis of new AVX, SSE and old AVX
https://godbolt.org/z/37bdazWYr
Bug: None
Change-Id: I72f5504407751e164c3558aebe836dd15223d65f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6957477
Reviewed-by: Justin Green <greenjustin@google.com>
The UV subsample's 4-pixel rounding average and ARGBToJ444 fixed-point scaling
were updated in d32d19cc and c060118b. The LoongArch optimization is updated now.
Bug: 381138208
Change-Id: I3585d72564e4fffe514599b1a9b4fee8fbbd0266
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6878364
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
The y0_fraction and y1_fraction variables in InterpolateRow_NEON were
marked as modified by the inline-asm block, however
5eea7812826c551559fdcd4a6988fcf1fbe341f6 marked these variables as
`const` which caused both LLVM and GCC to emit errors about modification
of const variables.
There is no need for these variables to be modified in the loop since
they are read-only, so simply update the inline asm block constraints to
match.
Change-Id: I94ca3696c4163ede6ad27d645f0f445fcfb0a1c3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6818289
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- MCA says old version was 4 cycles and new version is 2.5 cycles/loop
- lunarlake is the only known cpu
mca -mcpu=lunarlake 100 iterations
Was vpmulhu
Iterations: 100
Instructions: 1200
Total Cycles: 426
Total uOps: 1200
Dispatch Width: 8
uOps Per Cycle: 2.82
IPC: 2.82
Block RThroughput: 4.0
Now vpsrlw
Iterations: 100
Instructions: 1200
Total Cycles: 279
Total uOps: 1400
Dispatch Width: 8
uOps Per Cycle: 5.02
IPC: 4.30
Block RThroughput: 2.5
Bug: None
Change-Id: I5a49e1cf1ed3dfb59fe9861a871df9862417c6a6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6697745
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Per the Software Development and Build Convention for LoongArch™
Architectures manual, on Linux we should use HWCAP instead of CPUCFG to
detect if LSX/LASX is available. The reason is the kernel may be
configured to disable them, and CPUCFG cannot provide info about the
kernel support.
Change-Id: I3f1b23e6d4c91c7da81311fbbe294e36ff178121
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6772567
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Mostly just a straightforward copy of the existing SVE2 code ported to
Streaming-SVE. Introduce new "any" kernels for non-multiple of two
cases, matching what we already do for SVE2.
The existing SVE2 code makes use of the Neon MOVI instruction that is
not supported in Streaming-SVE, so adjust the code to use FMOV instead
which has the same performance characteristics.
Change-Id: I74b7ea1fe8e6af75dfaf92826a4de775a1559f77
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6663806
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE. There is no benefit from this kernel when the SVE vector
length is only 128 bits, so skip writing a non-streaming SVE
implementation.
Change-Id: Ide34dbb7125b5f2a1edda6ef7111a1a49aad324f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6651565
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Was using avgb twice for non-exact and C for exact.
On Skylake Xeon:
Now SSE3
ARGBToJ420_Opt (326 ms)
Was
Exact C
ARGBToJ420_Opt (871 ms)
Not exact AVX2
ARGBToJ420_Opt (237 ms)
Not exact SSSE3
ARGBToJ420_Opt (312 ms)
Bug: 381138208
Change-Id: I6d1081bb52e36f06736c0c6575fa82bb2268629b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6629605
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
- Add +i8mm build option for sve ARGBToUV which uses usdot
- util/cpuid Get cpu count (windows, macos, linux)
- For each x86 cpu, detect hybrid (e-core)
- Includes a comment fix for ubsan unittest
- Bump version
- Apply clang format to util/*.c as well as all *.cc/*.h
Bug: 424637372
Change-Id: I08310e18051fff62c9e4e4a10d1e4361871119ac
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6635640
Reviewed-by: Wan-Teh Chang <wtc@google.com>
This commit reworks the implementation of ARGBToUVMatrixRow_SVE2, using
an approach similar to that recently used in
61bdaee13a701d2b52c6dc943ccc5c888077a591.
In particular we can rework these SVE2 implementations to use 8-bit
dot-product instructions instead of 16-bit, allowing us to process more
data in a single vector.
To ensure that the input values fit in 8-bits, negate the UV constants
arrays passed to the kernel and undo the now-unnecessary flipping of the
middle two component values.
This commit mostly reverses the performance inversion where the Neon
I8MM implementation was previously faster than the SVE2 implementation.
The reduction in runtime observed compared to the existing Neon I8MM
implementation is now:
Cortex-A510: +5.6% (!)
Cortex-A520: -3.0%
Cortex-A710: -12.6%
Cortex-A715: -10.9%
Cortex-A720: -10.8%
Cortex-X2: -3.8%
Cortex-X3: -10.3%
Cortex-X4: -9.5%
Cortex-X925: -6.7%
Change-Id: I30253976dc8e3651cfb5fd39b63a6763975d41e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6640990
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Add a Neon implementation of the Convert8To16Row kernel. Compared to the
C implementation we can take advantage of knowing that the "scale"
parameter is always an unsigned power of two and fits in 16-bits,
allowing us to combine this with the shift and avoid needing to widen
the input data.
Reduction in run times observed compared to the existing C
implementation:
Cortex-A55: -44.5%
Cortex-A510: -26.1%
Cortex-A520: -30.6%
Cortex-A76: -61.6%
Cortex-A710: -57.6%
Cortex-X1: -46.5%
Cortex-X2: -54.4%
Cortex-X3: -57.1%
Cortex-X4: -55.0%
Cortex-X925: -49.3%
Change-Id: I34b858605ece47e46588c0680a1d2afa7a90d7a0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6516186
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>