We can use the dot product instructions to apply the coefficients
directly without the need for LD4 de-interleaving load instructions,
since these are known to be slow on some micro-architectures.
ST4 is also known to be slow on more modern micro-architectures, however
avoiding this is left for a future SVE implementation where we can make
use of interleaving-narrowing instructions.
Reduction in cycle counts observed compared to existing Neon code:
Cortex-A55: -5.8%
Cortex-A510: -18.9%
Cortex-A76: -21.8%
Cortex-A720: -30.2%
Cortex-X1: -28.6%
Cortex-X2: -23.4%
Bug: b/42280946
Change-Id: I5887559649cc805a810d867b652c85d48285657d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790970
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can use dot product instructions to apply the coefficients without
needing to use LD4 deinterleaving load instructions, and then TBL to mix
in the original alpha component. This is significantly faster on some
micro-architectures where LD4 instructions are known to be slow compared
to normal loads.
Reduction in cycle counts observed compared to existing Neon code:
Cortex-A55: -12.6%
Cortex-A510: -48.6%
Cortex-A76: -39.7%
Cortex-A720: -52.3%
Cortex-X1: -63.5%
Cortex-X2: -67.0%
Bug: b/42280946
Change-Id: I3641785e74873438acc00d675f5bc490dfa95b50
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785972
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Scaling 48 pixels at a time, but calling code checked for 24 pixels
- Added test for scaling to 1080x1920
libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560
Was
libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560
[ RUN ] LibYUVScaleTest.I420ScaleTo1080x1920_Box
Segmentation fault
Traceback (most recent call last):
Now
[ RUN ] LibYUVScaleTest.I420ScaleTo1080x1920_Box
filter 3 - 6741 us C - 3566 us OPT
[ OK ] LibYUVScaleTest.I420ScaleTo1080x1920_Box (43 ms)
Bug: b/366045177
Change-Id: I0ea6c2d6a32b2e7ca44cd030abc9f248115be44a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5857554
Reviewed-by: Wan-Teh Chang <wtc@google.com>
- avx2 is pack/perm is mutating order
- cvt method maintains channel order on avx512
Sapphire Rapids
Benchmark of 640x360 on Sapphire Rapids
AVX512BW
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms)
AVX2
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms)
SSE2
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms)
Skylake Xeon
Now vpmovuswb
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms)
Was vpackuswb
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms)
Switch from vpunpcklwd to vpbroadcastw for scale value parameter
Was
vpunpcklwd %%xmm2,%%xmm2,%%xmm2
vbroadcastss %%xmm2,%%ymm2
Now
vpbroadcastw %%xmm2,%%ymm2
Bug: 357439226, 357721018
Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040
Reviewed-by: Wan-Teh Chang <wtc@google.com>
The HalfFloatPlane() function does not follow libyuv's convention of buffer
stride in units of the corresponding buffer pointer. Document that.
Change-Id: Id8d466ccc2df263a49ad788ab349bc3993a48259
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5770639
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Declare functions as static. Declare functions in a header. Include the
header that declares the functions. Delete undeclared and unused
functions ScaleFilterRows_NEON() and ScaleRowUp2_16_NEON(). Delete
unused function ScaleY() in psnr_main.cc.
Change-Id: I182ec30611df83c61ffd01bbab595cd61fb5f1e5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
I010, also known as YUV420P10, is 10 bit YUV pixel format with 3 planes.
Both I010 and NV12 are 4:2:0 subsampling. NV12 has a Y plane, and an
interleaved UV plane.
Bug: 357721018
Change-Id: If215529b9eda8e0fb32aed666ca179c90244aaff
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5764823
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- P010 and NV12 have the same layout: Full size Y plane and half size UV plane.
P010 and NV12 are 4:2:0 subsampling
- P010 uses upper 10 bits of 16 bit elements
- NV12 uses 8 bit elements
- The Convert16To8 used internally will discard the low 2 bits.
- UV order is the same - U first in memory, followed by V, interleaved
- UV plane is be rounded up in size to allow odd size Y to have UV values
- Similar code could be used to convert P210ToNV16, P410ToNV24, with the size
of the UV plane affected by subsampling 4:2:2 and 4:4:4 variants.
Bug: b/357439226
Change-Id: I5d6ec84d97d0e0cc4008eeb18a929ea28570d6d9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5761958
Reviewed-by: Wan-Teh Chang <wtc@google.com>
We can make use of the ZA tile register to do the transpose and
de-interleaving of UV components without any explicit permute
instructions: the tile is loaded horizontally placing UV components into
alternative columns, then we can just store the independent components
vertically.
Change-Id: I67bd82dc840a43888290be1c9db8a3c05f16d730
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703588
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can make use of the ZA tile register to do the transpose without any
explicit permute instructions: just load the tile horizontally and store
it vertically.
Change-Id: I1c31e89af52a408e3491e62d6c9e6fee41b1b80a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703587
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We cannot use the standard dot-product instructions since the
coefficients multiplication results are both added and subtracted, but
I8MM supports mixed-sign dot products which work well here. We need to
add an additional variant of the coefficient structs since we need
negative constants for the elements that were previously subtracted.
Reduction in runtimes observed compared to the previous Neon
implementation:
Cortex-A510: -37.3%
Cortex-A520: -31.1%
Cortex-A715: -37.1%
Cortex-A720: -37.0%
Cortex-X2: -62.1%
Cortex-X3: -62.2%
Cortex-X4: -40.4%
Bug: libyuv:977
Change-Id: Idc3d9a6408c30e1bce3816a1ed926ecd76792236
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712928
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
This reverts commit f480fa1c4a4af0ce3c34cd7b1ab0d85f1a36ce17.
This code has a number of small issues:
* The YUVTORGB_SVE_SETUP macro requires p0 to be initialized to
all-true, however the existing kernel does not initialise p0 until
after this macro is called, so flip the order.
* The p2 register is missing from the clobber list, so add it.
* The existing code uses the wrong condition flags when determining
whether to do the tail iteration using WHILE instructions or not.
Additionally the number of tail iterations is incorrect, as it was
incorrectly not changed from when the tail code was always executed.
While we are here, make another few small improvements:
* Remove the single-quote digit separators as requested here:
https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133
* Remove "volatile" from the asm block counting the vector length. This
particular asm block cannot be removed by the compiler since the
output register is consumed by subsequent code, so "volatile" is
unnecessary here and we remove it.
* Add some additional empty comments to force clang-format to put macros
into the next line rather than on the same line as other asm.
Bug: b/352371649
Change-Id: I45676fab95343f588cf11ce2cf9186ffbe87489e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703586
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no nice way of forming the TBL permute indices here since we
are operating on sets of three bytes at a time, so instead load the
appropriate indices from a static array. We can make use of SVE
predication to ensure we are operating on a multiple of three bytes for
the load/store instructions rather than needing to make use of more
expensive LD4 or ST3 instructions.
Reduction in runtime observed compared to the existing Neon
implementations:
| ARGBToRAWRow | ARGBToRGB24Row
Cortex-A510 | -50.8% | -19.9%
Cortex-A720 | -39.8% | -39.1%
Cortex-X2 | -66.5% | -51.9%
Bug: libyuv:973
Change-Id: Iaead678715a3d70d54cf823391272a6196836769
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631544
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This can make use of the existing helper functions for RAWToARGBRow_SVE2
and RAWToRGBARow_SVE2 since the layouts are similar, we just need to
adjust the TBL constants to match the different input layout.
Observed reduction in runtime compared to the existing Neon kernel:
Cortex-A510: -25.6%
Cortex-A720: -15.2%
Cortex-X2: -10.2%
Cortex-X4: -30.2%
Bug: libyuv:973
Change-Id: Ie3676693286be90d09f0045766c3492cbc04ea64
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5638555
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
There is no nice way of forming the TBL permute indices here since we
are operating on sets of three bytes at a time, so instead load the
appropriate indices from a static array. We can make use of SVE
predication to ensure we are operating on a multiple of three bytes for
the load/store instructions rather than needing to make use of more
expensive LD3 or ST3 instructions.
Reduction in runtime observed compared to the existing Neon
implementation:
Cortex-A510: -39.2%
Cortex-A720: -34.5%
Cortex-X2: -31.0%
Bug: libyuv:973
Change-Id: I68560bde7a529e5cec150b0e9d3ffe4341038fb8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631543
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We can construct particular predicates to load only up to 3/4 of a full
vector, allowing us to use TBL to shuffle elements into the correct
place rather than needing to rely on more expensive LD3 or ST4
instructions.
Reduction in runtimes observed compared to the existing Neon
implementation:
| RAWToARGBRow | RAWToRGBARow
Cortex-A510 | -32.4% | -31.9%
Cortex-A720 | -15.7% | -15.6%
Cortex-X2 | -24.6% | -24.4%
Bug: libyuv:973
Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| Cortex-A55 | Cortex-A510 | Cortex-A76
P210ToARGBRow | -59.8% | -16.8% | -53.2%
P210ToAR30Row | -48.1% | -21.8% | -54.0%
P410ToARGBRow | -56.5% | -32.2% | -54.1%
P410ToAR30Row | -42.4% | -4.5% | -50.4%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
A semicolon is treated as the start of a comment by some assemblers
causing the vector length to be reported incorrectly, so use a newline
instead.
- Add volatile asm in row_gcc and row_neon64
Bug: b/5631539
Change-Id: I6b0836fcdd9247ef7b9e8ceda01df3150519ecf8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5666060
Reviewed-by: Justin Green <greenjustin@google.com>
There is an existing x86 implementation for this kernel, but not for
AArch64, so add one.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
Cortex-A55: -43.1%
Cortex-A510: -22.3%
Cortex-A76: -54.8%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210ToAR30Row | I210ToARGBRow
Cortex-A55 | -40.8% | -54.4%
Cortex-A510 | -26.2% | -22.7%
Cortex-A76 | -49.2% | -44.5%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Some configs have int64 elements off by default.
Disable ScaleDownBy4 row function to avoid compile error
Bug: 344954354
Change-Id: Ie0d74daea72375eff6438ab54cb2803d68d67e52
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598460
Reviewed-by: James Zern <jzern@google.com>
1. Add two defined marco LIBYUV_RVV_HAS_TUPLE_TYPE & LIBYUV_RVV_HAS_VXRM_ARG
Intrinsic v0.12 introduces
- tuple type in segment load & store
- vxrm argument in fixed-point intrinsics (e.g vnclip)
These two marcos are controled by __riscv_v_intrinsic.
2. Support RVV v0.12 intrinsics in row_rvv.cc & scale_rvv.cc
Change-Id: I921f91d9dc8fdda031e7b6647d0e296aa2793c39
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4767120
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels
except reading the YUV components all from the same interleaved input
array. We load four-byte elements and then use TBL to de-interleave the
UV components.
Unlike the NV{12,21} cases we need to de-interleave bytes rather than
widened 16-bit elements. Since we need a TBL instruction already it
would ordinarily be possible to perform the zero-extension from bytes to
16-bit elements by setting the index for every other byte to be out of
range. Such an approach does not work in SVE since at a vector length of
2048 bits since all possible byte values (0-255) are valid indices into
the vector. We instead get around this by rewriting the I4XXTORGB_SVE
macro to perform widening multiplies, operating on the low byte of each
16-bit UV element instead of the full value and therefore eliminating
the need for a zero-extension.
Observed reductions in runtimes compared to the existing Neon code:
| UYVYToARGBRow | YUY2ToARGBRow
Cortex-A510 | -30.2% | -30.2%
Cortex-A720 | -4.8% | -4.7%
Cortex-X2 | -9.6% | -10.1%
Bug: libyuv:973
Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we
can pre-calculate the UV component results before the loop body.
Unlike in the Neon version of the code we can make use of MOVPRFX and
USQADD to avoid needing to apply the bias separately from the UV
coefficient multiply additions.
Reduction in runtime observed compared to the existing Neon code:
Cortex-A510: -26.1%
Cortex-A520: -5.9%
Cortex-A715: -49.5%
Cortex-A720: -49.4%
Cortex-X2: -22.5%
Cortex-X3: -23.5%
Cortex-X4: -21.6%
Bug: libyuv:973
Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We need a permute to duplicate the UV components, so we can share a
common implementation for both NV12 and NV21 by varying the inputs to
the INDEX instruction that generates the TBL indices.
Observed reductions in runtimes compared to the existing Neon code:
| NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 | -29.1% | -29.1%
Cortex-A720 | -4.8% | -4.8%
Cortex-X2 | -9.2% | -9.2%
Bug: libyuv:973
Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We cannot use the standard dot-product instructions since the matrix of
coefficients are signed, but I8MM supports mixed-sign products which
work well here.
Reduction in runtimes observed compared to the previous Neon
implementation:
Cortex-A510: -50.8%
Cortex-A520: -33.3%
Cortex-A715: -38.6%
Cortex-A720: -38.5%
Cortex-X2: -43.2%
Cortex-X3: -40.0%
Cortex-X4: -55.0%
Change-Id: Ia4fe486faf8f43d0b837ad21bb37e2159f3bdb77
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5621577
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This commit just adds the kCpuHasSME to represent that the CPU has the
Arm Scalable Matrix Extension enabled, but this commit does not
introduce any code to actually use it yet.
Add a test to check that the HWCAP value is interpreted correctly.
Change-Id: I2de7bca26ca44ff3ee278b59108298a299a171b7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598869
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we
just need to interleave differently for the output.
The RGBA format actually saves us an instruction compared to ARGB since
there is no need to merge in the alpha component, we can just replace
the odd elements of the alpha vector itself during the narrowing.
Also rename some existing macros to make more sense when distinguishing
between ARGB and RGBA.
Reductions in runtime observed compared to the existing Neon code:
Cortex-A510: -27.0%
Cortex-A720: -5.3%
Cortex-X2: -14.7%
Bug: libyuv:973
Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
These kernels are mostly identical to each other except for the order of
the results, so we can use a single macro to parameterize the pairwise
addition and use the same macro for both implementations, just with the
register order flipped.
Similar to other 2x2 kernels the implementation here differs slightly
for the last element if the problem size is odd, so use an "any" kernel
to avoid needing to handle this in the common code path.
Observed reduction in runtime compared to the existing Neon code:
| AYUVToUVRow | AYUVToVURow
Cortex-A510 | -33.1% | -33.0%
Cortex-A720 | -25.1% | -25.1%
Cortex-X2 | -59.5% | -53.9%
Cortex-X4 | -39.2% | -39.4%
Bug: libyuv:973
Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Avoiding LD4 and unrolling gives a good perf improvement for the little
core especially.
Observed reduction in runtime relative to the existing Neon code:
Cortex-A510: -69.7%
Cortex-A720: -7.7%
Cortex-X2: -41.9%
Cortex-X4: -14.5%
Bug: libyuv:973
Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
I210ToAR30Row on Cortex-A55: -43.8%
I210ToAR30Row on Cortex-A510: -27.0%
I210ToAR30Row on Cortex-A76: -50.4%
I410ToAR30Row on Cortex-A55: -44.3%
I410ToAR30Row on Cortex-A510: -17.5%
I410ToAR30Row on Cortex-A76: -57.2%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210ToARGBRow | I410ToARGBRow
Cortex-A55 | -55.6% | -56.2%
Cortex-A510 | -22.6% | -35.6%
Cortex-A76 | -48.1% | -57.2%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210AlphaToARGBRow | I410AlphaToARGBRow
Cortex-A55 | -55.3% | -56.1%
Cortex-A510 | -27.9% | -42.6%
Cortex-A76 | -54.9% | -60.3%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing Neon code only makes use of 64-bit vectors throughout which
limits the performance on larger cores. To avoid this, swap the Neon
code from a Wx8 implementation to a Wx16 implementation and process
blocks of 16 full vectors at a time.
The original code also handled widths that were not exact multiples of
16, however this should already be handled by the "any" kernel so it is
removed.
Finally, avoid duplicating the TransposeWx16_C fallback kernel
definition in all architectures that need it, and just put it once in
rotate_common.cc instead.
Observed speedups for TransposePlane across a range of
micro-architectures:
Cortex-A53: -40.0%
Cortex-A55: -20.7%
Cortex-A57: -43.9%
Cortex-A510: -43.5%
Cortex-A520: -43.9%
Cortex-A720: -31.1%
Cortex-X2: -38.3%
Cortex-X4: -43.6%
Change-Id: Ic7c4d5f24eb27091d743ddc00cd95ef178b6984e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5545459
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| ABGRToAR30Row | ARGBToAR30Row
Cortex-A55 | -55.1% | -55.1%
Cortex-A510 | -39.3% | -40.1%
Cortex-A76 | -62.3% | -63.6%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Using the platform-specific functions IsProcessorFeaturePresent and
sysctlbyname to check individual features.
Bug: libyuv:980
Change-Id: I7971238ca72e5df862c30c2e65331c46dc634074
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465591
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
By maintaining the interleaved format of the data we can use a common
kernel for all input channel orderings and simply pass a different
vector of constants instead.
A similar approach is possible with only Neon by making use of
multiplies and repeated application of ADDP to combine channels, however
this is slower on older cores like Cortex-A53 so is not pursued further.
For odd problem sizes we need a slightly different implementation for
the final element, so introduce an "any" kernel to address that rather
than bloating the code for the common case.
Observed affect on runtimes compared to the existing Neon kernels:
| Cortex-A510 | Cortex-A720 | Cortex-X2
ABGRToUVJRow | -15.5% | +5.4% | -33.1%
ABGRToUVRow | -15.6% | +5.3% | -35.9%
ARGBToUVJRow | -10.1% | +5.4% | -32.7%
ARGBToUVRow | -10.1% | +5.4% | -29.3%
BGRAToUVRow | -15.5% | +4.6% | -32.8%
RGBAToUVRow | -10.1% | +4.2% | -36.0%
Bug: libyuv:973
Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This is mostly identical to the existing I422ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.
Reduction in runtimes observed compared to the existing Neon code:
Cortex-A510: -32.1%
Cortex-A720: -5.1%
Cortex-X2: -10.1%
Bug: libyuv:973
Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This is mostly identical to the existing I444ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.
Reduction in runtimes observed compared to the existing Neon code:
Cortex-A510: -34.2%
Cortex-A720: -17.6%
Cortex-X2: -9.6%
Bug: libyuv:973
Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We need a new macro for reading I422 data, but is otherwise mostly
identical to the existing I444ToARGBRow_SVE implementation.
Reduction in runtimes observed compared to the existing Neon code:
Cortex-A510: -25.0%
Cortex-A720: -5.0%
Cortex-X2: -10.8%
Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
We can use the Neon dot-product instructions as a slightly faster
widening accumulation. This also has the advantage of widening to 32
bits so avoids the risk of overflow present in the original Neon code.
Reduction in runtimes observed for HammingDistance compared to the
existing Neon code:
Cortex-A55: -4.4%
Cortex-A510: -26.5%
Cortex-A76: -8.1%
Cortex-A720: -15.5%
Cortex-X1: -4.1%
Cortex-X2: -5.1%
Bug: libyuv:977
Change-Id: I9e5e10d228c339d905cb2e668a9811ff0a6af5de
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5490049
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This re-lands commit ba0bba5b2b7e38c9365a5d152b4efa0458863213.
Now with additional #ifdef __linux__ guards to avoid compiling
Linux-specific code on non-Linux platforms. Non-linux feature detection
will be added in a separate patch.
Using getauxval(AT_HWCAP{,2}) has the advantage of also working under
emulation where faking /proc/cpuinfo is not supported.
For the Chromium sandbox, getauxval is supported since API version 18.
The minimum supported API version at time of writing is 21 so we should
be able to use getauxval unconditionally. On the off-chance the call
fails it will return 0 and we will correctly fall-back to using only
Neon.
If we want to read the current CPU implementer or part number we could
do this by checking HWCAP_CPUID and then reading MIDR_EL1. This will
cause a kernel trap to emulate the EL1 read but should still be a lot
faster than reading the whole of /proc/cpuinfo.
Bug: libyuv:980
Change-Id: I8ae103ea7e32ef44db72f3c9896417bfe97ff5c5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465590
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The Neon dot-product instructions perform two widening steps rather than
one, saving us the need to widen the absolute difference to 16-bits
before accumulating. Additionally, the dot-product instructions tend to
have better performance characteristics than traditional widening
multiply instructions like SMLAL used in the existing
SumSquareError_NEON code.
Observed reduction in runtimes compared to the existing Neon kernel:
Cortex-A55: -9.1%
Cortex-A510: -36.7%
Cortex-A76: -37.6%
Cortex-A720: -48.8%
Cortex-X1: -56.1%
Cortex-X2: -42.6%
Bug: libyuv:977
Change-Id: Ie20c69040cc47a803d8e95620d31e0bf1e1dac12
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463945
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This reverts commit ba0bba5b2b7e38c9365a5d152b4efa0458863213.
Reason for revert: breaks builds on windows and mac
Step _compile_ failed. Error logs are shown below:
[1/104] CXX obj/libyuv_internal/cpu_id.o
FAILED: obj/libyuv_internal/cpu_id.o
../../buildtools/reclient/rewrapper -cfg=../../buildtools/reclient_cfgs/chromium-browser-clang/rewra...(too long)
../../source/cpu_id.cc:25:10: fatal error: 'sys/auxv.h' file not found
25 | #include // For getauxval()
| ^~~~~~~~~~~~
1 error generated.
More information in raw_io.output_text[failure_summary]
Original change's description:
> [AArch64] Use getauxval(AT_HWCAP{,2}) for feature detection
>
> This has the advantage of also working under emulation where
> faking /proc/cpuinfo is not supported.
>
> For the Chromium sandbox, getauxval is supported since API version 18.
> The minimum supported API version at time of writing is 21 so we should
> be able to use getauxval unconditionally. On the off-chance the call
> fails it will return 0 and we will correctly fall-back to using only
> Neon.
>
> Change-Id: Ibbaa9caec1915ac0725c42d6cd2abc7ce19786c7
> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5453620
> Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Change-Id: Ic0f764217af7b4d998f19a8f78fc04ca85a45a3b
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463918
Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This has the advantage of also working under emulation where
faking /proc/cpuinfo is not supported.
For the Chromium sandbox, getauxval is supported since API version 18.
The minimum supported API version at time of writing is 21 so we should
be able to use getauxval unconditionally. On the off-chance the call
fails it will return 0 and we will correctly fall-back to using only
Neon.
Change-Id: Ibbaa9caec1915ac0725c42d6cd2abc7ce19786c7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5453620
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Don't define HAS_*_NEON_DOTPROD for 32-bit Arm platforms, since they are
only defined in *_neon64.cc for now.
Also define -DLIBYUV_NEON=1 and pass -mfpu=neon to *_neon.cc for 32-bit
Arm platforms, since otherwise __ARM_NEON__ is not defined.
Also fix a typo: ly_lib_static should be ly_lib_name in the name of the
common object files. The existing code happens to work since they are
defined to the same thing.
Change-Id: Ibdc9e5d0391f7ff8db1ca83384e5bd45ac9950a2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5439562
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Being able to use SVE2 functionality for these kernels has a number of
performance wins compared to the existing Neon code:
* For the Y component calculation we are able to use UMULH, versus the
existing UMULL x2 + UZP2 sequence in Neon.
* For the RGBTORGBA8 calculation we are able to take advantage of
interleaving narrowing instructions, allowing us to use ST2 rather
than ST4 for the store. This is a big performance win on some
micro-architectures where ST4 is costly.
* The use of predication means we do not need to add "any" kernels, we
can simply rerun the calculation with a not-full predicate for the
final iteration.
To avoid the overhead of generating a predicate register on every
iteration we duplicate the loop body and only generate a predicate on
the final iteration of the loop. This costs a small amount on the final
iteration but should still be significantly quicker than the overhead of
a function call needed by the "any" cases. Duplicating the loop body to
reduce the use of the WHILELT instruction improves little core
performance by ~12% by itself but has negligable impact on other
micro-architectures.
Reduction in runtime for the new SVE2 implementation compared to the
existing Neon implementation on selected micro-architectures:
Cortex-A510: -36.5%
Cortex-A720: -17.3%
Cortex-X2: -11.3%
Bug: libyuv:973
Change-Id: I2a485f0dfa077a56f96b80a667ad38bbea47b4b4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424739
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
In particular there are a few extensions that are interesting for us:
* FEAT_DotProd adds 4-way dot-product instructions which are useful in
e.g. ARGBToY.
* FEAT_I8MM adds additional mixed-sign dot-product instructions which
could be useful in e.g. ARGBToUV.
* FEAT_SVE and FEAT_SVE2 add support for the Scalable Vector Extension,
which adds an array of new instructions including new widening loads
and narrowing stores for dealing with mixed-width integer arithmetic
efficiently and predication for avoiding the need for "any" cleanup
loops.
This commit simply adds support for detecting the presence of these
features by extending the existing /proc/cpuinfo parsing, splitting it
into separate Arm and AArch64 functions for simplicity.
Since we have no space left in the bitset entries between Arm and X86
entries, we reuse some of the X86 entries for new AArch64 extensions.
This doesn't seem obviously problematic as long as we avoid setting
kCpuHasX86.
Bug: libyuv:973
Bug: libyuv:977
Change-Id: I8e256225fe12a4ba5da24460f54061e16eab6c57
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5378150
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
These declarations appear to exist in error since there is no
corresponding implementation of these kernels and nothing calls them. So
simply remove the declarations until we get around to adding an
implementation.
Change-Id: I0a9d72e7e13398b689e3e47ef101f672082c4795
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5387649
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
sde -spr -- libyuv_test -- --gunit_filter=*Cpu*
Note: Google Test filter = *Cpu*
[==========] Running 4 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 3 tests from LibYUVBaseTest
[ RUN ] LibYUVBaseTest.TestCpuHas
Cpu Flags 0x57fff9
Has X86 0x8
Has SSE2 0x10
Has SSSE3 0x20
Has SSE41 0x40
Has SSE42 0x80
Has AVX 0x100
Has AVX2 0x200
Has ERMS 0x400
Has FMA3 0x800
Has F16C 0x1000
Has AVX512BW 0x2000
Has AVX512VL 0x4000
Has AVX512VNNI 0x8000
Has AVX512VBMI 0x10000
Has AVX512VBMI2 0x20000
Has AVX512VBITALG 0x40000
Has AVX10 0x0
HAS AVXVNNI 0x100000
Has AVXVNNIINT8 0x0
Has AMXINT8 0x400000
[ OK ] LibYUVBaseTest.TestCpuHas (34 ms)
Bug: b/324356616
Change-Id: I5129b8946363a501bdd570e6dba3936c54aacd6c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5283433
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Add detect linux kernel version number in util/cpuid
adbrun -- blaze-bin/third_party/libyuv/cpuid
Kernel Version 4.14
Cpu Flags 0x7
Has ARM 0x2
Bug: libyuv:970
Change-Id: I655ed598db3655ca8448be08f1d71fbc328ced66
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5207990
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Update build files to include the new tests and source code.
Bug: libyuv:956
Change-Id: I6ec0beb6dc9570f0597d7df1835d616489dbaece
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5103585
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
HAS_SCALEARGBROWDOWNEVEN_RVV wasn't defined,
so we cannot use ScaleARGBRowDownEven_RVV & ScaleARGBRowDownEvenBox_RVV.
- Seperate to two conditional statements when selecting DownEven or DownEvenBox.
- Also, add HAS_SCALEARGBROWDOWNEVEN_RVV and disable it by default.
Bug: libyuv:965
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Change-Id: Ic7ec40520b64131a456c6f3eea0639b3620f11ae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4882441
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Change ScalePlane(), ScalePlane_16(), and ScalePlane_12() to return int
so that they can report memory allocation failures (by returning 1).
BUG=libyuv:968
Change-Id: Ie5c183ee42e3d595302671f9ecb7b3472dc8fdb5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5005031
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Add kCpuHasAVXVNNI flag
- Remove deprecated GFNI detect to make space.
Meteor Lake has AVX-VNNI but not AVX512
~/intelsde/sde -mtl -- blaze-bin/third_party/libyuv/libyuv_test --gunit_filter=*CpuHas
doyuv3
Note: Google Test filter = *CpuHas
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LibYUVBaseTest
[ RUN ] LibYUVBaseTest.TestCpuHas
Cpu Flags 0x203ff1
Has X86 0x10
Has SSE2 0x20
Has SSSE3 0x40
Has SSE41 0x80
Has SSE42 0x100
Has AVX 0x200
Has AVX2 0x400
Has ERMS 0x800
Has FMA3 0x1000
Has F16C 0x2000
Has AVX512BW 0x0
Has AVX512VL 0x0
Has AVX512VNNI 0x0
Has AVX512VBMI 0x0
Has AVX512VBMI2 0x0
Has AVX512VBITALG 0x0
Has AVX512VPOPCNTDQ 0x0
HAS AVXVNNI 0x200000
Has AVXVNNIINT8 0x0
AVX-VNNI detect
- Add kCpuHasAVXVNNI flag
- Remove deprecated GFNI detect to make space.
https://bugs.chromium.org/p/libyuv/issues/detail?id=967
Meteor Lake has AVX-VNNI but not AVX512
~/intelsde/sde -mtl -- blaze-bin/third_party/libyuv/libyuv_test --gunit_filter=*CpuHas
doyuv3
Note: Google Test filter = *CpuHas
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LibYUVBaseTest
[ RUN ] LibYUVBaseTest.TestCpuHas
Cpu Flags 0x203ff1
Has X86 0x10
Has SSE2 0x20
Has SSSE3 0x40
Has SSE41 0x80
Has SSE42 0x100
Has AVX 0x200
Has AVX2 0x400
Has ERMS 0x800
Has FMA3 0x1000
Has F16C 0x2000
Has AVX512BW 0x0
Has AVX512VL 0x0
Has AVX512VNNI 0x0
Has AVX512VBMI 0x0
Has AVX512VBMI2 0x0
Has AVX512VBITALG 0x0
Has AVX512VPOPCNTDQ 0x0
HAS AVXVNNI 0x200000
Has AVXVNNIINT8 0x0
Running on all cpus the following report avx-vnni
grep 'AVXVNNI 0x2' */*
adl/libyuv64.txt:HAS AVXVNNI 0x200000
gnr/libyuv64.txt:HAS AVXVNNI 0x200000
grr/libyuv64.txt:HAS AVXVNNI 0x200000
mtl/libyuv64.txt:HAS AVXVNNI 0x200000
rpl/libyuv64.txt:HAS AVXVNNI 0x200000
spr/libyuv64.txt:HAS AVXVNNI 0x200000
srf/libyuv64.txt:HAS AVXVNNI 0x200000
while these support avx512 vnni
grep 'VNNI 0x1' */*
clx/libyuv64.txt:Has AVX512VNNI 0x10000
cpx/libyuv64.txt:Has AVX512VNNI 0x10000
gnr/libyuv64.txt:Has AVX512VNNI 0x10000
icl/libyuv64.txt:Has AVX512VNNI 0x10000
icx/libyuv64.txt:Has AVX512VNNI 0x10000
spr/libyuv64.txt:Has AVX512VNNI 0x10000
tgl/libyuv64.txt:Has AVX512VNNI 0x10000
and these support avx-vnni-int8
grep AVXVNNIINT8.0x4 */*
grr/libyuv64.txt:Has AVXVNNIINT8 0x400000
srf/libyuv64.txt:Has AVXVNNIINT8 0x400000
Bug: libyuv:967
Change-Id: I84cd71d1b320e7c284173eb695fc1d3b72d14ddb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4912017
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
- Add kCpuHasAVXVNNIINT8 flag
- Move mips flags up a bit to make space.
~/intelsde/sde -srf -- blaze-bin/third_party/libyuv/libyuv_test --gunit_filter=*CpuHas
Note: Google Test filter = *CpuHas
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from LibYUVBaseTest
[ RUN ] LibYUVBaseTest.TestCpuHas
Cpu Flags 0x403ff1
Has X86 0x10
Has SSE2 0x20
Has SSSE3 0x40
Has SSE41 0x80
Has SSE42 0x100
Has AVX 0x200
Has AVX2 0x400
Has ERMS 0x800
Has FMA3 0x1000
Has F16C 0x2000
Has AVX512BW 0x0
Has AVX512VL 0x0
Has AVX512VNNI 0x0
Has AVX512VBMI 0x0
Has AVX512VBMI2 0x0
Has AVX512VBITALG 0x0
Has AVX512VPOPCNTDQ 0x0
Has AVXVNNIINT8 0x400000
Has GFNI 0x0
[ OK ] LibYUVBaseTest.TestCpuHas (32 ms)
INT8 supported on srf and grr
-srf Set chip-check and CPUID for Intel(R) Sierra Forest CPU
-grr Set chip-check and CPUID for Intel(R) Grand Ridge CPU
Bug: b/303434603
Change-Id: I628007929ff0518b2b36e1469b4d9aed71a9fa8f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4912015
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Add scalar code for AR64ToAB64, ARGBToRGBA, ARGBToBGRA, ARGBToABGR, RGBAToARGB, BGRAToARGB, and ABGRToARGB.
They are originally implemented by ARGBShffle.
This CL independetly implements them, and only enables for risc-v now.
This CL also add RVV implementation for `RGBA-family <-> RGBA-family` color conversions.
* Run on SiFive internal FPGA(VLEN=128):
Test Case Speedup
AR64ToAB64_Opt x4.6
ARGBToRGBA_Opt x6
ARGBToBGRA_Opt x6
ARGBToABGR_Opt x6
RGBAToARGB_Opt x6
Change-Id: Ie0630901046084aa259699fcdeccc64170d7103f
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4797451
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- MSAN fails on most inline assembly, unaware of what the load and store instructions do.
- MSAN is also failing on row_any functions, which memcpy a correct number of pixels into a buffer that is SIMD vector sized, apply SIMD to the full vector, and then memcpy the exact number of resulting pixels to the output buffer. MSAN wants the temporary buffer to be initialized. Which genenerally is done with a memset(buf, 0, sizeof(buf)); to satisify MSAN.
- RVV may not require disabling MSAN, since row functions are all 'any' number of elements, and implementation is intrinsics.
Bug: b/297979878
Change-Id: Ic21200689c0c7d2c85bb1de3eef38570137d3d8b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4832740
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
ScaleUVRowUp2_(Bi)linear_RVV function is equal to other platforms' ScaleRowUp2_(Bi)linear_Any_XXX.
We process entire row in this function.
Other platforms only implement non-edge part of image and process edge with scalar.
ScaleRowUp2_(Bi)linear_Any_XXX: Combine ScaleRowUp2_(Bi)linear_XXX(non-edge) + ScaleRowUp2_(Bi)linear_C(edge) by SBUH2LANY/SU2BLANY.
* Run on SiFive internal FPGA:
Test case RVV function Speedup
I444ScaleFrom640x360_Bilinear ScaleRowUp2_Bilinear_RVV 8.21
I444ScaleFrom640x360_Linear ScaleRowUp2_Linear_RVV 8.08
UVScaleFrom640x360_Bilinear ScaleUVRowUp2_Bilinear_RVV 7.80
UVScaleFrom640x360_Linear ScaleUVRowUp2_Linear_RVV 7.03
Change-Id: I539245ce51858f077506a78f0e7e82377ac6a95d
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4666062
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
* Run on SiFive internal FPGA:
Test case Speedup
ARGBBlend_Opt 4.60
BlendPlane_Opt 5.96
I420Blend_Opt 5.83
- Also, add code to use ScaleRowDown2Box_RVV in I420Blend
Change-Id: Icc75e05d26b3427a98269d2a33c4474074033264
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4681100
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Add static to internal scale and rotate functions
- Remove unittest that tested an internal scale function
- Remove unused private functions
- Include missing scale_argb.h header
- Bump version and apply clang format
Bug: libyuv:830
Change-Id: I45bab0423b86334f9707f935aedd0c6efc442dd4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4658956
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
- Makes ARM and Intel match and fixes some off by 1 cases
- Add ARGBToUV444MatrixRow_NEON
- Add ConvertFP16ToFP32Column_NEON
- scale_rvv fix intinsic build error
- disable row_win version of ARGBAttenuate/Unattenuate
Bug: libyuv:936, libyuv:956
Change-Id: Ied99aaad3a11a8eb69212b628c58f86ec0723c38
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617013
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
* Run on SiFive internal FPGA:
TestARGBExtractAlpha(~3.2x vs scalar)
TestARGBCopyYToAlpha(~1.6x vs scalar)
Change-Id: I36525c67e8ac3f71ea9d1a58c7dc15a4009d9da1
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617955
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Run on SiFive internal FPGA:
Test case RVV function Speedup
I444ScaleDownBy3by4_None ScaleRowDown34_RVV 5.8
I444ScaleDownBy3by4_Linear ScaleRowDown34_0/1_Box_RVV 6.5
I444ScaleDownBy3by4_Bilinear ScaleRowDown34_0/1_Box_RVV 6.3
Bug: libyuv:956
Change-Id: I8ef221ab14d631e14f1ba1aaa25d2b30d4e710db
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4607777
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Uses I012ToAR30Matrix with u and v swapped and with VU suffixed
constants.
Bug: b/268505204
Change-Id: If0d189891be3053da776feb48d49fa68a9866037
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581869
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
They re-use the same method as I410/I210 to I420 with a depth
value of 12 instead of 10.
Bug: b/268505204
Change-Id: I299862b4556461d8c95f0fc1dcd5260e1c1f25cd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581867
Commit-Queue: Vignesh Venkatasubramanian <vigneshv@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
* Run on SiFive internal FPGA:
MergeUVPlane_Opt(~6x vs scalar)
SplitUVPlane_Opt(~6x vs scalar)
TestCopyPlane(~8x vs scalar)
ARGBInterpolate0_Opt(~10x vs scalar)
ARGBInterpolate64_Opt(~9x vs scalar)
ARGBInterpolate168_Opt(~9x vs scalar)
ARGBInterpolate192_Opt(~8.5x vs scalar)
ARGBInterpolate255_Opt(~8x vs scalar)
Bug: libyuv:956
Change-Id: I8372341865f75f42e30371ef943d5c2e4be7b79a
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4574186
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Run on SiFive internal FPGA*:
I400ToARGB_Opt (~8x vs scalar)
J400ToARGB_Opt (~10x vs scalar)
LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10
Bug: libyuv:956, libyuv:961
Change-Id: If4e21ec85c4ff79083ec16a6faae0e457129a8de
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4544972
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Run on SiFive internal FPGA:
I444AlphaToARGB_Opt (~16x vs scalar)
I422AlphaToARGB_Opt (~10x vs scalar)
ARGBAttenuate_Opt (~3x vs scalar)
LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10
Change-Id: I0046eb7af8104bc8e13cee1cb91a19f90940d5b0
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4535657
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Run on SiFive internal FPGA:
ARGBToJ400_Opt (~6x vs scalar)
RGBAToJ400_Opt (~6x vs scalar)
RGB24ToJ400_Opt (~5.5x vs scalar)
LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10
Change-Id: Ia3ce8cea7962fbd8618cc23e850a7913c9cabf4f
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4521783
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Run on SiFive internal FPGA:
I444ToARGB_Opt (~16x vs scalar)
I444ToRGB24_Opt (~10x vs scalar)
LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10
Change-Id: Idae7dc46ef648beaa14b58ba3eb56b67b17c9b3b
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4520761
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Run on SiFive internal FPGA:
I422ToARGB_Opt (~10x vs scalar)
I422ToRGBA_Opt (~10x vs scalar)
I420ToRGB24_Opt (~8x vs scalar)
LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10
This CL manually sets rounding mode,
since we use fixed-point vector narrowing clip.
There is no definition about default value for fixed-point rounding mode.
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#38-vector-fixed-point-rounding-mode-register-vxrm
The behavior could be different on differet paltforms. To avoid unexpected behavior, we set rounding mode manually.
Change-Id: I90f0dcb90c37f7da7caab8eb1df6c9c7a3c874a8
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4512373
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
1. Fix compile error when build riscv without using vector
2. Fix run_qemu.sh misused v=true for USE_RVV=OFF case
3. [cmake] Fix warning by rename TEST to UNIT_TEST
Warning log:
CMake Warning (dev) at CMakeLists.txt:57 (if): [54/1931]
Policy CMP0064 is not set: Support new TEST if() operator. Run "cmake
--help-policy CMP0064" for policy details. Use the cmake_policy command to
set the policy and suppress this warning.
TEST will be interpreted as an operator when the policy is set to NEW.
Since the policy is not set the OLD behavior will be used.
This warning is for project developers. Use -Wno-dev to suppress it.
4. [cmake] Simplify logic for cross-build
Bug: libyuv:956
Change-Id: I120402fc7d6d86403e7d974180b81f4f9c663e36
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4486239
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
* Run on SiFive internal FPGA:
SplitRGBPlane_Opt (~6.87x vs scalar)
SplitARGBPlane_Opt (~10.77x vs scalar)
SplitXRGBPlane_Opt (~18.69x vs scalar)
MergeRGBPlane_Opt (~3.63x vs scalar)
MergeARGBPlane_Opt (~3.50x vs scalar)
MergeXRGBPlane_Opt (~2.90x vs scalar)
LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10
- include a fix to avoid implict conversion warning between size_t & int.
Bug: libyuv:956
Change-Id: Icd79b282b04ea3981e7fd4e6d547da6708d82516
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4443411
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
* Run on SiFive internal FPGA:
ARGBToAR64_Opt (~13.7x vs scalar)
ARGBToAB64_Opt (~5.81x vs scalar)
AR64ToARGB_Opt (~15.8x vs scalar)
AB64ToARGB_Opt (~2.40x vs scalar)
LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10
Bug: libyuv:956
Change-Id: Ida642a5077f59d25fb7c5328f671956b2293dadd
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4442913
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
* Run on SiFive internal FPGA:
ARGBToRAW_Opt (~1.55x vs scalar)
ARGBToRGB24_Opt (~1.44x vs scalar)
RGB24ToARGB_Opt (~1.77x vs scalar)
LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10
Bug: libyuv:956
Change-Id: I26722f6848cd68684d95d9a7ee06ce0416e7985d
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4413083
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- update cpu_id to use "re" for fopen to avoid leaking handles if a thread is started while the file is open.
Bug: libyuv:958
Change-Id: I1af9de68fce12e440e1226fc8070634ccb1bf090
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4417176
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
* Run on SiFive internal FPGA:
RAWToARGB_Opt (~2x vs scalar)
RAWToRGBA_Opt (~2x vs scalar)
RAWToRGB24_Opt (~1.5x vs scalar)
LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10
Change-Id: I21a13d646589ea2aa3822cb9225f5191068c285b
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4408357
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Allows code to be optimized with clang 17 -flto-thin
- Bump version number to 1864 to allow detection of fix
- Apply clang format to standardize formatting; No impact on code generated
Bug: chromium:1424089
Change-Id: Ib745836b27915a5e4cb1d7d928ee52659360612b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4370052
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Fix the algorithm for unpacking the lower 2 bits of M2T2 pixels.
Bug: b:258474032
Change-Id: Iea1d63f26e3f127a70ead26bc04ea3d939e793e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4337978
Commit-Queue: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Convert MergeUVRow_AVX512BW to assembly
- Enable MergeUVRow_AVX512BW for Windows with clangcl
- MergeUVRow_AVX2 use vpmovzxbw and vpsllw
- MergeUVRow_16_AVX2 use vpmovzxbw and vpsllw with different shift for U and V
AMD Zen 4 640x360 100000 iterations
Was
AVX512 MergeUVPlane_Opt (884 ms)
AVX2 MergeUVPlane_Opt (945 ms)
AVX2 MergeUVPlane_16_Opt (2167 ms)
Now
AVX512 MergeUVPlane_Opt (865 ms)
AVX2 MergeUVPlane_Opt (943 ms)
SSE2 MergeUVPlane_Opt (973 ms)
AVX2 MergeUVPlane_16_Opt (2102 ms)
Bug: None
Change-Id: I658ada2a75d44c3f93be8bd3ed96f83d5fa2ab8d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4271230
Reviewed-by: Fritz Koenig <frkoenig@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
- Convert 10 and 12 bit biplanar formats to planar.
- Shift 10 MSB to 10 LSB
- P010 is similar to NV12 in layout, but uses 10 MSB of 16 bit values.
- I010 is similar to I420 in layout, but uses 10 LSB of 16 bit values.
Bug: libyuv:951
Change-Id: I16a1bc64239d0fa4f41810910da448bf5720935f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4166560
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Minor variable name changes first/last to top/bottom
- Comments explaining rotate temporary buffers usage
- Add asserts for scale parameter
- Use NULL and stddef.h instead of 0
- Use void * for allocation in row.h
- Add () around size parameter in macros
Bug: libyuv:926, libyuv:949
Change-Id: Ib55417570926ccada0a0f8abd1753dc12e5b162e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4136762
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This initial implementation is based on current unoptimized code in webrtc using just plain for loops.
Bug: libyuv:949
Change-Id: Ic87ee49c3a0b62edbaaa4255c263c1f7be4ea02b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4110782
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The I410To420 implementation does a two step approach for scaling down and 10-to-8 bit conversion using the Y plane as temporal storage.
Bug: libyuv:950
Change-Id: I3d35fad4b99e17253230456233fbd947e013c0ec
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4110783
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- MT2T support for source strides added, but only works for positive values.
- Reduced casting in row_common - one cast per assignment.
- scaling functions use intptr_t for intermediate calculations, then cast strides to ptrdiff_t
Bug: libyuv:948, b/257266635, b/262468594
Change-Id: I0409a0ce916b777da2a01c0ab0b56dccefed3b33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4102203
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ernest Hua <ernesthua@google.com>
- fix ifdefs for DetilePlane_16 to use 16 bit versions, not 8 bit. (no functional change)
Bug: b/258474032
Change-Id: Ic07e02d9801e21126ebee0ceb5779aa712a493ce
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4034812
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Preserve xmm7 in ScaleRowUp2_Bilinear_12_SSSE3
- Previously xmm7 was used in ScaleRowUp2_Bilinear_12_SSSE3 without being preserved, which violates the Windows x64 calling conventions and can cause undefined behavior.
Bug: libyuv:945, 1218384
Change-Id: If18b292b588573355f9b4ba8c5b9c3fbe143d36b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3972137
Reviewed-by: Bruce Dawson <brucedawson@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- move RGB to UV into BIT_EXACT ifdefs for each compiler
- move RGB to Y to always enabled
Bug: b/253491233
Change-Id: I7f1b40c0977ba1349578ee1ff987bda845be15f1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3953663
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Previously was C for both Y and UV.
Was BGRAToI420_Opt (17780 ms)
Now BGRAToI420_Opt (9546 ms)
Bug: b/253491233
Change-Id: Id103d8d5ba0fed0f7a427dd5955e1830275eff6b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3953131
Reviewed-by: Wan-Teh Chang <wtc@google.com>
- Optimized YUY2ToNV12 that reduces it from 3 steps to 2 steps
- Was SplitUV, memcpy Y, InterpolateUV
- Now YUY2ToY, YUY2ToNVUV
- rollback LIBYUV_UNLIMITED_DATA
3840x2160 1000 iterations:
Pixel 2 Cortex A73
Was YUY2ToNV12_Opt (6515 ms)
Now YUY2ToNV12_Opt (3350 ms)
AB7 Mediatek P35 Cortex A53
Was YUY2ToNV12_Opt (6435 ms)
Now YUY2ToNV12_Opt (3301 ms)
Skylake AVX2 x64
Was YUY2ToNV12_Opt (1872 ms)
Now YUY2ToNV12_Opt (1657 ms)
SSE2 x64
Was YUY2ToNV12_Opt (2008 ms)
Now YUY2ToNV12_Opt (1691 ms)
Windows Skylake AVX2 32 bit x86
Was YUY2ToNV12_Opt (2161 ms)
Now YUY2ToNV12_Opt (1628 ms)
Bug: libyuv:943
Change-Id: I6c2ba2ae765413426baf770b837de114f808f6d0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3929843
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- YUV to RGB use linear for first and last row.
- add assert(yuvconstants)
- rename pointers to match row functions.
- use macros that match row functions.
- use 12 bit upsampler for conversions of 10 and 12 bits
Cortex A53 AArch32
I420ToRGB24_Opt (3627 ms)
I422ToRGB24_Opt (4099 ms)
I444ToRGB24_Opt (4186 ms)
I420ToRGB24Filter_Opt (5451 ms)
I422ToRGB24Filter_Opt (5430 ms)
AVX2
Was I420ToRGB24Filter_Opt (583 ms)
Now I420ToRGB24Filter_Opt (560 ms)
Neon Cortex A7
Was I420ToRGB24Filter_Opt (5447 ms)
Now I420ToRGB24Filter_Opt (5439 ms)
Bug: libyuv:938
Change-Id: I1731f2dd591073ae11a756f06574103ba0f803c7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3906082
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- add tests for all single plane formats that reduce or stay same in size
Bug: b/242233673
Change-Id: Ic25d808114f11995ac56ea9c31b99f66ba36d345
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3828485
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit a5a1102a added a function to the public ABI. Update the
version number to 1838.
Bug: b/241451603
Change-Id: I615792672c0dc097e2b1b2637ec5c3a1586d9f09
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3821166
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The code already exists to use a specific matrix. This CL simply
adds a function to use a generic YUV matrix for the conversion.
Bug: b/241451603
Change-Id: I0eea7e96a891d045905a9c963b56c053097029ec
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3820903
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- fix crash when width is not a multiple of 16
- apply clang format
- bump version
Bug: libyuv:940, b/240094327
Change-Id: Ic18e5b7b64f78f26e8b7d8440bf490a679bda200
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3812594
Reviewed-by: Wan-Teh Chang <wtc@google.com>
- Define HAS_SCALEROWUP2_BILINEAR_16_SSE2: it's now fixed.
- Correct function name to ScaleRowUp2_Bilinear_16_Any_SSE2:
this row function uses only SSE2 instructions.
Bug: libyuv:882
Change-Id: Ib1c7ac5b09997cb5b32bc54109d8c566af762433
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3800842
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- This test used to fail on ARM, but is passing now, so re-enable
- Kept behind a flag so it can be disabled with /DDISABLE_SLOW_TESTS
Bug: libyuv:905, b/197551385
Change-Id: Iff3c75c1778610c136621b595adee4b1004df9a5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3758943
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
MergeRGB and SplitRGB use a register to point to 9 shuffle tables.
- fixes an out of registers error with -mcmodel=large
InterpolateRow_16To8_NEON improves performance for I210ToI420:
On Pixel 4 for 720p x1000 images
Was I210ToI420_Opt (608 ms)
Now I210ToI420_Opt (336 ms)
On Skylake Xeon
Was I210ToI420_Opt (259 ms)
Now I210ToI420_Opt (209 ms)
Bug: libyuv:931, libyuv:930
Change-Id: I20f8244803f06da511299bf1a2ffc7945eb35221
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3717054
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
- Avoid stepping to height + 1 for bilinear filter 2nd row for last row of source
- Box filter ubsan fix for 3/4 and 3/8 scaling for 16 bit planar
- Height 1 asan fixes
Bug: libyuv:935, b/206716399
Change-Id: I56088520f2a884a37b987ee5265def175047673e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3717263
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>