- Remove special case Scale of 1 which used fp16 cvt but requires cpuid
- Port aarch64 to aarch32
- Use C for aarch32 with small (denormal) scale value
Bug: 377693555
Change-Id: I38e207e79ac54907ed6e65118b8109288fddb207
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6043392
Reviewed-by: Wan-Teh Chang <wtc@google.com>
The existing C implementation compiled with a recent LLVM is
auto-vectorised and unrolled to process four vectors per loop iteration,
making the Neon implementation slower than the C implementation on
little cores. To avoid this, unroll the Neon implementation to also
process four vectors per iteration.
Reduction in cycle counts observed compared to the existing Neon
implementation:
| HalfFloat1Row_NEON | HalfFloatRow_NEON
Cortex-A510 | -37.1% | -40.8%
Cortex-A520 | -32.3% | -37.4%
Cortex-A720 | 0.0% | -10.6%
Cortex-X2 | 0.0% | -7.8%
Cortex-X4 | +0.3% | -6.9%
Bug: b/42280945
Change-Id: I12b474c970fc4355d75ed924c4ca6169badda2bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872805
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Processing more data per loop iteration means that we can use the full
128-bit Neon vectors and also allows us to use e.g. UZP1 to perform XTN
+ XTN2 in a single instruction.
The early Cortex-X cores are not a fan of ST4 .16b with a
post-increment, so split out the pointer increment to a separate
instruction to avoid this bottleneck.
Reductions in runtime observed for ARGB1555ToARGBRow_NEON:
Cortex-A55: -18.1%
Cortex-A510: -11.2%
Cortex-A520: -39.5%
Cortex-A76: -18.0%
Cortex-A715: -34.8%
Cortex-A720: -34.8%
Cortex-X1: -0.9%
Cortex-X2: -4.6%
Cortex-X3: -3.6%
Cortex-X4: -20.8%
Bug: libyuv:976
Change-Id: Iae2ac24ffdbc718cd1e05bb77191f8d1df3fcf6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790975
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
The existing code only makes use of half of the vector lanes in the
RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more
data to allow using full vectors, adjusting the "any" kernel macros to
match. For the RGB565ToUVRow kernel we already have plenty of data but
currently call the macro twice as much as needed, so refactor the code
to only call it once but operating with full vectors instead.
Reduction in runtimes observed for selected micro-architectures:
| RGB565ToARGBRow | RGB565ToUVRow | RGB565ToYRow
Cortex-A53 | -35.2% | -28.8% | -31.1%
Cortex-A55 | -32.5% | -34.4% | -42.9%
Cortex-A510 | -21.6% | -27.7% | -47.2%
Cortex-A76 | -0.9% | -42.0% | -21.4%
Cortex-A720 | -28.6% | -37.2% | -26.1%
Cortex-X1 | -3.2% | -42.3% | -23.4%
Bug: b/42280945
Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- avx2 is pack/perm is mutating order
- cvt method maintains channel order on avx512
Sapphire Rapids
Benchmark of 640x360 on Sapphire Rapids
AVX512BW
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms)
AVX2
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms)
SSE2
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms)
Skylake Xeon
Now vpmovuswb
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms)
Was vpackuswb
[ OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms)
[ OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms)
Switch from vpunpcklwd to vpbroadcastw for scale value parameter
Was
vpunpcklwd %%xmm2,%%xmm2,%%xmm2
vbroadcastss %%xmm2,%%ymm2
Now
vpbroadcastw %%xmm2,%%ymm2
Bug: 357439226, 357721018
Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040
Reviewed-by: Wan-Teh Chang <wtc@google.com>
We cannot use the standard dot-product instructions since the
coefficients multiplication results are both added and subtracted, but
I8MM supports mixed-sign dot products which work well here. We need to
add an additional variant of the coefficient structs since we need
negative constants for the elements that were previously subtracted.
Reduction in runtimes observed compared to the previous Neon
implementation:
Cortex-A510: -37.3%
Cortex-A520: -31.1%
Cortex-A715: -37.1%
Cortex-A720: -37.0%
Cortex-X2: -62.1%
Cortex-X3: -62.2%
Cortex-X4: -40.4%
Bug: libyuv:977
Change-Id: Idc3d9a6408c30e1bce3816a1ed926ecd76792236
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712928
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
The existing ARGB4444TORGB macro only makes use of 64 bit wide vectors
rather than the full 128 bits available, so unroll it to allow us to
process more data per instruction.
For ARGB4444ToUVRow_NEON we already have enough data available each
iteration to make use of full vectors, but for ARGB4444ToYRow_NEON we
also need to adjust the "any" kernel to allow us to process 16 elements
per iteration.
Reduction in runtimes observed compared to the existing Neon kernels:
| ARGB4444ToUVRow | ARGB4444ToYRow
Cortex-A55 | -27.8% | -34.6%
Cortex-A510 | -37.0% | -44.4%
Cortex-A76 | -40.2% | -22.0%
Cortex-A720 | -33.4% | -35.5%
Cortex-X1 | -34.1% | -19.7%
Cortex-X2 | -32.1% | -26.3%
Bug: libyuv:976
Change-Id: I08f6286bab0ebf5e24d5d5803f8c45ec6ba776ee
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631541
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing RGB555TOARGB macro only makes use of 64 bit wide vectors
rather than the full 128 bits available, so unroll it to allow us to
process more data per instruction.
For ARGB1555ToUVRow_NEON we already have enough data available each
iteration to make use of full vectors, but for ARGB1555ToYRow_NEON we
also need to adjust the "any" kernel to allow us to process 16 elements
per iteration.
Reduction in runtimes observed compared to the existing Neon kernels:
| ARGB1555ToUVRow | ARGB1555ToYRow
Cortex-A55 | -28.8% | -35.3%
Cortex-A510 | -34.0% | -48.5%
Cortex-A76 | -36.7% | -25.1%
Cortex-A720 | -29.7% | -31.1%
Cortex-X1 | -31.6% | -19.7%
Cortex-X2 | -27.6% | -22.7%
Bug: libyuv:976
Change-Id: Idd745c133b5fb65001652a59f01ac1aa3bb42067
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631540
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| Cortex-A55 | Cortex-A510 | Cortex-A76
P210ToARGBRow | -59.8% | -16.8% | -53.2%
P210ToAR30Row | -48.1% | -21.8% | -54.0%
P410ToARGBRow | -56.5% | -32.2% | -54.1%
P410ToAR30Row | -42.4% | -4.5% | -50.4%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is an existing x86 implementation for this kernel, but not for
AArch64, so add one.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
Cortex-A55: -43.1%
Cortex-A510: -22.3%
Cortex-A76: -54.8%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210ToAR30Row | I210ToARGBRow
Cortex-A55 | -40.8% | -54.4%
Cortex-A510 | -26.2% | -22.7%
Cortex-A76 | -49.2% | -44.5%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
These kernels are mostly identical to each other except for the order of
the results, so we can use a single macro to parameterize the pairwise
addition and use the same macro for both implementations, just with the
register order flipped.
Similar to other 2x2 kernels the implementation here differs slightly
for the last element if the problem size is odd, so use an "any" kernel
to avoid needing to handle this in the common code path.
Observed reduction in runtime compared to the existing Neon code:
| AYUVToUVRow | AYUVToVURow
Cortex-A510 | -33.1% | -33.0%
Cortex-A720 | -25.1% | -25.1%
Cortex-X2 | -59.5% | -53.9%
Cortex-X4 | -39.2% | -39.4%
Bug: libyuv:973
Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
I210ToAR30Row on Cortex-A55: -43.8%
I210ToAR30Row on Cortex-A510: -27.0%
I210ToAR30Row on Cortex-A76: -50.4%
I410ToAR30Row on Cortex-A55: -44.3%
I410ToAR30Row on Cortex-A510: -17.5%
I410ToAR30Row on Cortex-A76: -57.2%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210ToARGBRow | I410ToARGBRow
Cortex-A55 | -55.6% | -56.2%
Cortex-A510 | -22.6% | -35.6%
Cortex-A76 | -48.1% | -57.2%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| I210AlphaToARGBRow | I410AlphaToARGBRow
Cortex-A55 | -55.3% | -56.1%
Cortex-A510 | -27.9% | -42.6%
Cortex-A76 | -54.9% | -60.3%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
By keeping intermediate data as 16-bits wide we can compute twice as
much and use ST2 to store the final result. This appears to be much
better even on micro-architectures where ST2 is slightly slower than
ST1.
We save a couple of instructions by taking advantage of multiply-add
instructions to perform an effective shift-left and bitwise-or, since we
know the set of nonzero bits are disjoint after the UMIN.
Reduction in runtime observed for MergeXR30Row_10_NEON:
Cortex-A55: -34.2%
Cortex-A510: -35.6%
Cortex-A76: -44.9%
Cortex-X2: -48.3%
Bug: libyuv:976
Change-Id: I6e2627f9aa8e400ea82ff381ed587fcfc0d94648
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509199
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There are existing x86 implementations for these kernels but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| ABGRToAR30Row | ARGBToAR30Row
Cortex-A55 | -55.1% | -55.1%
Cortex-A510 | -39.3% | -40.1%
Cortex-A76 | -62.3% | -63.6%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
By maintaining the interleaved format of the data we can use a common
kernel for all input channel orderings and simply pass a different
vector of constants instead.
A similar approach is possible with only Neon by making use of
multiplies and repeated application of ADDP to combine channels, however
this is slower on older cores like Cortex-A53 so is not pursued further.
For odd problem sizes we need a slightly different implementation for
the final element, so introduce an "any" kernel to address that rather
than bloating the code for the common case.
Observed affect on runtimes compared to the existing Neon kernels:
| Cortex-A510 | Cortex-A720 | Cortex-X2
ABGRToUVJRow | -15.5% | +5.4% | -33.1%
ABGRToUVRow | -15.6% | +5.3% | -35.9%
ARGBToUVJRow | -10.1% | +5.4% | -32.7%
ARGBToUVRow | -10.1% | +5.4% | -29.3%
BGRAToUVRow | -15.5% | +4.6% | -32.8%
RGBAToUVRow | -10.1% | +4.2% | -36.0%
Bug: libyuv:973
Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Fix -Wmemset-elt-size warning for GCC
- Use vin for inputs and vout for outputs
Bug: None
Change-Id: Iefd418dc884b4d062e1fdd9215319c8838c49eaa
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4412065
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
under gcc 12.2.0 using -Wall:
source/row_any.cc: In function ‘void libyuv::DetileRow_16_Any_SSE2(const
uint16_t*, ptrdiff_t, uint16_t*, int)’:
source/row_any.cc:2287:11: warning: ‘memset’ used with length equal to
number of elements without multiplication by element size
[-Wmemset-elt-size]
2287 | memset(temp, 0, 16 * BPP); /* for msan */
| ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
source/row_any.cc:2308:1: note: in expansion of macro ‘ANYDETILE’
2308 | ANYDETILE(DetileRow_16_Any_SSE2, DetileRow_16_SSE2, uint16_t, 2, 15)
This increases the memset to the full buffer size, which may not be
strictly necessary.
Change-Id: Iea2fc649990ee84ea9aa8020d6f6b25e012b18fb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4406599
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Convert MergeUVRow_AVX512BW to assembly
- Enable MergeUVRow_AVX512BW for Windows with clangcl
- MergeUVRow_AVX2 use vpmovzxbw and vpsllw
- MergeUVRow_16_AVX2 use vpmovzxbw and vpsllw with different shift for U and V
AMD Zen 4 640x360 100000 iterations
Was
AVX512 MergeUVPlane_Opt (884 ms)
AVX2 MergeUVPlane_Opt (945 ms)
AVX2 MergeUVPlane_16_Opt (2167 ms)
Now
AVX512 MergeUVPlane_Opt (865 ms)
AVX2 MergeUVPlane_Opt (943 ms)
SSE2 MergeUVPlane_Opt (973 ms)
AVX2 MergeUVPlane_16_Opt (2102 ms)
Bug: None
Change-Id: I658ada2a75d44c3f93be8bd3ed96f83d5fa2ab8d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4271230
Reviewed-by: Fritz Koenig <frkoenig@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
- fix ifdefs for DetilePlane_16 to use 16 bit versions, not 8 bit. (no functional change)
Bug: b/258474032
Change-Id: Ic07e02d9801e21126ebee0ceb5779aa712a493ce
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4034812
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Optimized YUY2ToNV12 that reduces it from 3 steps to 2 steps
- Was SplitUV, memcpy Y, InterpolateUV
- Now YUY2ToY, YUY2ToNVUV
- rollback LIBYUV_UNLIMITED_DATA
3840x2160 1000 iterations:
Pixel 2 Cortex A73
Was YUY2ToNV12_Opt (6515 ms)
Now YUY2ToNV12_Opt (3350 ms)
AB7 Mediatek P35 Cortex A53
Was YUY2ToNV12_Opt (6435 ms)
Now YUY2ToNV12_Opt (3301 ms)
Skylake AVX2 x64
Was YUY2ToNV12_Opt (1872 ms)
Now YUY2ToNV12_Opt (1657 ms)
SSE2 x64
Was YUY2ToNV12_Opt (2008 ms)
Now YUY2ToNV12_Opt (1691 ms)
Windows Skylake AVX2 32 bit x86
Was YUY2ToNV12_Opt (2161 ms)
Now YUY2ToNV12_Opt (1628 ms)
Bug: libyuv:943
Change-Id: I6c2ba2ae765413426baf770b837de114f808f6d0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3929843
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- YUV to RGB use linear for first and last row.
- add assert(yuvconstants)
- rename pointers to match row functions.
- use macros that match row functions.
- use 12 bit upsampler for conversions of 10 and 12 bits
Cortex A53 AArch32
I420ToRGB24_Opt (3627 ms)
I422ToRGB24_Opt (4099 ms)
I444ToRGB24_Opt (4186 ms)
I420ToRGB24Filter_Opt (5451 ms)
I422ToRGB24Filter_Opt (5430 ms)
AVX2
Was I420ToRGB24Filter_Opt (583 ms)
Now I420ToRGB24Filter_Opt (560 ms)
Neon Cortex A7
Was I420ToRGB24Filter_Opt (5447 ms)
Now I420ToRGB24Filter_Opt (5439 ms)
Bug: libyuv:938
Change-Id: I1731f2dd591073ae11a756f06574103ba0f803c7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3906082
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
MergeRGB and SplitRGB use a register to point to 9 shuffle tables.
- fixes an out of registers error with -mcmodel=large
InterpolateRow_16To8_NEON improves performance for I210ToI420:
On Pixel 4 for 720p x1000 images
Was I210ToI420_Opt (608 ms)
Now I210ToI420_Opt (336 ms)
On Skylake Xeon
Was I210ToI420_Opt (259 ms)
Now I210ToI420_Opt (209 ms)
Bug: libyuv:931, libyuv:930
Change-Id: I20f8244803f06da511299bf1a2ffc7945eb35221
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3717054
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
- Add I210ToI420 to convert 10 bit 4:2:2 YUV to 4:2:0 8 bit
- Add NEON InterpolateRow_16 for fast 10 bit scaling
- When scaling up, set step to interpolate toward height - 1 to avoid buffer overread
- When scaling down, center the 2 rows used for source to achieve filtering.
- CopyPlane check for 0 size and return
Bug: libyuv:931, b/228605787, b/233233302, b/233634772, b/234558395, b/234340482
Change-Id: I63e8580710a57812b683c2fe40583ac5a179c4f1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3687552
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Pixel 3
Was C I010ToI420_Opt (749 ms)
Now NEON I010ToI420_Opt (356 ms)
Pixel 4
Was C I010ToI420_Opt (581 ms)
Now NEON I010ToI420_Opt (163 ms)
Bug: b/233233302, b/233634772
Change-Id: I60a84648a66f77d97c0a7822b29bd18b8e3a3355
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3661401
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Unrolled to 16 pixels
- Take constants via structure, allowing different colorspace and channel order
- Use ADDHN to add 16.5 and take upper 8 bits of 16 bit values, narrowing to 8 bits
- clang-format applied, affecting mips code
On Cortex A510
Was RAWToJ400_Opt (1623 ms)
Now RAWToJ400_Opt (862 ms)
C RAWToJ400_Opt (1627 ms)
Bug: b/220171611
Change-Id: I06a9baf9650ebe2802fb6ff6dfbd524e2c06ada0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3534023
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Add support for MM21 to NV12 and I420 conversion, and add SIMD
optimizations for arm, aarch64, SSE2, and SSSE3 machines.
Bug: libyuv:915, b/215425056
Change-Id: Iecb0c33287f35766a6169d4adf3b7397f1ba8b5d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3433269
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Justin Green <greenjustin@google.com>
Optimize 44 functions in source/row_lsx.cc file.
All test cases passed on loongarch platform.
Bug: libyuv:913
Change-Id: Ic80a5751314adc2e9bd435f2bbd928ab017a90f9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3351467
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
1. Add supports for LSX and LASX.
2. Three optimization functions are added in loongarch/row_lasx.cc file:
I422ToARGBRow_LASX,I422ToRGBARow_LASX,I422AlphaToARGBRow_LASX.
Bug: libyuv:912, Bug: libyuv:913
Change-Id: I043c2704f99a5215724b5c0b7f97e6bf5f7a199b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3329189
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>