Replaced the 2-pass conversion (I420 -> ARGB -> RGB24/RAW) with a
highly optimized 1-pass AVX2 implementation. This avoids intermediate
stack buffering and significantly reduces memory bandwidth.
Implemented `I422ToRGB24Row_AVX2` in:
- `row_gcc.cc`: Inline assembly for GCC/Clang.
- `row_win.cc`: C++ intrinsics for MSVC (also verified with Clang).
Optimized the width alignment requirement: changed from 32-pixel to
16-pixel alignment in `convert_argb.cc` and `row_any.cc`. This allows
the optimized AVX2 path to be used for more common video resolutions.
Performance results (1080p, 100 iterations):
- C Reference: ~18.5 ms
- AVX2 2-Pass (Baseline): ~412 us (~45x speedup)
- AVX2 1-Pass (GCC Assembly): ~411 us (~s45x speedup)
- AVX2 1-Pass (Intrinsics): ~365 us (~50x speedup, 11% faster than asm)
Test: libyuv_unittest --gunit_filter=*I420ToRGB24*
Test: libyuv_unittest --gunit_filter=*I420ToRAW*
Bug: 42280902
Change-Id: I07c0505c95410ea16a6218c858844791a11ef073
This CL addresses two security findings related to integer overflows:
1. Input validation in ScalePlane, ScalePlane_16, and ScalePlane_12:
Added checks to reject invalid dimensions (e.g. width <= 0, height
== 0) and dimensions larger than 32768 (or smaller than -32768 for
height). This prevents FixedDiv signed integer overflows that can
lead to division by zero/overflow crashes (SIGFPE on x86) or
incorrect step calculations.
2. Stride overflow in ARGBAffineRow_C:
Casted pointer arithmetic operands to ptrdiff_t before multiplication
(y * stride and x * 4) to ensure 64-bit calculations, preventing
signed 32-bit integer overflow when calculating source pixel offsets.
Added unit tests to verify the input validation in ScalePlane functions.
Test: libyuv_unittest --gtest_filter=*InvalidInputs*
Test: libyuv_unittest --gtest_filter=*Scale*
Test: libyuv_unittest --gtest_filter=*TestAffine*
Bug: None
TAG=agy
CONV=0e990960-611b-4f38-94ec-24e79b66242e
R=wtc@google.com
Change-Id: I252af47a98e45dff8bb5f06308c3739c6eead741
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7886217
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
- Standardize libyuv ARGB-family (ARGB, ABGR, RGBA, BGRA) to YUV conversion by utilizing the generic MatrixRow architecture and explicit ArgbConstants.
- Consolidated ARGBToI420, ABGRToI420, BGRAToI420, and RGBAToI420 as wrappers for ARGBToI420Matrix.
- Refactored ABGRToJ420, ABGRToJ422, and ABGRToI422 to use generic matrix functions.
- Added matrix-based versions for NV21, I400, YUY2, and UYVY.
- Updated RAW and RGB24 to I420/I422/I444 dispatchers to use MatrixRow logic and explicit constants.
- Fixed parameter swap bugs in ARGBToI422, ARGBToJ422, and ABGRToJ422.
- Fixed a bug in the generic C implementation of matrix row functions ensuring all 4 channels are processed correctly for all ARGB-family formats.
- Moved kShuffleAARRGGBB in row_gcc.cc to the top of the libyuv namespace for visibility.
- Cleaned up redundant format-specific row implementations.
Bug: libyuv:42280902
Change-Id: I67ffa4c476abc0d2dcc4650510d7bda91b65988e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7830291
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
- implement wrappers with RAW, RGB24, NV21 and JNV21 to call it.
Zen5
Was [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (1146 ms)
Now [ OK ] LibYUVConvertTest.RAWToJNV21_Opt (1446 ms)
reason - the new code uses 1 pass for RAWToY but 2 pass for RAWToARGB,ARGBToUV. needs 1 RGBToUV
Bug: libyuv:42280902
Change-Id: Ife6fbed0829484045409e6d42b85cec1d1fd6052
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7780026
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
Adds RGBToYMatrixRow_AVX2 which reads 24 bit RGB values by reading 3 vectors instead of 4 and permutes them into 4 ARGB vectors before conversion.
Also adds RGBToYMatrixRow_Opt and RGBToYMatrixRow_2Step_Opt to convert_argb_test.cc to benchmark and compare the direct AVX2 conversion vs a 2-step approach.
./libyuv_test '--gunit_filter=*RAWToJ400_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=10000 --libyuv_flags=-1 --libyuv_cpu_info=-1
AMD Zen 5
Was LibYUVConvertTest.RAWToJ400_Opt (757 ms)
Now LibYUVConvertTest.RAWToJ400_Opt (699 ms)
Intel Skylake
Was LibYUVConvertTest.RAWToJ400_Opt (1705 ms)
Now LibYUVConvertTest.RAWToJ400_Opt (1426 ms)
Bug: 477295731
Change-Id: I29866baf4ad5fe7a3725e4a01f2fe24649510a7d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7777325
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
These are about 25% faster than the C versions.
Bug: libyuv:42280902
Change-Id: I8b298670ee5f3ed5db35527fc41d6d9a51b020a1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7573682
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
This allows for ABGR conversion using the same methods
Bug: libyuv:42280902
Change-Id: I5566e3150b30573a2326a900ce31ab095f8935f9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7564316
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
This was implemented by Gemini followed by manual review and some
tweaking for style. The 601 and JPEG constants are fully verified
against the existing non-matrix implementations. On x86 the C-only
versions appear to be about 25% slower than the optimized ones.
Bug: libyuv:42280902
Change-Id: Ia5b7cb499bad5c76faec53f36086ebb18f2b530f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7512030
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Dale Curtis <dalecurtis@chromium.org>
- Was using avgb twice for non-exact and C for exact.
On Skylake Xeon:
Now SSE3
ARGBToJ420_Opt (326 ms)
Was
Exact C
ARGBToJ420_Opt (871 ms)
Not exact AVX2
ARGBToJ420_Opt (237 ms)
Not exact SSSE3
ARGBToJ420_Opt (312 ms)
Bug: 381138208
Change-Id: I6d1081bb52e36f06736c0c6575fa82bb2268629b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6629605
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ben Weiss <bweiss@google.com>
- use negative coefficients for UV to allow -128
- change shift to truncate instead of round for UV
- adapt all row_gcc RGB to UV into matrix functions
- add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc
Bug: 381138208
Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The existing behaviour does not round correctly in all cases, so adjust
it to match the existing Neon implementation.
Update the tests to require bit-exactness and disable other
implementations that do not round correctly.
Change-Id: Ie790fb4b4805b555d74d689d83802e1dd4f33df5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5869115
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Add scalar code for AR64ToAB64, ARGBToRGBA, ARGBToBGRA, ARGBToABGR, RGBAToARGB, BGRAToARGB, and ABGRToARGB.
They are originally implemented by ARGBShffle.
This CL independetly implements them, and only enables for risc-v now.
This CL also add RVV implementation for `RGBA-family <-> RGBA-family` color conversions.
* Run on SiFive internal FPGA(VLEN=128):
Test Case Speedup
AR64ToAB64_Opt x4.6
ARGBToRGBA_Opt x6
ARGBToBGRA_Opt x6
ARGBToABGR_Opt x6
RGBAToARGB_Opt x6
Change-Id: Ie0630901046084aa259699fcdeccc64170d7103f
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4797451
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Save the value of a common subexpression in a local variable.
Change-Id: I5724fcf341900cb2a65eb37b505194b8d3c3da9a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4735651
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
- Makes ARM and Intel match and fixes some off by 1 cases
- Add ARGBToUV444MatrixRow_NEON
- Add ConvertFP16ToFP32Column_NEON
- scale_rvv fix intinsic build error
- disable row_win version of ARGBAttenuate/Unattenuate
Bug: libyuv:936, libyuv:956
Change-Id: Ied99aaad3a11a8eb69212b628c58f86ec0723c38
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617013
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Run on SiFive internal FPGA:
I422ToARGB_Opt (~10x vs scalar)
I422ToRGBA_Opt (~10x vs scalar)
I420ToRGB24_Opt (~8x vs scalar)
LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10
This CL manually sets rounding mode,
since we use fixed-point vector narrowing clip.
There is no definition about default value for fixed-point rounding mode.
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#38-vector-fixed-point-rounding-mode-register-vxrm
The behavior could be different on differet paltforms. To avoid unexpected behavior, we set rounding mode manually.
Change-Id: I90f0dcb90c37f7da7caab8eb1df6c9c7a3c874a8
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4512373
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Fix the algorithm for unpacking the lower 2 bits of M2T2 pixels.
Bug: b:258474032
Change-Id: Iea1d63f26e3f127a70ead26bc04ea3d939e793e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4337978
Commit-Queue: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
This initial implementation is based on current unoptimized code in webrtc using just plain for loops.
Bug: libyuv:949
Change-Id: Ic87ee49c3a0b62edbaaa4255c263c1f7be4ea02b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4110782
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- MT2T support for source strides added, but only works for positive values.
- Reduced casting in row_common - one cast per assignment.
- scaling functions use intptr_t for intermediate calculations, then cast strides to ptrdiff_t
Bug: libyuv:948, b/257266635, b/262468594
Change-Id: I0409a0ce916b777da2a01c0ab0b56dccefed3b33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4102203
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ernest Hua <ernesthua@google.com>
- Optimized YUY2ToNV12 that reduces it from 3 steps to 2 steps
- Was SplitUV, memcpy Y, InterpolateUV
- Now YUY2ToY, YUY2ToNVUV
- rollback LIBYUV_UNLIMITED_DATA
3840x2160 1000 iterations:
Pixel 2 Cortex A73
Was YUY2ToNV12_Opt (6515 ms)
Now YUY2ToNV12_Opt (3350 ms)
AB7 Mediatek P35 Cortex A53
Was YUY2ToNV12_Opt (6435 ms)
Now YUY2ToNV12_Opt (3301 ms)
Skylake AVX2 x64
Was YUY2ToNV12_Opt (1872 ms)
Now YUY2ToNV12_Opt (1657 ms)
SSE2 x64
Was YUY2ToNV12_Opt (2008 ms)
Now YUY2ToNV12_Opt (1691 ms)
Windows Skylake AVX2 32 bit x86
Was YUY2ToNV12_Opt (2161 ms)
Now YUY2ToNV12_Opt (1628 ms)
Bug: libyuv:943
Change-Id: I6c2ba2ae765413426baf770b837de114f808f6d0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3929843
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- YUV to RGB use linear for first and last row.
- add assert(yuvconstants)
- rename pointers to match row functions.
- use macros that match row functions.
- use 12 bit upsampler for conversions of 10 and 12 bits
Cortex A53 AArch32
I420ToRGB24_Opt (3627 ms)
I422ToRGB24_Opt (4099 ms)
I444ToRGB24_Opt (4186 ms)
I420ToRGB24Filter_Opt (5451 ms)
I422ToRGB24Filter_Opt (5430 ms)
AVX2
Was I420ToRGB24Filter_Opt (583 ms)
Now I420ToRGB24Filter_Opt (560 ms)
Neon Cortex A7
Was I420ToRGB24Filter_Opt (5447 ms)
Now I420ToRGB24Filter_Opt (5439 ms)
Bug: libyuv:938
Change-Id: I1731f2dd591073ae11a756f06574103ba0f803c7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3906082
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- add tests for all single plane formats that reduce or stay same in size
Bug: b/242233673
Change-Id: Ic25d808114f11995ac56ea9c31b99f66ba36d345
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3828485
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Fix the following MSVC warnings:
src\source\row_win.cc(117): warning C4309: 'argument': truncation of
constant value
src\source\row_win.cc(136): warning C4309: 'argument': truncation of
constant value
src\source\row_win.cc(155): warning C4309: 'argument': truncation of
constant value
src\source\row_win.cc(174): warning C4309: 'argument': truncation of
constant value
src\source\row_common.cc(1712): warning C4244: 'initializing':
conversion from 'uint16_t' to 'int8_t', possible loss of data
src\source\row_common.cc(1731): warning C4244: 'initializing':
conversion from 'int16_t' to 'int8_t', possible loss of data
src\source\row_common.cc(1786): warning C4244: 'initializing':
conversion from 'uint16_t' to 'int8_t', possible loss of data
src\source\row_common.cc(1805): warning C4244: 'initializing':
conversion from 'uint16_t' to 'int8_t', possible loss of data
Bug: libyuv:939
Change-Id: Ie87ba6e716732d1ff1ae5c236dfd9cfdac13439d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3807105
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
MergeRGB and SplitRGB use a register to point to 9 shuffle tables.
- fixes an out of registers error with -mcmodel=large
InterpolateRow_16To8_NEON improves performance for I210ToI420:
On Pixel 4 for 720p x1000 images
Was I210ToI420_Opt (608 ms)
Now I210ToI420_Opt (336 ms)
On Skylake Xeon
Was I210ToI420_Opt (259 ms)
Now I210ToI420_Opt (209 ms)
Bug: libyuv:931, libyuv:930
Change-Id: I20f8244803f06da511299bf1a2ffc7945eb35221
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3717054
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
- Avoid stepping to height + 1 for bilinear filter 2nd row for last row of source
- Box filter ubsan fix for 3/4 and 3/8 scaling for 16 bit planar
- Height 1 asan fixes
Bug: libyuv:935, b/206716399
Change-Id: I56088520f2a884a37b987ee5265def175047673e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3717263
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>