This version of the H010ToAR30 provides a 3 step conversion
Convert16To8Row_AVX2
H420ToARGB_AVX2
ARGBToAR30_AVX2
Low level function added to convert 16 bit to 8 bit using multiply
to adjust 10 bit or other bit depths and then save the upper 16 bits.
Bug: libyuv:751
Test: LibYUVPlanarTest.Convert16To8Row_Opt unittest added
Change-Id: I9cc576fda8afa1003cb961d03e0e656e0b478f03
Reviewed-on: https://chromium-review.googlesource.com/783554
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
When converting from lsb 10 bit formats to msb, the values
need to be shifted to the top 10 bits. Using a multiply
allows the different numbers of bits to be copied:
// 128 = 9 bits
// 64 = 10 bits
// 16 = 12 bits
// 1 = 16 bits
Bug: libyuv:751
Test: LibYUVPlanarTest.MultiplyRow_16_Opt
Change-Id: I9cf226053a164baa14155215cb175065b1c4f169
Reviewed-on: https://chromium-review.googlesource.com/762951
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
clang does not require -msse2 or -msse for inline, except
the "x" parameter. So change this to "m" for 32 bit. 64 bit
requires sse2 so use "x" for 64 bit.
gcc requires -msse for xmm registers in clobber list.
Reduce compiler requirement from -msse2 to -msse for enabling
assembly.
Bug: libyuv:754, libyuv:757
Test: CC=clang CXX=clang++ CFLAGS="-m32" CXXFLAGS="-m32 -mno-sse -O2" make -f linux.mk
Change-Id: I86df72cfee80b7d349561c1fd7c97ad360767255
Reviewed-on: https://chromium-review.googlesource.com/759303
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
H010 is 10 bit planar format with 10 bits in lower bits.
P010 is 10 bit biplanar format with 10 bits in upper bits.
This function weaves the U and V channels and shifts the bits
into the upper bits.
Bug: libyuv:751
Test: LibYUVPlanarTest.MergeUV10Row_Opt
Change-Id: I4a0bac0ef1ff95aa1b8d68261ec8e8e86f2d1fbf
Reviewed-on: https://chromium-review.googlesource.com/752692
Reviewed-by: Cheng Wang <wangcheng@google.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
Uses 1 add instead of 2 leas to reduce port pressure on ports 1 and 5
used for SIMD instructions.
BUG=libyuv:670
TEST=~/iaca-lin64/bin/iaca.sh -arch HSW out/Release/obj/libyuv/row_gcc.o
Change-Id: I3965ee5dcb49941a535efa611b5988d977f5b65c
Reviewed-on: https://chromium-review.googlesource.com/433391
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
add macros to enable/disable code analyst around blocks of code.
Normally these macros should not be used, but if performance
details are wanted for intel code, enable them around the code
and then run via the iaca tool, available on the intel website.
BUG=libyuv:670
TEST=~/iaca-lin64/bin/iaca.sh -64 out/Release/libyuv_unittest
R=wangcheng@google.com
Review-Url: https://codereview.chromium.org/2626193002 .
Debug builds of x86 gcc/clang can run out of register.
Previously NDEBUG or _DEBUG was used to detect a debug build.
But those macros are not set by gentoo builds.
This CL switches to the compiler predefine __OPTIMIZE__ which is
built into clang and gcc.
BUG=libyuv:602
TEST=untested
R=wangcheng@google.com
Review URL: https://codereview.chromium.org/2451503002 .
YUV 411 is very uncommon format. Remove support.
Update documentation to reflect that 411 is deprecated.
Simplify tests for YUV to only test with the new side by side YUV but keep old 3 plane test around with a macro for now.
BUG=libyuv:645
R=kjellander@chromium.org
Review URL: https://codereview.chromium.org/2406123002 .
Work around for android full debug build runnign out of registers.
5 functions were running out of registers causing the compiler error
error: 'asm' operand has impossible constraints
These functions mostly have 4 pointers, a counter (width) and a tempory
eax register. With fpic and debug using stackframes, 2 registers are
unavailable. So a total of 8 registers are used.
Although fpic and stack frame dont apply to assembly, the compiler
reserves 2 registers. The optimized version builds, so its likely
freeing up the registers once it knows they are not used.
These functions used to build, so compile options and/or compiler may
have updated.. likely fpic was turned on.
An attribute can be done to disable each, and will avoid using the
2 GPR registers, but they are still reserved and unavailable in debug
builds on current compilers (gcc 4.9 and clang 3.8).
R=dhrosa@google.com
BUG=libyuv:602
Review URL: https://codereview.chromium.org/2066933002 .
Inline that uses temporary variables is currently initializing them
to 0 and passing in as output "+r".
This CL replaces the output constraint to "=&r" for most meaning an
output with early write (before inputs). This allows the initialize
to zero step to be removed, saving 1 instruction.
BUG=libyuv:580
TESTED=local libyuv build on gcc/linux and try bots
R=harryjin@google.com
Review URL: https://codereview.chromium.org/1895743008 .
An SSSE3 version already exists, and an AVX2 version is available for
Visual C. This ports the function to AVX2 completing the AVX2 ports of
all YUV to RGB functions for AVX2 on gcc.
TBR=harryjin@google.com
BUG=libyuv:555
Review URL: https://codereview.chromium.org/1687253002 .
This is an UBSan error reported by libjingle
[ RUN ] WebRtcVideoFrameTest.ConvertToYUY2BufferStride
[000:000] (videoframe.cc:375): Validate frame passed. format: I420 bpp: 12 size: 1280x720 bytes: 1382400 expected: 1382400 sample[0..3]: 73, 73, 73, 73
../../chromium/src/third_party/libyuv/source/row_gcc.cc:2903:25: runtime error: signed integer overflow: 128 * 16843009 cannot be represented in type 'int'
[8/614] WebRtcVideoFrameTest.ConvertToYUY2BufferStride returned/aborted with exit code 1 (32 ms)
[9/614] WebRtcVideoFrameTest.ConvertToYUY2BufferInverted (29 ms)
Note: Google Test filter = WebRtcVideoFrameTest.ConvertToYUY2BufferInverted
The source is uint8 and the multiply is by 0x01010101 to replicate the byte to 4 bytes.
Changing the constant to 0x01010101u should avoid overflow.
R=harryjin@google.comTBR=harryjin@google.com
BUG=libyuv:563
Review URL: https://codereview.chromium.org/1657533005 .
Remove inaccurate specializations for 1/4 and 3/4, since they round
incorrectly. Specialize for 100% and 50% are kept due to performance.
Make C and ARM code match SSSE3.
Make unittests expect zero difference.
BUG=libyuv:535
R=harryjin@google.com
Review URL: https://codereview.chromium.org/1533643005 .
previously the I411 format used movd to read U, V pixels.
But this reads 4 bytes, and can cause a memory exception.
pinsrw can be used, but fails on drmemory 1.5, and is slow.
So in this change a movzxw is used to read 2 bytes into EBX,
then copy to xmm0 with movd.
Slightly slower, but no memory exception
Was LibYUVConvertTest.I411ToARGB_Opt (577 ms)
Now LibYUVConvertTest.I411ToARGB_Opt (608 ms)
TBR=harryjin@google.com
BUG=libyuv:525
Review URL: https://codereview.chromium.org/1457783004 .
SSSE3
Note: Google Test filter = *I444ToARGB*
[==========] Running 8 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 8 tests from LibYUVConvertTest
[ RUN ] LibYUVConvertTest.I444ToARGB_Any
[ OK ] LibYUVConvertTest.I444ToARGB_Any (435 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_Unaligned
[ OK ] LibYUVConvertTest.I444ToARGB_Unaligned (418 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_Invert
[ OK ] LibYUVConvertTest.I444ToARGB_Invert (417 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_Opt
[ OK ] LibYUVConvertTest.I444ToARGB_Opt (411 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_ARGB_Any
[ OK ] LibYUVConvertTest.I444ToARGB_ARGB_Any (419 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_ARGB_Unaligned
[ OK ] LibYUVConvertTest.I444ToARGB_ARGB_Unaligned (432 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_ARGB_Invert
[ OK ] LibYUVConvertTest.I444ToARGB_ARGB_Invert (435 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_ARGB_Opt
[ OK ] LibYUVConvertTest.I444ToARGB_ARGB_Opt (421 ms)
[----------] 8 tests from LibYUVConvertTest (3389 ms total)
AVX2
Note: Google Test filter = *I444ToARGB*
[==========] Running 8 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 8 tests from LibYUVConvertTest
[ RUN ] LibYUVConvertTest.I444ToARGB_Any
[ OK ] LibYUVConvertTest.I444ToARGB_Any (340 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_Unaligned
[ OK ] LibYUVConvertTest.I444ToARGB_Unaligned (325 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_Invert
[ OK ] LibYUVConvertTest.I444ToARGB_Invert (316 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_Opt
[ OK ] LibYUVConvertTest.I444ToARGB_Opt (316 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_ARGB_Any
[ OK ] LibYUVConvertTest.I444ToARGB_ARGB_Any (315 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_ARGB_Unaligned
[ OK ] LibYUVConvertTest.I444ToARGB_ARGB_Unaligned (341 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_ARGB_Invert
[ OK ] LibYUVConvertTest.I444ToARGB_ARGB_Invert (331 ms)
[ RUN ] LibYUVConvertTest.I444ToARGB_ARGB_Opt
[ OK ] LibYUVConvertTest.I444ToARGB_ARGB_Opt (329 ms)
[----------] 8 tests from LibYUVConvertTest (2615 ms total)
TBR=harryjin@google.com
BUG=libyuv:492
Review URL: https://codereview.chromium.org/1445893002 .
On Arm the YVU to RGB conversions move constants into registers.
This change does the same for 64 bit intel builds where additional
registers are available.
The AVX2 saves 3 instructions by because the 2nd argument needs to be a register, so a vmovdqu was avoided.
x64 builds using memory:
AVX2 I420ToARGB_Opt (3059 ms)
SSSE3 I420ToARGB_Opt (3959 ms)
Now using registers
AVX2 I420ToARGB_Opt (2906 ms)
SSSE3 I420ToARGB_Opt (3928 ms)
TBR=harryjin@google.com
BUG=libyuv:520
Review URL: https://codereview.chromium.org/1407353010 .
Removes low levels for I420ToBGRA and I420ToRAW and reimplements them as I420ToRGBA and I420ToRGB24 with transposed color matrix.
Adds unittests that do 1 step conversion vs 2 steps to test end swapping versions match direct conversions.
R=harryjin@google.com
BUG=libyuv:518
Review URL: https://codereview.chromium.org/1427993004 .
In some methods with 7 arguments gcc fails to find enough registers
to compile the assembler code when compiling debug. Simplest solution
is to skip the assembler version in debug of those particular functions
(I422Alpha -> ARBG/ABGR)
R=harryjin@google.com,bratell@opera.com
BUG=libyuv:517
Review URL: https://codereview.chromium.org/1423283002 .
U contributes to B and G. V contributes to R and G.
By swapping U and V, they contribute to the opposite channels. Adjust the matrix so the U contribution is in the matrix location such that it till contribute to the
new B channel and vice versa.
This allows ABGR versions of YUV conversion to use the same low level code as ARGB, just using a different matrix and swapping U and V pointers.
As a result the existing I444ToABGRRow functions are no longer needed and are removed.
Previously this function was only Intel AVX2 optimized for Windwos. Now it is also optimized for Arm and GCC.
ARMv7 Neon
Was LibYUVConvertTest.I444ToABGR_Opt (75971 ms)
Now LibYUVConvertTest.I444ToABGR_Opt (3672 ms)
20.6 times faster.
R=xhwang@chromium.org
BUG=libyuv:515
Review URL: https://codereview.chromium.org/1414133006 .
yuv constants for bt.601 were previously ported to neon64, as well
as the code to respect other color spaces. But the jpeg and bt.709
colour conversion constants were still in armv7 form. This changes
the constants for aarch64 builds to be compatible with the code.
yuv constants are now passed as const *
Remove Yvu constants which were used for older version on nv21 but not new code.
TBR=harryjin@google.com
BUG=none
Review URL: https://codereview.chromium.org/1398623002 .
Previously the assembly code was only available to Windows.
This CL ports the SSE2 code to GCC syntax.
When running a profiler on all the unittests, this function
was the slowest of all functions that still ran in C code.
3.71% libyuv_unittest libyuv_unittest [.] ARGBToRGB565DitherRow_C
Was
ARGBToRGB565Dither_Opt (2894 ms)
Now
ARGBToRGB565Dither_Opt (432 ms)
TBR=harryjin@google.com
BUG=libyuv:492
Review URL: https://codereview.chromium.org/1397673002 .
Low level for NV21ToARGB written to accept yuv matrix used by
other YUV to ARGB functions.
Previously NV21 was implemented for Windows using NV12 with a different
matrix that swapped U and V. But the Arm version of the low level does
not allow the matrix U and V contributions to be swapped.
Using a new low level function that reads NV21 and uses the same
yuvconstants as other YUV conversion functions allows an Arm port of
this function.
TBR=harryjin@google.com
BUG=libyuv:500
Review URL: https://codereview.chromium.org/1388273002 .
ARGBBlendRow_SSE2, ARGBAttenuateRow_SSE2, and MirrorRow_SSE2
Since vast majority of CPUs have SSSE3 now, removing the SSE2
improves the performance of CPU dispatching.
R=harryjin@google.com
BUG=none
Review URL: https://codereview.chromium.org/1377053003 .