Key instruction sets added for each microarchitecture:
AVX512BW, AVX512VL, AVX512DQ - skylake server or later
AVX512_VBMI, AVX512_IFMA - cannon lake or later
AVX512_BITALG, AVX512_VBMI2, AVX512_VPOPCNTDQ, AVX512_VNNI, GFNI, VAES, VPCLMULQDQ - ice lake or later
Bug: libyuv:752
Test: ~/intelsde/sde -icl -- out/Release/libyuv_unittest --gtest_filter=*Cpu*
Change-Id: I9ee28904c90009d66721b9f805a440c5fc2da122
Reviewed-on: https://chromium-review.googlesource.com/755617
Reviewed-by: Frank Barchard <fbarchard@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
vmovd is an AVX instruction. This will crash on an older CPU with
only SSSE3 but not AVX. Use movd instead.
Bug: libyuv:753
Test: ~/intelsde/sde -mrm -- out/Release/libyuv_unittest --gtest_filter=LibYUVCompareTest.BenchmarkHammingDistance_Opt
Change-Id: I1fb0026039d5f83d124f5d03fed7dc0d2d723e49
Reviewed-on: https://chromium-review.googlesource.com/756200
Reviewed-by: Cheng Wang <wangcheng@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
H010 is 10 bit planar format with 10 bits in lower bits.
P010 is 10 bit biplanar format with 10 bits in upper bits.
This function weaves the U and V channels and shifts the bits
into the upper bits.
Bug: libyuv:751
Test: LibYUVPlanarTest.MergeUV10Row_Opt
Change-Id: I4a0bac0ef1ff95aa1b8d68261ec8e8e86f2d1fbf
Reviewed-on: https://chromium-review.googlesource.com/752692
Reviewed-by: Cheng Wang <wangcheng@google.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
Bug: libyuv:701
Test: objdump to confirm code gen
Change-Id: Ibdcb2cc6bc9bf14b4ccb874c49fc9ff664650e1a
Reviewed-on: https://chromium-review.googlesource.com/745390
Reviewed-by: Frank Barchard <fbarchard@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
popcnt has a fake dependency on the destination.
This assembly avoids the dependency by using a different
register for each popcnt.
Bug: libyuv:701
Test: LIBYUV_DISABLE_SSSE3=1 out/Release/libyuv_unittest --gtest_filter=*Ham*Opt --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=9999 --libyuv_flags=-1 --libyuv_cpu_info=-1
Change-Id: Ie1d202e2613b7fa8a3c02acd433940e92c80eafa
Reviewed-on: https://chromium-review.googlesource.com/731826
Reviewed-by: Cheng Wang <wangcheng@google.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
mingw gcc sets the macro _M_IX86 which is normally only set
by Visual C and clangcl which are Visual C style source code
style for assembly, but gcc is not Visual C compatible.
Add _MSC_VER to most ifdefs to detect that its really Visual C
or clangcl and not mingw gcc so the gcc source code will be used.
Bug: libyuv:744
Test: CXXFLAGS=-m32 CXX=~/prebuilts/gcc/linux-x86/host/x86_64-w64-mingw32-4.8/bin/x86_64-w64-mingw32-g++ make -f linux.mk
Change-Id: I3431aa486eb769b145faa8d5eb75ed639f9d6f5e
Reviewed-on: https://chromium-review.googlesource.com/722319
Reviewed-by: Cheng Wang <wangcheng@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
Bug: libyuv:701
Test: HammingDistance unittest with large size
Change-Id: Id41a2c27eb8922d03b3a21dab32fa2e7b015ba38
Reviewed-on: https://chromium-review.googlesource.com/708335
Reviewed-by: Cheng Wang <wangcheng@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
This reverts commit ec75df5894845b8d6b1341885a78db1de83decd8.
Reason for revert: <INSERT REASONING HERE>
Original change's description:
> ComputeHammingDistance reduce SIMD loop to 1 call when possible.
>
> 32 bit x86 has high overhead due to -fpic. So this reduces the
> number of calls by 1.
>
> TBR=kjellander@chromium.org
> Bug: libyuv:701
> Test: BenchmarkHammingDistance
> Change-Id: I7f557ef047920db65eab362a5f93abbd274ca051
> Reviewed-on: https://chromium-review.googlesource.com/701755
> Reviewed-by: Frank Barchard <fbarchard@google.com>
> Reviewed-by: Cheng Wang <wangcheng@google.com>
TBR=rrwinterton@gmail.com,fbarchard@google.com,wangcheng@google.com
Change-Id: Ia61e8558a8f083c14be5f51e0e141550b6f2b5c1
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug: libyuv:701
Reviewed-on: https://chromium-review.googlesource.com/707823
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
32 bit x86 has high overhead due to -fpic. So this reduces the
number of calls by 1.
TBR=kjellander@chromium.org
Bug: libyuv:701
Test: BenchmarkHammingDistance
Change-Id: I7f557ef047920db65eab362a5f93abbd274ca051
Reviewed-on: https://chromium-review.googlesource.com/701755
Reviewed-by: Frank Barchard <fbarchard@google.com>
Reviewed-by: Cheng Wang <wangcheng@google.com>
If length of HammingDistance was not a multiple of 4,
the result was incorrect. The old tests did not catch this
so a new test is done to count 1s.
Bug: libyuv:740
Test: LibYUVCompareTest.TestHammingDistance
Change-Id: I93db5437821c597f1f162ac263d4a594bb83231f
Reviewed-on: https://chromium-review.googlesource.com/699614
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Reviewed-by: Cheng Wang <wangcheng@google.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
Under cache thrashing circumstances, ldp/stp perform better than
ld1/st1 on QC820/QC821 CPUs. Same performance when hitting cache.
Bug: libyuv:738
Test: LibYUVPlanarTest.TestCopySamples_Opt (445 ms)
Change-Id: Ib6a0a5d5e6a1b7ef667b9bb2edb39d681cf3614c
Reviewed-on: https://chromium-review.googlesource.com/691281
Commit-Queue: Frank Barchard <fbarchard@google.com>
Reviewed-by: Cheng Wang <wangcheng@google.com>
Full color test is the slowest of the unittests, and not catching any
additional bugs at the moment. Step thru range of 0 to 255 in steps of
5 to speed up the test. 255 is 3 * 5 * 17, so any of those primes would
hit 0 and 255 exactly.
Was LibYUVColorTest.TestFullYUV (896 ms)
Now LibYUVColorTest.TestFullYUV (212 ms)
TBR=kjellander@chromium.org
Bug: libyuv:736
Test: LibYUVColorTest.TestFullYUV
Change-Id: I5b55fb07ada0dc7bdc3c3c20569d36bf09bb3804
Reviewed-on: https://chromium-review.googlesource.com/672064
Commit-Queue: Frank Barchard <fbarchard@google.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
Use ld2 to load even and odd pixels into different registers
and hadd to half add them to each other.
Previously used paired and shift.
TBR=kjellander@chromium.org
BUG=libyuv:723
TEST=ScaleDownBy2_Linear
Change-Id: I3ec72bcf7d4c746837217496c301eb4e4ad963cf
Reviewed-on: https://chromium-review.googlesource.com/644113
Reviewed-by: Cheng Wang <wangcheng@google.com>
urhadd is a rounded average. Linear filter wants to average
horizontally, so use ld2 to separate even and odd pixels.
TBR=jkellander@chromium.org
BUG=None
TEST=LibYUVScaleTest.*ScaleDownBy2*
Change-Id: Id667288a030e72ce8e1c1d6719b69c555c0db063
Reviewed-on: https://chromium-review.googlesource.com/642448
Commit-Queue: Frank Barchard <fbarchard@google.com>
Reviewed-by: Cheng Wang <wangcheng@google.com>
Roughly. instead of 4 loads and 8 multiples, use 1 load and 2 multiples
4 times over. The original code, as with the C code from clang and gcc,
did all the loads, then all the math, then the store. The new code
does a load, then the math, then the next load, etc.
This schedules better on current arm 64 cpus.
Number of registers also reduced, reusing the same registers.
HiSilicon ARM A73:
Now
TestGaussRow_Opt (890 ms)
TestGaussCol_Opt (571 ms)
Was
TestGaussRow_Opt (1061 ms)
TestGaussCol_Opt (595 ms)
Qualcomm 821 (Pixel):
Now
TestGaussRow_Opt (571 ms)
TestGaussCol_Opt (474 ms)
Was
TestGaussRow_Opt (751 ms)
TestGaussCol_Opt (520 ms)
TBR=kjellander@chromium.org
BUG=libyuv:719
TEST=LibYUVPlanarTest.TestGaussRow_Opt
Reviewed-on: https://chromium-review.googlesource.com/627478
Reviewed-by: Cheng Wang <wangcheng@google.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
Change-Id: I5ec81191d460801f0d4a89f0384f89925ff036de
Reviewed-on: https://chromium-review.googlesource.com/634448
Commit-Queue: Frank Barchard <fbarchard@google.com>
Downsample 16x2 to 8x1 with box filtering
[ RUN ] LibYUVScaleTest.TestScaleRowUp2_16
[ OK ] LibYUVScaleTest.TestScaleRowUp2_16 (579 ms)
[ RUN ] LibYUVScaleTest.TestScaleRowDown2Box_16
[ OK ] LibYUVScaleTest.TestScaleRowDown2Box_16 (329 ms)
[----------] 2 tests from LibYUVScaleTest (909 ms total)
TBR=kjellander@chromium.org
BUG=libyuv:718
TEST=LibYUVScaleTest.TestScaleRowUp2_16 and LibYUVScaleTest.TestScaleRowDown2Box_16
Change-Id: I457d44123f2751e5f71bf3935401fff74b8e9db2
Reviewed-on: https://chromium-review.googlesource.com/608876
Reviewed-by: Cheng Wang <wangcheng@google.com>
add ScaleMaxSamples_NEON function with max
done on original values.
TBR=kjellander@chromium.org
BUG=libyuv:717
TEST=LibYUVPlanarTest.TestScaleMaxSamples_Opt
Change-Id: Id99338860782b10ffd24f66242eb42014c2e229e
Reviewed-on: https://chromium-review.googlesource.com/614685
Reviewed-by: Frank Barchard <fbarchard@google.com>
Reviewed-by: Cheng Wang <wangcheng@google.com>
This reverts commit 1dda4cb0b7bd564e646d6ec2efee497fcd7146ca.
Reason for revert: build error on jpeg FILE
Original change's description:
> include <new> header for benefit of new clang builds
>
> TBR=kjellander@chromium.org
> BUG=libyuv:712
> TEST=local builds still work
>
> Change-Id: I040e8edc40aafd820d2a29629fe7aec5c049bc6b
> Reviewed-on: https://chromium-review.googlesource.com/576971
> Reviewed-by: Frank Barchard <fbarchard@google.com>
> Commit-Queue: Frank Barchard <fbarchard@google.com>
TBR=kjellander@chromium.org,fbarchard@google.com
# Not skipping CQ checks because original CL landed > 1 day ago.
Bug: libyuv:712
Change-Id: I4cf4e26eadb476017dc95e6c9578092204f088a3
Reviewed-on: https://chromium-review.googlesource.com/601211
Commit-Queue: Frank Barchard <fbarchard@google.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
NaCL has been disabled for awhile, so the code
will still build, but only with C versions.
This change removes the MEMACCESS() macros from
Neon and Neon64 source.
BUG=libyuv:702
TEST=try bots build for arm.
R=kjellander@chromium.org
Change-Id: Id581a5c8ff71e18cc69595e7fee9337f97c44a19
Reviewed-on: https://chromium-review.googlesource.com/528332
Reviewed-by: Cheng Wang <wangcheng@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
instead of casting int to int64, pass the int
and use %w modifier to use the word version of the register.
TBR=kjellander@chromium.org
BUG=libyuv:706
TEST=git cl lint
R=wangcheng@google.com
Change-Id: Iee5a70f04d928903ca8efac00066b8821a465e36
Reviewed-on: https://chromium-review.googlesource.com/528381
Reviewed-by: Cheng Wang <wangcheng@google.com>
Reviewed-by: Frank Barchard <fbarchard@google.com>
Summing 16 bit hamming codes restricts the maximum length,
but saves an inner loop instruction. The outer loop can sum the
values.
32 bit Neon
Now BenchmarkHammingDistance_Opt (78 ms)
Was BenchmarkHammingDistance_Opt (92 ms)
64 bit Neon
Now BenchmarkHammingDistance_Opt (85 ms)
Was BenchmarkHammingDistance_Opt (92 ms)
R=wangcheng@google.comTBR=kjellander@chromium.org
BUG=libyuv:701
TEST=BenchmarkHammingDistance
Change-Id: Ie40f0eac2f3339c33b833b42af5d394b122066ae
Reviewed-on: https://chromium-review.googlesource.com/526932
Reviewed-by: Frank Barchard <fbarchard@google.com>
Reviewed-by: Cheng Wang <wangcheng@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>
The 32 bit version of HammingDistance_NEON accumulates
using vertical add and paired adds, which takes 3 instructions
instead of 4.
The instructions are also portable between 32 and 64 bit.
Was BenchmarkHammingDistance_Opt (105 ms)
Now BenchmarkHammingDistance_Opt (90 ms)
TBR=kjellander@chromium.org
BUG=libyuv:701
TEST=BenchmarkHammingDistance
BenchmarkHammingDistance_Opt (90 ms)
Change-Id: If9e621e0bd2fe2492a1532056f8a1b451ba53d7e
Reviewed-on: https://chromium-review.googlesource.com/526365
Reviewed-by: Frank Barchard <fbarchard@google.com>
Commit-Queue: Frank Barchard <fbarchard@google.com>