122 Commits

Author SHA1 Message Date
George Steed
61bdaee13a Add Neon I8MM implementations of ARGB to UV and variants
The maximum coefficient is 128, so store constants negated to take
advantage of -128 being representable in 8-bit integers. This allows us
to use the I8MM USDOT instructions.

Reduction in time taken observed compared to the existing Neon
implementation, as a geomean of all ARGBToUV variants:

Cortex-A510:  -7.1%
Cortex-A520:  -2.1%
Cortex-A710:  -8.4%
Cortex-A715:  -0.3%
Cortex-A720:  -0.3%
  Cortex-X2: -40.0%
  Cortex-X3: -43.3%
  Cortex-X4: -11.3%
Cortex-X925:  -2.5%

Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-05-12 11:14:00 -07:00
Frank Barchard
61354d2671 ARGBToUV Matrix for AVX2 and SSSE3
- Round before shifting to 8 bit to match NEON
  - RAWToARGB use unaligned loads and port to AVX2

Was C/SSSE/AVX2
ARGBToI444_Opt (343 ms)
ARGBToJ444_Opt (677 ms)
RAWToI444_Opt (405 ms)
RAWToJ444_Opt (803 ms)

Now AVX2
ARGBToI444_Opt (283 ms)
ARGBToJ444_Opt (284 ms)
RAWToI444_Opt (316 ms)
RAWToJ444_Opt (339 ms)

Profile Now AVX2
  38.31%  ARGBToUVJ444Row_AVX2
  32.31%  RAWToARGBRow_AVX2
  23.99%  ARGBToYJRow_AVX2

Profile Was C/SSSE/AVX2
    73.15%  ARGBToUVJ444Row_C
    15.74%  RAWToARGBRow_SSSE3
     8.87%  ARGBToYJRow_AVX2

Bug: 381138208
Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Ben Weiss <bweiss@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-10 18:36:18 -08:00
Frank Barchard
c1bac9e6a5 RAWToJ444 and ARGBToJ444
- ARGBToJ444 implements ARGBToUVJ444Row_C
- RAWToJ444 implemented as 2 steps - RAWToARGB and ARGBToJ444

libyuv_test '--gunit_filter=*R*To?444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1
(with bit exact off)

Samsung S23
RAWToJ444_Opt (437 ms)
ARGBToJ444_Opt (337 ms)
ARGBToI444_Opt (196 ms)

Skylake Xeon
RAWToJ444_Opt (1699 ms)
ARGBToJ444_Opt (1559 ms)
ARGBToI444_Opt (346 ms)

Bug: 390247964
Change-Id: Id1b1b45a5e4512ab50830aadf62f780fbe631575
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207845
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-29 15:18:38 -08:00
George Steed
7391559cb4 [AArch64] Add SME implementation of MergeUVRow{,_16}
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel
and use ST2 to avoid needing a separate ZIP instruction.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 01:16:19 -08:00
George Steed
a4ccf9940e [AArch64] Add I8MM implementation of ARGBToUV444Row
We cannot use the standard dot-product instructions since the
coefficients multiplication results are both added and subtracted, but
I8MM supports mixed-sign dot products which work well here.  We need to
add an additional variant of the coefficient structs since we need
negative constants for the elements that were previously subtracted.

Reduction in runtimes observed compared to the previous Neon
implementation:

Cortex-A510: -37.3%
Cortex-A520: -31.1%
Cortex-A715: -37.1%
Cortex-A720: -37.0%
  Cortex-X2: -62.1%
  Cortex-X3: -62.2%
  Cortex-X4: -40.4%

Bug: libyuv:977
Change-Id: Idc3d9a6408c30e1bce3816a1ed926ecd76792236
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712928
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-16 17:32:52 +00:00
George Steed
899bc48327 [AArch64] Add SVE2 implementations of ARGBTo{RAW,RGB24}Row
There is no nice way of forming the TBL permute indices here since we
are operating on sets of three bytes at a time, so instead load the
appropriate indices from a static array. We can make use of SVE
predication to ensure we are operating on a multiple of three bytes for
the load/store instructions rather than needing to make use of more
expensive LD4 or ST3 instructions.

Reduction in runtime observed compared to the existing Neon
implementations:

            | ARGBToRAWRow | ARGBToRGB24Row
Cortex-A510 |       -50.8% |         -19.9%
Cortex-A720 |       -39.8% |         -39.1%
  Cortex-X2 |       -66.5% |         -51.9%

Bug: libyuv:973
Change-Id: Iaead678715a3d70d54cf823391272a6196836769
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631544
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-08 20:27:54 +00:00
George Steed
250e1e1ba3 [AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow
Observed performance improvements compared to the existing Neon
implementation:

Cortex-A510: -21.7%
Cortex-A720: -49.2%
  Cortex-X2: -62.6%

Bug: libyuv:973
Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-06-03 23:15:04 +00:00
George Steed
bce3392830 [AArch64] Add SVE2 implementation of ARGBToRGB565Row
Observed performance improvements compared to the existing Neon
implementation:

Cortex-A510: -27.1%
Cortex-A720: -49.4%
  Cortex-X2: -67.9%

Bug: libyuv:973
Change-Id: I321dc080a6e89301cd959c2ee18bc6680f749312
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505538
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-05-31 17:42:27 +00:00
George Steed
9fac9a4a82 [AArch64] Add Neon implementations for {ARGB,ABGR}ToAR30Row
There are existing x86 implementations for these kernels but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | ABGRToAR30Row | ARGBToAR30Row
 Cortex-A55 |        -55.1% |        -55.1%
Cortex-A510 |        -39.3% |        -40.1%
 Cortex-A76 |        -62.3% |        -63.6%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:35:07 +00:00
George Steed
6f1d8b1e11 [AArch64] Add SVE2 implementations for ARGBToUVRow and similar
By maintaining the interleaved format of the data we can use a common
kernel for all input channel orderings and simply pass a different
vector of constants instead.

A similar approach is possible with only Neon by making use of
multiplies and repeated application of ADDP to combine channels, however
this is slower on older cores like Cortex-A53 so is not pursued further.

For odd problem sizes we need a slightly different implementation for
the final element, so introduce an "any" kernel to address that rather
than bloating the code for the common case.

Observed affect on runtimes compared to the existing Neon kernels:

             | Cortex-A510 | Cortex-A720 | Cortex-X2
ABGRToUVJRow |      -15.5% |       +5.4% |    -33.1%
 ABGRToUVRow |      -15.6% |       +5.3% |    -35.9%
ARGBToUVJRow |      -10.1% |       +5.4% |    -32.7%
 ARGBToUVRow |      -10.1% |       +5.4% |    -29.3%
 BGRAToUVRow |      -15.5% |       +4.6% |    -32.8%
 RGBAToUVRow |      -10.1% |       +4.2% |    -36.0%

Bug: libyuv:973
Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-01 19:46:43 +00:00
George Steed
f2e78e1304 [AArch64] Use Neon dot-product instructions in ARGBToYMatrixRow
Using the dot-product instructions here allows us to avoid needing LD4
for loading individual colour channels, which gives a big benefit on
some micro-architectures where such instructions perform significantly
worse than LD1. In addition the dot-product instructions have higher
throughput compared to the Neon

Observed reduction in runtimes for selected kernels moving from *_NEON
to *_NEON_DotProd:

     Kernel | Cortex-A55 | Cortex-A510 | Cortex-A76 | Cortex-X2
ABGRToYJRow |      -6.5% |      -22.5% |     -43.5% |    -71.2%
 ABGRToYRow |      -6.5% |      -22.5% |     -43.5% |    -68.3%
ARGBToYJRow |      -6.5% |      -22.5% |     -43.5% |    -68.1%
 ARGBToYRow |      -6.5% |      -22.5% |     -43.5% |    -68.1%
 BGRAToYRow |      -6.5% |      -22.5% |     -42.3% |    -68.4%
RGBAToYJRow |      -6.5% |      -22.5% |     -42.2% |    -73.7%
 RGBAToYRow |      -6.5% |      -22.5% |     -42.3% |    -64.9%

Bug: libyuv:977
Change-Id: If244190a7bdacf7e6e6b16af7e6853ee13ff6585
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424737
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-09 03:09:36 +00:00
Frank Barchard
def473f501 malloc return 1 for failures and assert for internal functions
Bug: libyuv:968
Change-Id: Iea2f907061532d2e00347996124bc80d079a7bdc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5010874
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-12-04 22:55:20 +00:00
Frank Barchard
31e1d6f896 Check allocations that return NULL and return early
BUG=libyuv:968

Change-Id: I9e8594440a6035958511f9c50072820131331fc8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4977552
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-10-27 17:41:36 +00:00
Bruce Lai
ec2e9ca000 [RVV] Support AR64ToAB64 and RGBA-family color conversions
Add scalar code for AR64ToAB64, ARGBToRGBA, ARGBToBGRA, ARGBToABGR, RGBAToARGB, BGRAToARGB, and ABGRToARGB.
They are originally implemented by ARGBShffle.
This CL independetly implements them, and only enables for risc-v now.
This CL also add RVV implementation for `RGBA-family <-> RGBA-family` color conversions.

* Run on SiFive internal FPGA(VLEN=128):

Test Case	Speedup
AR64ToAB64_Opt  x4.6
ARGBToRGBA_Opt  x6
ARGBToBGRA_Opt  x6
ARGBToABGR_Opt  x6
RGBAToARGB_Opt  x6

Change-Id: Ie0630901046084aa259699fcdeccc64170d7103f
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4797451
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-09-05 22:44:48 +00:00
Frank Barchard
157b153b60 Fix tidy warning that uint32_t dither4 should not be const
- Remove const from uint32_t dither4 parameter to fix clang-tidy warning
- Apply clang format
- Bump version
- Remove unused MMI source; superceded by MSA

Bug: None
Change-Id: Id49991db25bca4e99590b415312542d917471c62
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581882
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-06-02 00:42:02 +00:00
Darren Hsieh
d14bd701c8 [RVV] Enable CopyRow_RVV, InterpolateRow_RVV, {Merge,Split}UVRow_RVV
* Run on SiFive internal FPGA:

MergeUVPlane_Opt(~6x vs scalar)
SplitUVPlane_Opt(~6x vs scalar)
TestCopyPlane(~8x vs scalar)
ARGBInterpolate0_Opt(~10x vs scalar)
ARGBInterpolate64_Opt(~9x vs scalar)
ARGBInterpolate168_Opt(~9x vs scalar)
ARGBInterpolate192_Opt(~8.5x vs scalar)
ARGBInterpolate255_Opt(~8x vs scalar)

Bug: libyuv:956
Change-Id: I8372341865f75f42e30371ef943d5c2e4be7b79a
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4574186
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-05-30 09:10:35 +00:00
Lu Wang
8670bcf17f Optimize the following 19 functions with LSX in row_lsx.cc.
UYVYToYRow_LSX, UYVYToUVRow_LSX, UYVYToUV422Row_LSX,
ARGBToUVRow_LSX, ARGBToRGB24Row_LSX, ARGBToRAWRow_LSX,
ARGBToRGB565Row_LSX, ARGBToARGB1555Row_LSX, ARGBToARGB4444Row_LSX,
ARGBToUV444Row_LSX, ARGBMultiplyRow_LSX, ARGBAddRow_LSX,
ARGBSubtractRow_LSX, ARGBAttenuateRow_LSX, ARGBToRGB565DitherRow_LSX,
ARGBShuffleRow_LSX, ARGBShadeRow_LSX, ARGBGrayRow_LSX,
ARGBSepiaRow_LSX

Bug: libyuv:913
Change-Id: I02c0c9d68b229c4a66c96837e9b928c2f5dda1f3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4546814
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-05-19 18:55:58 +00:00
Frank Barchard
6a68b18a96 Bump version and apply clang format
Bug: libyuv:956
Change-Id: I2375a02583789af2a5f13f8dba6c663d5975aaa9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4522352
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-05-11 11:27:28 +00:00
Bruce Lai
59eae49f17 Enable ARGBToYMatrixRow_RVV/RGBAToYMatrixRow_RVV/RGBToYMatrixRow_RVV
Run on SiFive internal FPGA:

ARGBToJ400_Opt (~6x vs scalar)
RGBAToJ400_Opt (~6x vs scalar)
RGB24ToJ400_Opt (~5.5x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

Change-Id: Ia3ce8cea7962fbd8618cc23e850a7913c9cabf4f
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4521783
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-05-11 10:17:51 +00:00
Lu Wang
1d940cc570 Optimize the following functions with LSX.
MirrorRow_LSX, MirrorUVRow_LSX, ARGBMirrorRow_LSX,
I422ToYUY2Row_LSX, I422ToUYVYRow_LSX, I422ToARGBRow_LSX,
I422ToRGBARow_LSX, I422AlphaToARGBRow_LSX, I422ToRGB24Row_LSX,
I422ToRGB565Row_LSX, I422ToARGB4444Row_LSX, I422ToARGB1555Row_LSX,
YUY2ToYRow_LSX, YUY2ToUVRow_LSX, YUY2ToUV422Row_LSX

Bug: libyuv:913
Change-Id: I46cec605001d7ddd73846eed6d0a77f936b6dc53
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4515191
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-05-10 00:25:48 +00:00
Bruce Lai
1330a79e9f Optimized AR64/AB64 <-> ARGB with RVV
* Run on SiFive internal FPGA:

ARGBToAR64_Opt (~13.7x vs scalar)
ARGBToAB64_Opt (~5.81x vs scalar)
AR64ToARGB_Opt (~15.8x vs scalar)
AB64ToARGB_Opt (~2.40x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

Bug: libyuv:956

Change-Id: Ida642a5077f59d25fb7c5328f671956b2293dadd
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4442913
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-04-20 19:49:55 +00:00
Darren Hsieh
44396e6e9a Add ARGBToRAWRow_RVV, ARGBToRGB24Row_RVV, RGB24ToARGBRow_RVV
* Run on SiFive internal FPGA:

ARGBToRAW_Opt (~1.55x vs scalar)

ARGBToRGB24_Opt (~1.44x vs scalar)

RGB24ToARGB_Opt (~1.77x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

Bug: libyuv:956

Change-Id: I26722f6848cd68684d95d9a7ee06ce0416e7985d
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4413083
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-04-13 19:33:16 +00:00
Frank Barchard
88b050f337 MergeUV AVX512BW use assembly
- Convert MergeUVRow_AVX512BW to assembly
- Enable MergeUVRow_AVX512BW for Windows with clangcl
- MergeUVRow_AVX2 use vpmovzxbw and vpsllw
- MergeUVRow_16_AVX2 use vpmovzxbw and vpsllw with different shift for U and V

AMD Zen 4 640x360 100000 iterations
Was
AVX512 MergeUVPlane_Opt (884 ms)
AVX2   MergeUVPlane_Opt (945 ms)
AVX2   MergeUVPlane_16_Opt (2167 ms)

Now
AVX512 MergeUVPlane_Opt (865 ms)
AVX2   MergeUVPlane_Opt (943 ms)
SSE2   MergeUVPlane_Opt (973 ms)
AVX2   MergeUVPlane_16_Opt (2102 ms)

Bug: None
Change-Id: I658ada2a75d44c3f93be8bd3ed96f83d5fa2ab8d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4271230
Reviewed-by: Fritz Koenig <frkoenig@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-22 21:19:08 +00:00
Frank Barchard
2bdc210be9 MergeUV_AVX512BW for I420ToNV12
On Skylake Xeon 640x360 100000 iterations
AVX512   MergeUVPlane_Opt (1196 ms)
AVX2     MergeUVPlane_Opt (1565 ms)
SSE2     MergeUVPlane_Opt (1780 ms)
Pixel 7  MergeUVPlane_Opt (1177 ms)

Bug: None
Change-Id: If47d4fa957cf27781bba5fd6a2f0bf554101a5c6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4242247
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-13 20:14:57 +00:00
Hao Chen
0809713775 Refine some functions on the Longarch platform.
Add ARGBToYMatrixRow_LSX/LASX, RGBAToYMatrixRow_LSX/LASX and
RGBToYMatrixRow_LSX/LASX functions with RgbConstants argument.

Bug: libyuv:912
Change-Id: I956e639d1f0da4a47a55b79c9d41dcd29e29bdc5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4167860
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-01-18 18:54:14 +00:00
Frank Barchard
f71c83552d I420ToRGB24MatrixFilter function added
- Implemented as 3 steps: Upsample UV to 4:4:4, I444ToARGB, ARGBToRGB24
- Fix some build warnings for missing prototypes.

Pixel 4
I420ToRGB24_Opt (743 ms)
I420ToRGB24Filter_Opt (1331 ms)

Windows with skylake xeon:
x86 32 bit
I420ToRGB24_Opt (387 ms)
I420ToRGB24Filter_Opt (571 ms)
x64 64 bit
I420ToRGB24_Opt (384 ms)
I420ToRGB24Filter_Opt (582 ms)


Bug: libyuv:938, libyuv:830
Change-Id: Ie27f70816ec084437014f8a1c630ae011ee2348c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3900298
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2022-09-16 19:46:47 +00:00
Frank Barchard
65e7c9d570 MM21ToYUY2 and ABGRToJ420 conversion
MM21 to YUY2 use zip1 for performance

Cortex A510
Was MM21ToYUY2 (612 ms)
Now MM21ToYUY2 (573 ms)

Prefetches help Cortex A53
Was MM21ToYUY2 (4998 ms)
Now MM21ToYUY2 (1900 ms)

Pixel 4 Cortex A76
Was MM21ToYUY2 (215 ms)
Now MM21ToYUY2 (173 ms)

ABGRToJ420
- NEON, SSSE3 and AVX2 row functions
- J400, J420 and J422 formats.
- Added AVX2 for UV on ARGBToJ420.  Was SSSE3

Same code/performance as ARGBToJ420 but with constants re-ordered.
Pixel 4
ABGRToJ420_Opt (623 ms)
ABGRToJ422_Opt (702 ms)
ABGRToJ400_Opt (238 ms)

Skylake Xeon
With LIBYUV_BIT_EXACT which uses C for UV
ABGRToJ420_Opt (988 ms)
ABGRToJ422_Opt (1872 ms)
ABGRToJ400_Opt (186 ms)
Skylake Xeon using AVX2
ABGRToJ420_Opt (251 ms)
ABGRToJ422_Opt (245 ms)
ABGRToJ400_Opt (184 ms)
Skylake Xeon using SSSE3
ABGRToJ420_Opt (328 ms)
ABGRToJ422_Opt (362 ms)
ABGRToJ400_Opt (185 ms)

Bug: b/238137982
Change-Id: I559c3fe3fb80fa2ce5be3d8218736f9cbc627666
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3832111
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2022-08-16 22:07:38 +00:00
Frank Barchard
95b14b2446 RAWToJ400 faster version for ARM
- Unrolled to 16 pixels
- Take constants via structure, allowing different colorspace and channel order
- Use ADDHN to add 16.5 and take upper 8 bits of 16 bit values, narrowing to 8 bits
- clang-format applied, affecting mips code

On Cortex A510
Was RAWToJ400_Opt (1623 ms)
Now RAWToJ400_Opt (862 ms)

C   RAWToJ400_Opt (1627 ms)

Bug: b/220171611
Change-Id: I06a9baf9650ebe2802fb6ff6dfbd524e2c06ada0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3534023
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-03-18 07:22:36 +00:00
Hao Chen
91bae707e1 Optimize functions for LASX in row_lasx.cc.
1. Optimize 18 functions in source/row_lasx.cc file.
2. Make small modifications to LSX.
3. Remove some unnecessary content.

Bug: libyuv:912
Change-Id: Ifd1d85366efb9cdb3b99491e30fa450ff1848661
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3507640
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-03-09 08:52:54 +00:00
Frank Barchard
42d76a342f RAWToJNV21 function with 2 step conversion
RAWToJ420 + J420ToNV21 on row level

Pixel 6
RAWToJNV21_Opt (320 ms)

Skylake Xeon
RAWToJNV21_Opt (302 ms)

Bug: b/220171611
Change-Id: I39dcce9cf56c576b95666bb4fb1baccf9fbc7f7a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3495876
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-03-01 19:33:49 +00:00
Frank Barchard
2c6bfc02d5 Remove MMI support
Bug: libyuv:916
Change-Id: I345b7e271ceb4b32fe91e292915e66be40812810
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3415817
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-01-26 08:41:33 +00:00
Hao Chen
dfe046d272 Add optimization functions in row_lsx.cc file.
Optimize 44 functions in source/row_lsx.cc file.
All test cases passed on loongarch platform.

Bug: libyuv:913
Change-Id: Ic80a5751314adc2e9bd435f2bbd928ab017a90f9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3351467
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2022-01-21 01:34:38 +00:00
Hao Chen
de8ae8c679 Add optimization functions in row_lasx.cc file.
Optimize 32 functions in source/row_lasx.cc file.
All test cases passed on loongarch platform.

Bug: libyuv:912
Signed-off-by: Hao Chen <chenhao@loongson.cn>
Change-Id: I7d3f649f753f72ca9bd052d5e0562dbc6f6ccfed
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3351466
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-01-21 01:34:38 +00:00
Frank Barchard
b179f1847a Enable SIMD for exact RGB to Y conversions
Bug: libyuv:908, b/202888439
Change-Id: Icc5470b85d91b441ded9958ee04b4f32246646f0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3230489
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2021-10-19 07:54:50 +00:00
Frank Barchard
55b97cb48f BIT_EXACT for unattenuate and attenuate.
- reenable Intel SIMD unaffected by BIT_EXACT
- add bit exact version of ARGBAttenuate, which uses ARM version of formula.
- add bit exact version of ARGBUnatenuate, which mimics the AVX code.

Apply clang format to cleanup code.

Bug: libyuv:908, b/202888439
Change-Id: Ie842b1b3956b48f4190858e61c02998caedc2897
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3224702
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2021-10-15 19:46:02 +00:00
Frank Barchard
daf9778a24 Fix for failed compile with armv-7a neon gcc
Bug: libyuv:907
Change-Id: I955e83c72b57ce5ba45730030b32f337be610a21
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3216739
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2021-10-12 18:17:50 +00:00
Yuan Tong
f37014fcff Add support for AR64 format
Add following conversions:
ARGB,ABGR <-> AR64,AB64
AR64 <-> AB64

R=fbarchard@chromium.org

Change-Id: I5ca5b40a98bffea11981e136afae4a511ba6c564
Bug: libyuv:886
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2746780
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2021-03-13 20:55:21 +00:00
Shiyou Yin
d9681c53b3 Refine conditional compilation for MSA and MMI.
This patch is a complement for commit bed9292f2cbba2f8f9ff0f1635a8aa17a311f2f9.
1. Supplement inspection for macro HAS_***TOUV*ROW_MMI/MSA.
2. Reduce calls to function TestCpuFlag().
3. Fix a mistake in source/convert.cc: line 1105.

Change-Id: I5e7f9fe367fa0f6d1db6f7644c5b48d4ad85fedb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2169342
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2020-04-29 19:13:23 +00:00
Shiyou Yin
bed9292f2c Move init process of msa after mmi.
Some processors support both MSA and MMI.
when they are enabled together, MSA will be preferd.
This patch move MSA initialization after MMI, so that
MSA can overide MMI and be setted to effective.

Change-Id: I8a52cce83ee4ec9727d47c99b287c9580329b149
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2155944
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2020-04-28 11:01:51 +00:00
Frank Barchard
fce0fed542 ARGBToY use 8 bit precision instead of 7 bit.
Neon and GCC Intel optimized, but win32 and mips not optimized.

BUG=libyuv:842, b/141482243

Change-Id: Ia56fa85c8cc1db51f374bd0c89b56d21ec94afa7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/1825642
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Miguel Casas <mcasas@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2019-10-07 23:01:10 +00:00
Frank Barchard
9b63884a3e Add ABGRToNV21 and ABGRToNV12
Fix ARGBToUVJRow_AVX2 constants for win32

BUG=libyuv:833, libyuv:839

Change-Id: Id4731a573d40d7a9b46fcc31c2fee295483e1ff6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/1739509
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Hirokazu Honda <hiroh@chromium.org>
2019-08-07 01:29:13 +00:00
Martin Storsjö
9b772abf97 Restore the file mode for source files
This was changed in 21be9122aadf7824efe3fc19b2a09ff253a688e1.

Change-Id: I6c04dc92f673557e10c231bd090ec8aa88b6bee4
Reviewed-on: https://chromium-review.googlesource.com/1146183
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2018-08-06 18:53:32 +00:00
lixia zhang
21be9122aa libyuv:loongson optimize compare/row/scale/rotate files with mmi.
Currently, libyuv supports MIPS SIMD Arch(MSA),
but libyuv does not supports MultiMedia Instruction(MMI)(such as loongson3a platform).

In order to improve performance of libyuv on loongson3a platform,
this provides optimize 98 functions with mmi.

BUG=libyuv:804

Change-Id: I8947626009efad769b3103a867363ece25d79629
Reviewed-on: https://chromium-review.googlesource.com/1122064
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2018-07-20 22:53:04 +00:00
Johann
b8696fde84 add const to casts
When casting a const value, ensure the cast is const as well.

BUG=webm:1509

Change-Id: I5b597fdcc148d111e9824bc7cf918fc5f24e970f
Reviewed-on: https://chromium-review.googlesource.com/996553
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2018-04-13 22:52:52 +00:00
Frank Barchard
83aa7512c1 AVX512 VMBI version of ARGBToRGB24
Use VMBI instructions but on AVX2 registers to avoid clockrate change.

Bug: libyuv:778
Test: LibYUVConvertTest.NV21ToRGB24_Opt
Change-Id: Id4f8ad1e0e142a380c8a46c5eab90ce145a10edd
Reviewed-on: https://chromium-review.googlesource.com/956609
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2018-03-10 02:04:48 +00:00
Frank Barchard
1d509f2178 ARGBToRGB24_AVX2 version
AVX2 port of SSSE3 conversion to output 24 bit RGB

Bug: libyuv:778
Test: LibYUVConvertTest.NV21ToRGB24_Opt
Change-Id: I14f7815522d1b790ecd2bb39d9a3441e803b694a
Reviewed-on: https://chromium-review.googlesource.com/953303
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2018-03-08 02:38:21 +00:00
Frank Barchard
e1f6c1c0b5 tidy applied with readability-inconsistent-declaration-parameter-name
Bug: libyuv:750
Test: builds and runs and passes more tidy tests
Change-Id: I023699a7aa61ea3f5e4a21647112691ea5739281
Reviewed-on: https://chromium-review.googlesource.com/902170
Reviewed-by: Weiyong Yao <braveyao@chromium.org>
2018-02-07 00:24:25 +00:00
Frank Barchard
664c735677 I420ToYUY2_AVX2 port
I420 and I422 To YUY2 and UYVY ported from SSE2 to AVX2.

Was SSE2
I420ToYUY2_Opt (135 ms)
I420ToUYVY_Opt (148 ms)
I422ToYUY2_Opt (145 ms)
I422ToUYVY_Opt (142 ms)

Now AVX2
I420ToYUY2_Opt (133 ms)
I420ToUYVY_Opt (130 ms)
I422ToYUY2_Opt (127 ms)
I422ToUYVY_Opt (137 ms)

Bug: libyuv:556
Test: out/Release/libyuv_unittest --sandbox_unittests --gtest_filter=*I42?To*UY*Opt
Change-Id: Ic35f97cee02dc009fd98785589ba17c7cf50bb35
Reviewed-on: https://chromium-review.googlesource.com/892493
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2018-02-01 00:33:25 +00:00
Frank Barchard
ffec313dbe ABGRToAR30 used AVX2 with reversed shuffler
vpshufb is used to reverse R and B channels;
Code is otherwise the same as ARGBToAR30.

Bug: libyuv:751
Test: ABGRToAR30 unittest
Change-Id: I30e02925f5c729e4496c5963ba4ba4af16633b3b
Reviewed-on: https://chromium-review.googlesource.com/891807
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2018-01-29 22:31:31 +00:00
Frank Barchard
92e22cf5b6 Lint cleanup after C99 change CL
TBR=braveyao@chromium.org
Bug: libyuv:774
Test: git cl lint
Change-Id: I51cf8107a8db17fbc9952d610f3e4d7aac5aa743
Reviewed-on: https://chromium-review.googlesource.com/882217
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2018-01-24 19:16:03 +00:00