289 Commits

Author SHA1 Message Date
George Steed
004352ba16 [AArch64] Add SVE2 implementations for AYUVTo{UV,VU}Row
These kernels are mostly identical to each other except for the order of
the results, so we can use a single macro to parameterize the pairwise
addition and use the same macro for both implementations, just with the
register order flipped.

Similar to other 2x2 kernels the implementation here differs slightly
for the last element if the problem size is odd, so use an "any" kernel
to avoid needing to handle this in the common code path.

Observed reduction in runtime compared to the existing Neon code:

            | AYUVToUVRow | AYUVToVURow
Cortex-A510 |      -33.1% |      -33.0%
Cortex-A720 |      -25.1% |      -25.1%
  Cortex-X2 |      -59.5% |      -53.9%
  Cortex-X4 |      -39.2% |      -39.4%

Bug: libyuv:973
Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
6c70eb2819 [AArch64] Add Neon impls for I{210,410}ToAR30Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 I210ToAR30Row on Cortex-A55: -43.8%
I210ToAR30Row on Cortex-A510: -27.0%
 I210ToAR30Row on Cortex-A76: -50.4%
 I410ToAR30Row on Cortex-A55: -44.3%
I410ToAR30Row on Cortex-A510: -17.5%
 I410ToAR30Row on Cortex-A76: -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:46:12 +00:00
George Steed
812b4955b2 [AArch64] Add Neon impls for I{210,410}ToARGBRow_NEON
There is are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToARGBRow | I410ToARGBRow
 Cortex-A55 |        -55.6% |        -56.2%
Cortex-A510 |        -22.6% |        -35.6%
 Cortex-A76 |        -48.1% |        -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 17:40:48 +00:00
George Steed
5b4160b9c3 [AArch64] Add Neon impls for I{210,410}AlphaToARGBRow_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210AlphaToARGBRow | I410AlphaToARGBRow
 Cortex-A55 |             -55.3% |             -56.1%
Cortex-A510 |             -27.9% |             -42.6%
 Cortex-A76 |             -54.9% |             -60.3%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:41:31 +00:00
George Steed
e348995a92 [AArch64] Optimize MergeXR30Row_10_NEON
By keeping intermediate data as 16-bits wide we can compute twice as
much and use ST2 to store the final result. This appears to be much
better even on micro-architectures where ST2 is slightly slower than
ST1.

We save a couple of instructions by taking advantage of multiply-add
instructions to perform an effective shift-left and bitwise-or, since we
know the set of nonzero bits are disjoint after the UMIN.

Reduction in runtime observed for MergeXR30Row_10_NEON:

 Cortex-A55: -34.2%
Cortex-A510: -35.6%
 Cortex-A76: -44.9%
  Cortex-X2: -48.3%

Bug: libyuv:976
Change-Id: I6e2627f9aa8e400ea82ff381ed587fcfc0d94648
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509199
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:32:55 +00:00
George Steed
9fac9a4a82 [AArch64] Add Neon implementations for {ARGB,ABGR}ToAR30Row
There are existing x86 implementations for these kernels but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | ABGRToAR30Row | ARGBToAR30Row
 Cortex-A55 |        -55.1% |        -55.1%
Cortex-A510 |        -39.3% |        -40.1%
 Cortex-A76 |        -62.3% |        -63.6%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:35:07 +00:00
George Steed
6f1d8b1e11 [AArch64] Add SVE2 implementations for ARGBToUVRow and similar
By maintaining the interleaved format of the data we can use a common
kernel for all input channel orderings and simply pass a different
vector of constants instead.

A similar approach is possible with only Neon by making use of
multiplies and repeated application of ADDP to combine channels, however
this is slower on older cores like Cortex-A53 so is not pursued further.

For odd problem sizes we need a slightly different implementation for
the final element, so introduce an "any" kernel to address that rather
than bloating the code for the common case.

Observed affect on runtimes compared to the existing Neon kernels:

             | Cortex-A510 | Cortex-A720 | Cortex-X2
ABGRToUVJRow |      -15.5% |       +5.4% |    -33.1%
 ABGRToUVRow |      -15.6% |       +5.3% |    -35.9%
ARGBToUVJRow |      -10.1% |       +5.4% |    -32.7%
 ARGBToUVRow |      -10.1% |       +5.4% |    -29.3%
 BGRAToUVRow |      -15.5% |       +4.6% |    -32.8%
 RGBAToUVRow |      -10.1% |       +4.2% |    -36.0%

Bug: libyuv:973
Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-01 19:46:43 +00:00
George Steed
f2e78e1304 [AArch64] Use Neon dot-product instructions in ARGBToYMatrixRow
Using the dot-product instructions here allows us to avoid needing LD4
for loading individual colour channels, which gives a big benefit on
some micro-architectures where such instructions perform significantly
worse than LD1. In addition the dot-product instructions have higher
throughput compared to the Neon

Observed reduction in runtimes for selected kernels moving from *_NEON
to *_NEON_DotProd:

     Kernel | Cortex-A55 | Cortex-A510 | Cortex-A76 | Cortex-X2
ABGRToYJRow |      -6.5% |      -22.5% |     -43.5% |    -71.2%
 ABGRToYRow |      -6.5% |      -22.5% |     -43.5% |    -68.3%
ARGBToYJRow |      -6.5% |      -22.5% |     -43.5% |    -68.1%
 ARGBToYRow |      -6.5% |      -22.5% |     -43.5% |    -68.1%
 BGRAToYRow |      -6.5% |      -22.5% |     -42.3% |    -68.4%
RGBAToYJRow |      -6.5% |      -22.5% |     -42.2% |    -73.7%
 RGBAToYRow |      -6.5% |      -22.5% |     -42.3% |    -64.9%

Bug: libyuv:977
Change-Id: If244190a7bdacf7e6e6b16af7e6853ee13ff6585
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424737
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-09 03:09:36 +00:00
Lu Wang
8670bcf17f Optimize the following 19 functions with LSX in row_lsx.cc.
UYVYToYRow_LSX, UYVYToUVRow_LSX, UYVYToUV422Row_LSX,
ARGBToUVRow_LSX, ARGBToRGB24Row_LSX, ARGBToRAWRow_LSX,
ARGBToRGB565Row_LSX, ARGBToARGB1555Row_LSX, ARGBToARGB4444Row_LSX,
ARGBToUV444Row_LSX, ARGBMultiplyRow_LSX, ARGBAddRow_LSX,
ARGBSubtractRow_LSX, ARGBAttenuateRow_LSX, ARGBToRGB565DitherRow_LSX,
ARGBShuffleRow_LSX, ARGBShadeRow_LSX, ARGBGrayRow_LSX,
ARGBSepiaRow_LSX

Bug: libyuv:913
Change-Id: I02c0c9d68b229c4a66c96837e9b928c2f5dda1f3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4546814
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-05-19 18:55:58 +00:00
Lu Wang
1d940cc570 Optimize the following functions with LSX.
MirrorRow_LSX, MirrorUVRow_LSX, ARGBMirrorRow_LSX,
I422ToYUY2Row_LSX, I422ToUYVYRow_LSX, I422ToARGBRow_LSX,
I422ToRGBARow_LSX, I422AlphaToARGBRow_LSX, I422ToRGB24Row_LSX,
I422ToRGB565Row_LSX, I422ToARGB4444Row_LSX, I422ToARGB1555Row_LSX,
YUY2ToYRow_LSX, YUY2ToUVRow_LSX, YUY2ToUV422Row_LSX

Bug: libyuv:913
Change-Id: I46cec605001d7ddd73846eed6d0a77f936b6dc53
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4515191
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-05-10 00:25:48 +00:00
Frank Barchard
ee3e71c7ce Any functions use memset(vin, 0, sizeof(vin)) for GCC warning fix
- Fix -Wmemset-elt-size warning for GCC
- Use vin for inputs and vout for outputs

Bug: None
Change-Id: Iefd418dc884b4d062e1fdd9215319c8838c49eaa
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4412065
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2023-04-10 20:44:10 +00:00
James Zern
0200037a5a row_any,ANYDETILE: fix -Wmemset-elt-size warning
under gcc 12.2.0 using -Wall:

source/row_any.cc: In function ‘void libyuv::DetileRow_16_Any_SSE2(const
                       uint16_t*, ptrdiff_t, uint16_t*, int)’:
source/row_any.cc:2287:11: warning: ‘memset’ used with length equal to
number of elements without multiplication by element size
[-Wmemset-elt-size]
 2287 |     memset(temp, 0, 16 * BPP); /* for msan */
      |     ~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
source/row_any.cc:2308:1: note: in expansion of macro ‘ANYDETILE’
 2308 | ANYDETILE(DetileRow_16_Any_SSE2, DetileRow_16_SSE2, uint16_t, 2, 15)

This increases the memset to the full buffer size, which may not be
strictly necessary.

Change-Id: Iea2fc649990ee84ea9aa8020d6f6b25e012b18fb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4406599
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-04-08 19:01:02 +00:00
Frank Barchard
88b050f337 MergeUV AVX512BW use assembly
- Convert MergeUVRow_AVX512BW to assembly
- Enable MergeUVRow_AVX512BW for Windows with clangcl
- MergeUVRow_AVX2 use vpmovzxbw and vpsllw
- MergeUVRow_16_AVX2 use vpmovzxbw and vpsllw with different shift for U and V

AMD Zen 4 640x360 100000 iterations
Was
AVX512 MergeUVPlane_Opt (884 ms)
AVX2   MergeUVPlane_Opt (945 ms)
AVX2   MergeUVPlane_16_Opt (2167 ms)

Now
AVX512 MergeUVPlane_Opt (865 ms)
AVX2   MergeUVPlane_Opt (943 ms)
SSE2   MergeUVPlane_Opt (973 ms)
AVX2   MergeUVPlane_16_Opt (2102 ms)

Bug: None
Change-Id: I658ada2a75d44c3f93be8bd3ed96f83d5fa2ab8d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4271230
Reviewed-by: Fritz Koenig <frkoenig@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-22 21:19:08 +00:00
Frank Barchard
2bdc210be9 MergeUV_AVX512BW for I420ToNV12
On Skylake Xeon 640x360 100000 iterations
AVX512   MergeUVPlane_Opt (1196 ms)
AVX2     MergeUVPlane_Opt (1565 ms)
SSE2     MergeUVPlane_Opt (1780 ms)
Pixel 7  MergeUVPlane_Opt (1177 ms)

Bug: None
Change-Id: If47d4fa957cf27781bba5fd6a2f0bf554101a5c6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4242247
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-13 20:14:57 +00:00
Hao Chen
0809713775 Refine some functions on the Longarch platform.
Add ARGBToYMatrixRow_LSX/LASX, RGBAToYMatrixRow_LSX/LASX and
RGBToYMatrixRow_LSX/LASX functions with RgbConstants argument.

Bug: libyuv:912
Change-Id: I956e639d1f0da4a47a55b79c9d41dcd29e29bdc5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4167860
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-01-18 18:54:14 +00:00
Frank Barchard
ea26d7adb1 DetilePlane_16 AVX version
- fix ifdefs for DetilePlane_16 to use 16 bit versions, not 8 bit.  (no functional change)

Bug: b/258474032
Change-Id: Ic07e02d9801e21126ebee0ceb5779aa712a493ce
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4034812
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-11-18 23:59:06 +00:00
Frank Barchard
2d2cee418a Add Detile_16 planar function for 10 bit MT2T format
- Neon and SSE2
- Any for odd widths

Pixel 2 little core AArch32 build
C
TestDetilePlane_16 (1275 ms)
TestDetilePlane (1203 ms)
Neon
TestDetilePlane_16 (693 ms)
TestDetilePlane (660 ms)

Bug: b/258474032
Change-Id: Idbd09c5e9324e4deef5f1d54090d4b63cc7db812
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4031848
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-11-17 02:47:57 +00:00
Frank Barchard
00950840d1 YUY2ToNV12 using YUY2ToY and YUY2ToNVUV
- Optimized YUY2ToNV12 that reduces it from 3 steps to 2 steps
  - Was SplitUV, memcpy Y, InterpolateUV
  - Now YUY2ToY, YUY2ToNVUV
- rollback LIBYUV_UNLIMITED_DATA

3840x2160 1000 iterations:

Pixel 2 Cortex A73
Was YUY2ToNV12_Opt (6515 ms)
Now YUY2ToNV12_Opt (3350 ms)

AB7 Mediatek P35 Cortex A53
Was YUY2ToNV12_Opt (6435 ms)
Now YUY2ToNV12_Opt (3301 ms)

Skylake AVX2 x64
Was YUY2ToNV12_Opt (1872 ms)
Now YUY2ToNV12_Opt (1657 ms)

SSE2 x64
Was YUY2ToNV12_Opt (2008 ms)
Now YUY2ToNV12_Opt (1691 ms)

Windows Skylake AVX2 32 bit x86
Was YUY2ToNV12_Opt (2161 ms)
Now YUY2ToNV12_Opt (1628 ms)

Bug: libyuv:943
Change-Id: I6c2ba2ae765413426baf770b837de114f808f6d0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3929843
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-09-30 22:41:21 +00:00
Frank Barchard
248172e2ba I422ToRGB24, I422ToRAW, I422ToRGB24MatrixFilter conversion functions added.
- YUV to RGB use linear for first and last row.
- add assert(yuvconstants)
- rename pointers to match row functions.
- use macros that match row functions.
- use 12 bit upsampler for conversions of 10 and 12 bits

Cortex A53 AArch32
I420ToRGB24_Opt (3627 ms)
I422ToRGB24_Opt (4099 ms)
I444ToRGB24_Opt (4186 ms)
I420ToRGB24Filter_Opt (5451 ms)
I422ToRGB24Filter_Opt (5430 ms)

AVX2
Was I420ToRGB24Filter_Opt (583 ms)
Now I420ToRGB24Filter_Opt (560 ms)

Neon Cortex A7
Was I420ToRGB24Filter_Opt (5447 ms)
Now I420ToRGB24Filter_Opt (5439 ms)

Bug: libyuv:938


Change-Id: I1731f2dd591073ae11a756f06574103ba0f803c7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3906082
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-09-20 02:00:52 +00:00
Frank Barchard
3e38ce5058 SSE2 MM21->YUY2 conversion
Add SSE2 optimization for MM21ToYUY2 conversion.

Bug: b/238137982
Change-Id: I189f712514308322f651b082b496bce9c015c4ee
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3832525
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2022-08-17 18:39:05 +00:00
Frank Barchard
65e7c9d570 MM21ToYUY2 and ABGRToJ420 conversion
MM21 to YUY2 use zip1 for performance

Cortex A510
Was MM21ToYUY2 (612 ms)
Now MM21ToYUY2 (573 ms)

Prefetches help Cortex A53
Was MM21ToYUY2 (4998 ms)
Now MM21ToYUY2 (1900 ms)

Pixel 4 Cortex A76
Was MM21ToYUY2 (215 ms)
Now MM21ToYUY2 (173 ms)

ABGRToJ420
- NEON, SSSE3 and AVX2 row functions
- J400, J420 and J422 formats.
- Added AVX2 for UV on ARGBToJ420.  Was SSSE3

Same code/performance as ARGBToJ420 but with constants re-ordered.
Pixel 4
ABGRToJ420_Opt (623 ms)
ABGRToJ422_Opt (702 ms)
ABGRToJ400_Opt (238 ms)

Skylake Xeon
With LIBYUV_BIT_EXACT which uses C for UV
ABGRToJ420_Opt (988 ms)
ABGRToJ422_Opt (1872 ms)
ABGRToJ400_Opt (186 ms)
Skylake Xeon using AVX2
ABGRToJ420_Opt (251 ms)
ABGRToJ422_Opt (245 ms)
ABGRToJ400_Opt (184 ms)
Skylake Xeon using SSSE3
ABGRToJ420_Opt (328 ms)
ABGRToJ422_Opt (362 ms)
ABGRToJ400_Opt (185 ms)

Bug: b/238137982
Change-Id: I559c3fe3fb80fa2ce5be3d8218736f9cbc627666
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3832111
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2022-08-16 22:07:38 +00:00
Frank Barchard
6900494d90 Merge/SplitRGB fix -mcmodel=large x86 and InterpolateRow_16To8_NEON
MergeRGB and SplitRGB use a register to point to 9 shuffle tables.

- fixes an out of registers error with -mcmodel=large

InterpolateRow_16To8_NEON improves performance for I210ToI420:

On Pixel 4 for 720p x1000 images
Was I210ToI420_Opt (608 ms)
Now I210ToI420_Opt (336 ms)

On Skylake Xeon
Was I210ToI420_Opt (259 ms)
Now I210ToI420_Opt (209 ms)


Bug: libyuv:931, libyuv:930
Change-Id: I20f8244803f06da511299bf1a2ffc7945eb35221
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3717054
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2022-06-29 00:00:46 +00:00
Frank Barchard
e906ba9fe9 InterpolateRow_Any test if fraction is 0 and dont memcpy 2nd row.
Bug: b/228605787
Change-Id: Ia8912e4c1599401320ee82882a2593e78bf56582
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3708833
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-06-17 18:15:09 +00:00
Frank Barchard
30f9b28048 Add I210ToI420
Bug: libyuv:931, b/228605787, b/233233302, b/233634772, b/234558395, b/234340482
Change-Id: Ib135d0b4ff17665f6a4ab60edb782a7b314219a4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3696042
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2022-06-09 08:07:50 +00:00
Frank Barchard
d011314f14 Revert "I210ToI420, InterpolatePlane_16, and ScalePlane Vertical-only asan fix"
This reverts commit 60254a1d846a93a4d7559009004cdd91bcc04d82.

Reason for revert: breaks PaintCanvasVideoRendererTest.HighBitDepth

Original change's description:
> I210ToI420, InterpolatePlane_16, and ScalePlane Vertical-only asan fix
>
> - Add I210ToI420 to convert 10 bit 4:2:2 YUV to 4:2:0 8 bit
> - Add NEON InterpolateRow_16 for fast 10 bit scaling
> - When scaling up, set step to interpolate toward height - 1 to avoid buffer overread
> - When scaling down, center the 2 rows used for source to achieve filtering.
> - CopyPlane check for 0 size and return
>
> Bug:  libyuv:931, b/228605787, b/233233302, b/233634772, b/234558395, b/234340482
> Change-Id: I63e8580710a57812b683c2fe40583ac5a179c4f1
> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3687552
> Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
> Reviewed-by: richard winterton <rrwinterton@gmail.com>

Bug: libyuv:931, b/228605787, b/233233302, b/233634772, b/234558395, b/234340482
Change-Id: Icc05bb340db0e7fe864061fb501d0a861c764116
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3692886
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2022-06-07 09:16:05 +00:00
Frank Barchard
60254a1d84 I210ToI420, InterpolatePlane_16, and ScalePlane Vertical-only asan fix
- Add I210ToI420 to convert 10 bit 4:2:2 YUV to 4:2:0 8 bit
- Add NEON InterpolateRow_16 for fast 10 bit scaling
- When scaling up, set step to interpolate toward height - 1 to avoid buffer overread
- When scaling down, center the 2 rows used for source to achieve filtering.
- CopyPlane check for 0 size and return

Bug:  libyuv:931, b/228605787, b/233233302, b/233634772, b/234558395, b/234340482
Change-Id: I63e8580710a57812b683c2fe40583ac5a179c4f1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3687552
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2022-06-07 01:41:56 +00:00
Frank Barchard
eb2c88e499 Convert16To8 NEON
Pixel 3
Was C    I010ToI420_Opt (749 ms)
Now NEON I010ToI420_Opt (356 ms)

Pixel 4
Was C    I010ToI420_Opt (581 ms)
Now NEON I010ToI420_Opt (163 ms)

Bug: b/233233302, b/233634772
Change-Id: I60a84648a66f77d97c0a7822b29bd18b8e3a3355
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3661401
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2022-05-24 18:07:16 +00:00
Frank Barchard
95b14b2446 RAWToJ400 faster version for ARM
- Unrolled to 16 pixels
- Take constants via structure, allowing different colorspace and channel order
- Use ADDHN to add 16.5 and take upper 8 bits of 16 bit values, narrowing to 8 bits
- clang-format applied, affecting mips code

On Cortex A510
Was RAWToJ400_Opt (1623 ms)
Now RAWToJ400_Opt (862 ms)

C   RAWToJ400_Opt (1627 ms)

Bug: b/220171611
Change-Id: I06a9baf9650ebe2802fb6ff6dfbd524e2c06ada0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3534023
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-03-18 07:22:36 +00:00
Hao Chen
91bae707e1 Optimize functions for LASX in row_lasx.cc.
1. Optimize 18 functions in source/row_lasx.cc file.
2. Make small modifications to LSX.
3. Remove some unnecessary content.

Bug: libyuv:912
Change-Id: Ifd1d85366efb9cdb3b99491e30fa450ff1848661
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3507640
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-03-09 08:52:54 +00:00
Justin Green
b4ddbaf549 Add support for MM21.
Add support for MM21 to NV12 and I420 conversion, and add SIMD
optimizations for arm, aarch64, SSE2, and SSSE3 machines.

Bug: libyuv:915, b/215425056
Change-Id: Iecb0c33287f35766a6169d4adf3b7397f1ba8b5d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3433269
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Justin Green <greenjustin@google.com>
2022-02-03 17:01:49 +00:00
Frank Barchard
2c6bfc02d5 Remove MMI support
Bug: libyuv:916
Change-Id: I345b7e271ceb4b32fe91e292915e66be40812810
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3415817
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-01-26 08:41:33 +00:00
Hao Chen
dfe046d272 Add optimization functions in row_lsx.cc file.
Optimize 44 functions in source/row_lsx.cc file.
All test cases passed on loongarch platform.

Bug: libyuv:913
Change-Id: Ic80a5751314adc2e9bd435f2bbd928ab017a90f9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3351467
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2022-01-21 01:34:38 +00:00
Hao Chen
de8ae8c679 Add optimization functions in row_lasx.cc file.
Optimize 32 functions in source/row_lasx.cc file.
All test cases passed on loongarch platform.

Bug: libyuv:912
Signed-off-by: Hao Chen <chenhao@loongson.cn>
Change-Id: I7d3f649f753f72ca9bd052d5e0562dbc6f6ccfed
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3351466
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-01-21 01:34:38 +00:00
Hao Chen
51de1e16f2 Add supports for loongarch LSX and LASX.
1. Add supports for LSX and LASX.
2. Three optimization functions are added in loongarch/row_lasx.cc file:
   I422ToARGBRow_LASX,I422ToRGBARow_LASX,I422AlphaToARGBRow_LASX.

Bug: libyuv:912, Bug: libyuv:913
Change-Id: I043c2704f99a5215724b5c0b7f97e6bf5f7a199b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3329189
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-01-20 19:25:38 +00:00
Frank Barchard
90ffd5cba9 I420ToARGB for AVX512
On Skylake Xeon
AVX512  I420ToARGB_Opt (2050 ms)
AVX2    I420ToARGB_Opt (2533 ms)
SSSE3   I420ToARGB_Opt (3688 ms)

Bug: libyuv:911
Change-Id: I2214cc15dec24b06541895ca59d88990edbb2216
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3382100
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-01-14 09:17:33 +00:00
Frank Barchard
d7a2d5da87 J400ToARGB optimized for Exynos using ZIP+ST1
Bug: 204562143
Change-Id: I56c98198c02bd0dd1283f1c14837730c92832c39
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3328702
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2021-12-10 01:00:07 +00:00
Frank Barchard
000806f373 NV21ToYUV24 replace ST3 with ST1. ARGBToAR64 replace ST2 with ST1
On Samsung S8 Exynos M2
Was ST3 NV21ToYUV24_Opt (769 ms)
Now ST1 NV21ToYUV24_Opt (473 ms)
Was ST2 ARGBToAR64_Opt (1759 ms)
Now ST1 ARGBToAR64_Opt (987 ms)

Skylake Xeon, AVX2 version:
Was NV21ToYUV24_Opt (885 ms)
Now NV21ToYUV24_Opt (194 ms)

Bug: b/204562143, b/124413599
Change-Id: Icc9cb64d822cd11937789a4e04fbb773b3e33aa3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3290664
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2021-11-24 07:38:49 +00:00
Frank Barchard
b179f1847a Enable SIMD for exact RGB to Y conversions
Bug: libyuv:908, b/202888439
Change-Id: Icc5470b85d91b441ded9958ee04b4f32246646f0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3230489
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2021-10-19 07:54:50 +00:00
Frank Barchard
55b97cb48f BIT_EXACT for unattenuate and attenuate.
- reenable Intel SIMD unaffected by BIT_EXACT
- add bit exact version of ARGBAttenuate, which uses ARM version of formula.
- add bit exact version of ARGBUnatenuate, which mimics the AVX code.

Apply clang format to cleanup code.

Bug: libyuv:908, b/202888439
Change-Id: Ie842b1b3956b48f4190858e61c02998caedc2897
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3224702
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2021-10-15 19:46:02 +00:00
Frank Barchard
daf9778a24 Fix for failed compile with armv-7a neon gcc
Bug: libyuv:907
Change-Id: I955e83c72b57ce5ba45730030b32f337be610a21
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3216739
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2021-10-12 18:17:50 +00:00
Frank Barchard
287158925b use width + 1 for odd width tests
Bug: libyuv:894, libyuv:898, libyuv:899
Change-Id: Ieba8eaeb8b06f0323824967776673e339b263220
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2809701
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2021-04-09 20:17:55 +00:00
Yuan Tong
2cd098f83b Fix MergeAR64Plane on odd width
R=fbarchard@chromium.org

Bug: libyuv:898
Change-Id: I031e008ea91baba1c7598efa0eda70750cbfce85
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2810066
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2021-04-08 09:18:34 +00:00
Frank Barchard
60db98b6fa clang-tidy applied
Bug: libyuv:886, libyuv:889
Change-Id: I2d14d03c19402381256d3c6d988e0b7307bdffd8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2800147
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2021-04-01 21:42:47 +00:00
Yuan Tong
8a13626e42 Add MergeAR30Plane, MergeAR64Plane, MergeARGB16To8Plane
These functions merge high bit depth planar RGB pixels into packed format.

Change-Id: I506935a164b069e6b2fed8bf152cb874310c0916
Bug: libyuv:886, libyuv:889
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2780468
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2021-03-31 20:46:02 +00:00
Yuan Tong
f37014fcff Add support for AR64 format
Add following conversions:
ARGB,ABGR <-> AR64,AB64
AR64 <-> AB64

R=fbarchard@chromium.org

Change-Id: I5ca5b40a98bffea11981e136afae4a511ba6c564
Bug: libyuv:886
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2746780
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2021-03-13 20:55:21 +00:00
Frank Barchard
ba033a11e3 Add 12 bit YUV to 10 bit RGB
Bug: libyuv:843
Change-Id: I0104c8fcaeed09e83d2fd654c6a5e7d41bcb74cf
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2727775
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2021-03-05 01:09:37 +00:00
Yuan Tong
cdabad5bfa Add more 10 bit YUV To RGB function
The following functions are added:
planar YUV:
 I410ToAR30, I410ToARGB
planar YUVA:
 I010AlphaToARGB, I210AlphaToARGB, I410AlphaToARGB
biplanar YUV:
 P010ToARGB, P210ToARGB
 P010ToAR30, P210ToAR30

biplanar functions can also handle 12 bit and 16 bit samples.

libyuv_unittest --gtest_filter=LibYUVConvertTest.*10*ToA*:LibYUVConvertTest.*P?1?ToA*

R=fbarchard@chromium.org

Bug: libyuv:751, libyuv:844
Change-Id: I2be02244dfa23335e1e7bc241fb0613990208de5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2707003
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2021-03-03 15:48:47 +00:00
Yuan Tong
a8c181050c Add 10/12 bit YUV To YUV functions
The following functions (and their 12 bit variant) are added:

planar, 10->10:
 I410ToI010, I210ToI010

planar, 10->8:
 I410ToI444, I210ToI422

planar<->biplanar, 10->10:
 I010ToP010, I210ToP210, I410ToP410
 P010ToI010, P210ToI210, P410ToI410

R=fbarchard@chromium.org

Change-Id: I9aa2bafa0d6a6e1e38ce4e20cbb437e10f9b0158
Bug: libyuv:834, libyuv:873
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2709822
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2021-02-25 23:16:54 +00:00
Frank Barchard
c28d404936 win32 build fix for I422ToRGBA
Bug:  libyuv:877, b/178713286
Change-Id: Iad55df99083b9a4bb9306e052e0e687e58570d96
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2657701
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2021-01-29 10:07:08 +00:00
Yuan Tong
a85cc26fde Add MergeARGBPlane and SplitARGBPlane
These functions convert between planar and interleaved ARGB,
optionally fill 255 to alpha / discard alpha.

This can help handle YUV(A) with Identity matrix, which is
basically planar ARGB.

libyuv_unittest --gtest_filter=LibYUVPlanarTest.*ARGBPlane*:LibYUVPlanarTest.*XRGBPlane*

R=fbarchard@google.com

Change-Id: I522a189b434f490ba1723ce51317727e7c5eb112
Bug: libyuv:877
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/2649887
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2021-01-27 19:33:51 +00:00