205 Commits

Author SHA1 Message Date
George Steed
22c5c18778 [AArch64] Add SME implementation of I422ToARGBRow
Including addition of a new row_sme.cc file and associated
infrastructure.

The actual implementation in this case is a pure streaming-SVE (SSVE)
implementation based on the existing SVE2 implementation, we do not use
the ZA tile.

Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-29 05:49:28 +00:00
George Steed
22ac86800e [AArch64] Add SVE2 implementation of I422ToARGB4444Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -35.5%
Cortex-A520: -38.2%
Cortex-A715: -19.8%
Cortex-A720: -19.8%
  Cortex-X2: -24.2%
  Cortex-X3: -24.1%
  Cortex-X4: -21.6%
Cortex-X925: -19.5%

Bug: b/42280942
Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f4eaeca22a [AArch64] Add SVE2 implementation of I422ToARGB1555Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.8%
Cortex-A520: -42.6%
Cortex-A715: -22.5%
Cortex-A720: -22.6%
  Cortex-X2: -22.7%
  Cortex-X3: -22.4%
  Cortex-X4: -19.4%
Cortex-X925: -27.0%

Bug: b/42280942
Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
f40042533c [AArch64] Add SVE2 implementation of I422ToRGB565Row
This makes use of the same approach as the Neon code to avoid redundant
narrowing and then widening shifts by instead placing the values at the
top portion of the lanes and then shifting down from there instead.

Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -41.1%
Cortex-A520: -38.2%
Cortex-A715: -21.5%
Cortex-A720: -21.6%
  Cortex-X2: -21.6%
  Cortex-X3: -22.0%
  Cortex-X4: -23.5%
Cortex-X925: -21.7%

Bug: b/42280942
Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-10-24 21:27:39 +00:00
George Steed
0dce974ca0 [AArch64] Add SVE2 implementation of I422ToRGB24Row
Observed reduction in runtime compared to the existing Neon code:

Cortex-A510: -57.8%
Cortex-A520: -41.7%
Cortex-A715: -28.0%
Cortex-A720: -28.1%
  Cortex-X2: -29.7%
  Cortex-X3: -28.7%
  Cortex-X4: -30.5%
Cortex-X925: -30.3%

Bug: b/42280942
Change-Id: I328bd16babda75fb089c8da8f2714465f658187e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802965
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-10-24 02:17:32 +00:00
George Steed
d5303f4f77 [AArch64] Unroll ARGB1555ToARGBRow_NEON to use full Neon vectors
Processing more data per loop iteration means that we can use the full
128-bit Neon vectors and also allows us to use e.g. UZP1 to perform XTN
+ XTN2 in a single instruction.

The early Cortex-X cores are not a fan of ST4 .16b with a
post-increment, so split out the pointer increment to a separate
instruction to avoid this bottleneck.

Reductions in runtime observed for ARGB1555ToARGBRow_NEON:

 Cortex-A55: -18.1%
Cortex-A510: -11.2%
Cortex-A520: -39.5%
 Cortex-A76: -18.0%
Cortex-A715: -34.8%
Cortex-A720: -34.8%
  Cortex-X1:  -0.9%
  Cortex-X2:  -4.6%
  Cortex-X3:  -3.6%
  Cortex-X4: -20.8%

Bug: libyuv:976
Change-Id: Iae2ac24ffdbc718cd1e05bb77191f8d1df3fcf6f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790975
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-09-16 04:36:43 +00:00
George Steed
772f0fde1c [AArch64] Use full Neon vectors in RGB565To{ARGB,UV,Y}Row_NEON
The existing code only makes use of half of the vector lanes in the
RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more
data to allow using full vectors, adjusting the "any" kernel macros to
match. For the RGB565ToUVRow kernel we already have plenty of data but
currently call the macro twice as much as needed, so refactor the code
to only call it once but operating with full vectors instead.

Reduction in runtimes observed for selected micro-architectures:

            | RGB565ToARGBRow | RGB565ToUVRow | RGB565ToYRow
 Cortex-A53 |          -35.2% |        -28.8% |       -31.1%
 Cortex-A55 |          -32.5% |        -34.4% |       -42.9%
Cortex-A510 |          -21.6% |        -27.7% |       -47.2%
 Cortex-A76 |           -0.9% |        -42.0% |       -21.4%
Cortex-A720 |          -28.6% |        -37.2% |       -26.1%
  Cortex-X1 |           -3.2% |        -42.3% |       -23.4%

Bug: b/42280945
Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:35:47 +00:00
George Steed
555f80f3ce [AArch64] Add SVE2 implementation of RGB24ToARGBRow
This can make use of the existing helper functions for RAWToARGBRow_SVE2
and RAWToRGBARow_SVE2 since the layouts are similar, we just need to
adjust the TBL constants to match the different input layout.

Observed reduction in runtime compared to the existing Neon kernel:

Cortex-A510: -25.6%
Cortex-A720: -15.2%
  Cortex-X2: -10.2%
  Cortex-X4: -30.2%

Bug: libyuv:973
Change-Id: Ie3676693286be90d09f0045766c3492cbc04ea64
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5638555
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-08 20:12:05 +00:00
George Steed
c613c3f102 [AArch64] Add SVE2 implementations for RAWTo{ARGB,RGBA}Row
We can construct particular predicates to load only up to 3/4 of a full
vector, allowing us to use TBL to shuffle elements into the correct
place rather than needing to rely on more expensive LD3 or ST4
instructions.

Reduction in runtimes observed compared to the existing Neon
implementation:

            | RAWToARGBRow | RAWToRGBARow
Cortex-A510 |       -32.4% |       -31.9%
Cortex-A720 |       -15.7% |       -15.6%
  Cortex-X2 |       -24.6% |       -24.4%

Bug: libyuv:973
Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-06 22:40:15 +00:00
George Steed
d1ec694ad3 [AArch64] Add P{210,410}To{ARGB,AR30}Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

              | Cortex-A55 | Cortex-A510 | Cortex-A76
P210ToARGBRow |     -59.8% |      -16.8% |     -53.2%
P210ToAR30Row |     -48.1% |      -21.8% |     -54.0%
P410ToARGBRow |     -56.5% |      -32.2% |     -54.1%
P410ToAR30Row |     -42.4% |       -4.5% |     -50.4%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-06 22:37:08 +00:00
George Steed
d32436e8f8 [AArch64] Add Neon implementation for I422ToAR30Row_NEON
There is an existing x86 implementation for this kernel, but not for
AArch64, so add one.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 Cortex-A55: -43.1%
Cortex-A510: -22.3%
 Cortex-A76: -54.8%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-07-02 18:16:33 +00:00
George Steed
bbd9cedc4f [AArch64] Add Neon impls for I212To{ARGB,AR30}Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToAR30Row | I210ToARGBRow
 Cortex-A55 |        -40.8% |        -54.4%
Cortex-A510 |        -26.2% |        -22.7%
 Cortex-A76 |        -49.2% |        -44.5%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-07-02 18:16:33 +00:00
George Steed
367dd50755 [AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow
This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels
except reading the YUV components all from the same interleaved input
array. We load four-byte elements and then use TBL to de-interleave the
UV components.

Unlike the NV{12,21} cases we need to de-interleave bytes rather than
widened 16-bit elements. Since we need a TBL instruction already it
would ordinarily be possible to perform the zero-extension from bytes to
16-bit elements by setting the index for every other byte to be out of
range. Such an approach does not work in SVE since at a vector length of
2048 bits since all possible byte values (0-255) are valid indices into
the vector. We instead get around this by rewriting the I4XXTORGB_SVE
macro to perform widening multiplies, operating on the low byte of each
16-bit UV element instead of the full value and therefore eliminating
the need for a zero-extension.

Observed reductions in runtimes compared to the existing Neon code:

            | UYVYToARGBRow | YUY2ToARGBRow
Cortex-A510 |        -30.2% |        -30.2%
Cortex-A720 |         -4.8% |         -4.7%
  Cortex-X2 |         -9.6% |        -10.1%

Bug: libyuv:973
Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-13 22:06:46 +00:00
George Steed
cd4113f4e8 [AArch64] Add SVE2 implementation of I400ToARGBRow
This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we
can pre-calculate the UV component results before the loop body.

Unlike in the Neon version of the code we can make use of MOVPRFX and
USQADD to avoid needing to apply the bias separately from the UV
coefficient multiply additions.

Reduction in runtime observed compared to the existing Neon code:

Cortex-A510: -26.1%
Cortex-A520:  -5.9%
Cortex-A715: -49.5%
Cortex-A720: -49.4%
  Cortex-X2: -22.5%
  Cortex-X3: -23.5%
  Cortex-X4: -21.6%

Bug: libyuv:973
Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-13 22:02:06 +00:00
George Steed
34abe98fe2 [AArch64] Add SVE2 implementations for NV{12,21}ToARGBRow
We need a permute to duplicate the UV components, so we can share a
common implementation for both NV12 and NV21 by varying the inputs to
the INDEX instruction that generates the TBL indices.

Observed reductions in runtimes compared to the existing Neon code:

            | NV12ToARGBRow_SVE2 | NV21ToARGBRow_SVE2
Cortex-A510 |             -29.1% |             -29.1%
Cortex-A720 |              -4.8% |              -4.8%
  Cortex-X2 |              -9.2% |              -9.2%

Bug: libyuv:973
Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-12 16:24:40 +00:00
George Steed
96bbdb53ed [AArch64] Add SVE2 implementation of I422ToRGBARow
This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we
just need to interleave differently for the output.

The RGBA format actually saves us an instruction compared to ARGB since
there is no need to merge in the alpha component, we can just replace
the odd elements of the alpha vector itself during the narrowing.

Also rename some existing macros to make more sense when distinguishing
between ARGB and RGBA.

Reductions in runtime observed compared to the existing Neon code:

Cortex-A510: -27.0%
Cortex-A720:  -5.3%
  Cortex-X2: -14.7%

Bug: libyuv:973
Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
d0da5a3298 [AArch64] Add SVE2 implementation of ARGB1555ToARGBRow
Avoiding LD4 and unrolling gives a good perf improvement for the little
core especially.

Observed reduction in runtime relative to the existing Neon code:

Cortex-A510: -69.7%
Cortex-A720:  -7.7%
  Cortex-X2: -41.9%
  Cortex-X4: -14.5%

Bug: libyuv:973
Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
250e1e1ba3 [AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow
Observed performance improvements compared to the existing Neon
implementation:

Cortex-A510: -21.7%
Cortex-A720: -49.2%
  Cortex-X2: -62.6%

Bug: libyuv:973
Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-06-03 23:15:04 +00:00
George Steed
6c70eb2819 [AArch64] Add Neon impls for I{210,410}ToAR30Row_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

 I210ToAR30Row on Cortex-A55: -43.8%
I210ToAR30Row on Cortex-A510: -27.0%
 I210ToAR30Row on Cortex-A76: -50.4%
 I410ToAR30Row on Cortex-A55: -44.3%
I410ToAR30Row on Cortex-A510: -17.5%
 I410ToAR30Row on Cortex-A76: -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-06-03 22:46:12 +00:00
George Steed
812b4955b2 [AArch64] Add Neon impls for I{210,410}ToARGBRow_NEON
There is are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210ToARGBRow | I410ToARGBRow
 Cortex-A55 |        -55.6% |        -56.2%
Cortex-A510 |        -22.6% |        -35.6%
 Cortex-A76 |        -48.1% |        -57.2%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 17:40:48 +00:00
George Steed
5b4160b9c3 [AArch64] Add Neon impls for I{210,410}AlphaToARGBRow_NEON
There are existing x86 implementations for these kernels, but not for
AArch64, so add them.

Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:

            | I210AlphaToARGBRow | I410AlphaToARGBRow
 Cortex-A55 |             -55.3% |             -56.1%
Cortex-A510 |             -27.9% |             -42.6%
 Cortex-A76 |             -54.9% |             -60.3%

Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-31 08:41:31 +00:00
Wan-Teh Chang
ec6f15079f Remove unneeded #ifdef HAVE_JPEG code
Change-Id: Ic7e1393b48bec735625197243b3d436ea01cfb07
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5529467
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-09 23:02:18 +00:00
George Steed
f483007b9a [AArch64] Add SVE implementation for I422AlphaToARGBRow
This is mostly identical to the existing I422ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -32.1%
Cortex-A720:  -5.1%
  Cortex-X2: -10.1%

Bug: libyuv:973
Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:54:07 +00:00
George Steed
b53b27d6bf [AArch64] Add SVE implementation for I444AlphaToARGBRow
This is mostly identical to the existing I444ToARGBRow_SVE
implementation, we just need to make sure to load the alpha component
rather than hard-coding it to 255.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -34.2%
Cortex-A720: -17.6%
  Cortex-X2:  -9.6%

Bug: libyuv:973
Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-29 18:54:02 +00:00
George Steed
6ac90403a1 [AArch64] Add SVE implementation for I422ToARGBRow
We need a new macro for reading I422 data, but is otherwise mostly
identical to the existing I444ToARGBRow_SVE implementation.

Reduction in runtimes observed compared to the existing Neon code:

Cortex-A510: -25.0%
Cortex-A720:  -5.0%
  Cortex-X2: -10.8%

Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-04-27 18:26:11 +00:00
George Steed
e52007eff9 [AArch64] Add SVE2 implementation for I444ToARGBRow
Being able to use SVE2 functionality for these kernels has a number of
performance wins compared to the existing Neon code:

* For the Y component calculation we are able to use UMULH, versus the
  existing UMULL x2 + UZP2 sequence in Neon.

* For the RGBTORGBA8 calculation we are able to take advantage of
  interleaving narrowing instructions, allowing us to use ST2 rather
  than ST4 for the store. This is a big performance win on some
  micro-architectures where ST4 is costly.

* The use of predication means we do not need to add "any" kernels, we
  can simply rerun the calculation with a not-full predicate for the
  final iteration.

To avoid the overhead of generating a predicate register on every
iteration we duplicate the loop body and only generate a predicate on
the final iteration of the loop. This costs a small amount on the final
iteration but should still be significantly quicker than the overhead of
a function call needed by the "any" cases. Duplicating the loop body to
reduce the use of the WHILELT instruction improves little core
performance by ~12% by itself but has negligable impact on other
micro-architectures.

Reduction in runtime for the new SVE2 implementation compared to the
existing Neon implementation on selected micro-architectures:

Cortex-A510: -36.5%
Cortex-A720: -17.3%
  Cortex-X2: -11.3%

Bug: libyuv:973
Change-Id: I2a485f0dfa077a56f96b80a667ad38bbea47b4b4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424739
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-09 03:11:01 +00:00
Frank Barchard
914624f0b8 YUY2ToARGBMatrix and UYVYToARGBMatrix added to allow any color matrix
Bug: libyuv:971
Change-Id: If15d4598d75500a3717f07d02c0c295fdc58254e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5214453
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-01-19 21:21:37 +00:00
Frank Barchard
def473f501 malloc return 1 for failures and assert for internal functions
Bug: libyuv:968
Change-Id: Iea2f907061532d2e00347996124bc80d079a7bdc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5010874
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-12-04 22:55:20 +00:00
Frank Barchard
31e1d6f896 Check allocations that return NULL and return early
BUG=libyuv:968

Change-Id: I9e8594440a6035958511f9c50072820131331fc8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4977552
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-10-27 17:41:36 +00:00
Bruce Lai
ec2e9ca000 [RVV] Support AR64ToAB64 and RGBA-family color conversions
Add scalar code for AR64ToAB64, ARGBToRGBA, ARGBToBGRA, ARGBToABGR, RGBAToARGB, BGRAToARGB, and ABGRToARGB.
They are originally implemented by ARGBShffle.
This CL independetly implements them, and only enables for risc-v now.
This CL also add RVV implementation for `RGBA-family <-> RGBA-family` color conversions.

* Run on SiFive internal FPGA(VLEN=128):

Test Case	Speedup
AR64ToAB64_Opt  x4.6
ARGBToRGBA_Opt  x6
ARGBToBGRA_Opt  x6
ARGBToABGR_Opt  x6
RGBAToARGB_Opt  x6

Change-Id: Ie0630901046084aa259699fcdeccc64170d7103f
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4797451
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-09-05 22:44:48 +00:00
Darren Hsieh
10de943a12 [RVV] Enable ScaleRowUp2_(Bi)linear_RVV/ScaleUVRowUp2_(Bi)linear_RVV
ScaleUVRowUp2_(Bi)linear_RVV function is equal to other platforms' ScaleRowUp2_(Bi)linear_Any_XXX.
We process entire row in this function.
Other platforms only implement non-edge part of image and process edge with scalar.
ScaleRowUp2_(Bi)linear_Any_XXX: Combine ScaleRowUp2_(Bi)linear_XXX(non-edge) + ScaleRowUp2_(Bi)linear_C(edge) by SBUH2LANY/SU2BLANY.

* Run on SiFive internal FPGA:

Test case                       RVV function			Speedup
I444ScaleFrom640x360_Bilinear	ScaleRowUp2_Bilinear_RVV	8.21
I444ScaleFrom640x360_Linear	ScaleRowUp2_Linear_RVV	        8.08
UVScaleFrom640x360_Bilinear	ScaleUVRowUp2_Bilinear_RVV	7.80
UVScaleFrom640x360_Linear	ScaleUVRowUp2_Linear_RVV	7.03

Change-Id: I539245ce51858f077506a78f0e7e82377ac6a95d
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4666062
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-07-26 18:05:50 +00:00
Darren Hsieh
aed6dbef17 [RVV] Enable NV{12,21}To{ARGB,RGB24}Row_RVV
* Run on SiFive internal FPGA(w/ -march=rv64gcv):

Test Case	Speedup
NV12ToARGB_Opt	12.0
NV21ToARGB_Opt	12.1
NV12ToABGR_Opt	12.6
NV21ToABGR_Opt	12.0
NV12ToRGB24_Opt	12.5
NV21ToRGB24_Opt	11.7
NV12ToRAW_Opt	12.1
NV21ToRAW_Opt	11.4

Change-Id: Icae2bac2b4ebbd4c5a89e847fde9a74fe6481878
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4707804
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-07-24 17:07:01 +00:00
Frank Barchard
157b153b60 Fix tidy warning that uint32_t dither4 should not be const
- Remove const from uint32_t dither4 parameter to fix clang-tidy warning
- Apply clang format
- Bump version
- Remove unused MMI source; superceded by MSA

Bug: None
Change-Id: Id49991db25bca4e99590b415312542d917471c62
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581882
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-06-02 00:42:02 +00:00
Wan-Teh Chang
179b0203e5 Enable {J400/I400}ToARGBRow_RVV
Run on SiFive internal FPGA*:

I400ToARGB_Opt (~8x vs scalar)
J400ToARGB_Opt (~10x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

Bug: libyuv:956, libyuv:961
Change-Id: If4e21ec85c4ff79083ec16a6faae0e457129a8de
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4544972
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
2023-05-20 23:29:33 +00:00
Lu Wang
8670bcf17f Optimize the following 19 functions with LSX in row_lsx.cc.
UYVYToYRow_LSX, UYVYToUVRow_LSX, UYVYToUV422Row_LSX,
ARGBToUVRow_LSX, ARGBToRGB24Row_LSX, ARGBToRAWRow_LSX,
ARGBToRGB565Row_LSX, ARGBToARGB1555Row_LSX, ARGBToARGB4444Row_LSX,
ARGBToUV444Row_LSX, ARGBMultiplyRow_LSX, ARGBAddRow_LSX,
ARGBSubtractRow_LSX, ARGBAttenuateRow_LSX, ARGBToRGB565DitherRow_LSX,
ARGBShuffleRow_LSX, ARGBShadeRow_LSX, ARGBGrayRow_LSX,
ARGBSepiaRow_LSX

Bug: libyuv:913
Change-Id: I02c0c9d68b229c4a66c96837e9b928c2f5dda1f3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4546814
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-05-19 18:55:58 +00:00
Frank Barchard
a37799344d ARGBToI420Alpha function to convert ARGB to I420 with Alpha
Bug: b/281866362
Change-Id: Ic1093a887fb483f134c78909cf1ee7495e7345ba
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4534100
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2023-05-17 00:23:24 +00:00
Bruce Lai
11d4536002 Enable I{422,444}AlphaToARGBRow_RVV & ARGBAttentuateRow_RVV
Run on SiFive internal FPGA:

I444AlphaToARGB_Opt (~16x vs scalar)
I422AlphaToARGB_Opt (~10x vs scalar)
ARGBAttenuate_Opt (~3x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

Change-Id: I0046eb7af8104bc8e13cee1cb91a19f90940d5b0
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4535657
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-05-16 19:20:49 +00:00
Darren Hsieh
497ea35688 Enable I444To{ARGB,RGB24}Row_RVV
Run on SiFive internal FPGA:

I444ToARGB_Opt (~16x vs scalar)
I444ToRGB24_Opt (~10x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

Change-Id: Idae7dc46ef648beaa14b58ba3eb56b67b17c9b3b
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4520761
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-05-10 19:50:56 +00:00
Darren Hsieh
964d963afb Enable I422To{ARGB,RGBA,RGB24}Row_RVV
Run on SiFive internal FPGA:

I422ToARGB_Opt (~10x vs scalar)
I422ToRGBA_Opt (~10x vs scalar)
I420ToRGB24_Opt (~8x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

This CL manually sets rounding mode,
since we use fixed-point vector narrowing clip.
There is no definition about default value for fixed-point rounding mode.
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#38-vector-fixed-point-rounding-mode-register-vxrm
The behavior could be different on differet paltforms. To avoid unexpected behavior, we set rounding mode manually.

Change-Id: I90f0dcb90c37f7da7caab8eb1df6c9c7a3c874a8
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4512373
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-05-10 00:29:20 +00:00
Lu Wang
1d940cc570 Optimize the following functions with LSX.
MirrorRow_LSX, MirrorUVRow_LSX, ARGBMirrorRow_LSX,
I422ToYUY2Row_LSX, I422ToUYVYRow_LSX, I422ToARGBRow_LSX,
I422ToRGBARow_LSX, I422AlphaToARGBRow_LSX, I422ToRGB24Row_LSX,
I422ToRGB565Row_LSX, I422ToARGB4444Row_LSX, I422ToARGB1555Row_LSX,
YUY2ToYRow_LSX, YUY2ToUVRow_LSX, YUY2ToUV422Row_LSX

Bug: libyuv:913
Change-Id: I46cec605001d7ddd73846eed6d0a77f936b6dc53
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4515191
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-05-10 00:25:48 +00:00
Bruce Lai
1330a79e9f Optimized AR64/AB64 <-> ARGB with RVV
* Run on SiFive internal FPGA:

ARGBToAR64_Opt (~13.7x vs scalar)
ARGBToAB64_Opt (~5.81x vs scalar)
AR64ToARGB_Opt (~15.8x vs scalar)
AB64ToARGB_Opt (~2.40x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

Bug: libyuv:956

Change-Id: Ida642a5077f59d25fb7c5328f671956b2293dadd
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4442913
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-04-20 19:49:55 +00:00
Darren Hsieh
44396e6e9a Add ARGBToRAWRow_RVV, ARGBToRGB24Row_RVV, RGB24ToARGBRow_RVV
* Run on SiFive internal FPGA:

ARGBToRAW_Opt (~1.55x vs scalar)

ARGBToRGB24_Opt (~1.44x vs scalar)

RGB24ToARGB_Opt (~1.77x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

Bug: libyuv:956

Change-Id: I26722f6848cd68684d95d9a7ee06ce0416e7985d
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4413083
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-04-13 19:33:16 +00:00
Darren Hsieh
e8af6cb2e4 Add RAWToARGBRow_RVV,RAWToRGBARow_RVV,RAWToRGB24Row_RVV
* Run on SiFive internal FPGA:

RAWToARGB_Opt (~2x vs scalar)

RAWToRGBA_Opt (~2x vs scalar)

RAWToRGB24_Opt (~1.5x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

Change-Id: I21a13d646589ea2aa3822cb9225f5191068c285b
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4408357
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-04-07 18:45:08 +00:00
Frank Barchard
248172e2ba I422ToRGB24, I422ToRAW, I422ToRGB24MatrixFilter conversion functions added.
- YUV to RGB use linear for first and last row.
- add assert(yuvconstants)
- rename pointers to match row functions.
- use macros that match row functions.
- use 12 bit upsampler for conversions of 10 and 12 bits

Cortex A53 AArch32
I420ToRGB24_Opt (3627 ms)
I422ToRGB24_Opt (4099 ms)
I444ToRGB24_Opt (4186 ms)
I420ToRGB24Filter_Opt (5451 ms)
I422ToRGB24Filter_Opt (5430 ms)

AVX2
Was I420ToRGB24Filter_Opt (583 ms)
Now I420ToRGB24Filter_Opt (560 ms)

Neon Cortex A7
Was I420ToRGB24Filter_Opt (5447 ms)
Now I420ToRGB24Filter_Opt (5439 ms)

Bug: libyuv:938


Change-Id: I1731f2dd591073ae11a756f06574103ba0f803c7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3906082
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-09-20 02:00:52 +00:00
Frank Barchard
f71c83552d I420ToRGB24MatrixFilter function added
- Implemented as 3 steps: Upsample UV to 4:4:4, I444ToARGB, ARGBToRGB24
- Fix some build warnings for missing prototypes.

Pixel 4
I420ToRGB24_Opt (743 ms)
I420ToRGB24Filter_Opt (1331 ms)

Windows with skylake xeon:
x86 32 bit
I420ToRGB24_Opt (387 ms)
I420ToRGB24Filter_Opt (571 ms)
x64 64 bit
I420ToRGB24_Opt (384 ms)
I420ToRGB24Filter_Opt (582 ms)


Bug: libyuv:938, libyuv:830
Change-Id: Ie27f70816ec084437014f8a1c630ae011ee2348c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3900298
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2022-09-16 19:46:47 +00:00
Vignesh Venkatasubramanian
a5a1102a60 Add I422ToRGB565Matrix
The code already exists to use a specific matrix. This CL simply
adds a function to use a generic YUV matrix for the conversion.

Bug: b/241451603
Change-Id: I0eea7e96a891d045905a9c963b56c053097029ec
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3820903
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-08-09 20:15:44 +00:00
Frank Barchard
b028453ba6 Disable bilinear 16 bit scale up for SSE2
- Undefine HAS_SCALEROWUP2_BILINEAR_16_SSE2
- Save XMM7 in ScaleRowUp2_Bilinear_16_SSE2().
- Rename HAS_SCALEROWUP2LINEAR_xxx to HAS_SCALEROWUP2_LINEAR_xxx
- DetileSplitUVRow_C() is implemented using SplitUVRow_C().
- Changes to unit_test/planar_test.cc.

Bug: libyuv:882
Change-Id: I0a8e8e5fb43bdf58ded87244e802343eacb789f2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3795063
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2022-08-01 22:54:48 +00:00
Frank Barchard
eec8dd37e8 Change ScaleUVRowUp2_Biinear_16_SSE2 to SSE41
Bug: libyuv:928

xed -i scale_gcc.o:
SYM ScaleUVRowUp2_Linear_16_SSE2:
XDIS 0: LOGICAL   SSE2       660FEFED                 pxor xmm5, xmm5
XDIS 4: SSE       SSE2       660F76E4                 pcmpeqd xmm4, xmm4
XDIS 8: SSE       SSE2       660F72D41F               psrld xmm4, 0x1f
XDIS d: SSE       SSE2       660F72F401               pslld xmm4, 0x1
XDIS 12: DATAXFER  SSE2       F30F7E07                 movq xmm0, qword ptr [rdi]
XDIS 16: DATAXFER  SSE2       F30F7E4F04               movq xmm1, qword ptr [rdi+0x4]
XDIS 1b: SSE       SSE2       660F61C5                 punpcklwd xmm0, xmm5
XDIS 1f: SSE       SSE2       660F61CD                 punpcklwd xmm1, xmm5
XDIS 23: DATAXFER  SSE2       660F6FD0                 movdqa xmm2, xmm0
XDIS 27: DATAXFER  SSE2       660F6FD9                 movdqa xmm3, xmm1
XDIS 2b: SSE       SSE2       660F70D24E               pshufd xmm2, xmm2, 0x4e
XDIS 30: SSE       SSE2       660F70DB4E               pshufd xmm3, xmm3, 0x4e
XDIS 35: SSE       SSE2       660FFED4                 paddd xmm2, xmm4
XDIS 39: SSE       SSE2       660FFEDC                 paddd xmm3, xmm4
XDIS 3d: SSE       SSE2       660FFED0                 paddd xmm2, xmm0
XDIS 41: SSE       SSE2       660FFED9                 paddd xmm3, xmm1
XDIS 45: SSE       SSE2       660FFEC0                 paddd xmm0, xmm0
XDIS 49: SSE       SSE2       660FFEC9                 paddd xmm1, xmm1
XDIS 4d: SSE       SSE2       660FFEC2                 paddd xmm0, xmm2
XDIS 51: SSE       SSE2       660FFECB                 paddd xmm1, xmm3
XDIS 55: SSE       SSE2       660F72D002               psrld xmm0, 0x2
XDIS 5a: SSE       SSE2       660F72D102               psrld xmm1, 0x2
XDIS 5f: SSE       SSE4       660F382BC1               packusdw xmm0, xmm1
XDIS 64: DATAXFER  SSE2       F30F7F06                 movdqu xmmword ptr [rsi], xmm0
XDIS 68: MISC      BASE       488D7F08                 lea rdi, ptr [rdi+0x8]
XDIS 6c: MISC      BASE       488D7610                 lea rsi, ptr [rsi+0x10]
XDIS 70: BINARY    BASE       83EA04                   sub edx, 0x4
XDIS 73: COND_BR   BASE       7F9D                     jnle 0x12 <ScaleUVRowUp2_Linear_16_SSE2+0x12>
XDIS 75: RET       BASE       C3                       ret

SYM ScaleUVRowUp2_Bilinear_16_SSE2:
XDIS 0: LOGICAL   SSE2       660FEFFF                 pxor xmm7, xmm7
XDIS 4: SSE       SSE2       660F76F6                 pcmpeqd xmm6, xmm6
XDIS 8: SSE       SSE2       660F72D61F               psrld xmm6, 0x1f
XDIS d: SSE       SSE2       660F72F603               pslld xmm6, 0x3
XDIS 12: DATAXFER  SSE2       F30F7E07                 movq xmm0, qword ptr [rdi]
XDIS 16: DATAXFER  SSE2       F30F7E4F04               movq xmm1, qword ptr [rdi+0x4]
XDIS 1b: SSE       SSE2       660F61C7                 punpcklwd xmm0, xmm7
XDIS 1f: SSE       SSE2       660F61CF                 punpcklwd xmm1, xmm7
XDIS 23: DATAXFER  SSE2       660F6FD0                 movdqa xmm2, xmm0
XDIS 27: DATAXFER  SSE2       660F6FD9                 movdqa xmm3, xmm1
XDIS 2b: SSE       SSE2       660F70D24E               pshufd xmm2, xmm2, 0x4e
XDIS 30: SSE       SSE2       660F70DB4E               pshufd xmm3, xmm3, 0x4e
XDIS 35: SSE       SSE2       660FFED0                 paddd xmm2, xmm0
XDIS 39: SSE       SSE2       660FFED9                 paddd xmm3, xmm1
XDIS 3d: SSE       SSE2       660FFEC0                 paddd xmm0, xmm0
XDIS 41: SSE       SSE2       660FFEC9                 paddd xmm1, xmm1
XDIS 45: SSE       SSE2       660FFEC2                 paddd xmm0, xmm2
XDIS 49: SSE       SSE2       660FFECB                 paddd xmm1, xmm3
XDIS 4d: DATAXFER  SSE2       F30F7E1477               movq xmm2, qword ptr [rdi+rsi*2]
XDIS 52: DATAXFER  SSE2       F30F7E5C7704             movq xmm3, qword ptr [rdi+rsi*2+0x4]
XDIS 58: SSE       SSE2       660F61D7                 punpcklwd xmm2, xmm7
XDIS 5c: SSE       SSE2       660F61DF                 punpcklwd xmm3, xmm7
XDIS 60: DATAXFER  SSE2       660F6FE2                 movdqa xmm4, xmm2
XDIS 64: DATAXFER  SSE2       660F6FEB                 movdqa xmm5, xmm3
XDIS 68: SSE       SSE2       660F70E44E               pshufd xmm4, xmm4, 0x4e
XDIS 6d: SSE       SSE2       660F70ED4E               pshufd xmm5, xmm5, 0x4e
XDIS 72: SSE       SSE2       660FFEE2                 paddd xmm4, xmm2
XDIS 76: SSE       SSE2       660FFEEB                 paddd xmm5, xmm3
XDIS 7a: SSE       SSE2       660FFED2                 paddd xmm2, xmm2
XDIS 7e: SSE       SSE2       660FFEDB                 paddd xmm3, xmm3
XDIS 82: SSE       SSE2       660FFED4                 paddd xmm2, xmm4
XDIS 86: SSE       SSE2       660FFEDD                 paddd xmm3, xmm5
XDIS 8a: DATAXFER  SSE2       660F6FE0                 movdqa xmm4, xmm0
XDIS 8e: DATAXFER  SSE2       660F6FEA                 movdqa xmm5, xmm2
XDIS 92: SSE       SSE2       660FFEE0                 paddd xmm4, xmm0
XDIS 96: SSE       SSE2       660FFEEE                 paddd xmm5, xmm6
XDIS 9a: SSE       SSE2       660FFEE0                 paddd xmm4, xmm0
XDIS 9e: SSE       SSE2       660FFEE5                 paddd xmm4, xmm5
XDIS a2: SSE       SSE2       660F72D404               psrld xmm4, 0x4
XDIS a7: DATAXFER  SSE2       660F6FEA                 movdqa xmm5, xmm2
XDIS ab: SSE       SSE2       660FFEEA                 paddd xmm5, xmm2
XDIS af: SSE       SSE2       660FFEC6                 paddd xmm0, xmm6
XDIS b3: SSE       SSE2       660FFEEA                 paddd xmm5, xmm2
XDIS b7: SSE       SSE2       660FFEE8                 paddd xmm5, xmm0
XDIS bb: SSE       SSE2       660F72D504               psrld xmm5, 0x4
XDIS c0: DATAXFER  SSE2       660F6FC1                 movdqa xmm0, xmm1
XDIS c4: DATAXFER  SSE2       660F6FD3                 movdqa xmm2, xmm3
XDIS c8: SSE       SSE2       660FFEC1                 paddd xmm0, xmm1
XDIS cc: SSE       SSE2       660FFED6                 paddd xmm2, xmm6
XDIS d0: SSE       SSE2       660FFEC1                 paddd xmm0, xmm1
XDIS d4: SSE       SSE2       660FFEC2                 paddd xmm0, xmm2
XDIS d8: SSE       SSE2       660F72D004               psrld xmm0, 0x4
XDIS dd: DATAXFER  SSE2       660F6FD3                 movdqa xmm2, xmm3
XDIS e1: SSE       SSE2       660FFED3                 paddd xmm2, xmm3
XDIS e5: SSE       SSE2       660FFECE                 paddd xmm1, xmm6
XDIS e9: SSE       SSE2       660FFED3                 paddd xmm2, xmm3
XDIS ed: SSE       SSE2       660FFED1                 paddd xmm2, xmm1
XDIS f1: SSE       SSE2       660F72D204               psrld xmm2, 0x4
XDIS f6: SSE       SSE4       660F382BE0               packusdw xmm4, xmm0
XDIS fb: DATAXFER  SSE2       F30F7F22                 movdqu xmmword ptr [rdx], xmm4
XDIS ff: SSE       SSE4       660F382BEA               packusdw xmm5, xmm2
XDIS 104: DATAXFER  SSE2       F30F7F2C4A               movdqu xmmword ptr [rdx+rcx*2], xmm5
XDIS 109: MISC      BASE       488D7F08                 lea rdi, ptr [rdi+0x8]
XDIS 10d: MISC      BASE       488D5210                 lea rdx, ptr [rdx+0x10]
XDIS 111: BINARY    BASE       4183E804                 sub r8d, 0x4
XDIS 115: COND_BR   BASE       0F8FF7FEFFFF             jnle 0x12 <ScaleUVRowUp2_Bilinear_16_SSE2+0x12>
XDIS 11b: RET       BASE       C3                       ret

Change-Id: Ia20860e9c3c45368822cfd8877167ff0bf973dcc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3587602
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2022-04-15 18:46:09 +00:00
Wan-Teh Chang
ebd9e130f0 Fix bugs in I010AlphaToARGBMatrixBilinear()
Add a missing increment of src_a and ARGBAttenuateRow() call.

Bug: libyuv:922
Change-Id: I26e04e70c6a1a231cbe54b60c249f4c2e8af112a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3549976
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Wan-Teh Chang <wtc@google.com>
2022-03-25 15:39:52 +00:00
Wan-Teh Chang
173ed374c0 Add null pointer checks for the src_a parameters
Change-Id: Icc96e18eab07080c18b6542171a340c97f059c78
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3550016
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>
2022-03-25 15:14:38 +00:00