267 Commits

Author SHA1 Message Date
Frank Barchard
2b4453d46f Deprecate MIPS and MSA support.
- Remove *_msa.cc source files
- Update build files
- Update header references, planar ifdefs for row functions
- Update documentation on supported platforms
- Version bumped to 1921
- clang-format applied

Bug: 434383432
Change-Id: I072d6aac4956f0ed668e64614ac8557612171f76
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/7045953
Reviewed-by: Justin Green <greenjustin@google.com>
2025-10-16 12:20:40 -07:00
George Steed
007b920232 [AArch64] Add SME implementation of ARGBToUVRow and similar
Mostly just a straightforward copy of the existing SVE2 code ported to
Streaming-SVE. Introduce new "any" kernels for non-multiple of two
cases, matching what we already do for SVE2.

The existing SVE2 code makes use of the Neon MOVI instruction that is
not supported in Streaming-SVE, so adjust the code to use FMOV instead
which has the same performance characteristics.

Change-Id: I74b7ea1fe8e6af75dfaf92826a4de775a1559f77
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6663806
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2025-06-30 09:20:23 -07:00
George Steed
61bdaee13a Add Neon I8MM implementations of ARGB to UV and variants
The maximum coefficient is 128, so store constants negated to take
advantage of -128 being representable in 8-bit integers. This allows us
to use the I8MM USDOT instructions.

Reduction in time taken observed compared to the existing Neon
implementation, as a geomean of all ARGBToUV variants:

Cortex-A510:  -7.1%
Cortex-A520:  -2.1%
Cortex-A710:  -8.4%
Cortex-A715:  -0.3%
Cortex-A720:  -0.3%
  Cortex-X2: -40.0%
  Cortex-X3: -43.3%
  Cortex-X4: -11.3%
Cortex-X925:  -2.5%

Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-05-12 11:14:00 -07:00
Frank Barchard
918329caee Make constant 0x0101 using vpcmpeqb+vpabsb
Was
      vpcmpeqb    %%ymm4,%%ymm4,%%ymm4
      vpsrlw      $0xf,%%ymm4,%%ymm4
      vpackuswb   %%ymm4,%%ymm4,%%ymm4
Now
      vpcmpeqb    %%ymm4,%%ymm4,%%ymm4
      vpabsb      %%ymm4,%%ymm4

Bug: 381138208
Change-Id: Ib70c24ac636fff95a10c7f06ed8f0a3bc7514906
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6312925
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2025-03-10 13:25:16 -07:00
Frank Barchard
c060118bea ARGBToJ444 use 256 for fixed point scale UV
- use negative coefficients for UV to allow -128
- change shift to truncate instead of round for UV
- adapt all row_gcc RGB to UV into matrix functions
- add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc

Bug: 381138208
Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: James Zern <jzern@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-27 13:04:15 -08:00
Frank Barchard
61354d2671 ARGBToUV Matrix for AVX2 and SSSE3
- Round before shifting to 8 bit to match NEON
  - RAWToARGB use unaligned loads and port to AVX2

Was C/SSSE/AVX2
ARGBToI444_Opt (343 ms)
ARGBToJ444_Opt (677 ms)
RAWToI444_Opt (405 ms)
RAWToJ444_Opt (803 ms)

Now AVX2
ARGBToI444_Opt (283 ms)
ARGBToJ444_Opt (284 ms)
RAWToI444_Opt (316 ms)
RAWToJ444_Opt (339 ms)

Profile Now AVX2
  38.31%  ARGBToUVJ444Row_AVX2
  32.31%  RAWToARGBRow_AVX2
  23.99%  ARGBToYJRow_AVX2

Profile Was C/SSSE/AVX2
    73.15%  ARGBToUVJ444Row_C
    15.74%  RAWToARGBRow_SSSE3
     8.87%  ARGBToYJRow_AVX2

Bug: 381138208
Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Ben Weiss <bweiss@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2025-02-10 18:36:18 -08:00
Frank Barchard
5a9a6ea936 Add RAWToI444
Skylake Xeon
  RAWToI444_Opt (433 ms)
  RAWToJ444_Opt (1781 ms)
  ARGBToI444_Opt (352 ms)
  ARGBToJ444_Opt (1577 ms)

Samsung S22 Exynos
  ARGBToI444_Opt (283 ms)
  ARGBToJ444_Opt (209 ms)
  RAWToI444_Opt (294 ms)
  RAWToJ444_Opt (293 ms)

Profiling on Samsung S22 Exynos
37.62%,  ARGBToUV444Row_NEON_I8MM
29.42%,  RAWToARGBRow_SVE2
19.61%,  ARGBToYRow_NEON_DotProd

Passing different --libyuv_cpu_info=N etc we can compare each ISA
C           1  RAWToI444_Opt (781 ms)
NEON      511  RAWToI444_Opt (757 ms)
NEONDOT  1023  RAWToI444_Opt (571 ms)
NEONI8MM 2047  RAWToI444_Opt (334 ms)
SVE2     8191  RAWToI444_Opt (307 ms)



Bug: 390247964
Change-Id: I0316fedd32222588455afa751f5b854f46bce024
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6223658
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-02-03 16:13:03 -08:00
Frank Barchard
c1bac9e6a5 RAWToJ444 and ARGBToJ444
- ARGBToJ444 implements ARGBToUVJ444Row_C
- RAWToJ444 implemented as 2 steps - RAWToARGB and ARGBToJ444

libyuv_test '--gunit_filter=*R*To?444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1
(with bit exact off)

Samsung S23
RAWToJ444_Opt (437 ms)
ARGBToJ444_Opt (337 ms)
ARGBToI444_Opt (196 ms)

Skylake Xeon
RAWToJ444_Opt (1699 ms)
ARGBToJ444_Opt (1559 ms)
ARGBToI444_Opt (346 ms)

Bug: 390247964
Change-Id: Id1b1b45a5e4512ab50830aadf62f780fbe631575
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207845
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-29 15:18:38 -08:00
Frank Barchard
26277baf96 J420ToI420 using planar 8 bit scaling
- Add Convert8To8Plane which scale and add 8 bit values allowing full range
  YUV to be converted to limited range YUV

libyuv_test '--gunit_filter=*J420ToI420*' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Samsung S23
J420ToI420_Opt (45 ms)
I420ToI420_Opt (37 ms)

Skylake
J420ToI420_Opt (596 ms)
I420ToI420_Opt (99 ms)

Bug: 381327032
Change-Id: I380c3fa783491f2e3727af28b0ea9ce16d2bb8a4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6182631
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-22 02:50:24 -08:00
Frank Barchard
47ddac2996 Sub sampling conversions use CopyPlane for Y channel
- Replace ScalePlane with CopyPlane for Y channel
- Vertical mirroring is supported, but not horizontal mirroring.
- Check src_y is not null when dst_y is not null for all libyuv functions that allow a null dst_y.
- Apply clang-format
- Bump version to 1899

Bug: None
Change-Id: Id1805b52b8024ba95a7f1b098dabf45af48670eb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6128599
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-02 13:34:11 -08:00
Frank Barchard
e0040eb318 Apply clang format
Bug: None
Change-Id: I0d9db4b384144523e61ae32b6ab3f72e93a0c265
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6138934
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2025-01-02 13:31:20 -08:00
George Steed
c2e7f8389a [AArch64] Add SME implementations of InterpolateRow{,_16,_16To8}
InterpolateRow_SME and InterpolateRow_16_SME need special cases to
handle if source_y_fraction is 256 since this would overflow a byte and
can just be a call to memcpy instead.

InterpolateRow_16To8_SME is never called with a source_y_fraction value
of 256 so there is no need for a special case here.

Change-Id: I67805b5db2c411acb93ada626cf414b35620f467
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 03:03:41 -08:00
George Steed
418b6df0de [AArch64] Add SME implementation of Convert16To8Row
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel.
SVE has a "widening multiply get high half" instruction in UMULH,
however using the same technique as the Neon code to avoid the need for
a widening multiply at all is more performant here.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2024-12-12 03:01:55 -08:00
George Steed
7391559cb4 [AArch64] Add SME implementation of MergeUVRow{,_16}
Mostly just a straightforward copy of the Neon code ported to
Streaming-SVE, we can use predication to avoid needing an `Any` kernel
and use ST2 to avoid needing a separate ZIP instruction.

These is no benefit from this kernel when the SVE vector length is only
128 bits, so skip writing a non-streaming SVE implementation.

Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-12-12 01:16:19 -08:00
George Steed
772f0fde1c [AArch64] Use full Neon vectors in RGB565To{ARGB,UV,Y}Row_NEON
The existing code only makes use of half of the vector lanes in the
RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more
data to allow using full vectors, adjusting the "any" kernel macros to
match. For the RGB565ToUVRow kernel we already have plenty of data but
currently call the macro twice as much as needed, so refactor the code
to only call it once but operating with full vectors instead.

Reduction in runtimes observed for selected micro-architectures:

            | RGB565ToARGBRow | RGB565ToUVRow | RGB565ToYRow
 Cortex-A53 |          -35.2% |        -28.8% |       -31.1%
 Cortex-A55 |          -32.5% |        -34.4% |       -42.9%
Cortex-A510 |          -21.6% |        -27.7% |       -47.2%
 Cortex-A76 |           -0.9% |        -42.0% |       -21.4%
Cortex-A720 |          -28.6% |        -37.2% |       -26.1%
  Cortex-X1 |           -3.2% |        -42.3% |       -23.4%

Bug: b/42280945
Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-09-16 04:35:47 +00:00
Frank Barchard
679e851f65 Convert16To8Row_AVX512BW using vpmovuswb
- avx2 is pack/perm is mutating order
- cvt method maintains channel order on avx512

Sapphire Rapids

Benchmark of 640x360 on Sapphire Rapids
AVX512BW
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms)

AVX2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms)

SSE2
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms)

Skylake Xeon
Now vpmovuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms)

Was vpackuswb
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms)

Switch from vpunpcklwd to vpbroadcastw for scale value parameter
Was
vpunpcklwd  %%xmm2,%%xmm2,%%xmm2
vbroadcastss %%xmm2,%%ymm2

Now
vpbroadcastw %%xmm2,%%ymm2

Bug: 357439226, 357721018
Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-15 20:13:33 +00:00
Frank Barchard
336e6fd25b I010ToNV12 conversion using 2 step row function for UV
- convert full Y plane with row coalescing if possible
- convert rows of UV from 10 bit to 8 bit then call MergeUV

libyuv_test '--gunit_filter=*010ToNV12_Opt' --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1
Note: Google Test filter = *010ToNV12_Opt

Skylake Xeon Was 2 pass planes
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (4512 ms)
Now 2 pass rows
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (2400 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (2265 ms)

On Samsung S23
libyuv_test --gunit_filter=*.????ToNV12_Opt --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000'

Was
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (3563 ms)

Now
[       OK ] LibYUVConvertTest.AYUVToNV12_Opt (3068 ms
[       OK ] LibYUVConvertTest.ARGBToNV12_Opt (2990 ms
[       OK ] LibYUVConvertTest.ABGRToNV12_Opt (2904 ms
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (1177 ms
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (1150 ms <- now
[       OK ] LibYUVConvertTest.I444ToNV12_Opt (1118 ms
[       OK ] LibYUVConvertTest.MM21ToNV12_Opt (1008 ms
[       OK ] LibYUVConvertTest.UYVYToNV12_Opt (1007 ms
[       OK ] LibYUVConvertTest.YUY2ToNV12_Opt (938 ms)
[       OK ] LibYUVConvertTest.NV21ToNV12_Opt (496 ms)
[       OK ] LibYUVConvertTest.I420ToNV12_Opt (466 ms)


Bug: b/357439226, b/357721018
Change-Id: I48405929ae835b171e7d556a16794eac22c50ae9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5782404
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-13 19:30:16 +00:00
Frank Barchard
a97746349b Add test for I010ToNV12
- Add support for negative height to invert
- Fix off by 1 on odd width and height
- Bump version to 1895

Initial I010 is 2 step planar conversion

libyuv_test '--gunit_filter=*010ToNV12_Opt' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1

Skylake Xeon
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (2675 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (1547 ms)
Pixel 7
[       OK ] LibYUVConvertTest.I010ToNV12_Opt (464 ms)
[       OK ] LibYUVConvertTest.P010ToNV12_Opt (125 ms)

Bug: b/357721018, b/357439226
Change-Id: I2ae59783cf328a6592d0ab80c374ae4dc281daf3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778595
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-12 18:57:56 +00:00
Chunbo Hua
fc94178260 Implement I010ToNV12 conversion
I010, also known as YUV420P10, is 10 bit YUV pixel format with 3 planes.
Both I010 and NV12 are 4:2:0 subsampling. NV12 has a Y plane, and an
interleaved UV plane.

Bug: 357721018
Change-Id: If215529b9eda8e0fb32aed666ca179c90244aaff
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5764823
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-08-06 17:36:13 +00:00
Frank Barchard
32ccd53bb3 Add P010ToNV12 to convert 10 bit biplanar to 8 bit biplanar
- P010 and NV12 have the same layout: Full size Y plane and half size UV plane.
  P010 and NV12 are 4:2:0 subsampling
- P010 uses upper 10 bits of 16 bit elements
- NV12 uses 8 bit elements
- The Convert16To8 used internally will discard the low 2 bits.
- UV order is the same - U first in memory, followed by V, interleaved
- UV plane is be rounded up in size to allow odd size Y to have UV values
- Similar code could be used to convert P210ToNV16, P410ToNV24, with the size
  of the UV plane affected by subsampling 4:2:2 and 4:4:4 variants.

Bug: b/357439226
Change-Id: I5d6ec84d97d0e0cc4008eeb18a929ea28570d6d9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5761958
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2024-08-05 18:55:44 +00:00
George Steed
004352ba16 [AArch64] Add SVE2 implementations for AYUVTo{UV,VU}Row
These kernels are mostly identical to each other except for the order of
the results, so we can use a single macro to parameterize the pairwise
addition and use the same macro for both implementations, just with the
register order flipped.

Similar to other 2x2 kernels the implementation here differs slightly
for the last element if the problem size is odd, so use an "any" kernel
to avoid needing to handle this in the common code path.

Observed reduction in runtime compared to the existing Neon code:

            | AYUVToUVRow | AYUVToVURow
Cortex-A510 |      -33.1% |      -33.0%
Cortex-A720 |      -25.1% |      -25.1%
  Cortex-X2 |      -59.5% |      -53.9%
  Cortex-X4 |      -39.2% |      -39.4%

Bug: libyuv:973
Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-06-04 18:18:07 +00:00
George Steed
6f1d8b1e11 [AArch64] Add SVE2 implementations for ARGBToUVRow and similar
By maintaining the interleaved format of the data we can use a common
kernel for all input channel orderings and simply pass a different
vector of constants instead.

A similar approach is possible with only Neon by making use of
multiplies and repeated application of ADDP to combine channels, however
this is slower on older cores like Cortex-A53 so is not pursued further.

For odd problem sizes we need a slightly different implementation for
the final element, so introduce an "any" kernel to address that rather
than bloating the code for the common case.

Observed affect on runtimes compared to the existing Neon kernels:

             | Cortex-A510 | Cortex-A720 | Cortex-X2
ABGRToUVJRow |      -15.5% |       +5.4% |    -33.1%
 ABGRToUVRow |      -15.6% |       +5.3% |    -35.9%
ARGBToUVJRow |      -10.1% |       +5.4% |    -32.7%
 ARGBToUVRow |      -10.1% |       +5.4% |    -29.3%
 BGRAToUVRow |      -15.5% |       +4.6% |    -32.8%
 RGBAToUVRow |      -10.1% |       +4.2% |    -36.0%

Bug: libyuv:973
Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-05-01 19:46:43 +00:00
George Steed
f2e78e1304 [AArch64] Use Neon dot-product instructions in ARGBToYMatrixRow
Using the dot-product instructions here allows us to avoid needing LD4
for loading individual colour channels, which gives a big benefit on
some micro-architectures where such instructions perform significantly
worse than LD1. In addition the dot-product instructions have higher
throughput compared to the Neon

Observed reduction in runtimes for selected kernels moving from *_NEON
to *_NEON_DotProd:

     Kernel | Cortex-A55 | Cortex-A510 | Cortex-A76 | Cortex-X2
ABGRToYJRow |      -6.5% |      -22.5% |     -43.5% |    -71.2%
 ABGRToYRow |      -6.5% |      -22.5% |     -43.5% |    -68.3%
ARGBToYJRow |      -6.5% |      -22.5% |     -43.5% |    -68.1%
 ARGBToYRow |      -6.5% |      -22.5% |     -43.5% |    -68.1%
 BGRAToYRow |      -6.5% |      -22.5% |     -42.3% |    -68.4%
RGBAToYJRow |      -6.5% |      -22.5% |     -42.2% |    -73.7%
 RGBAToYRow |      -6.5% |      -22.5% |     -42.3% |    -64.9%

Bug: libyuv:977
Change-Id: If244190a7bdacf7e6e6b16af7e6853ee13ff6585
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424737
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-04-09 03:09:36 +00:00
Frank Barchard
5625f42424 I444ToI420 and I422ToI420 check U and V pointers and return -1 if NULL.
- Add detect linux kernel version number in util/cpuid

adbrun -- blaze-bin/third_party/libyuv/cpuid
Kernel Version 4.14
Cpu Flags 0x7
Has ARM 0x2

Bug: libyuv:970
Change-Id: I655ed598db3655ca8448be08f1d71fbc328ced66
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5207990
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2024-01-18 21:56:11 +00:00
Frank Barchard
def473f501 malloc return 1 for failures and assert for internal functions
Bug: libyuv:968
Change-Id: Iea2f907061532d2e00347996124bc80d079a7bdc
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5010874
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-12-04 22:55:20 +00:00
Wan-Teh Chang
fb6341d326 Change ScalePlane,ScalePlane_16,... to return int
Change ScalePlane(), ScalePlane_16(), and ScalePlane_12() to return int
so that they can report memory allocation failures (by returning 1).

BUG=libyuv:968

Change-Id: Ie5c183ee42e3d595302671f9ecb7b3472dc8fdb5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5005031
Commit-Queue: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-11-03 23:53:24 +00:00
Frank Barchard
31e1d6f896 Check allocations that return NULL and return early
BUG=libyuv:968

Change-Id: I9e8594440a6035958511f9c50072820131331fc8
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4977552
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-10-27 17:41:36 +00:00
Bruce Lai
04821d1e7d [RVV] Enable ARGBExtractAlphaRow/ARGBCopyYToAlphaRow
* Run on SiFive internal FPGA:

TestARGBExtractAlpha(~3.2x vs scalar)
TestARGBCopyYToAlpha(~1.6x vs scalar)

Change-Id: I36525c67e8ac3f71ea9d1a58c7dc15a4009d9da1
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617955
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-06-15 23:45:24 +00:00
Frank Barchard
157b153b60 Fix tidy warning that uint32_t dither4 should not be const
- Remove const from uint32_t dither4 parameter to fix clang-tidy warning
- Apply clang format
- Bump version
- Remove unused MMI source; superceded by MSA

Bug: None
Change-Id: Id49991db25bca4e99590b415312542d917471c62
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581882
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-06-02 00:42:02 +00:00
Vignesh Venkatasubramanian
c0f64c14ca Add I412/I212 to I420 functions
They re-use the same method as I410/I210 to I420 with a depth
value of 12 instead of 10.

Bug: b/268505204
Change-Id: I299862b4556461d8c95f0fc1dcd5260e1c1f25cd
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4581867
Commit-Queue: Vignesh Venkatasubramanian <vigneshv@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-06-01 19:50:16 +00:00
Darren Hsieh
d14bd701c8 [RVV] Enable CopyRow_RVV, InterpolateRow_RVV, {Merge,Split}UVRow_RVV
* Run on SiFive internal FPGA:

MergeUVPlane_Opt(~6x vs scalar)
SplitUVPlane_Opt(~6x vs scalar)
TestCopyPlane(~8x vs scalar)
ARGBInterpolate0_Opt(~10x vs scalar)
ARGBInterpolate64_Opt(~9x vs scalar)
ARGBInterpolate168_Opt(~9x vs scalar)
ARGBInterpolate192_Opt(~8.5x vs scalar)
ARGBInterpolate255_Opt(~8x vs scalar)

Bug: libyuv:956
Change-Id: I8372341865f75f42e30371ef943d5c2e4be7b79a
Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4574186
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-05-30 09:10:35 +00:00
Lu Wang
8670bcf17f Optimize the following 19 functions with LSX in row_lsx.cc.
UYVYToYRow_LSX, UYVYToUVRow_LSX, UYVYToUV422Row_LSX,
ARGBToUVRow_LSX, ARGBToRGB24Row_LSX, ARGBToRAWRow_LSX,
ARGBToRGB565Row_LSX, ARGBToARGB1555Row_LSX, ARGBToARGB4444Row_LSX,
ARGBToUV444Row_LSX, ARGBMultiplyRow_LSX, ARGBAddRow_LSX,
ARGBSubtractRow_LSX, ARGBAttenuateRow_LSX, ARGBToRGB565DitherRow_LSX,
ARGBShuffleRow_LSX, ARGBShadeRow_LSX, ARGBGrayRow_LSX,
ARGBSepiaRow_LSX

Bug: libyuv:913
Change-Id: I02c0c9d68b229c4a66c96837e9b928c2f5dda1f3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4546814
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-05-19 18:55:58 +00:00
Frank Barchard
a37799344d ARGBToI420Alpha function to convert ARGB to I420 with Alpha
Bug: b/281866362
Change-Id: Ic1093a887fb483f134c78909cf1ee7495e7345ba
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4534100
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2023-05-17 00:23:24 +00:00
Bruce Lai
59eae49f17 Enable ARGBToYMatrixRow_RVV/RGBAToYMatrixRow_RVV/RGBToYMatrixRow_RVV
Run on SiFive internal FPGA:

ARGBToJ400_Opt (~6x vs scalar)
RGBAToJ400_Opt (~6x vs scalar)
RGB24ToJ400_Opt (~5.5x vs scalar)

LIBYUV_WIDTH=1280 LIBYUV_HEIGHT=720 LIBYUV_REPEAT=10

Change-Id: Ia3ce8cea7962fbd8618cc23e850a7913c9cabf4f
Signed-off-by: Bruce Lai <bruce.lai@sifive.com>
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4521783
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-05-11 10:17:51 +00:00
Lu Wang
1d940cc570 Optimize the following functions with LSX.
MirrorRow_LSX, MirrorUVRow_LSX, ARGBMirrorRow_LSX,
I422ToYUY2Row_LSX, I422ToUYVYRow_LSX, I422ToARGBRow_LSX,
I422ToRGBARow_LSX, I422AlphaToARGBRow_LSX, I422ToRGB24Row_LSX,
I422ToRGB565Row_LSX, I422ToARGB4444Row_LSX, I422ToARGB1555Row_LSX,
YUY2ToYRow_LSX, YUY2ToUVRow_LSX, YUY2ToUV422Row_LSX

Bug: libyuv:913
Change-Id: I46cec605001d7ddd73846eed6d0a77f936b6dc53
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4515191
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-05-10 00:25:48 +00:00
Frank Barchard
3f219a3501 GCC warning fix for MT2T
- Fix redundent assignment compile warning in GCC
- Apply clang-format
- Bump version to 1863

Bug: libyuv:955
Change-Id: If2b6588cd5a7f068a1745fe7763e90caa7277101
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4344729
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
2023-03-16 06:57:20 +00:00
Frank Barchard
88b050f337 MergeUV AVX512BW use assembly
- Convert MergeUVRow_AVX512BW to assembly
- Enable MergeUVRow_AVX512BW for Windows with clangcl
- MergeUVRow_AVX2 use vpmovzxbw and vpsllw
- MergeUVRow_16_AVX2 use vpmovzxbw and vpsllw with different shift for U and V

AMD Zen 4 640x360 100000 iterations
Was
AVX512 MergeUVPlane_Opt (884 ms)
AVX2   MergeUVPlane_Opt (945 ms)
AVX2   MergeUVPlane_16_Opt (2167 ms)

Now
AVX512 MergeUVPlane_Opt (865 ms)
AVX2   MergeUVPlane_Opt (943 ms)
SSE2   MergeUVPlane_Opt (973 ms)
AVX2   MergeUVPlane_16_Opt (2102 ms)

Bug: None
Change-Id: I658ada2a75d44c3f93be8bd3ed96f83d5fa2ab8d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4271230
Reviewed-by: Fritz Koenig <frkoenig@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-22 21:19:08 +00:00
Frank Barchard
2bdc210be9 MergeUV_AVX512BW for I420ToNV12
On Skylake Xeon 640x360 100000 iterations
AVX512   MergeUVPlane_Opt (1196 ms)
AVX2     MergeUVPlane_Opt (1565 ms)
SSE2     MergeUVPlane_Opt (1780 ms)
Pixel 7  MergeUVPlane_Opt (1177 ms)

Bug: None
Change-Id: If47d4fa957cf27781bba5fd6a2f0bf554101a5c6
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4242247
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
2023-02-13 20:14:57 +00:00
Hao Chen
0809713775 Refine some functions on the Longarch platform.
Add ARGBToYMatrixRow_LSX/LASX, RGBAToYMatrixRow_LSX/LASX and
RGBToYMatrixRow_LSX/LASX functions with RgbConstants argument.

Bug: libyuv:912
Change-Id: I956e639d1f0da4a47a55b79c9d41dcd29e29bdc5
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4167860
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-01-18 18:54:14 +00:00
Frank Barchard
541d8efbaf Fix for divide row functions used by P010ToI010
Bug: libyuv:951
Change-Id: Id323656cb6f99b1be0be7aaa854d3cc15feeba69
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4166562
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-01-17 21:40:45 +00:00
Frank Barchard
d5aa3d4a76 P010ToI010 and P012ToI012 conversion functions
- Convert 10 and 12 bit biplanar formats to planar.
- Shift 10 MSB to 10 LSB
- P010 is similar to NV12 in layout, but uses 10 MSB of 16 bit values.
- I010 is similar to I420 in layout, but uses 10 LSB of 16 bit values.

Bug: libyuv:951
Change-Id: I16a1bc64239d0fa4f41810910da448bf5720935f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4166560
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-01-13 19:20:12 +00:00
Frank Barchard
6e4b0acb4b I422Rotate take stride for temporary buffers
- Minor variable name changes first/last to top/bottom
- Comments explaining rotate temporary buffers usage
- Add asserts for scale parameter
- Use NULL and stddef.h instead of 0
- Use void * for allocation in row.h
- Add () around size parameter in macros


Bug: libyuv:926, libyuv:949
Change-Id: Ib55417570926ccada0a0f8abd1753dc12e5b162e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4136762
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-01-04 23:11:52 +00:00
Sergio Garcia Murillo
22a579c438 Use ScalePlaneDown2_16To8 for avoiding the 2 step process
Bug: libyuv:950
Change-Id: I5a77bca9a0230fe00abd810939e217833a14683f
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4134524
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2023-01-03 21:41:21 +00:00
Sergio Garcia Murillo
f583b1b4b8 Add I410Copy and I410ToI420 methods
The I410To420 implementation does a two step approach for scaling down and 10-to-8 bit conversion using the Y plane as temporal storage.

Bug: libyuv:950
Change-Id: I3d35fad4b99e17253230456233fbd947e013c0ec
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4110783
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2023-01-03 20:27:28 +00:00
Frank Barchard
3abd6f36b6 Casting for scale functions
- MT2T support for source strides added, but only works for positive values.
- Reduced casting in row_common - one cast per assignment.
- scaling functions use intptr_t for intermediate calculations, then cast strides to ptrdiff_t

Bug: libyuv:948, b/257266635, b/262468594
Change-Id: I0409a0ce916b777da2a01c0ab0b56dccefed3b33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4102203
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ernest Hua <ernesthua@google.com>
2022-12-15 22:34:22 +00:00
Frank Barchard
610e0cdead MT2T Warning fixes for fuchsia
Bug: b/258474032, b/257266635
Change-Id: Ic5cbbc60e2e1463361e359a2fe3e97976c1ea929
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4081348
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2022-12-06 19:54:40 +00:00
Frank Barchard
cb35d5f90e BGRAToI420 use SSSE3 for Y but C for UV when LIBYUV_BIT_EXACT enabled
- Previously was C for both Y and UV.

Was BGRAToI420_Opt (17780 ms)
Now BGRAToI420_Opt (9546 ms)

Bug: b/253491233
Change-Id: Id103d8d5ba0fed0f7a427dd5955e1830275eff6b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3953131
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2022-10-14 03:09:56 +00:00
Frank Barchard
f71c83552d I420ToRGB24MatrixFilter function added
- Implemented as 3 steps: Upsample UV to 4:4:4, I444ToARGB, ARGBToRGB24
- Fix some build warnings for missing prototypes.

Pixel 4
I420ToRGB24_Opt (743 ms)
I420ToRGB24Filter_Opt (1331 ms)

Windows with skylake xeon:
x86 32 bit
I420ToRGB24_Opt (387 ms)
I420ToRGB24Filter_Opt (571 ms)
x64 64 bit
I420ToRGB24_Opt (384 ms)
I420ToRGB24Filter_Opt (582 ms)


Bug: libyuv:938, libyuv:830
Change-Id: Ie27f70816ec084437014f8a1c630ae011ee2348c
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3900298
Reviewed-by: Wan-Teh Chang <wtc@google.com>
2022-09-16 19:46:47 +00:00
Frank Barchard
3e38ce5058 SSE2 MM21->YUY2 conversion
Add SSE2 optimization for MM21ToYUY2 conversion.

Bug: b/238137982
Change-Id: I189f712514308322f651b082b496bce9c015c4ee
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3832525
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
2022-08-17 18:39:05 +00:00
Frank Barchard
65e7c9d570 MM21ToYUY2 and ABGRToJ420 conversion
MM21 to YUY2 use zip1 for performance

Cortex A510
Was MM21ToYUY2 (612 ms)
Now MM21ToYUY2 (573 ms)

Prefetches help Cortex A53
Was MM21ToYUY2 (4998 ms)
Now MM21ToYUY2 (1900 ms)

Pixel 4 Cortex A76
Was MM21ToYUY2 (215 ms)
Now MM21ToYUY2 (173 ms)

ABGRToJ420
- NEON, SSSE3 and AVX2 row functions
- J400, J420 and J422 formats.
- Added AVX2 for UV on ARGBToJ420.  Was SSSE3

Same code/performance as ARGBToJ420 but with constants re-ordered.
Pixel 4
ABGRToJ420_Opt (623 ms)
ABGRToJ422_Opt (702 ms)
ABGRToJ400_Opt (238 ms)

Skylake Xeon
With LIBYUV_BIT_EXACT which uses C for UV
ABGRToJ420_Opt (988 ms)
ABGRToJ422_Opt (1872 ms)
ABGRToJ400_Opt (186 ms)
Skylake Xeon using AVX2
ABGRToJ420_Opt (251 ms)
ABGRToJ422_Opt (245 ms)
ABGRToJ400_Opt (184 ms)
Skylake Xeon using SSSE3
ABGRToJ420_Opt (328 ms)
ABGRToJ422_Opt (362 ms)
ABGRToJ400_Opt (185 ms)

Bug: b/238137982
Change-Id: I559c3fe3fb80fa2ce5be3d8218736f9cbc627666
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3832111
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2022-08-16 22:07:38 +00:00