libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2025-12-06 16:56:55 +08:00

Author	SHA1	Message	Date
Frank Barchard	48943bb378	Convert8To16 use VPSRLW instead of VPMULHUW for better lunarlake performance - MCA says old version was 4 cycles and new version is 2.5 cycles/loop - lunarlake is the only known cpu mca -mcpu=lunarlake 100 iterations Was vpmulhu Iterations: 100 Instructions: 1200 Total Cycles: 426 Total uOps: 1200 Dispatch Width: 8 uOps Per Cycle: 2.82 IPC: 2.82 Block RThroughput: 4.0 Now vpsrlw Iterations: 100 Instructions: 1200 Total Cycles: 279 Total uOps: 1400 Dispatch Width: 8 uOps Per Cycle: 5.02 IPC: 4.30 Block RThroughput: 2.5 Bug: None Change-Id: I5a49e1cf1ed3dfb59fe9861a871df9862417c6a6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6697745 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-08-04 12:42:50 -07:00
George Steed	007b920232	[AArch64] Add SME implementation of ARGBToUVRow and similar Mostly just a straightforward copy of the existing SVE2 code ported to Streaming-SVE. Introduce new "any" kernels for non-multiple of two cases, matching what we already do for SVE2. The existing SVE2 code makes use of the Neon MOVI instruction that is not supported in Streaming-SVE, so adjust the code to use FMOV instead which has the same performance characteristics. Change-Id: I74b7ea1fe8e6af75dfaf92826a4de775a1559f77 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6663806 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-30 09:20:23 -07:00
George Steed	88798bcd63	[AArch64] Add SME implementation of Convert8To16Row_SME Mostly just a straightforward copy of the Neon code ported to Streaming-SVE. There is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: Ide34dbb7125b5f2a1edda6ef7111a1a49aad324f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6651565 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-23 11:32:56 -07:00
George Steed	7e5863ae5a	Add SVE2 and SME implementations of I422ToAR30Row This can make use of the existing load/convert/store macros that are already present for other kernels, so add I422ToAR30Row_SVE2 and I422ToAR30Row_SME to match the existing kernels. Reduction in time taken observed for the new SVE2 implementation, compared to the existing Neon implementation: Cortex-A510: -9.1% Cortex-A520: +6.8% (!) Cortex-A710: -4.0% Cortex-A715: -1.1% Cortex-A720: -1.1% Cortex-X2: -5.7% Cortex-X3: -5.9% Cortex-X4: -2.8% Cortex-X925: -4.0% Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-27 11:39:00 -07:00
George Steed	949cb623bf	Add SVE2 and SME implementations of I444ToRGB24Row Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so they are usable in both SVE2 and SME implementations, and use them to add new I444ToRGB24Row implementations for SVE2 and SME. We need to use the unrolled versions here to use the ST3B interleaving store instructions, since there is no partial vector version of this store instruction. Reduction in time taken observed for the new SVE2 implementation, compared to the existing Neon implementation: Cortex-A510: -57.6% Cortex-A520: -38.1% Cortex-A710: -15.5% Cortex-A715: -9.2% Cortex-A720: -9.2% Cortex-X2: -25.8% Cortex-X3: -26.2% Cortex-X4: -23.2% Cortex-X925: -17.8% Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-22 13:33:06 -07:00
George Steed	c4a0c8d34a	[AArch64] Add SVE2 and SME implementations for Convert8To8Row SVE can make use of the UMULH instruction to avoid needing separate widening multiply and narrowing steps for the scale application. Reduction in runtime for Convert8To8Row_SVE2 observed compared to the existing Neon implementation: Cortex-A510: -13.2% Cortex-A520: -16.4% Cortex-A710: -37.1% Cortex-A715: -38.5% Cortex-A720: -38.4% Cortex-X2: -33.2% Cortex-X3: -31.8% Cortex-X4: -31.8% Cortex-X925: -13.9% Change-Id: I17c0cb81661c5fbce786b47cdf481549cfdcbfc7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207692 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-01-28 15:53:26 -08:00
George Steed	db5a71c528	[AArch64] Remove unused variables in HalfRow_{16To8,16}_SME The HalfRow kernels assume that the fraction is exactly half, so there is no need to calculate it. No-Try: True Change-Id: I2319d55ba99f202aa22c9693ec44c9891e7f72d5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087914 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>	2024-12-13 08:00:58 -08:00
George Steed	7fd0bd197e	[AArch64] Port YUVToRGB color conversions to SME Some of the color conversion kernels already have Streaming-SVE implementations however many do not. We can re-use the existing SVE implementation by moving it to a new shared row_sve.h header and marking it with a "streaming-compatible" attribute to ensure it can be called from both streaming and non-streaming execution modes. As part of this move to a common header we also add duplicated streaming-mode implementations of the following kernels that did not previously have an SME implementation: - I210AlphaToARGBRow_SME - I210ToAR30Row_SME - I210ToARGBRow_SME - I212ToAR30Row_SME - I212ToARGBRow_SME - I400ToARGBRow_SME - I410AlphaToARGBRow_SME - I410ToAR30Row_SME - I410ToARGBRow_SME - I422AlphaToARGBRow_SME - I422ToARGB1555Row_SME - I422ToARGB4444Row_SME - I422ToRGB24Row_SME - I422ToRGB565Row_SME - I422ToRGBARow_SME - I444AlphaToARGBRow_SME - NV12ToARGBRow_SME - NV12ToRGB24Row_SME - NV21ToARGBRow_SME - NV21ToRGB24Row_SME - P210ToAR30Row_SME - P210ToARGBRow_SME - P410ToAR30Row_SME - P410ToARGBRow_SME - UYVYToARGBRow_SME - YUY2ToARGBRow_SME Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:07:54 -08:00
George Steed	c2e7f8389a	[AArch64] Add SME implementations of InterpolateRow{,_16,_16To8} InterpolateRow_SME and InterpolateRow_16_SME need special cases to handle if source_y_fraction is 256 since this would overflow a byte and can just be a call to memcpy instead. InterpolateRow_16To8_SME is never called with a source_y_fraction value of 256 so there is no need for a special case here. Change-Id: I67805b5db2c411acb93ada626cf414b35620f467 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:03:41 -08:00
George Steed	2d8652f3e7	[AArch64] Add SME implementation of CopyRow Add a streaming-SVE implementation of CopyRow using normal vector load/store instructions. Change-Id: Ia551413f9740a96473fa2e8a0958953be2f4b04e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074374 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:02:07 -08:00
George Steed	418b6df0de	[AArch64] Add SME implementation of Convert16To8Row Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel. SVE has a "widening multiply get high half" instruction in UMULH, however using the same technique as the Neon code to avoid the need for a widening multiply at all is more performant here. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 03:01:55 -08:00
George Steed	7391559cb4	[AArch64] Add SME implementation of MergeUVRow{,_16} Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel and use ST2 to avoid needing a separate ZIP instruction. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 01:16:19 -08:00
George Steed	9144583f22	[AArch64] Add SME impls of MultiplyRow_16 and ARGBMultiplyRow Mostly just a translation of the existing Neon code to SME. Change-Id: Ic3d6b8ac774c9a1bb9204ed6c78c8802668bffe9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067147 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-03 22:11:19 +00:00
George Steed	237f39cb8c	[AArch64] Add SME implementation of I444ToARGBRow This is based on an unrolled version of the existing SVE2 code. The implementation in this case is a pure streaming-SVE (SSVE) implementation based on the existing SVE2 implementation, we do not use the ZA tile. Change-Id: I83d8e58aafd814125b3446fb1c9ec4a5fb56fe3e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913882 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-29 18:10:23 +00:00
George Steed	22c5c18778	[AArch64] Add SME implementation of I422ToARGBRow Including addition of a new row_sme.cc file and associated infrastructure. The actual implementation in this case is a pure streaming-SVE (SSVE) implementation based on the existing SVE2 implementation, we do not use the ZA tile. Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-29 05:49:28 +00:00

15 Commits