libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2025-12-06 16:56:55 +08:00

Author	SHA1	Message	Date
George Steed	db5a71c528	[AArch64] Remove unused variables in HalfRow_{16To8,16}_SME The HalfRow kernels assume that the fraction is exactly half, so there is no need to calculate it. No-Try: True Change-Id: I2319d55ba99f202aa22c9693ec44c9891e7f72d5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087914 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>	2024-12-13 08:00:58 -08:00
George Steed	7fd0bd197e	[AArch64] Port YUVToRGB color conversions to SME Some of the color conversion kernels already have Streaming-SVE implementations however many do not. We can re-use the existing SVE implementation by moving it to a new shared row_sve.h header and marking it with a "streaming-compatible" attribute to ensure it can be called from both streaming and non-streaming execution modes. As part of this move to a common header we also add duplicated streaming-mode implementations of the following kernels that did not previously have an SME implementation: - I210AlphaToARGBRow_SME - I210ToAR30Row_SME - I210ToARGBRow_SME - I212ToAR30Row_SME - I212ToARGBRow_SME - I400ToARGBRow_SME - I410AlphaToARGBRow_SME - I410ToAR30Row_SME - I410ToARGBRow_SME - I422AlphaToARGBRow_SME - I422ToARGB1555Row_SME - I422ToARGB4444Row_SME - I422ToRGB24Row_SME - I422ToRGB565Row_SME - I422ToRGBARow_SME - I444AlphaToARGBRow_SME - NV12ToARGBRow_SME - NV12ToRGB24Row_SME - NV21ToARGBRow_SME - NV21ToRGB24Row_SME - P210ToAR30Row_SME - P210ToARGBRow_SME - P410ToAR30Row_SME - P410ToARGBRow_SME - UYVYToARGBRow_SME - YUY2ToARGBRow_SME Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:07:54 -08:00
George Steed	c2e7f8389a	[AArch64] Add SME implementations of InterpolateRow{,_16,_16To8} InterpolateRow_SME and InterpolateRow_16_SME need special cases to handle if source_y_fraction is 256 since this would overflow a byte and can just be a call to memcpy instead. InterpolateRow_16To8_SME is never called with a source_y_fraction value of 256 so there is no need for a special case here. Change-Id: I67805b5db2c411acb93ada626cf414b35620f467 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:03:41 -08:00
George Steed	2d8652f3e7	[AArch64] Add SME implementation of CopyRow Add a streaming-SVE implementation of CopyRow using normal vector load/store instructions. Change-Id: Ia551413f9740a96473fa2e8a0958953be2f4b04e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074374 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:02:07 -08:00
George Steed	418b6df0de	[AArch64] Add SME implementation of Convert16To8Row Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel. SVE has a "widening multiply get high half" instruction in UMULH, however using the same technique as the Neon code to avoid the need for a widening multiply at all is more performant here. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 03:01:55 -08:00
runzezhang	192b8c2238	Add NV24 scaling support to libyuv Some projects require scaling support for the NV24 format, but libyuv currently lacks this functionality. This commit adds a scaling function for NV24, enabling its use in projects that require NV24 format processing. Change-Id: I6e6b2bea342e1df7f387056ab3bc5003da983bb7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6068715 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 02:46:11 -08:00
George Steed	85331e00cc	[AArch64] Add SME impls of ScaleRowDown2{,Linear,Box}_16 Mostly just straightforward copies of the Neon code ported to Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2 SME kernels, but operating on 16-bit data rather than 8-bit. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I7bad0719d24cdb1760d1039c63c0e77726b28a54 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070784 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 01:21:08 -08:00
George Steed	15f2ae7d70	[AArch64] Add SME impls of ScaleARGBRowDown2{,Linear,Box} Mostly just straightforward copies of the Neon code ported to Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2 and ScaleUVRowDown2 SME kernels, but operating on 32-bit ARGB tuples rather than 8-bit data or 16-bit UV tuples. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I15600c2498cc592f5ea1d97b78fafec327de7947 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070783 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 01:19:20 -08:00
George Steed	7391559cb4	[AArch64] Add SME implementation of MergeUVRow{,_16} Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel and use ST2 to avoid needing a separate ZIP instruction. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 01:16:19 -08:00
George Steed	3e75e41e79	[AArch64] Add "limit" variable explanations in SVE *AR30 kernels As requested here: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023583/1/source/row_sve.cc#1973 Change-Id: I15d8ca1f724a7123fbf52ac60b18c850e4004e64 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067153 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-11 23:50:27 -08:00
George Steed	11ef227b6d	[AArch64] Clean up formatting in row_sve.cc Force macros onto empty lines with empty comments and adjust some other comments to be consistent with the rest of the file. Change-Id: I1a35283608b868c53e91b337187ebe0e402c9834 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067152 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-11 23:48:57 -08:00
George Steed	3a0ad00ed3	Use separate intermediate RGBA buffers in planar function tests The existing tests reuse the intermediate buffers between the reference and optimized implementations. In particular the existing tests appear to pass even if the optimized implementation is completely empty, so long as it does not modify the desintation buffers since these are already filled with correct values from the reference code. To avoid this, allocate separate buffers for optimized and reference implementations to store intermediate data between function calls. Additionally remove unused buffers from HalfMergeUVPlane_Opt tests. Change-Id: I7e9ea21fc193e7be21cc24e2be0d7a122e068f6e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074941 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-05 11:47:17 +00:00
George Steed	8f659daffd	[AArch64] Add SVE2 implementations of NV{12,21}ToRGB24Row Now that we have the `_2X` versions of the macros we can use these to implement `ToRGB24` kernels. These cannot use the bottom/top approach previously used by other SVE kernels since there are three rather than two or four elements each. Reduction in runtimes observed compared to the existing Neon implementations: \| NV12ToRGB24Row \| NV21ToRGB24Row Cortex-A510 \| -60.7% \| -60.7% Cortex-A520 \| -46.0% \| -46.0% Cortex-A715 \| -25.2% \| -25.2% Cortex-A720 \| -25.2% \| -25.2% Cortex-X2 \| -28.9% \| -29.0% Cortex-X3 \| -28.2% \| -28.1% Cortex-X4 \| -30.8% \| -30.7% Cortex-X925 \| -28.8% \| -28.9% Change-Id: I39853d124bfdcac38584109870b398b8ecd5b632 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067149 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-04 17:51:08 +00:00
George Steed	233f859e3c	[AArch64] Remove redundant increments in ScaleRowDown2_16_NEON These were mistakenly copied from the main loop body, however this particular block of the code is only executed at most once so we do not need to perform the address updates. Also adjust formatting with clang-format to match other kernels. Change-Id: I8214821417d5e4f455ebe8805e1a37a9728ab8d2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067154 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-04 17:48:11 +00:00
George Steed	9144583f22	[AArch64] Add SME impls of MultiplyRow_16 and ARGBMultiplyRow Mostly just a translation of the existing Neon code to SME. Change-Id: Ic3d6b8ac774c9a1bb9204ed6c78c8802668bffe9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067147 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-03 22:11:19 +00:00
George Steed	88a3472f52	[AArch64] Unroll SVE2 impls of NV{12,21}ToARGBRow We can reuse most of the logic from the existing I422TORGB_SVE_2X macro and simply amend the existing READNV_SVE macro to read twice as much data. Unrolling is primarily beneficial for little cores but also provides some smaller benefits to larger cores as well. \| NV12ToARGBRow_SVE2 \| NV21ToARGBRow_SVE2 Cortex-A510 \| -48.0% \| -47.9% Cortex-A520 \| -48.1% \| -48.2% Cortex-A715 \| -20.4% \| -20.4% Cortex-A720 \| -20.6% \| -20.6% Cortex-X2 \| -7.1% \| -7.3% Cortex-X3 \| -4.0% \| -4.3% Cortex-X4 \| -14.1% \| -14.3% Cortex-X925 \| -8.2% \| -8.6% Change-Id: I195005d23e743d7d46319220ad05ee89bb7385ae Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067148 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-03 22:03:42 +00:00
George Steed	03a935493d	[AArch64] Simplify predicate width calculations Several of the existing SVE kernels used calculations of the form: remainder = width & (vl - 1) == 0 ? vl : width & (vl - 1); This is due to initial SVE contributed code unconditionally using the predicated tail for the final iteration even if the width was a perfect multiple of the vector length. In the current code the fully-predicated main body loop will instead iterate through the width completely and simply skip over the tail entirely. Skipping over the tail means that the case handled by the ternary condition now never occurs, and the remainder calculation can now simply be: remainder = width & (vl - 1); This avoids the need for a compare and conditional select in the function prologue. Change-Id: Ia73f5f8bc66fad6bea64439dc2beeaccb54622d2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067151 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-03 21:54:32 +00:00
George Steed	2c32b689e4	[AArch64] Improve instruction interleaving in READI212_SVE The existing instruction arrangement is sub-optimal on little cores since it has instructions with dependencies next to each other, so spread them out to improve performance. No significant change observed on bigger cores, but little cores do show some small improvements except for the Alpha kernels which regress slightly. Runtimes observed compared to the previous SVE implementation: \| Cortex-A510 \| Cortex-A520 I210AlphaToARGBRow \| (!) +7.0% \| (!) +6.8% I210ToAR30Row \| -10.3% \| -9.9% I210ToARGBRow \| -2.4% \| -2.3% I212ToAR30Row \| -10.3% \| -9.9% I212ToARGBRow \| -2.4% \| -2.3% Change-Id: I626942ce02c4610cfac1ea4f8e7890653ee4324f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067150 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-03 21:50:47 +00:00
Junji Watanabe	a729ba686a	Add hook to fetch reclient config files (Initially uploaded here https://crrev.com/c/5726652) This logic was copied from the login in chromium/src at https://chromium-review.googlesource.com/c/chromium/src/+/4666325 as that is the current version of buildtools that libyuv uses This is needed to be able to remove the old path of downloading remote exec configs on ci builders Test: CQ tryjobs No-Try: true Bug: b/292501270 Change-Id: Idea22e9a499e57d86f1e1e8ed9c0ca346aa162b6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6055341 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Reviewed-by: Christoffer Dewerin <jansson@chromium.org> Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>	2024-12-03 09:24:16 +00:00
Hao Chen	532126bf70	Fix bugs in ARGBAttenuateRow_LASX/LSX function Fix errors in ARGBAttenuateRow_LASX and ARGBAttenuateRow_LSX functions caused by changes in calculation methods. In addition, add the option to automatically add "-mlsx" and "-mlasx" to enable SIMD optimization when compiling with cmake on LoongArch platform. Bug: libyuv:913 Change-Id: I7215f5198d3fb94f981d60969dc21a483006023e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802829 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Ben Weiss <bweiss@google.com>	2024-11-30 23:09:04 +00:00
George Steed	9a9752134e	[AArch64] Add Neon implementation of ScaleRowDown2Linear_16 Reduction in runtime observed relative to the auto-vectorized C implementation compiled with LLVM 19: Cortex-A55: -13.7% Cortex-A510: -49.0% Cortex-A520: -32.0% Cortex-A76: -34.3% Cortex-A710: -56.7% Cortex-A715: -45.4% Cortex-A720: -44.7% Cortex-X1: -70.6% Cortex-X2: -67.9% Cortex-X3: -72.2% Cortex-X4: -40.0% Cortex-X925: -24.1% Bug: b/42280942 Change-Id: I977899a2239e752400c9901f4d8482a76841269a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040154 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-25 21:10:26 +00:00
George Steed	11c57f4f12	[AArch64] Add Neon implementation of ScaleRowDown2_16_NEON The auto-vectorized implementation unrolls to process 32 elements per iteration, so unroll the new Neon implementation to match and avoid a performance regression on little cores. Performance relative to the auto-vectorized C implementation compiled with LLVM 19: Cortex-A55: -35.8% Cortex-A510: -20.4% Cortex-A520: -22.1% Cortex-A76: -54.8% Cortex-A710: -44.5% Cortex-A715: -31.1% Cortex-A720: -31.4% Cortex-X1: -48.5% Cortex-X2: -47.8% Cortex-X3: -47.6% Cortex-X4: -51.1% Cortex-X925: -14.6% Bug: b/42280942 Change-Id: Ib4e89ba230d554f2717052e934ca0e8a109ccc42 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040153 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-25 21:10:05 +00:00
George Steed	952d6a282f	[AArch64] Enable use of ScaleRowDown2Box_16_NEON The #ifdef surrounding the use of this kernel is never defined and ScaleRowDown2_16_NEON does not exist, so add the missing #define and remove the use of ScaleRowDown2_16_NEON for now. Additionally since there is no implementation of this kernel for 32-bit Arm, restrict the define to only be present on AArch64. Bug: b/42280942 Change-Id: Icc35c145c1bad1c0df2933a2d8bc7dcf7fe63cb7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040152 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-24 19:58:00 +00:00
George Steed	9ed07258c7	[AArch64] Add SVE2 implementation of I410ToAR30Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -18.1% Cortex-A520: -6.0% Cortex-A715: -22.0% Cortex-A720: -21.1% Cortex-X2: -9.4% Cortex-X3: -12.0% Cortex-X4: -7.6% Cortex-X925: -5.8% Bug: b/42280942 Change-Id: I853a028e08f1f1076ac20cd9c7f4f8ac8a211ac1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023584 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:59:55 +00:00
George Steed	3dd047733e	[AArch64] Add SVE2 implementation of I410AlphaToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -37.2% Cortex-A520: -6.9% Cortex-A715: -14.8% Cortex-A720: -16.0% Cortex-X2: -14.8% Cortex-X3: -17.5% Cortex-X4: -12.8% Cortex-X925: -13.0% Bug: b/42280942 Change-Id: I1977fd1e1dfac25021724483fd89c6ff3e227d8b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023582 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:58:11 +00:00
George Steed	e84d809348	[AArch64] Add SVE2 implementation of I410ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -37.9% Cortex-A520: -9.2% Cortex-A715: -14.3% Cortex-A720: -14.2% Cortex-X2: -10.9% Cortex-X3: -11.1% Cortex-X4: -12.5% Cortex-X925: -10.6% Bug: b/42280942 Change-Id: I6720b07c900c7dfbd849ee38e413e98b9374dac2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023581 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:54:48 +00:00
George Steed	7c9c72ab4b	[AArch64] Add SVE2 implementation of I210ToAR30Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -15.5% Cortex-A520: -3.8% Cortex-A715: -15.8% Cortex-A720: -15.8% Cortex-X2: -7.9% Cortex-X3: -6.5% Cortex-X4: -5.0% Cortex-X925: -5.3% Bug: b/42280942 Change-Id: I5171537fd125b3214d25a0ae503a8f40dbeb6042 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023583 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-11-23 00:53:16 +00:00
George Steed	fc3569ad27	[AArch64] Add SVE2 implementation of I210AlphaToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -33.9% Cortex-A520: -4.2% Cortex-A715: -22.0% Cortex-A720: -22.4% Cortex-X2: -14.6% Cortex-X3: -14.5% Cortex-X4: -11.6% Cortex-X925: -12.6% Bug: b/42280942 Change-Id: Ifb4ed7a865c369d584af498cc65b84d065cfb207 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023580 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:47:32 +00:00
George Steed	50108f29fb	[AArch64] Add SVE2 implementation of I212ToAR30Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -15.4% Cortex-A520: -3.8% Cortex-A715: -15.7% Cortex-A720: -15.6% Cortex-X2: -7.9% Cortex-X3: -5.7% Cortex-X4: -5.3% Cortex-X925: -4.8% Bug: b/42280942 Change-Id: I99846820682687c8e0f52d05f5aa3d50369fe0a2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025829 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:27:57 +00:00
George Steed	305a7a4ede	[AArch64] Add SVE2 implementation of I212ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -34.5% Cortex-A520: -6.5% Cortex-A715: -10.1% Cortex-A720: -16.1% Cortex-X2: -11.9% Cortex-X3: -11.9% Cortex-X4: -9.3% Cortex-X925: -11.2% Bug: b/42280942 Change-Id: Idc30e69552f7d227217ac7011a786210b11e4752 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025828 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:21:27 +00:00
Frank Barchard	595146434a	HalfFloat fix SigIll on aarch64 - Remove special case Scale of 1 which used fp16 cvt but requires cpuid - Port aarch64 to aarch32 - Use C for aarch32 with small (denormal) scale value Bug: 377693555 Change-Id: I38e207e79ac54907ed6e65118b8109288fddb207 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6043392 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-11-22 22:08:00 +00:00
Frank Barchard	307b951229	Add CopyPlane_Unaligned, _Any and _Invert tests/benchmarksCpuId test - Add AMD_ERMSB detect for ERMS on AMD Bug: 379457420 Change-Id: I608568556024faf19abe4d0662aeeee553a0a349 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6032852 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-11-19 23:53:05 +00:00
Frank Barchard	1c501a8f3f	CpuId test FSMR - Fast Short Rep Movsb - Renumber cpuid bits to use low byte to ID the type of CPU and upper 24 bits for features Intel CPUs starting at Icelake support FSMR adl:Has FSMR 0x8000 arl:Has FSMR 0x0 bdw:Has FSMR 0x0 clx:Has FSMR 0x0 cnl:Has FSMR 0x0 cpx:Has FSMR 0x0 emr:Has FSMR 0x8000 glm:Has FSMR 0x0 glp:Has FSMR 0x0 gnr:Has FSMR 0x8000 gnr256:Has FSMR 0x8000 hsw:Has FSMR 0x0 icl:Has FSMR 0x8000 icx:Has FSMR 0x8000 ivb:Has FSMR 0x0 knl:Has FSMR 0x0 knm:Has FSMR 0x0 lnl:Has FSMR 0x8000 mrm:Has FSMR 0x0 mtl:Has FSMR 0x8000 nhm:Has FSMR 0x0 pnr:Has FSMR 0x0 rpl:Has FSMR 0x8000 skl:Has FSMR 0x0 skx:Has FSMR 0x0 slm:Has FSMR 0x0 slt:Has FSMR 0x0 snb:Has FSMR 0x0 snr:Has FSMR 0x0 spr:Has FSMR 0x8000 srf:Has FSMR 0x0 tgl:Has FSMR 0x8000 tnt:Has FSMR 0x0 wsm:Has FSMR 0x0 Intel CPUs starting at Ivybridge support ERMS adl:Has ERMS 0x4000 arl:Has ERMS 0x4000 bdw:Has ERMS 0x4000 clx:Has ERMS 0x4000 cnl:Has ERMS 0x4000 cpx:Has ERMS 0x4000 emr:Has ERMS 0x4000 glm:Has ERMS 0x4000 glp:Has ERMS 0x4000 gnr:Has ERMS 0x4000 gnr256:Has ERMS 0x4000 hsw:Has ERMS 0x4000 icl:Has ERMS 0x4000 icx:Has ERMS 0x4000 ivb:Has ERMS 0x4000 knl:Has ERMS 0x4000 knm:Has ERMS 0x4000 lnl:Has ERMS 0x4000 mrm:Has ERMS 0x0 mtl:Has ERMS 0x4000 nhm:Has ERMS 0x0 pnr:Has ERMS 0x0 rpl:Has ERMS 0x4000 skl:Has ERMS 0x4000 skx:Has ERMS 0x4000 slm:Has ERMS 0x4000 slt:Has ERMS 0x0 snb:Has ERMS 0x0 snr:Has ERMS 0x4000 spr:Has ERMS 0x4000 srf:Has ERMS 0x4000 tgl:Has ERMS 0x4000 tnt:Has ERMS 0x4000 wsm:Has ERMS 0x0 Change-Id: I18e5a3905f2691ab66d4d0cb6f668c0a0ff72d37 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6027541 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2024-11-18 17:56:45 +00:00
Frank Barchard	75f7cfdde5	SplitRGB for SSE4 and AVX2 libyuv_test '--gunit_filter=SplitRGB' --libyuv_width=640 --libyuv_height=360 --libyuv_repeat=100000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Note: Google Test filter = SplitRGB Skylake Xeon x86 32 bit AVX2 LibYUVPlanarTest.SplitRGBPlane_Opt (4143 ms) SSE4 LibYUVPlanarTest.SplitRGBPlane_Opt (4543 ms) SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5346 ms) C LibYUVPlanarTest.SplitRGBPlane_Opt (22965 ms) Skylake Xeon x86 64 bit AVX2 LibYUVPlanarTest.SplitRGBPlane_Opt (4470 ms) SSE4 LibYUVPlanarTest.SplitRGBPlane_Opt (4723 ms) SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5465 ms) C LibYUVPlanarTest.SplitRGBPlane_Opt (4707 ms) Bug: 379186682 Change-Id: Idce67a4ded836f2ee31854aa06f3903e7bcb7791 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6024314 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2024-11-15 00:46:25 +00:00
George Steed	823d960afc	[AArch64] Add SVE2 implementations of {P210,P410}ToAR30Row Observed reductions in runtime compared to the existing Neon code: \| P210ToAR30Row \| P410ToAR30Row Cortex-A510 \| -16.5% \| -21.2% Cortex-A520 \| (!) +2.7% \| -8.7% Cortex-A715 \| -6.1% \| -6.1% Cortex-A720 \| -6.2% \| -5.9% Cortex-X2 \| -4.1% \| -4.2% Cortex-X3 \| -4.2% \| -4.2% Cortex-X4 \| -1.2% \| -1.2% Cortex-X925 \| -3.6% \| -2.8% Bug: b/42280942 Change-Id: I40723a370fad1ccb53f8ccd9d32cddb502500dd6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023036 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-14 16:52:21 +00:00
George Steed	0ddf3f7b90	[AArch64] Add SVE2 implementation of I210ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -34.5% Cortex-A520: -6.5% Cortex-A715: -10.1% Cortex-A720: -13.9% Cortex-X2: -11.9% Cortex-X3: -11.6% Cortex-X4: -9.5% Cortex-X925: -11.5% Bug: b/42280942 Change-Id: Ie97dc3b5efd021ecfea14d4c477cc205191e09c3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023037 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-14 16:36:41 +00:00
Frank Barchard	74bd6d93c6	Use grep extended regex for version - Uses grep extended regex to extract version information rather than perl regex, which isn't supported on macOS Co-authored-by: trevormcguire@google.com Bug: 277348774 Change-Id: Ifa37207ae360350f0a96c1248bf6407005c00096 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6011548 Reviewed-by: Ben Weiss <bweiss@google.com>	2024-11-13 02:11:17 +00:00
George Steed	5b906a0ec8	[AArch64] Add SVE2 implementation of P410ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -34.7% Cortex-A520: -2.4% Cortex-A715: -18.7% Cortex-A720: -18.8% Cortex-X2: -7.7% Cortex-X3: -8.9% Cortex-X4: +1.0% (!) Cortex-X925: -8.3% Bug: b/42280942 Change-Id: I90dca0573887a9a24e2172378a9e0eb6812e2131 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975321 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:34:56 +00:00
George Steed	b753822d47	[AArch64] Add SVE2 implementation of P210ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -32.8% Cortex-A520: +8.7% (!) Cortex-A715: -18.9% Cortex-A720: -18.9% Cortex-X2: -7.9% Cortex-X3: -8.8% Cortex-X4: +1.0% (!) Cortex-X925: -8.6% Bug: b/42280942 Change-Id: Ibe557500c3788b4fb39372c92b2f42ba216e6fea Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975320 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-11-12 18:32:55 +00:00
George Steed	721ad4aa18	[AArch64] Add SME implementation of ScaleUVRowDown2Box There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: Ie15bb4e7484b61e78f405ad4e8a8a7bbb66b7edb Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979727 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:30:30 +00:00
George Steed	576218dbce	[AArch64] Add SME implementation of ScaleUVRowDown2Linear There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: I401eb6ad14b3159917c8e3a79ab20dde318d28b6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979726 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:28:57 +00:00
George Steed	551cee7845	[AArch64] Add SME implementation of ScaleUVRowDown2 There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: Ic4ba5f97dc57afc558c08a57e9b5009d6e487e0f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979725 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:24:28 +00:00
George Steed	de6b47370f	CMakeLists.txt: Fix typo: OLD_CMAKE_{REQURED => REQUIRED}_FLAGS Change-Id: Ib09316dfda4182a860d2f1db985b15ebeabba5ba Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6012824 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:22:45 +00:00
George Steed	5c12e0b2de	[AArch64] Add SVE2 implementations of HalfFloat{,1}Row For HalfFloat1Row, SVE has direct 16-bit integer to half-float conversion instructions so there is no need to widen to 32-bits. For HalfFloatRow, SVE zero-extending loads avoid the need for seperate UXTL(2) instructions. Observed reductions in runtime compared to the existing Neon code: \| HalfFloat1Row \| HalfFloatRow Cortex-A510 \| -38.3% \| -17.3% Cortex-A520 \| -37.6% \| -18.8% Cortex-A720 \| -50.1% \| -7.8% Cortex-X2 \| -50.2% \| -0.4% Cortex-X4 \| -51.5% \| -12.5% Bug: b/42280942 Change-Id: I445071ccd453113144ce42d465ba03c9ee89ec9e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975319 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:53:00 +00:00
George Steed	7d383c2f1a	[AArch64] Add comments to ScaleRowDown38_{2,3}_Box_NEON impls Add a few comments to help illustrate the permute operations. As requested here: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872803 Change-Id: I8596ef63af5fae4dba1e6fdb548742ba7e191ab9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975317 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:47:12 +00:00
George Steed	f27b983f38	[AArch64] Add SVE2 implementation of DivideRow_16 SVE contains the UMULH instruction which allows us to multiply and take the high half of the result in a single instruction rather than needing separate widening multiply and then narrowing shift steps. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -21.2% Cortex-A520: -20.9% Cortex-A715: -47.9% Cortex-A720: -47.6% Cortex-X2: -5.2% Cortex-X3: -2.6% Cortex-X4: -32.4% Cortex-X925: -1.5% Bug: b/42280942 Change-Id: I25154699b17772db1fb5cb84c049919181d86f4b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975318 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:46:02 +00:00
George Steed	aec4b4e22e	[AArch64] Add SME implementation of ScaleRowDown2Box There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: I5021aeda30f4c5f1aa4cc6326c8d7886851d2c09 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913885 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:42:21 +00:00
George Steed	b0f72309c6	Remove duplicate kernel assignment from scale_uv.cc The assignment of ScaleUVRowDown2Box_NEON is already done in the block immediately below this one, so just remove this code. Change-Id: I83c0f18dbe66e908cd4fbce73e20e96a137860cf Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979723 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-01 15:42:21 +00:00
George Steed	f00c43f4d6	[AArch64] Unroll HalfFloat{,1}Row_NEON The existing C implementation compiled with a recent LLVM is auto-vectorised and unrolled to process four vectors per loop iteration, making the Neon implementation slower than the C implementation on little cores. To avoid this, unroll the Neon implementation to also process four vectors per iteration. Reduction in cycle counts observed compared to the existing Neon implementation: \| HalfFloat1Row_NEON \| HalfFloatRow_NEON Cortex-A510 \| -37.1% \| -40.8% Cortex-A520 \| -32.3% \| -37.4% Cortex-A720 \| 0.0% \| -10.6% Cortex-X2 \| 0.0% \| -7.8% Cortex-X4 \| +0.3% \| -6.9% Bug: b/42280945 Change-Id: I12b474c970fc4355d75ed924c4ca6169badda2bc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872805 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-30 17:58:29 +00:00
George Steed	51d07554a0	[AArch64] Add SME implementation of ScaleRowDown2Linear There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: Ie6b91bd4407130ba2653838088e81e72e4460f68 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913884 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-30 17:57:15 +00:00

1 2 3 4 5 ...

2870 Commits