libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2025-12-07 09:16:48 +08:00

Author	SHA1	Message	Date
George Steed	e1a93c79fc	[AArch64] Fix rotate by odd sizes The existing disabled gtest rotate tests fail because the existing "any" kernels always assume we are processing height=8 rows at a time. This was recently changed to 16 on AArch64 which triggered this bug. To fix this, amend the TANY macro to explicitly specify the fallback kernel, such that we can use the height=16 kernel to match the SIMD optimized version where necessary. Also change other architecture versions to match. Bug: b/352351302 Change-Id: I8080fa8f44c7c67fa970a78fb426f2f801a9a00e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703585 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-15 18:13:31 +00:00
George Steed	c1fe5663f5	[AArch64] Use full vectors in ARGB4444To{Y,UV}Row_NEON The existing ARGB4444TORGB macro only makes use of 64 bit wide vectors rather than the full 128 bits available, so unroll it to allow us to process more data per instruction. For ARGB4444ToUVRow_NEON we already have enough data available each iteration to make use of full vectors, but for ARGB4444ToYRow_NEON we also need to adjust the "any" kernel to allow us to process 16 elements per iteration. Reduction in runtimes observed compared to the existing Neon kernels: \| ARGB4444ToUVRow \| ARGB4444ToYRow Cortex-A55 \| -27.8% \| -34.6% Cortex-A510 \| -37.0% \| -44.4% Cortex-A76 \| -40.2% \| -22.0% Cortex-A720 \| -33.4% \| -35.5% Cortex-X1 \| -34.1% \| -19.7% Cortex-X2 \| -32.1% \| -26.3% Bug: libyuv:976 Change-Id: I08f6286bab0ebf5e24d5d5803f8c45ec6ba776ee Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631541 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-10 23:12:43 +00:00
George Steed	5bac99fe09	[AArch64] Rework data loading in ScaleARGBFilterCols_NEON The existing code makes use of lane-indexed LD2 instructions to load the input data however this creates a strong dependency chain between consecutive load instructions. We can reduce this dependency chain by instead loading two vectors with wider lane-indexed LD1 instructions and then performing a permute to unzip the data. We can also avoid the need for a complex sequence of DUP + EXT instructions by using TBL to permute the data exactly as we want it. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: =0.0% Cortex-A510: -44.2% Cortex-A520: -47.6% Cortex-A76: -45.8% Cortex-A715: -58.3% Cortex-A720: -58.4% Cortex-X1: -66.7% Cortex-X2: -68.0% Cortex-X3: -67.9% Cortex-X4: -70.0% Change-Id: I8a1d1fe08d8a2ddb0b86d4a44f0d49b69ab03ece Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5683126 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-10 23:10:43 +00:00
George Steed	a425b559bd	[AArch64] Use full vectors in ARGB1555To{Y,UV}Row_NEON The existing RGB555TOARGB macro only makes use of 64 bit wide vectors rather than the full 128 bits available, so unroll it to allow us to process more data per instruction. For ARGB1555ToUVRow_NEON we already have enough data available each iteration to make use of full vectors, but for ARGB1555ToYRow_NEON we also need to adjust the "any" kernel to allow us to process 16 elements per iteration. Reduction in runtimes observed compared to the existing Neon kernels: \| ARGB1555ToUVRow \| ARGB1555ToYRow Cortex-A55 \| -28.8% \| -35.3% Cortex-A510 \| -34.0% \| -48.5% Cortex-A76 \| -36.7% \| -25.1% Cortex-A720 \| -29.7% \| -31.1% Cortex-X1 \| -31.6% \| -19.7% Cortex-X2 \| -27.6% \| -22.7% Bug: libyuv:976 Change-Id: Idd745c133b5fb65001652a59f01ac1aa3bb42067 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631540 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-10 23:09:53 +00:00
Frank Barchard	3902eaaf86	Fix for source/row_neon64.cc:551:12: error: unused variable 'alpha' [-Werror,-Wunused-variable] 551 \| uint16_t alpha = 0xc000; \| ^~~~~ 1 error generated. Bug: None Change-Id: Ifdfe39f75c003921e4f759bcbbbffe0e766039bd Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5690260 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-07-09 22:51:33 +00:00
George Steed	899bc48327	[AArch64] Add SVE2 implementations of ARGBTo{RAW,RGB24}Row There is no nice way of forming the TBL permute indices here since we are operating on sets of three bytes at a time, so instead load the appropriate indices from a static array. We can make use of SVE predication to ensure we are operating on a multiple of three bytes for the load/store instructions rather than needing to make use of more expensive LD4 or ST3 instructions. Reduction in runtime observed compared to the existing Neon implementations: \| ARGBToRAWRow \| ARGBToRGB24Row Cortex-A510 \| -50.8% \| -19.9% Cortex-A720 \| -39.8% \| -39.1% Cortex-X2 \| -66.5% \| -51.9% Bug: libyuv:973 Change-Id: Iaead678715a3d70d54cf823391272a6196836769 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631544 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 20:27:54 +00:00
George Steed	5236846b64	[AArch64] Keep UV interleaved in some *ToARGBRow_SVE2 kernels The existing I4XXTORGB_SVE macro operates only on even byte lanes of the loaded U/V vectors. This is sub-optimal since we are effectively wasting half of the vector in any pre-processing steps before the conversion. In particular, where the UV components are loaded from interleaved data we can save a TBL instruction by maintaining the interleaved format. This commit introduces a new NVTORGB_SVE macro to handle the case where U/V components are interleaved into even/odd bytes of a vector, mirroring a similar macro in the AArch64 Neon implementation. Reduction in runtimes observed compared to the existing SVE2 code: \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 NV12ToARGBRow_SVE2 \| -5.3% \| -0.2% \| -4.4% NV21ToARGBRow_SVE2 \| -5.3% \| -0.2% \| -4.4% UYVYToARGBRow_SVE2 \| -5.6% \| 0.0% \| -4.6% YUY2ToARGBRow_SVE2 \| -5.5% \| -0.1% \| -4.2% Bug: libyuv:973 Change-Id: I418de2e684e0b6b0b9e41c39b564438531e44671 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 20:26:23 +00:00
George Steed	555f80f3ce	[AArch64] Add SVE2 implementation of RGB24ToARGBRow This can make use of the existing helper functions for RAWToARGBRow_SVE2 and RAWToRGBARow_SVE2 since the layouts are similar, we just need to adjust the TBL constants to match the different input layout. Observed reduction in runtime compared to the existing Neon kernel: Cortex-A510: -25.6% Cortex-A720: -15.2% Cortex-X2: -10.2% Cortex-X4: -30.2% Bug: libyuv:973 Change-Id: Ie3676693286be90d09f0045766c3492cbc04ea64 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5638555 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-08 20:12:05 +00:00
George Steed	fcbe22c59c	[AArch64] Enable SME feature detection on Apple Silicon Check for availability of SME and SME2 by looking for the hw.optional.arm.FEAT_SME2 feature string in sysctlbyname. Non-streaming SVE is not supported but for our purposes the features can be treated as orthogonal since our SME code will only ever run in streaming mode. Change-Id: I7e9d242e0f581217b625d74c7c3b0c76a0fe03da Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5683128 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 16:19:27 +00:00
George Steed	11ff6067a5	[AArch64] Add SVE2 implementation of RAWToRGB24Row There is no nice way of forming the TBL permute indices here since we are operating on sets of three bytes at a time, so instead load the appropriate indices from a static array. We can make use of SVE predication to ensure we are operating on a multiple of three bytes for the load/store instructions rather than needing to make use of more expensive LD3 or ST3 instructions. Reduction in runtime observed compared to the existing Neon implementation: Cortex-A510: -39.2% Cortex-A720: -34.5% Cortex-X2: -31.0% Bug: libyuv:973 Change-Id: I68560bde7a529e5cec150b0e9d3ffe4341038fb8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631543 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 15:55:14 +00:00
George Steed	c613c3f102	[AArch64] Add SVE2 implementations for RAWTo{ARGB,RGBA}Row We can construct particular predicates to load only up to 3/4 of a full vector, allowing us to use TBL to shuffle elements into the correct place rather than needing to rely on more expensive LD3 or ST4 instructions. Reduction in runtimes observed compared to the existing Neon implementation: \| RAWToARGBRow \| RAWToRGBARow Cortex-A510 \| -32.4% \| -31.9% Cortex-A720 \| -15.7% \| -15.6% Cortex-X2 \| -24.6% \| -24.4% Bug: libyuv:973 Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-06 22:40:15 +00:00
George Steed	d1ec694ad3	[AArch64] Add P{210,410}To{ARGB,AR30}Row_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 P210ToARGBRow \| -59.8% \| -16.8% \| -53.2% P210ToAR30Row \| -48.1% \| -21.8% \| -54.0% P410ToARGBRow \| -56.5% \| -32.2% \| -54.1% P410ToAR30Row \| -42.4% \| -4.5% \| -50.4% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-06 22:37:08 +00:00
Frank Barchard	611806a155	[AArch64] Fix SVE/SME vector length printing in cpuid A semicolon is treated as the start of a comment by some assemblers causing the vector length to be reported incorrectly, so use a newline instead. - Add volatile asm in row_gcc and row_neon64 Bug: b/5631539 Change-Id: I6b0836fcdd9247ef7b9e8ceda01df3150519ecf8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5666060 Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-02 19:44:41 +00:00
George Steed	d32436e8f8	[AArch64] Add Neon implementation for I422ToAR30Row_NEON There is an existing x86 implementation for this kernel, but not for AArch64, so add one. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: Cortex-A55: -43.1% Cortex-A510: -22.3% Cortex-A76: -54.8% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-02 18:16:33 +00:00
George Steed	bbd9cedc4f	[AArch64] Add Neon impls for I212To{ARGB,AR30}Row_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| I210ToAR30Row \| I210ToARGBRow Cortex-A55 \| -40.8% \| -54.4% Cortex-A510 \| -26.2% \| -22.7% Cortex-A76 \| -49.2% \| -44.5% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-02 18:16:33 +00:00
Frank Barchard	fa16ddbb9f	cpuid show vector length on ARM and RISCV - additional asm volatile changes from github - rotate mips remove C function - moved to common Run on Samsung S22 [ RUN ] LibYUVBaseTest.TestCpuHas Kernel Version 5.10 Has Arm 0x2 Has Neon 0x4 Has Neon DotProd 0x10 Has Neon I8MM 0x20 Has SVE 0x40 Has SVE2 0x80 Has SME 0x0 SVE vector length: 16 bytes [ OK ] LibYUVBaseTest.TestCpuHas (0 ms) [ RUN ] LibYUVBaseTest.TestCompilerMacros __ATOMIC_RELAXED 0 __cplusplus 201703 __clang_major__ 17 __clang_minor__ 0 __GNUC__ 4 __GNUC_MINOR__ 2 __aarch64__ 1 __clang__ 1 __llvm__ 1 __pic__ 2 INT_TYPES_DEFINED __has_feature Run on RISCV qemu emulating SiFive X280: [ RUN ] LibYUVBaseTest.TestCpuHas Kernel Version 6.6 Has RISCV 0x10000000 Has RVV 0x20000000 RVV vector length: 64 bytes [ OK ] LibYUVBaseTest.TestCpuHas (4 ms) [ RUN ] LibYUVBaseTest.TestCompilerMacros __ATOMIC_RELAXED 0 __cplusplus 202002 __clang_major__ 9999 __clang_minor__ 0 __GNUC__ 4 __GNUC_MINOR__ 2 __riscv 1 __riscv_vector 1 __riscv_v_intrinsic 12000 __riscv_zve64x 1000000 __clang__ 1 __llvm__ 1 __pic__ 2 INT_TYPES_DEFINED __has_feature Bug: b/42280943 Change-Id: I53cf0450be4965a28942e113e4c77295ace70999 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5672088 Reviewed-by: David Gao <davidgao@google.com>	2024-07-02 18:10:56 +00:00
Frank Barchard	616bee5420	Add volatile for gcc inline to avoid being removed Bug: b/42280943 Change-Id: I4439077a92ffa6dff91d2d10accd5251b76f7544 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5671187 Reviewed-by: David Gao <davidgao@google.com>	2024-07-02 01:25:24 +00:00
Frank Barchard	efd164d64e	Disable RVV ScaleDownBy4 if compiler option is not enabled - Some configs have int64 elements off by default. Disable ScaleDownBy4 row function to avoid compile error Bug: 344954354 Change-Id: Ie0d74daea72375eff6438ab54cb2803d68d67e52 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598460 Reviewed-by: James Zern <jzern@google.com>	2024-06-18 01:52:40 +00:00
Frank Barchard	b0dfa70114	RVV remove unused variables - ARM Planar test use regular asm volatile syntax - x86 row functions remove volatile from asm Bug: 347111119, 347112532 Change-Id: I535b3dfa1a7a19824503bd95584a63b047b0e9a1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5637058 Reviewed-by: Justin Green <greenjustin@google.com>	2024-06-17 20:25:31 +00:00
Bruce Lai	7758c961c5	Support RVV v0.12 intrinsics for row_rvv.cc & scale_rvv.cc 1. Add two defined marco LIBYUV_RVV_HAS_TUPLE_TYPE & LIBYUV_RVV_HAS_VXRM_ARG Intrinsic v0.12 introduces - tuple type in segment load & store - vxrm argument in fixed-point intrinsics (e.g vnclip) These two marcos are controled by __riscv_v_intrinsic. 2. Support RVV v0.12 intrinsics in row_rvv.cc & scale_rvv.cc Change-Id: I921f91d9dc8fdda031e7b6647d0e296aa2793c39 Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4767120 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-17 18:01:49 +00:00
George Steed	367dd50755	[AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels except reading the YUV components all from the same interleaved input array. We load four-byte elements and then use TBL to de-interleave the UV components. Unlike the NV{12,21} cases we need to de-interleave bytes rather than widened 16-bit elements. Since we need a TBL instruction already it would ordinarily be possible to perform the zero-extension from bytes to 16-bit elements by setting the index for every other byte to be out of range. Such an approach does not work in SVE since at a vector length of 2048 bits since all possible byte values (0-255) are valid indices into the vector. We instead get around this by rewriting the I4XXTORGB_SVE macro to perform widening multiplies, operating on the low byte of each 16-bit UV element instead of the full value and therefore eliminating the need for a zero-extension. Observed reductions in runtimes compared to the existing Neon code: \| UYVYToARGBRow \| YUY2ToARGBRow Cortex-A510 \| -30.2% \| -30.2% Cortex-A720 \| -4.8% \| -4.7% Cortex-X2 \| -9.6% \| -10.1% Bug: libyuv:973 Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-13 22:06:46 +00:00
George Steed	cd4113f4e8	[AArch64] Add SVE2 implementation of I400ToARGBRow This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we can pre-calculate the UV component results before the loop body. Unlike in the Neon version of the code we can make use of MOVPRFX and USQADD to avoid needing to apply the bias separately from the UV coefficient multiply additions. Reduction in runtime observed compared to the existing Neon code: Cortex-A510: -26.1% Cortex-A520: -5.9% Cortex-A715: -49.5% Cortex-A720: -49.4% Cortex-X2: -22.5% Cortex-X3: -23.5% Cortex-X4: -21.6% Bug: libyuv:973 Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-13 22:02:06 +00:00
George Steed	34abe98fe2	[AArch64] Add SVE2 implementations for NV{12,21}ToARGBRow We need a permute to duplicate the UV components, so we can share a common implementation for both NV12 and NV21 by varying the inputs to the INDEX instruction that generates the TBL indices. Observed reductions in runtimes compared to the existing Neon code: \| NV12ToARGBRow_SVE2 \| NV21ToARGBRow_SVE2 Cortex-A510 \| -29.1% \| -29.1% Cortex-A720 \| -4.8% \| -4.8% Cortex-X2 \| -9.2% \| -9.2% Bug: libyuv:973 Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-12 16:24:40 +00:00
George Steed	a758a15dbf	[AArch64] Add I8MM implementation of ARGBColorMatrixRow We cannot use the standard dot-product instructions since the matrix of coefficients are signed, but I8MM supports mixed-sign products which work well here. Reduction in runtimes observed compared to the previous Neon implementation: Cortex-A510: -50.8% Cortex-A520: -33.3% Cortex-A715: -38.6% Cortex-A720: -38.5% Cortex-X2: -43.2% Cortex-X3: -40.0% Cortex-X4: -55.0% Change-Id: Ia4fe486faf8f43d0b837ad21bb37e2159f3bdb77 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5621577 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-12 16:17:59 +00:00
George Steed	89cf221baa	[AArch64] Avoid unnecessary widening in I422ToARGB1555Row_NEON The existing code first widens the component vectors from 8-bit elements to 16-bits to construct the final ARGB1555 result, however this is unnecessary since the inputs to the widening are themselves the result of having just been narrowed in the RGBTORGB8 macro. By making use of the new RGBTORGB8_TOP macro we can get rid of both the widening as well as the prior narrowing step. Also remove volatile from the asm, it is unnecessary. Reduction in runtime observed for I422ToARGB1555Row_NEON: Cortex-A55: -7.8% Cortex-A76: -15.0% Cortex-A720: -20.3% Cortex-X1: -20.2% Cortex-X2: -20.3% Bug: libyuv:976 Change-Id: Id031c5d4d788828297adcc2fe2c2cd8d99b45433 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5616050 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-11 23:36:13 +00:00
George Steed	e6c4b9ad2e	[Arm][AArch64] Remove unused ARGBToUVJ444Row_NEON definition There is no corresponding declaration in a header file and it appears to be unused, so remove from both the Arm and AArch64. Change-Id: I4de9fb7ce8e8dff6e76f4a99fdd93c743f92bf18 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5587507 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-10 18:36:31 +00:00
George Steed	c8974cf8d4	[AArch64] Add SME feature detection on Linux This commit just adds the kCpuHasSME to represent that the CPU has the Arm Scalable Matrix Extension enabled, but this commit does not introduce any code to actually use it yet. Add a test to check that the HWCAP value is interpreted correctly. Change-Id: I2de7bca26ca44ff3ee278b59108298a299a171b7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598869 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-08 23:34:22 +00:00
George Steed	910f8e3645	[AArch64] Remove redundant semicolons after ANY41CT Introduced by 5b4160b9c322fda98e2208d80c2ea75dd7e7f25f. Bug: 345650115 Change-Id: I68c4c34ad9701f62729590ad137d743324497d28 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5604588 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-06-08 23:33:54 +00:00
George Steed	a68b959873	[AArch64] Add initial build system support for SME Extend both the CMake and BUILD.gn configurations to support building a library with the Arm Scalable Matrix Extension (SME). Add an initial (empty) rotate_sme.cc source file to populate the library for now. Change-Id: Icd4bd6a8ce72ba132299b00c99478a18a85d869a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5588664 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-08 23:32:41 +00:00
George Steed	3f657221f0	[AArch64] Remove unused vars in I{210,410}{,Alpha}ToARGBRow_NEON The elements of the YUV constants are passed directly to the inline asm block, so no need to pull them out into variables first. Also remove "volatile" from inline asm blocks, it is unnecessary. Bug: 344998222 Change-Id: I7d97dec8c7495651e5a31c10eda2d4aeed36fe6a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598764 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-07 02:39:20 +00:00
George Steed	96bbdb53ed	[AArch64] Add SVE2 implementation of I422ToRGBARow This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we just need to interleave differently for the output. The RGBA format actually saves us an instruction compared to ARGB since there is no need to merge in the alpha component, we can just replace the odd elements of the alpha vector itself during the narrowing. Also rename some existing macros to make more sense when distinguishing between ARGB and RGBA. Reductions in runtime observed compared to the existing Neon code: Cortex-A510: -27.0% Cortex-A720: -5.3% Cortex-X2: -14.7% Bug: libyuv:973 Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	004352ba16	[AArch64] Add SVE2 implementations for AYUVTo{UV,VU}Row These kernels are mostly identical to each other except for the order of the results, so we can use a single macro to parameterize the pairwise addition and use the same macro for both implementations, just with the register order flipped. Similar to other 2x2 kernels the implementation here differs slightly for the last element if the problem size is odd, so use an "any" kernel to avoid needing to handle this in the common code path. Observed reduction in runtime compared to the existing Neon code: \| AYUVToUVRow \| AYUVToVURow Cortex-A510 \| -33.1% \| -33.0% Cortex-A720 \| -25.1% \| -25.1% Cortex-X2 \| -59.5% \| -53.9% Cortex-X4 \| -39.2% \| -39.4% Bug: libyuv:973 Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	d0da5a3298	[AArch64] Add SVE2 implementation of ARGB1555ToARGBRow Avoiding LD4 and unrolling gives a good perf improvement for the little core especially. Observed reduction in runtime relative to the existing Neon code: Cortex-A510: -69.7% Cortex-A720: -7.7% Cortex-X2: -41.9% Cortex-X4: -14.5% Bug: libyuv:973 Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	250e1e1ba3	[AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow Observed performance improvements compared to the existing Neon implementation: Cortex-A510: -21.7% Cortex-A720: -49.2% Cortex-X2: -62.6% Bug: libyuv:973 Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-06-03 23:15:04 +00:00
George Steed	dff7bad43d	[AArch64] Use full Neon vectors in ARGB4444ToARGBRow_NEON The existing Neon code narrows the input 16-bit packed data to 8-bit elements and separates the color channels, causing us to only process half a Neon vector per instruction for the channel widening from 4-bit color data to 8-bits. We can note that the processing being done is identical for all color channels and therefore we can keep them partially interleaved during the widening step. This allows us to use full Neon vectors for the whole loop body. Reductions in runtimes observed for ARGB4444ToARGBRow_NEON: Cortex-A55: -30.7% Cortex-A510: -44.3% Cortex-A76: -51.6% Cortex-X2: -54.2% Bug: libyuv:976 Change-Id: I9d9cda7e16eb07619c6d7f1de2e6b8c0fb6d64cf Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5594389 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-03 22:52:33 +00:00
George Steed	7633c818ec	[AArch64] Remove pointless MOVI in ARGB1555ToARGBRow_NEON This function takes the alpha component from the loaded data rather than hard-coding it to 255, so initialising v3 to 255 is unused here. Change-Id: I668825e0eeb317d1365035ce3bb47f3d92081c6f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5594388 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-03 22:47:01 +00:00
George Steed	6c70eb2819	[AArch64] Add Neon impls for I{210,410}ToAR30Row_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: I210ToAR30Row on Cortex-A55: -43.8% I210ToAR30Row on Cortex-A510: -27.0% I210ToAR30Row on Cortex-A76: -50.4% I410ToAR30Row on Cortex-A55: -44.3% I410ToAR30Row on Cortex-A510: -17.5% I410ToAR30Row on Cortex-A76: -57.2% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-03 22:46:12 +00:00
George Steed	214b4a25c7	[Arm] Clean up rotate_neon.cc kernels Get rid of unused tail loops, since they are already handled by the "any" kernels. Also remove unnecessary "volatile" specifier from asm blocks. Bug: libyuv:976 Change-Id: I4676fc807bcaedbb5f0f52b1bed20a172fef4ed6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5553719 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-03 22:23:40 +00:00
George Steed	bce3392830	[AArch64] Add SVE2 implementation of ARGBToRGB565Row Observed performance improvements compared to the existing Neon implementation: Cortex-A510: -27.1% Cortex-A720: -49.4% Cortex-X2: -67.9% Bug: libyuv:973 Change-Id: I321dc080a6e89301cd959c2ee18bc6680f749312 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505538 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-31 17:42:27 +00:00
George Steed	812b4955b2	[AArch64] Add Neon impls for I{210,410}ToARGBRow_NEON There is are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| I210ToARGBRow \| I410ToARGBRow Cortex-A55 \| -55.6% \| -56.2% Cortex-A510 \| -22.6% \| -35.6% Cortex-A76 \| -48.1% \| -57.2% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 17:40:48 +00:00
George Steed	5b4160b9c3	[AArch64] Add Neon impls for I{210,410}AlphaToARGBRow_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| I210AlphaToARGBRow \| I410AlphaToARGBRow Cortex-A55 \| -55.3% \| -56.1% Cortex-A510 \| -27.9% \| -42.6% Cortex-A76 \| -54.9% \| -60.3% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 08:41:31 +00:00
George Steed	e348995a92	[AArch64] Optimize MergeXR30Row_10_NEON By keeping intermediate data as 16-bits wide we can compute twice as much and use ST2 to store the final result. This appears to be much better even on micro-architectures where ST2 is slightly slower than ST1. We save a couple of instructions by taking advantage of multiply-add instructions to perform an effective shift-left and bitwise-or, since we know the set of nonzero bits are disjoint after the UMIN. Reduction in runtime observed for MergeXR30Row_10_NEON: Cortex-A55: -34.2% Cortex-A510: -35.6% Cortex-A76: -44.9% Cortex-X2: -48.3% Bug: libyuv:976 Change-Id: I6e2627f9aa8e400ea82ff381ed587fcfc0d94648 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509199 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 08:32:55 +00:00
George Steed	56258c125b	[AArch64] Avoid redundant shift around RGB565 conversion The existing code performs a narrowing shift right (in RGBTORGB8) followed by a widening left shift (in ARGBTORGB565). This is redundant since we could have simply not performed a narrowing operation to begin with and instead done a saturating left shift to saturate against the top of the 16-bit lanes rather than the narrowed 8-bit lanes. To enable this we introduce new RGBTORGB8_TOP and ARGBTORGB565_FROM_TOP macros which produce and consume values from the high half of each 16-bit lane rather than a narrowed 8-bit intermediate. Reduction in runtime for selected kernels: \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 \| Cortex-X2 I422ToRGB565Row_NEON \| -10.8% \| -6.1% \| -17.2% \| -23.6% NV12ToRGB565Row_NEON \| -11.4% \| -4.9% \| -20.4% \| -17.4% Bug: libyuv:976 Change-Id: I3337b8f41ff62a7af1b70a56b774239bdb55d0f1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509197 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 08:29:54 +00:00
George Steed	c5f9583b1c	[AArch64] Avoid extracting alpha in ARGB1555ToYRow_NEON The existing implementation of this kernel uses the ARGB1555TOARGB macro which extracts and sign-extends the alpha component into v3, however this particular kernel does not need the alpha component. We can avoid calculating the alpha component completely by using the existing RGB555TOARGB macro, so use that instead. Reduction in runtimes observed for ARGB1555ToYRow_NEON (no noticeable improvement observed on Cortex-A510): Cortex-A55: -3.6% Cortex-A76: -20.9% Cortex-X2: -15.1% Bug: libyuv:976 Change-Id: I2cf2729c8297c53dcd32d0df28e64d4d5c7f6def Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509200 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-31 08:28:27 +00:00
George Steed	7c122e8859	[AArch64] Use ST2 to avoid TRN step in TransposeWx16_NEON ST2 with 64-bit lanes has good performance on all micro-architectures of interest and saves us 8 TRN instructions, so use that instead. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -8.6% Cortex-A510: -4.9% Cortex-A520: -6.0% Cortex-A76: -14.4% Cortex-A720: -5.3% Cortex-X1: -13.6% Cortex-X2: -5.8% Bug: libyuv:976 Change-Id: I08bb5517bbdc54c4784fce42a885b12f91e7a982 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5581597 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 08:27:05 +00:00
George Steed	6b9604dffc	[AArch64] Remove unused code from TransposeUVWx8_NEON We already have an "any" helper function set up for this kernel, so use it to match the other existing architecture paths. This change also affects the 32-bit Arm paths, which will be cleaned up in a later commit. With this change the kernel is now only entered with width as a multiple of eight, so remove the now-unneeded tail loops. Also remove volatile specifier from the asm block, it is unnecessary. Change-Id: If37428ac2d6035a8c27eec9bd80d014a98ac3eb1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5553717 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-27 21:52:56 +00:00
George Steed	d0c28db56c	[AArch64] Optimize Merge{ARGB,XRGB}16To8Row_NEON Rather than shifting the data into the low half of each lane and then using a saturating narrowing operation, we can do the saturation as part of a shift into the highest half of the lane and then use a simpler TRN2 instruction to extract pairs of high halves into full vectors. This also has the nice advantage of allowing us to use ST2 rather than ST4 for storing the result, since ST4 is known to be slow on some micro-architectures. Reduction in runtimes observed for the two kernels: \| MergeARGB16To8Row_NEON \| MergeXRGB16To8Row_NEON Cortex-A55 \| -8.0% \| -12.2% Cortex-A510 \| -29.9% \| -31.4% Cortex-A76 \| -29.0% \| -32.0% Cortex-X2 \| -33.5% \| -43.4% Bug: libyuv:976 Change-Id: I9da3beedc27ab43527b3642aa6d4decf3b5b6683 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509198 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-21 07:55:03 +00:00
George Steed	4f7fd808b7	[AArch64] Use full vectors in TransposeWx{8 => 16}_NEON The existing Neon code only makes use of 64-bit vectors throughout which limits the performance on larger cores. To avoid this, swap the Neon code from a Wx8 implementation to a Wx16 implementation and process blocks of 16 full vectors at a time. The original code also handled widths that were not exact multiples of 16, however this should already be handled by the "any" kernel so it is removed. Finally, avoid duplicating the TransposeWx16_C fallback kernel definition in all architectures that need it, and just put it once in rotate_common.cc instead. Observed speedups for TransposePlane across a range of micro-architectures: Cortex-A53: -40.0% Cortex-A55: -20.7% Cortex-A57: -43.9% Cortex-A510: -43.5% Cortex-A520: -43.9% Cortex-A720: -31.1% Cortex-X2: -38.3% Cortex-X4: -43.6% Change-Id: Ic7c4d5f24eb27091d743ddc00cd95ef178b6984e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5545459 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-21 07:46:42 +00:00
George Steed	9fac9a4a82	[AArch64] Add Neon implementations for {ARGB,ABGR}ToAR30Row There are existing x86 implementations for these kernels but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| ABGRToAR30Row \| ARGBToAR30Row Cortex-A55 \| -55.1% \| -55.1% Cortex-A510 \| -39.3% \| -40.1% Cortex-A76 \| -62.3% \| -63.6% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-21 07:35:07 +00:00
George Steed	83c48c782a	[AArch64] Improve ARGB4444TOARGB using SRI instructions Also avoid constructing the alpha component when it isn't needed by introducing a new ARGB4444TORGB macro. Reduction in runtime for selected kernels: \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 ARGB4444ToARGBRow_NEON \| -27.5% \| -27.9% \| -29.1% ARGB4444ToUVRow_NEON \| -20.2% \| -25.2% \| -21.7% ARGB4444ToYRow_NEON \| -16.0% \| -20.2% \| -21.3% Bug: libyuv:976 Change-Id: Ida061e1c49ba228b02c2f691a067b58edad073a8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509196 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-21 07:29:11 +00:00

1 2 3 4 5 ...

1850 Commits