libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2025-12-08 01:36:47 +08:00

Author	SHA1	Message	Date
George Steed	b753822d47	[AArch64] Add SVE2 implementation of P210ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -32.8% Cortex-A520: +8.7% (!) Cortex-A715: -18.9% Cortex-A720: -18.9% Cortex-X2: -7.9% Cortex-X3: -8.8% Cortex-X4: +1.0% (!) Cortex-X925: -8.6% Bug: b/42280942 Change-Id: Ibe557500c3788b4fb39372c92b2f42ba216e6fea Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975320 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-11-12 18:32:55 +00:00
George Steed	5c12e0b2de	[AArch64] Add SVE2 implementations of HalfFloat{,1}Row For HalfFloat1Row, SVE has direct 16-bit integer to half-float conversion instructions so there is no need to widen to 32-bits. For HalfFloatRow, SVE zero-extending loads avoid the need for seperate UXTL(2) instructions. Observed reductions in runtime compared to the existing Neon code: \| HalfFloat1Row \| HalfFloatRow Cortex-A510 \| -38.3% \| -17.3% Cortex-A520 \| -37.6% \| -18.8% Cortex-A720 \| -50.1% \| -7.8% Cortex-X2 \| -50.2% \| -0.4% Cortex-X4 \| -51.5% \| -12.5% Bug: b/42280942 Change-Id: I445071ccd453113144ce42d465ba03c9ee89ec9e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975319 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:53:00 +00:00
George Steed	f27b983f38	[AArch64] Add SVE2 implementation of DivideRow_16 SVE contains the UMULH instruction which allows us to multiply and take the high half of the result in a single instruction rather than needing separate widening multiply and then narrowing shift steps. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -21.2% Cortex-A520: -20.9% Cortex-A715: -47.9% Cortex-A720: -47.6% Cortex-X2: -5.2% Cortex-X3: -2.6% Cortex-X4: -32.4% Cortex-X925: -1.5% Bug: b/42280942 Change-Id: I25154699b17772db1fb5cb84c049919181d86f4b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975318 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:46:02 +00:00
George Steed	22ac86800e	[AArch64] Add SVE2 implementation of I422ToARGB4444Row This makes use of the same approach as the Neon code to avoid redundant narrowing and then widening shifts by instead placing the values at the top portion of the lanes and then shifting down from there instead. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -35.5% Cortex-A520: -38.2% Cortex-A715: -19.8% Cortex-A720: -19.8% Cortex-X2: -24.2% Cortex-X3: -24.1% Cortex-X4: -21.6% Cortex-X925: -19.5% Bug: b/42280942 Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 21:27:39 +00:00
George Steed	f4eaeca22a	[AArch64] Add SVE2 implementation of I422ToARGB1555Row This makes use of the same approach as the Neon code to avoid redundant narrowing and then widening shifts by instead placing the values at the top portion of the lanes and then shifting down from there instead. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -41.8% Cortex-A520: -42.6% Cortex-A715: -22.5% Cortex-A720: -22.6% Cortex-X2: -22.7% Cortex-X3: -22.4% Cortex-X4: -19.4% Cortex-X925: -27.0% Bug: b/42280942 Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 21:27:39 +00:00
George Steed	f40042533c	[AArch64] Add SVE2 implementation of I422ToRGB565Row This makes use of the same approach as the Neon code to avoid redundant narrowing and then widening shifts by instead placing the values at the top portion of the lanes and then shifting down from there instead. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -41.1% Cortex-A520: -38.2% Cortex-A715: -21.5% Cortex-A720: -21.6% Cortex-X2: -21.6% Cortex-X3: -22.0% Cortex-X4: -23.5% Cortex-X925: -21.7% Bug: b/42280942 Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 21:27:39 +00:00
George Steed	0dce974ca0	[AArch64] Add SVE2 implementation of I422ToRGB24Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -57.8% Cortex-A520: -41.7% Cortex-A715: -28.0% Cortex-A720: -28.1% Cortex-X2: -29.7% Cortex-X3: -28.7% Cortex-X4: -30.5% Cortex-X925: -30.3% Bug: b/42280942 Change-Id: I328bd16babda75fb089c8da8f2714465f658187e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802965 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-24 02:17:32 +00:00
Wan-Teh Chang	6157cc4583	Remove the ' separators in hex integer constants They are a C++14 feature, not supported in C++11 mode (-std=c++11). Change-Id: I618020342d4964b994aefa06af83b2e8d553a032 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786607 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-14 20:50:28 +00:00
Wan-Teh Chang	3cf54e90d3	Fix -Wmissing-prototypes warnings Declare functions as static. Declare functions in a header. Include the header that declares the functions. Delete undeclared and unused functions ScaleFilterRows_NEON() and ScaleRowUp2_16_NEON(). Delete unused function ScaleY() in psnr_main.cc. Change-Id: I182ec30611df83c61ffd01bbab595cd61fb5f1e5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601 Commit-Queue: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-12 19:08:24 +00:00
George Steed	42d33341d3	[AArch64] Unroll {RAW,RGB24}To{ARGB,RGBA}Row_SVE2 Unrolling gives a nice improvement to the little cores and even a small improvement to the big cores thanks to avoiding the loop control overhead. Observed performance improvement relative to the existing SVE2 code. \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 RAWToARGBRow_SVE2 \| -28.4% \| -10.1% \| -3.5% RAWToRGBARow_SVE2 \| -28.5% \| -10.1% \| -4.4% RGB24ToARGBRow_SVE2 \| -28.5% \| -10.4% \| -5.5% Bug: libyuv:973 Change-Id: I7aa03fdaa1a24ecfdd13418647a02e5effe8333f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725174 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 16:01:56 +00:00
George Steed	4ad050b5ec	[AArch64] Unroll {I422,I422Alpha}ToARGBRow_SVE2 Since the UV components are duplicated in I422 we end up wasting half of the vector bandwidth processing the same elements twice. By unrolling the kernel to process two vectors of Y per iteration we can fill a whole vector of U/V components. Rather than packing RGBA components into pairs during the narrowing we now just narrow into individual component vectors and use ST4B instead. This by itself is slower on some micro-architectures like Cortex-A510 but the benefit from unrolling significantly outweights this. \| I422AlphaToARGBRow_SVE2 \| I422ToARGBRow_SVE2 Cortex-A510 \| -46.2% \| -48.8% Cortex-A720 \| -20.8% \| -21.0% Cortex-X2 \| -11.3% \| -7.5% Cortex-X4 \| -15.4% \| -15.5% Bug: libyuv:973 Change-Id: I69389c4279861f7a460ae0c28186f023c728c4e8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725173 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 15:55:59 +00:00
George Steed	a64fffe632	Revert "Disable NV12ToARGB_SVE2 which fails the 'any' test" This reverts commit f480fa1c4a4af0ce3c34cd7b1ab0d85f1a36ce17. This code has a number of small issues: * The YUVTORGB_SVE_SETUP macro requires p0 to be initialized to all-true, however the existing kernel does not initialise p0 until after this macro is called, so flip the order. * The p2 register is missing from the clobber list, so add it. * The existing code uses the wrong condition flags when determining whether to do the tail iteration using WHILE instructions or not. Additionally the number of tail iterations is incorrect, as it was incorrectly not changed from when the tail code was always executed. While we are here, make another few small improvements: * Remove the single-quote digit separators as requested here: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133 * Remove "volatile" from the asm block counting the vector length. This particular asm block cannot be removed by the compiler since the output register is consumed by subsequent code, so "volatile" is unnecessary here and we remove it. * Add some additional empty comments to force clang-format to put macros into the next line rather than on the same line as other asm. Bug: b/352371649 Change-Id: I45676fab95343f588cf11ce2cf9186ffbe87489e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703586 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-15 18:13:42 +00:00
George Steed	899bc48327	[AArch64] Add SVE2 implementations of ARGBTo{RAW,RGB24}Row There is no nice way of forming the TBL permute indices here since we are operating on sets of three bytes at a time, so instead load the appropriate indices from a static array. We can make use of SVE predication to ensure we are operating on a multiple of three bytes for the load/store instructions rather than needing to make use of more expensive LD4 or ST3 instructions. Reduction in runtime observed compared to the existing Neon implementations: \| ARGBToRAWRow \| ARGBToRGB24Row Cortex-A510 \| -50.8% \| -19.9% Cortex-A720 \| -39.8% \| -39.1% Cortex-X2 \| -66.5% \| -51.9% Bug: libyuv:973 Change-Id: Iaead678715a3d70d54cf823391272a6196836769 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631544 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 20:27:54 +00:00
George Steed	5236846b64	[AArch64] Keep UV interleaved in some *ToARGBRow_SVE2 kernels The existing I4XXTORGB_SVE macro operates only on even byte lanes of the loaded U/V vectors. This is sub-optimal since we are effectively wasting half of the vector in any pre-processing steps before the conversion. In particular, where the UV components are loaded from interleaved data we can save a TBL instruction by maintaining the interleaved format. This commit introduces a new NVTORGB_SVE macro to handle the case where U/V components are interleaved into even/odd bytes of a vector, mirroring a similar macro in the AArch64 Neon implementation. Reduction in runtimes observed compared to the existing SVE2 code: \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 NV12ToARGBRow_SVE2 \| -5.3% \| -0.2% \| -4.4% NV21ToARGBRow_SVE2 \| -5.3% \| -0.2% \| -4.4% UYVYToARGBRow_SVE2 \| -5.6% \| 0.0% \| -4.6% YUY2ToARGBRow_SVE2 \| -5.5% \| -0.1% \| -4.2% Bug: libyuv:973 Change-Id: I418de2e684e0b6b0b9e41c39b564438531e44671 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 20:26:23 +00:00
George Steed	555f80f3ce	[AArch64] Add SVE2 implementation of RGB24ToARGBRow This can make use of the existing helper functions for RAWToARGBRow_SVE2 and RAWToRGBARow_SVE2 since the layouts are similar, we just need to adjust the TBL constants to match the different input layout. Observed reduction in runtime compared to the existing Neon kernel: Cortex-A510: -25.6% Cortex-A720: -15.2% Cortex-X2: -10.2% Cortex-X4: -30.2% Bug: libyuv:973 Change-Id: Ie3676693286be90d09f0045766c3492cbc04ea64 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5638555 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-08 20:12:05 +00:00
George Steed	11ff6067a5	[AArch64] Add SVE2 implementation of RAWToRGB24Row There is no nice way of forming the TBL permute indices here since we are operating on sets of three bytes at a time, so instead load the appropriate indices from a static array. We can make use of SVE predication to ensure we are operating on a multiple of three bytes for the load/store instructions rather than needing to make use of more expensive LD3 or ST3 instructions. Reduction in runtime observed compared to the existing Neon implementation: Cortex-A510: -39.2% Cortex-A720: -34.5% Cortex-X2: -31.0% Bug: libyuv:973 Change-Id: I68560bde7a529e5cec150b0e9d3ffe4341038fb8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631543 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 15:55:14 +00:00
George Steed	c613c3f102	[AArch64] Add SVE2 implementations for RAWTo{ARGB,RGBA}Row We can construct particular predicates to load only up to 3/4 of a full vector, allowing us to use TBL to shuffle elements into the correct place rather than needing to rely on more expensive LD3 or ST4 instructions. Reduction in runtimes observed compared to the existing Neon implementation: \| RAWToARGBRow \| RAWToRGBARow Cortex-A510 \| -32.4% \| -31.9% Cortex-A720 \| -15.7% \| -15.6% Cortex-X2 \| -24.6% \| -24.4% Bug: libyuv:973 Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-06 22:40:15 +00:00
Frank Barchard	616bee5420	Add volatile for gcc inline to avoid being removed Bug: b/42280943 Change-Id: I4439077a92ffa6dff91d2d10accd5251b76f7544 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5671187 Reviewed-by: David Gao <davidgao@google.com>	2024-07-02 01:25:24 +00:00
George Steed	367dd50755	[AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels except reading the YUV components all from the same interleaved input array. We load four-byte elements and then use TBL to de-interleave the UV components. Unlike the NV{12,21} cases we need to de-interleave bytes rather than widened 16-bit elements. Since we need a TBL instruction already it would ordinarily be possible to perform the zero-extension from bytes to 16-bit elements by setting the index for every other byte to be out of range. Such an approach does not work in SVE since at a vector length of 2048 bits since all possible byte values (0-255) are valid indices into the vector. We instead get around this by rewriting the I4XXTORGB_SVE macro to perform widening multiplies, operating on the low byte of each 16-bit UV element instead of the full value and therefore eliminating the need for a zero-extension. Observed reductions in runtimes compared to the existing Neon code: \| UYVYToARGBRow \| YUY2ToARGBRow Cortex-A510 \| -30.2% \| -30.2% Cortex-A720 \| -4.8% \| -4.7% Cortex-X2 \| -9.6% \| -10.1% Bug: libyuv:973 Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-13 22:06:46 +00:00
George Steed	cd4113f4e8	[AArch64] Add SVE2 implementation of I400ToARGBRow This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we can pre-calculate the UV component results before the loop body. Unlike in the Neon version of the code we can make use of MOVPRFX and USQADD to avoid needing to apply the bias separately from the UV coefficient multiply additions. Reduction in runtime observed compared to the existing Neon code: Cortex-A510: -26.1% Cortex-A520: -5.9% Cortex-A715: -49.5% Cortex-A720: -49.4% Cortex-X2: -22.5% Cortex-X3: -23.5% Cortex-X4: -21.6% Bug: libyuv:973 Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-13 22:02:06 +00:00
George Steed	34abe98fe2	[AArch64] Add SVE2 implementations for NV{12,21}ToARGBRow We need a permute to duplicate the UV components, so we can share a common implementation for both NV12 and NV21 by varying the inputs to the INDEX instruction that generates the TBL indices. Observed reductions in runtimes compared to the existing Neon code: \| NV12ToARGBRow_SVE2 \| NV21ToARGBRow_SVE2 Cortex-A510 \| -29.1% \| -29.1% Cortex-A720 \| -4.8% \| -4.8% Cortex-X2 \| -9.2% \| -9.2% Bug: libyuv:973 Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-12 16:24:40 +00:00
George Steed	96bbdb53ed	[AArch64] Add SVE2 implementation of I422ToRGBARow This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we just need to interleave differently for the output. The RGBA format actually saves us an instruction compared to ARGB since there is no need to merge in the alpha component, we can just replace the odd elements of the alpha vector itself during the narrowing. Also rename some existing macros to make more sense when distinguishing between ARGB and RGBA. Reductions in runtime observed compared to the existing Neon code: Cortex-A510: -27.0% Cortex-A720: -5.3% Cortex-X2: -14.7% Bug: libyuv:973 Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	004352ba16	[AArch64] Add SVE2 implementations for AYUVTo{UV,VU}Row These kernels are mostly identical to each other except for the order of the results, so we can use a single macro to parameterize the pairwise addition and use the same macro for both implementations, just with the register order flipped. Similar to other 2x2 kernels the implementation here differs slightly for the last element if the problem size is odd, so use an "any" kernel to avoid needing to handle this in the common code path. Observed reduction in runtime compared to the existing Neon code: \| AYUVToUVRow \| AYUVToVURow Cortex-A510 \| -33.1% \| -33.0% Cortex-A720 \| -25.1% \| -25.1% Cortex-X2 \| -59.5% \| -53.9% Cortex-X4 \| -39.2% \| -39.4% Bug: libyuv:973 Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	d0da5a3298	[AArch64] Add SVE2 implementation of ARGB1555ToARGBRow Avoiding LD4 and unrolling gives a good perf improvement for the little core especially. Observed reduction in runtime relative to the existing Neon code: Cortex-A510: -69.7% Cortex-A720: -7.7% Cortex-X2: -41.9% Cortex-X4: -14.5% Bug: libyuv:973 Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	250e1e1ba3	[AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow Observed performance improvements compared to the existing Neon implementation: Cortex-A510: -21.7% Cortex-A720: -49.2% Cortex-X2: -62.6% Bug: libyuv:973 Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-06-03 23:15:04 +00:00
George Steed	bce3392830	[AArch64] Add SVE2 implementation of ARGBToRGB565Row Observed performance improvements compared to the existing Neon implementation: Cortex-A510: -27.1% Cortex-A720: -49.4% Cortex-X2: -67.9% Bug: libyuv:973 Change-Id: I321dc080a6e89301cd959c2ee18bc6680f749312 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505538 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-31 17:42:27 +00:00
George Steed	a114f85e50	[AArch64] Fix naming in ARGBToUVMatrixRow_SVE2 etc constants Avoid abbreviations and capitalize ARGB and UV naming, as suggested here: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537 Bug: libyuv:973 Change-Id: I0d0143154594c03e6aca7c859b874e39634ca54f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5513544 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-05-03 17:25:14 +00:00
George Steed	6f1d8b1e11	[AArch64] Add SVE2 implementations for ARGBToUVRow and similar By maintaining the interleaved format of the data we can use a common kernel for all input channel orderings and simply pass a different vector of constants instead. A similar approach is possible with only Neon by making use of multiplies and repeated application of ADDP to combine channels, however this is slower on older cores like Cortex-A53 so is not pursued further. For odd problem sizes we need a slightly different implementation for the final element, so introduce an "any" kernel to address that rather than bloating the code for the common case. Observed affect on runtimes compared to the existing Neon kernels: \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 ABGRToUVJRow \| -15.5% \| +5.4% \| -33.1% ABGRToUVRow \| -15.6% \| +5.3% \| -35.9% ARGBToUVJRow \| -10.1% \| +5.4% \| -32.7% ARGBToUVRow \| -10.1% \| +5.4% \| -29.3% BGRAToUVRow \| -15.5% \| +4.6% \| -32.8% RGBAToUVRow \| -10.1% \| +4.2% \| -36.0% Bug: libyuv:973 Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-05-01 19:46:43 +00:00
George Steed	ce32eb773f	[AArch64] Avoid extraneous CMP in I{444,422}ToARGBRow_SVE2 impl We can use subs to set condition flags as part of the subtract, no need for a separate compare instruction. No performance difference observed from this change, but it now matches the other SVE2 kernels. Also remove unnecessary volatile from asm blocks. Bug: libyuv:973 Change-Id: I9bb4f5f1101086602f7d5223feaeae0fb63b385c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463951 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-29 18:56:22 +00:00
George Steed	f483007b9a	[AArch64] Add SVE implementation for I422AlphaToARGBRow This is mostly identical to the existing I422ToARGBRow_SVE implementation, we just need to make sure to load the alpha component rather than hard-coding it to 255. Reduction in runtimes observed compared to the existing Neon code: Cortex-A510: -32.1% Cortex-A720: -5.1% Cortex-X2: -10.1% Bug: libyuv:973 Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-29 18:54:07 +00:00
George Steed	b53b27d6bf	[AArch64] Add SVE implementation for I444AlphaToARGBRow This is mostly identical to the existing I444ToARGBRow_SVE implementation, we just need to make sure to load the alpha component rather than hard-coding it to 255. Reduction in runtimes observed compared to the existing Neon code: Cortex-A510: -34.2% Cortex-A720: -17.6% Cortex-X2: -9.6% Bug: libyuv:973 Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-29 18:54:02 +00:00
George Steed	6ac90403a1	[AArch64] Add SVE implementation for I422ToARGBRow We need a new macro for reading I422 data, but is otherwise mostly identical to the existing I444ToARGBRow_SVE implementation. Reduction in runtimes observed compared to the existing Neon code: Cortex-A510: -25.0% Cortex-A720: -5.0% Cortex-X2: -10.8% Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-27 18:26:11 +00:00
George Steed	e52007eff9	[AArch64] Add SVE2 implementation for I444ToARGBRow Being able to use SVE2 functionality for these kernels has a number of performance wins compared to the existing Neon code: * For the Y component calculation we are able to use UMULH, versus the existing UMULL x2 + UZP2 sequence in Neon. * For the RGBTORGBA8 calculation we are able to take advantage of interleaving narrowing instructions, allowing us to use ST2 rather than ST4 for the store. This is a big performance win on some micro-architectures where ST4 is costly. * The use of predication means we do not need to add "any" kernels, we can simply rerun the calculation with a not-full predicate for the final iteration. To avoid the overhead of generating a predicate register on every iteration we duplicate the loop body and only generate a predicate on the final iteration of the loop. This costs a small amount on the final iteration but should still be significantly quicker than the overhead of a function call needed by the "any" cases. Duplicating the loop body to reduce the use of the WHILELT instruction improves little core performance by ~12% by itself but has negligable impact on other micro-architectures. Reduction in runtime for the new SVE2 implementation compared to the existing Neon implementation on selected micro-architectures: Cortex-A510: -36.5% Cortex-A720: -17.3% Cortex-X2: -11.3% Bug: libyuv:973 Change-Id: I2a485f0dfa077a56f96b80a667ad38bbea47b4b4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424739 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-09 03:11:01 +00:00
George Steed	9a8be20def	[AArch64] Add :libyuv_sve library in preparation for SVE kernels This commit only adds the bare minimum to get the new library building through GN, the actual content of row_sve.cc is empty for now until we start porting some kernels across. Bug: libyuv:973 Change-Id: Ibdf4fc258761f3e507d700f27a405099c667ac75 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424738 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-09 03:10:01 +00:00

34 Commits