libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2026-01-01 03:12:16 +08:00

Author	SHA1	Message	Date
George Steed	367dd50755	[AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels except reading the YUV components all from the same interleaved input array. We load four-byte elements and then use TBL to de-interleave the UV components. Unlike the NV{12,21} cases we need to de-interleave bytes rather than widened 16-bit elements. Since we need a TBL instruction already it would ordinarily be possible to perform the zero-extension from bytes to 16-bit elements by setting the index for every other byte to be out of range. Such an approach does not work in SVE since at a vector length of 2048 bits since all possible byte values (0-255) are valid indices into the vector. We instead get around this by rewriting the I4XXTORGB_SVE macro to perform widening multiplies, operating on the low byte of each 16-bit UV element instead of the full value and therefore eliminating the need for a zero-extension. Observed reductions in runtimes compared to the existing Neon code: \| UYVYToARGBRow \| YUY2ToARGBRow Cortex-A510 \| -30.2% \| -30.2% Cortex-A720 \| -4.8% \| -4.7% Cortex-X2 \| -9.6% \| -10.1% Bug: libyuv:973 Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-13 22:06:46 +00:00
George Steed	cd4113f4e8	[AArch64] Add SVE2 implementation of I400ToARGBRow This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we can pre-calculate the UV component results before the loop body. Unlike in the Neon version of the code we can make use of MOVPRFX and USQADD to avoid needing to apply the bias separately from the UV coefficient multiply additions. Reduction in runtime observed compared to the existing Neon code: Cortex-A510: -26.1% Cortex-A520: -5.9% Cortex-A715: -49.5% Cortex-A720: -49.4% Cortex-X2: -22.5% Cortex-X3: -23.5% Cortex-X4: -21.6% Bug: libyuv:973 Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-13 22:02:06 +00:00
George Steed	34abe98fe2	[AArch64] Add SVE2 implementations for NV{12,21}ToARGBRow We need a permute to duplicate the UV components, so we can share a common implementation for both NV12 and NV21 by varying the inputs to the INDEX instruction that generates the TBL indices. Observed reductions in runtimes compared to the existing Neon code: \| NV12ToARGBRow_SVE2 \| NV21ToARGBRow_SVE2 Cortex-A510 \| -29.1% \| -29.1% Cortex-A720 \| -4.8% \| -4.8% Cortex-X2 \| -9.2% \| -9.2% Bug: libyuv:973 Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-12 16:24:40 +00:00
George Steed	a758a15dbf	[AArch64] Add I8MM implementation of ARGBColorMatrixRow We cannot use the standard dot-product instructions since the matrix of coefficients are signed, but I8MM supports mixed-sign products which work well here. Reduction in runtimes observed compared to the previous Neon implementation: Cortex-A510: -50.8% Cortex-A520: -33.3% Cortex-A715: -38.6% Cortex-A720: -38.5% Cortex-X2: -43.2% Cortex-X3: -40.0% Cortex-X4: -55.0% Change-Id: Ia4fe486faf8f43d0b837ad21bb37e2159f3bdb77 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5621577 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-12 16:17:59 +00:00
George Steed	c8974cf8d4	[AArch64] Add SME feature detection on Linux This commit just adds the kCpuHasSME to represent that the CPU has the Arm Scalable Matrix Extension enabled, but this commit does not introduce any code to actually use it yet. Add a test to check that the HWCAP value is interpreted correctly. Change-Id: I2de7bca26ca44ff3ee278b59108298a299a171b7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598869 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-08 23:34:22 +00:00
George Steed	96bbdb53ed	[AArch64] Add SVE2 implementation of I422ToRGBARow This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we just need to interleave differently for the output. The RGBA format actually saves us an instruction compared to ARGB since there is no need to merge in the alpha component, we can just replace the odd elements of the alpha vector itself during the narrowing. Also rename some existing macros to make more sense when distinguishing between ARGB and RGBA. Reductions in runtime observed compared to the existing Neon code: Cortex-A510: -27.0% Cortex-A720: -5.3% Cortex-X2: -14.7% Bug: libyuv:973 Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	004352ba16	[AArch64] Add SVE2 implementations for AYUVTo{UV,VU}Row These kernels are mostly identical to each other except for the order of the results, so we can use a single macro to parameterize the pairwise addition and use the same macro for both implementations, just with the register order flipped. Similar to other 2x2 kernels the implementation here differs slightly for the last element if the problem size is odd, so use an "any" kernel to avoid needing to handle this in the common code path. Observed reduction in runtime compared to the existing Neon code: \| AYUVToUVRow \| AYUVToVURow Cortex-A510 \| -33.1% \| -33.0% Cortex-A720 \| -25.1% \| -25.1% Cortex-X2 \| -59.5% \| -53.9% Cortex-X4 \| -39.2% \| -39.4% Bug: libyuv:973 Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	d0da5a3298	[AArch64] Add SVE2 implementation of ARGB1555ToARGBRow Avoiding LD4 and unrolling gives a good perf improvement for the little core especially. Observed reduction in runtime relative to the existing Neon code: Cortex-A510: -69.7% Cortex-A720: -7.7% Cortex-X2: -41.9% Cortex-X4: -14.5% Bug: libyuv:973 Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	250e1e1ba3	[AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow Observed performance improvements compared to the existing Neon implementation: Cortex-A510: -21.7% Cortex-A720: -49.2% Cortex-X2: -62.6% Bug: libyuv:973 Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-06-03 23:15:04 +00:00
George Steed	6c70eb2819	[AArch64] Add Neon impls for I{210,410}ToAR30Row_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: I210ToAR30Row on Cortex-A55: -43.8% I210ToAR30Row on Cortex-A510: -27.0% I210ToAR30Row on Cortex-A76: -50.4% I410ToAR30Row on Cortex-A55: -44.3% I410ToAR30Row on Cortex-A510: -17.5% I410ToAR30Row on Cortex-A76: -57.2% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-03 22:46:12 +00:00
George Steed	bce3392830	[AArch64] Add SVE2 implementation of ARGBToRGB565Row Observed performance improvements compared to the existing Neon implementation: Cortex-A510: -27.1% Cortex-A720: -49.4% Cortex-X2: -67.9% Bug: libyuv:973 Change-Id: I321dc080a6e89301cd959c2ee18bc6680f749312 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505538 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-31 17:42:27 +00:00
George Steed	812b4955b2	[AArch64] Add Neon impls for I{210,410}ToARGBRow_NEON There is are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| I210ToARGBRow \| I410ToARGBRow Cortex-A55 \| -55.6% \| -56.2% Cortex-A510 \| -22.6% \| -35.6% Cortex-A76 \| -48.1% \| -57.2% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 17:40:48 +00:00
George Steed	5b4160b9c3	[AArch64] Add Neon impls for I{210,410}AlphaToARGBRow_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| I210AlphaToARGBRow \| I410AlphaToARGBRow Cortex-A55 \| -55.3% \| -56.1% Cortex-A510 \| -27.9% \| -42.6% Cortex-A76 \| -54.9% \| -60.3% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 08:41:31 +00:00
George Steed	4f7fd808b7	[AArch64] Use full vectors in TransposeWx{8 => 16}_NEON The existing Neon code only makes use of 64-bit vectors throughout which limits the performance on larger cores. To avoid this, swap the Neon code from a Wx8 implementation to a Wx16 implementation and process blocks of 16 full vectors at a time. The original code also handled widths that were not exact multiples of 16, however this should already be handled by the "any" kernel so it is removed. Finally, avoid duplicating the TransposeWx16_C fallback kernel definition in all architectures that need it, and just put it once in rotate_common.cc instead. Observed speedups for TransposePlane across a range of micro-architectures: Cortex-A53: -40.0% Cortex-A55: -20.7% Cortex-A57: -43.9% Cortex-A510: -43.5% Cortex-A520: -43.9% Cortex-A720: -31.1% Cortex-X2: -38.3% Cortex-X4: -43.6% Change-Id: Ic7c4d5f24eb27091d743ddc00cd95ef178b6984e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5545459 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-21 07:46:42 +00:00
George Steed	9fac9a4a82	[AArch64] Add Neon implementations for {ARGB,ABGR}ToAR30Row There are existing x86 implementations for these kernels but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| ABGRToAR30Row \| ARGBToAR30Row Cortex-A55 \| -55.1% \| -55.1% Cortex-A510 \| -39.3% \| -40.1% Cortex-A76 \| -62.3% \| -63.6% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-21 07:35:07 +00:00
George Steed	ee830a5f77	[AArch64] Enable feature detection on Windows and Apple Silicon Using the platform-specific functions IsProcessorFeaturePresent and sysctlbyname to check individual features. Bug: libyuv:980 Change-Id: I7971238ca72e5df862c30c2e65331c46dc634074 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465591 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-05-03 18:42:51 +00:00
George Steed	6f1d8b1e11	[AArch64] Add SVE2 implementations for ARGBToUVRow and similar By maintaining the interleaved format of the data we can use a common kernel for all input channel orderings and simply pass a different vector of constants instead. A similar approach is possible with only Neon by making use of multiplies and repeated application of ADDP to combine channels, however this is slower on older cores like Cortex-A53 so is not pursued further. For odd problem sizes we need a slightly different implementation for the final element, so introduce an "any" kernel to address that rather than bloating the code for the common case. Observed affect on runtimes compared to the existing Neon kernels: \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 ABGRToUVJRow \| -15.5% \| +5.4% \| -33.1% ABGRToUVRow \| -15.6% \| +5.3% \| -35.9% ARGBToUVJRow \| -10.1% \| +5.4% \| -32.7% ARGBToUVRow \| -10.1% \| +5.4% \| -29.3% BGRAToUVRow \| -15.5% \| +4.6% \| -32.8% RGBAToUVRow \| -10.1% \| +4.2% \| -36.0% Bug: libyuv:973 Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-05-01 19:46:43 +00:00
George Steed	67e5e79dbe	[AArch64] Add Neon implementation of HashDjb2 Reduction in runtime observed compared to the existing C code compiled with LLVM 18: Cortex-A55: -46.2% Cortex-A510: -60.4% Cortex-A76: -82.9% Cortex-A720: -87.4% Cortex-X1: -90.0% Cortex-X2: -91.7% Change-Id: I39a4479f78299508043a864e64fb40578c66ce19 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5494094 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-01 19:37:31 +00:00
George Steed	f483007b9a	[AArch64] Add SVE implementation for I422AlphaToARGBRow This is mostly identical to the existing I422ToARGBRow_SVE implementation, we just need to make sure to load the alpha component rather than hard-coding it to 255. Reduction in runtimes observed compared to the existing Neon code: Cortex-A510: -32.1% Cortex-A720: -5.1% Cortex-X2: -10.1% Bug: libyuv:973 Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-29 18:54:07 +00:00
George Steed	b53b27d6bf	[AArch64] Add SVE implementation for I444AlphaToARGBRow This is mostly identical to the existing I444ToARGBRow_SVE implementation, we just need to make sure to load the alpha component rather than hard-coding it to 255. Reduction in runtimes observed compared to the existing Neon code: Cortex-A510: -34.2% Cortex-A720: -17.6% Cortex-X2: -9.6% Bug: libyuv:973 Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-29 18:54:02 +00:00
George Steed	6ac90403a1	[AArch64] Add SVE implementation for I422ToARGBRow We need a new macro for reading I422 data, but is otherwise mostly identical to the existing I444ToARGBRow_SVE implementation. Reduction in runtimes observed compared to the existing Neon code: Cortex-A510: -25.0% Cortex-A720: -5.0% Cortex-X2: -10.8% Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-27 18:26:11 +00:00
George Steed	95eed2b75f	[AArch64] Add Neon dot-product implementation of HammingDistance We can use the Neon dot-product instructions as a slightly faster widening accumulation. This also has the advantage of widening to 32 bits so avoids the risk of overflow present in the original Neon code. Reduction in runtimes observed for HammingDistance compared to the existing Neon code: Cortex-A55: -4.4% Cortex-A510: -26.5% Cortex-A76: -8.1% Cortex-A720: -15.5% Cortex-X1: -4.1% Cortex-X2: -5.1% Bug: libyuv:977 Change-Id: I9e5e10d228c339d905cb2e668a9811ff0a6af5de Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5490049 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-26 18:39:00 +00:00
George Steed	f5882ed1c5	[AArch64] getauxval(AT_HWCAP{,2}) feature detection, attempt #2 This re-lands commit ba0bba5b2b7e38c9365a5d152b4efa0458863213. Now with additional #ifdef __linux__ guards to avoid compiling Linux-specific code on non-Linux platforms. Non-linux feature detection will be added in a separate patch. Using getauxval(AT_HWCAP{,2}) has the advantage of also working under emulation where faking /proc/cpuinfo is not supported. For the Chromium sandbox, getauxval is supported since API version 18. The minimum supported API version at time of writing is 21 so we should be able to use getauxval unconditionally. On the off-chance the call fails it will return 0 and we will correctly fall-back to using only Neon. If we want to read the current CPU implementer or part number we could do this by checking HWCAP_CPUID and then reading MIDR_EL1. This will cause a kernel trap to emulate the EL1 read but should still be a lot faster than reading the whole of /proc/cpuinfo. Bug: libyuv:980 Change-Id: I8ae103ea7e32ef44db72f3c9896417bfe97ff5c5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465590 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-25 21:26:31 +00:00
George Steed	53b65220da	[AArch64] Add Neon dot-product implementation of SumSquareError The Neon dot-product instructions perform two widening steps rather than one, saving us the need to widen the absolute difference to 16-bits before accumulating. Additionally, the dot-product instructions tend to have better performance characteristics than traditional widening multiply instructions like SMLAL used in the existing SumSquareError_NEON code. Observed reduction in runtimes compared to the existing Neon kernel: Cortex-A55: -9.1% Cortex-A510: -36.7% Cortex-A76: -37.6% Cortex-A720: -48.8% Cortex-X1: -56.1% Cortex-X2: -42.6% Bug: libyuv:977 Change-Id: Ie20c69040cc47a803d8e95620d31e0bf1e1dac12 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463945 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-25 20:54:48 +00:00
Frank Barchard	9d660a0f3b	Fix environment variable LIBYUV_CPU_INFO for unittests - Also bump version number Bug: libyuv:979 Change-Id: I2903f15f9b9f3cd1b556eba95b01c4c58d1733b7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5466641 Reviewed-by: James Zern <jzern@google.com>	2024-04-20 17:41:56 +00:00
Frank Barchard	fe51553f5f	Revert "[AArch64] Use getauxval(AT_HWCAP{,2}) for feature detection" This reverts commit ba0bba5b2b7e38c9365a5d152b4efa0458863213. Reason for revert: breaks builds on windows and mac Step _compile_ failed. Error logs are shown below: [1/104] CXX obj/libyuv_internal/cpu_id.o FAILED: obj/libyuv_internal/cpu_id.o ../../buildtools/reclient/rewrapper -cfg=../../buildtools/reclient_cfgs/chromium-browser-clang/rewra...(too long) ../../source/cpu_id.cc:25:10: fatal error: 'sys/auxv.h' file not found 25 \| #include // For getauxval() \| ^~~~~~~~~~~~ 1 error generated. More information in raw_io.output_text[failure_summary] Original change's description: > [AArch64] Use getauxval(AT_HWCAP{,2}) for feature detection > > This has the advantage of also working under emulation where > faking /proc/cpuinfo is not supported. > > For the Chromium sandbox, getauxval is supported since API version 18. > The minimum supported API version at time of writing is 21 so we should > be able to use getauxval unconditionally. On the off-chance the call > fails it will return 0 and we will correctly fall-back to using only > Neon. > > Change-Id: Ibbaa9caec1915ac0725c42d6cd2abc7ce19786c7 > Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5453620 > Reviewed-by: Frank Barchard <fbarchard@chromium.org> Change-Id: Ic0f764217af7b4d998f19a8f78fc04ca85a45a3b No-Presubmit: true No-Tree-Checks: true No-Try: true Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463918 Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-19 06:52:22 +00:00
George Steed	ba0bba5b2b	[AArch64] Use getauxval(AT_HWCAP{,2}) for feature detection This has the advantage of also working under emulation where faking /proc/cpuinfo is not supported. For the Chromium sandbox, getauxval is supported since API version 18. The minimum supported API version at time of writing is 21 so we should be able to use getauxval unconditionally. On the off-chance the call fails it will return 0 and we will correctly fall-back to using only Neon. Change-Id: Ibbaa9caec1915ac0725c42d6cd2abc7ce19786c7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5453620 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-19 06:37:04 +00:00
George Steed	3af6cafe8d	[Arm] Don't expose DotProd kernels, fix CMakeLists.txt Don't define HAS__NEON_DOTPROD for 32-bit Arm platforms, since they are only defined in _neon64.cc for now. Also define -DLIBYUV_NEON=1 and pass -mfpu=neon to *_neon.cc for 32-bit Arm platforms, since otherwise __ARM_NEON__ is not defined. Also fix a typo: ly_lib_static should be ly_lib_name in the name of the common object files. The existing code happens to work since they are defined to the same thing. Change-Id: Ibdc9e5d0391f7ff8db1ca83384e5bd45ac9950a2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5439562 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-10 20:17:49 +00:00
George Steed	e52007eff9	[AArch64] Add SVE2 implementation for I444ToARGBRow Being able to use SVE2 functionality for these kernels has a number of performance wins compared to the existing Neon code: * For the Y component calculation we are able to use UMULH, versus the existing UMULL x2 + UZP2 sequence in Neon. * For the RGBTORGBA8 calculation we are able to take advantage of interleaving narrowing instructions, allowing us to use ST2 rather than ST4 for the store. This is a big performance win on some micro-architectures where ST4 is costly. * The use of predication means we do not need to add "any" kernels, we can simply rerun the calculation with a not-full predicate for the final iteration. To avoid the overhead of generating a predicate register on every iteration we duplicate the loop body and only generate a predicate on the final iteration of the loop. This costs a small amount on the final iteration but should still be significantly quicker than the overhead of a function call needed by the "any" cases. Duplicating the loop body to reduce the use of the WHILELT instruction improves little core performance by ~12% by itself but has negligable impact on other micro-architectures. Reduction in runtime for the new SVE2 implementation compared to the existing Neon implementation on selected micro-architectures: Cortex-A510: -36.5% Cortex-A720: -17.3% Cortex-X2: -11.3% Bug: libyuv:973 Change-Id: I2a485f0dfa077a56f96b80a667ad38bbea47b4b4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424739 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-09 03:11:01 +00:00
George Steed	f2e78e1304	[AArch64] Use Neon dot-product instructions in ARGBToYMatrixRow Using the dot-product instructions here allows us to avoid needing LD4 for loading individual colour channels, which gives a big benefit on some micro-architectures where such instructions perform significantly worse than LD1. In addition the dot-product instructions have higher throughput compared to the Neon Observed reduction in runtimes for selected kernels moving from _NEON to _NEON_DotProd: Kernel \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 \| Cortex-X2 ABGRToYJRow \| -6.5% \| -22.5% \| -43.5% \| -71.2% ABGRToYRow \| -6.5% \| -22.5% \| -43.5% \| -68.3% ARGBToYJRow \| -6.5% \| -22.5% \| -43.5% \| -68.1% ARGBToYRow \| -6.5% \| -22.5% \| -43.5% \| -68.1% BGRAToYRow \| -6.5% \| -22.5% \| -42.3% \| -68.4% RGBAToYJRow \| -6.5% \| -22.5% \| -42.2% \| -73.7% RGBAToYRow \| -6.5% \| -22.5% \| -42.3% \| -64.9% Bug: libyuv:977 Change-Id: If244190a7bdacf7e6e6b16af7e6853ee13ff6585 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424737 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-09 03:09:36 +00:00
George Steed	a038cda7b8	[AArch64] Enable detection of additional architecture features In particular there are a few extensions that are interesting for us: * FEAT_DotProd adds 4-way dot-product instructions which are useful in e.g. ARGBToY. * FEAT_I8MM adds additional mixed-sign dot-product instructions which could be useful in e.g. ARGBToUV. * FEAT_SVE and FEAT_SVE2 add support for the Scalable Vector Extension, which adds an array of new instructions including new widening loads and narrowing stores for dealing with mixed-width integer arithmetic efficiently and predication for avoiding the need for "any" cleanup loops. This commit simply adds support for detecting the presence of these features by extending the existing /proc/cpuinfo parsing, splitting it into separate Arm and AArch64 functions for simplicity. Since we have no space left in the bitset entries between Arm and X86 entries, we reuse some of the X86 entries for new AArch64 extensions. This doesn't seem obviously problematic as long as we avoid setting kCpuHasX86. Bug: libyuv:973 Bug: libyuv:977 Change-Id: I8e256225fe12a4ba5da24460f54061e16eab6c57 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5378150 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-05 17:48:22 +00:00
George Steed	549e798ac7	[AArch64] Remove declarations of P{210,410}To{ARGB,AR30}Row_NEON These declarations appear to exist in error since there is no corresponding implementation of these kernels and nothing calls them. So simply remove the declarations until we get around to adding an implementation. Change-Id: I0a9d72e7e13398b689e3e47ef101f672082c4795 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5387649 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-03-25 18:15:55 +00:00
Frank Barchard	b66c42d4a8	Revert "AMX detect OS support for linux kernel" This reverts commit 8c8a33762d64b916ae8469cc3fc602a64080a23a. Reason for revert: breaks sandbox Original change's description: > AMX detect OS support for linux kernel > > Bug: b/327013106 > Change-Id: Ie1784249f3a121c52e6504ff502bdc3eb245d858 > Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5329907 > Commit-Queue: Frank Barchard <fbarchard@chromium.org> > Reviewed-by: richard winterton <rrwinterton@gmail.com> Bug: b/327013106 Change-Id: If54bb84bc1167177c1869763f6ccfdf1f92fbe09 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5332617 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-02-29 00:33:29 +00:00
Frank Barchard	8c8a33762d	AMX detect OS support for linux kernel Bug: b/327013106 Change-Id: Ie1784249f3a121c52e6504ff502bdc3eb245d858 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5329907 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2024-02-28 03:13:44 +00:00
Frank Barchard	a6a2ec654b	Add AMXINT8 cpu detect sde -spr -- libyuv_test -- --gunit_filter=Cpu Note: Google Test filter = Cpu [==========] Running 4 tests from 2 test suites. [----------] Global test environment set-up. [----------] 3 tests from LibYUVBaseTest [ RUN ] LibYUVBaseTest.TestCpuHas Cpu Flags 0x57fff9 Has X86 0x8 Has SSE2 0x10 Has SSSE3 0x20 Has SSE41 0x40 Has SSE42 0x80 Has AVX 0x100 Has AVX2 0x200 Has ERMS 0x400 Has FMA3 0x800 Has F16C 0x1000 Has AVX512BW 0x2000 Has AVX512VL 0x4000 Has AVX512VNNI 0x8000 Has AVX512VBMI 0x10000 Has AVX512VBMI2 0x20000 Has AVX512VBITALG 0x40000 Has AVX10 0x0 HAS AVXVNNI 0x100000 Has AVXVNNIINT8 0x0 Has AMXINT8 0x400000 [ OK ] LibYUVBaseTest.TestCpuHas (34 ms) Bug: b/324356616 Change-Id: I5129b8946363a501bdd570e6dba3936c54aacd6c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5283433 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-02-15 21:44:47 +00:00
Frank Barchard	3e435fe6d4	YUY2ToARGB use ymm6/7 for shuffle constants - 1 load and 2 shuffles from registers replaces 2 loads and 2 memory shuffles - vbroadcastf128 16 byte shuffler replaces 32 byte shufflers - bump version and apply clang-format libyuv_test '--gunit_filter=*.???2ToARGB_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 AMD Zen2 I422ToARGB_Opt (272 ms) NV12ToARGB_Opt (255 ms) YUY2ToARGB_Opt (208 ms) Was YUY2ToARGB_Opt (214 ms) Change-Id: I1fa4d462d04536c877d1cab1a14586be8ed1b2f2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5218447 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2024-01-22 21:47:23 +00:00
Frank Barchard	914624f0b8	YUY2ToARGBMatrix and UYVYToARGBMatrix added to allow any color matrix Bug: libyuv:971 Change-Id: If15d4598d75500a3717f07d02c0c295fdc58254e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5214453 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-01-19 21:21:37 +00:00
Frank Barchard	5625f42424	I444ToI420 and I422ToI420 check U and V pointers and return -1 if NULL. - Add detect linux kernel version number in util/cpuid adbrun -- blaze-bin/third_party/libyuv/cpuid Kernel Version 4.14 Cpu Flags 0x7 Has ARM 0x2 Bug: libyuv:970 Change-Id: I655ed598db3655ca8448be08f1d71fbc328ced66 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5207990 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-01-18 21:56:11 +00:00
Frank Barchard	af6ac8265b	AVX10 cpuid detect added Replace unused popcount feature bit Bug: libyuv:911 Change-Id: Icd88fcc732751d39b0950d5f09a58bc9ac2c4e30 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5179911 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-01-10 00:08:22 +00:00
Frank Barchard	6dc03dacbf	Split scale_test and scale_plane_test to allow building on small devices Bug: libyuv:956 Change-Id: I1903aa616243e891440ed92836dfb0992d31d4cd Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5107257 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2023-12-09 18:39:41 +00:00
Frank Barchard	9e61d7f9c1	Split convert_test and convert_argb_test to allow building on small systems that run out of memory compiling unittests. Update build files to include the new tests and source code. Bug: libyuv:956 Change-Id: I6ec0beb6dc9570f0597d7df1835d616489dbaece Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5103585 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-12-08 13:39:56 +00:00
Bruce Lai	1dcbc30553	Add HAS_SCALEARGBROWDOWNEVEN_RVV marco and disable it by default HAS_SCALEARGBROWDOWNEVEN_RVV wasn't defined, so we cannot use ScaleARGBRowDownEven_RVV & ScaleARGBRowDownEvenBox_RVV. - Seperate to two conditional statements when selecting DownEven or DownEvenBox. - Also, add HAS_SCALEARGBROWDOWNEVEN_RVV and disable it by default. Bug: libyuv:965 Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Change-Id: Ic7ec40520b64131a456c6f3eea0639b3620f11ae Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4882441 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-12-07 22:54:23 +00:00
Wan-Teh Chang	fb6341d326	Change ScalePlane,ScalePlane_16,... to return int Change ScalePlane(), ScalePlane_16(), and ScalePlane_12() to return int so that they can report memory allocation failures (by returning 1). BUG=libyuv:968 Change-Id: Ie5c183ee42e3d595302671f9ecb7b3472dc8fdb5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5005031 Commit-Queue: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-11-03 23:53:24 +00:00
Frank Barchard	31e1d6f896	Check allocations that return NULL and return early BUG=libyuv:968 Change-Id: I9e8594440a6035958511f9c50072820131331fc8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4977552 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-10-27 17:41:36 +00:00
Frank Barchard	331c361581	AVX-VNNI detect - Add kCpuHasAVXVNNI flag - Remove deprecated GFNI detect to make space. Meteor Lake has AVX-VNNI but not AVX512 ~/intelsde/sde -mtl -- blaze-bin/third_party/libyuv/libyuv_test --gunit_filter=CpuHas doyuv3 Note: Google Test filter = CpuHas [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from LibYUVBaseTest [ RUN ] LibYUVBaseTest.TestCpuHas Cpu Flags 0x203ff1 Has X86 0x10 Has SSE2 0x20 Has SSSE3 0x40 Has SSE41 0x80 Has SSE42 0x100 Has AVX 0x200 Has AVX2 0x400 Has ERMS 0x800 Has FMA3 0x1000 Has F16C 0x2000 Has AVX512BW 0x0 Has AVX512VL 0x0 Has AVX512VNNI 0x0 Has AVX512VBMI 0x0 Has AVX512VBMI2 0x0 Has AVX512VBITALG 0x0 Has AVX512VPOPCNTDQ 0x0 HAS AVXVNNI 0x200000 Has AVXVNNIINT8 0x0 AVX-VNNI detect - Add kCpuHasAVXVNNI flag - Remove deprecated GFNI detect to make space. https://bugs.chromium.org/p/libyuv/issues/detail?id=967 Meteor Lake has AVX-VNNI but not AVX512 ~/intelsde/sde -mtl -- blaze-bin/third_party/libyuv/libyuv_test --gunit_filter=CpuHas doyuv3 Note: Google Test filter = CpuHas [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from LibYUVBaseTest [ RUN ] LibYUVBaseTest.TestCpuHas Cpu Flags 0x203ff1 Has X86 0x10 Has SSE2 0x20 Has SSSE3 0x40 Has SSE41 0x80 Has SSE42 0x100 Has AVX 0x200 Has AVX2 0x400 Has ERMS 0x800 Has FMA3 0x1000 Has F16C 0x2000 Has AVX512BW 0x0 Has AVX512VL 0x0 Has AVX512VNNI 0x0 Has AVX512VBMI 0x0 Has AVX512VBMI2 0x0 Has AVX512VBITALG 0x0 Has AVX512VPOPCNTDQ 0x0 HAS AVXVNNI 0x200000 Has AVXVNNIINT8 0x0 Running on all cpus the following report avx-vnni grep 'AVXVNNI 0x2' / adl/libyuv64.txt:HAS AVXVNNI 0x200000 gnr/libyuv64.txt:HAS AVXVNNI 0x200000 grr/libyuv64.txt:HAS AVXVNNI 0x200000 mtl/libyuv64.txt:HAS AVXVNNI 0x200000 rpl/libyuv64.txt:HAS AVXVNNI 0x200000 spr/libyuv64.txt:HAS AVXVNNI 0x200000 srf/libyuv64.txt:HAS AVXVNNI 0x200000 while these support avx512 vnni grep 'VNNI 0x1' / clx/libyuv64.txt:Has AVX512VNNI 0x10000 cpx/libyuv64.txt:Has AVX512VNNI 0x10000 gnr/libyuv64.txt:Has AVX512VNNI 0x10000 icl/libyuv64.txt:Has AVX512VNNI 0x10000 icx/libyuv64.txt:Has AVX512VNNI 0x10000 spr/libyuv64.txt:Has AVX512VNNI 0x10000 tgl/libyuv64.txt:Has AVX512VNNI 0x10000 and these support avx-vnni-int8 grep AVXVNNIINT8.0x4 / grr/libyuv64.txt:Has AVXVNNIINT8 0x400000 srf/libyuv64.txt:Has AVXVNNIINT8 0x400000 Bug: libyuv:967 Change-Id: I84cd71d1b320e7c284173eb695fc1d3b72d14ddb Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4912017 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2023-10-05 21:24:09 +00:00
Frank Barchard	709d60e6ee	VNNI-INT8 detect - Add kCpuHasAVXVNNIINT8 flag - Move mips flags up a bit to make space. ~/intelsde/sde -srf -- blaze-bin/third_party/libyuv/libyuv_test --gunit_filter=CpuHas Note: Google Test filter = CpuHas [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from LibYUVBaseTest [ RUN ] LibYUVBaseTest.TestCpuHas Cpu Flags 0x403ff1 Has X86 0x10 Has SSE2 0x20 Has SSSE3 0x40 Has SSE41 0x80 Has SSE42 0x100 Has AVX 0x200 Has AVX2 0x400 Has ERMS 0x800 Has FMA3 0x1000 Has F16C 0x2000 Has AVX512BW 0x0 Has AVX512VL 0x0 Has AVX512VNNI 0x0 Has AVX512VBMI 0x0 Has AVX512VBMI2 0x0 Has AVX512VBITALG 0x0 Has AVX512VPOPCNTDQ 0x0 Has AVXVNNIINT8 0x400000 Has GFNI 0x0 [ OK ] LibYUVBaseTest.TestCpuHas (32 ms) INT8 supported on srf and grr -srf Set chip-check and CPUID for Intel(R) Sierra Forest CPU -grr Set chip-check and CPUID for Intel(R) Grand Ridge CPU Bug: b/303434603 Change-Id: I628007929ff0518b2b36e1469b4d9aed71a9fa8f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4912015 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-10-04 16:31:36 +00:00
Bruce Lai	ec2e9ca000	[RVV] Support AR64ToAB64 and RGBA-family color conversions Add scalar code for AR64ToAB64, ARGBToRGBA, ARGBToBGRA, ARGBToABGR, RGBAToARGB, BGRAToARGB, and ABGRToARGB. They are originally implemented by ARGBShffle. This CL independetly implements them, and only enables for risc-v now. This CL also add RVV implementation for `RGBA-family <-> RGBA-family` color conversions. * Run on SiFive internal FPGA(VLEN=128): Test Case Speedup AR64ToAB64_Opt x4.6 ARGBToRGBA_Opt x6 ARGBToBGRA_Opt x6 ARGBToABGR_Opt x6 RGBAToARGB_Opt x6 Change-Id: Ie0630901046084aa259699fcdeccc64170d7103f Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4797451 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-09-05 22:44:48 +00:00
Frank Barchard	f0921806a2	Disable NEON if memory sanitizer is enabled - MSAN fails on most inline assembly, unaware of what the load and store instructions do. - MSAN is also failing on row_any functions, which memcpy a correct number of pixels into a buffer that is SIMD vector sized, apply SIMD to the full vector, and then memcpy the exact number of resulting pixels to the output buffer. MSAN wants the temporary buffer to be initialized. Which genenerally is done with a memset(buf, 0, sizeof(buf)); to satisify MSAN. - RVV may not require disabling MSAN, since row functions are all 'any' number of elements, and implementation is intrinsics. Bug: b/297979878 Change-Id: Ic21200689c0c7d2c85bb1de3eef38570137d3d8b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4832740 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2023-08-31 18:07:42 +00:00
Frank Barchard	696e619571	RVV check __riscv_v_intrinsic version Bug: libyuv:965 Change-Id: I9b02abd13ab3345288655fa7a16383f59cf66bb8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4750230 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>	2023-08-04 18:39:27 +00:00
Bruce Lai	c60ac4025c	[RVV] Enable ScaleRowDown38_RVV & ScaleRowDown38_{2,3}_Box_RVV * Run on SiFive internal FPGA: Test Case Speedup I420ScaleDownBy3by8_None 4.2 I420ScaleDownBy3by8_Linear 1.7 I420ScaleDownBy3by8_Bilinear 1.7 I420ScaleDownBy3by8_Box 1.7 I444ScaleDownBy3by8_None 4.2 I444ScaleDownBy3by8_Linear 1.8 I444ScaleDownBy3by8_Bilinear 1.8 I444ScaleDownBy3by8_Box 1.8 Change-Id: Ic2e98de2494d9e7b25f5db115a7f21c618eaefed Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4711857 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-07-27 02:59:47 +00:00

1 2 3 4 5 ...

1763 Commits