libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2026-02-08 02:36:43 +08:00

Author	SHA1	Message	Date
Frank Barchard	595146434a	HalfFloat fix SigIll on aarch64 - Remove special case Scale of 1 which used fp16 cvt but requires cpuid - Port aarch64 to aarch32 - Use C for aarch32 with small (denormal) scale value Bug: 377693555 Change-Id: I38e207e79ac54907ed6e65118b8109288fddb207 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6043392 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-11-22 22:08:00 +00:00
Frank Barchard	307b951229	Add CopyPlane_Unaligned, _Any and _Invert tests/benchmarksCpuId test - Add AMD_ERMSB detect for ERMS on AMD Bug: 379457420 Change-Id: I608568556024faf19abe4d0662aeeee553a0a349 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6032852 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-11-19 23:53:05 +00:00
Frank Barchard	1c501a8f3f	CpuId test FSMR - Fast Short Rep Movsb - Renumber cpuid bits to use low byte to ID the type of CPU and upper 24 bits for features Intel CPUs starting at Icelake support FSMR adl:Has FSMR 0x8000 arl:Has FSMR 0x0 bdw:Has FSMR 0x0 clx:Has FSMR 0x0 cnl:Has FSMR 0x0 cpx:Has FSMR 0x0 emr:Has FSMR 0x8000 glm:Has FSMR 0x0 glp:Has FSMR 0x0 gnr:Has FSMR 0x8000 gnr256:Has FSMR 0x8000 hsw:Has FSMR 0x0 icl:Has FSMR 0x8000 icx:Has FSMR 0x8000 ivb:Has FSMR 0x0 knl:Has FSMR 0x0 knm:Has FSMR 0x0 lnl:Has FSMR 0x8000 mrm:Has FSMR 0x0 mtl:Has FSMR 0x8000 nhm:Has FSMR 0x0 pnr:Has FSMR 0x0 rpl:Has FSMR 0x8000 skl:Has FSMR 0x0 skx:Has FSMR 0x0 slm:Has FSMR 0x0 slt:Has FSMR 0x0 snb:Has FSMR 0x0 snr:Has FSMR 0x0 spr:Has FSMR 0x8000 srf:Has FSMR 0x0 tgl:Has FSMR 0x8000 tnt:Has FSMR 0x0 wsm:Has FSMR 0x0 Intel CPUs starting at Ivybridge support ERMS adl:Has ERMS 0x4000 arl:Has ERMS 0x4000 bdw:Has ERMS 0x4000 clx:Has ERMS 0x4000 cnl:Has ERMS 0x4000 cpx:Has ERMS 0x4000 emr:Has ERMS 0x4000 glm:Has ERMS 0x4000 glp:Has ERMS 0x4000 gnr:Has ERMS 0x4000 gnr256:Has ERMS 0x4000 hsw:Has ERMS 0x4000 icl:Has ERMS 0x4000 icx:Has ERMS 0x4000 ivb:Has ERMS 0x4000 knl:Has ERMS 0x4000 knm:Has ERMS 0x4000 lnl:Has ERMS 0x4000 mrm:Has ERMS 0x0 mtl:Has ERMS 0x4000 nhm:Has ERMS 0x0 pnr:Has ERMS 0x0 rpl:Has ERMS 0x4000 skl:Has ERMS 0x4000 skx:Has ERMS 0x4000 slm:Has ERMS 0x4000 slt:Has ERMS 0x0 snb:Has ERMS 0x0 snr:Has ERMS 0x4000 spr:Has ERMS 0x4000 srf:Has ERMS 0x4000 tgl:Has ERMS 0x4000 tnt:Has ERMS 0x4000 wsm:Has ERMS 0x0 Change-Id: I18e5a3905f2691ab66d4d0cb6f668c0a0ff72d37 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6027541 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2024-11-18 17:56:45 +00:00
Frank Barchard	75f7cfdde5	SplitRGB for SSE4 and AVX2 libyuv_test '--gunit_filter=SplitRGB' --libyuv_width=640 --libyuv_height=360 --libyuv_repeat=100000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Note: Google Test filter = SplitRGB Skylake Xeon x86 32 bit AVX2 LibYUVPlanarTest.SplitRGBPlane_Opt (4143 ms) SSE4 LibYUVPlanarTest.SplitRGBPlane_Opt (4543 ms) SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5346 ms) C LibYUVPlanarTest.SplitRGBPlane_Opt (22965 ms) Skylake Xeon x86 64 bit AVX2 LibYUVPlanarTest.SplitRGBPlane_Opt (4470 ms) SSE4 LibYUVPlanarTest.SplitRGBPlane_Opt (4723 ms) SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5465 ms) C LibYUVPlanarTest.SplitRGBPlane_Opt (4707 ms) Bug: 379186682 Change-Id: Idce67a4ded836f2ee31854aa06f3903e7bcb7791 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6024314 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2024-11-15 00:46:25 +00:00
George Steed	823d960afc	[AArch64] Add SVE2 implementations of {P210,P410}ToAR30Row Observed reductions in runtime compared to the existing Neon code: \| P210ToAR30Row \| P410ToAR30Row Cortex-A510 \| -16.5% \| -21.2% Cortex-A520 \| (!) +2.7% \| -8.7% Cortex-A715 \| -6.1% \| -6.1% Cortex-A720 \| -6.2% \| -5.9% Cortex-X2 \| -4.1% \| -4.2% Cortex-X3 \| -4.2% \| -4.2% Cortex-X4 \| -1.2% \| -1.2% Cortex-X925 \| -3.6% \| -2.8% Bug: b/42280942 Change-Id: I40723a370fad1ccb53f8ccd9d32cddb502500dd6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023036 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-14 16:52:21 +00:00
George Steed	0ddf3f7b90	[AArch64] Add SVE2 implementation of I210ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -34.5% Cortex-A520: -6.5% Cortex-A715: -10.1% Cortex-A720: -13.9% Cortex-X2: -11.9% Cortex-X3: -11.6% Cortex-X4: -9.5% Cortex-X925: -11.5% Bug: b/42280942 Change-Id: Ie97dc3b5efd021ecfea14d4c477cc205191e09c3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023037 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-14 16:36:41 +00:00
Frank Barchard	74bd6d93c6	Use grep extended regex for version - Uses grep extended regex to extract version information rather than perl regex, which isn't supported on macOS Co-authored-by: trevormcguire@google.com Bug: 277348774 Change-Id: Ifa37207ae360350f0a96c1248bf6407005c00096 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6011548 Reviewed-by: Ben Weiss <bweiss@google.com>	2024-11-13 02:11:17 +00:00
George Steed	5b906a0ec8	[AArch64] Add SVE2 implementation of P410ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -34.7% Cortex-A520: -2.4% Cortex-A715: -18.7% Cortex-A720: -18.8% Cortex-X2: -7.7% Cortex-X3: -8.9% Cortex-X4: +1.0% (!) Cortex-X925: -8.3% Bug: b/42280942 Change-Id: I90dca0573887a9a24e2172378a9e0eb6812e2131 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975321 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:34:56 +00:00
George Steed	b753822d47	[AArch64] Add SVE2 implementation of P210ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -32.8% Cortex-A520: +8.7% (!) Cortex-A715: -18.9% Cortex-A720: -18.9% Cortex-X2: -7.9% Cortex-X3: -8.8% Cortex-X4: +1.0% (!) Cortex-X925: -8.6% Bug: b/42280942 Change-Id: Ibe557500c3788b4fb39372c92b2f42ba216e6fea Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975320 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-11-12 18:32:55 +00:00
George Steed	721ad4aa18	[AArch64] Add SME implementation of ScaleUVRowDown2Box There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: Ie15bb4e7484b61e78f405ad4e8a8a7bbb66b7edb Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979727 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:30:30 +00:00
George Steed	576218dbce	[AArch64] Add SME implementation of ScaleUVRowDown2Linear There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: I401eb6ad14b3159917c8e3a79ab20dde318d28b6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979726 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:28:57 +00:00
George Steed	551cee7845	[AArch64] Add SME implementation of ScaleUVRowDown2 There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: Ic4ba5f97dc57afc558c08a57e9b5009d6e487e0f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979725 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:24:28 +00:00
George Steed	de6b47370f	CMakeLists.txt: Fix typo: OLD_CMAKE_{REQURED => REQUIRED}_FLAGS Change-Id: Ib09316dfda4182a860d2f1db985b15ebeabba5ba Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6012824 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:22:45 +00:00
George Steed	5c12e0b2de	[AArch64] Add SVE2 implementations of HalfFloat{,1}Row For HalfFloat1Row, SVE has direct 16-bit integer to half-float conversion instructions so there is no need to widen to 32-bits. For HalfFloatRow, SVE zero-extending loads avoid the need for seperate UXTL(2) instructions. Observed reductions in runtime compared to the existing Neon code: \| HalfFloat1Row \| HalfFloatRow Cortex-A510 \| -38.3% \| -17.3% Cortex-A520 \| -37.6% \| -18.8% Cortex-A720 \| -50.1% \| -7.8% Cortex-X2 \| -50.2% \| -0.4% Cortex-X4 \| -51.5% \| -12.5% Bug: b/42280942 Change-Id: I445071ccd453113144ce42d465ba03c9ee89ec9e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975319 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:53:00 +00:00
George Steed	7d383c2f1a	[AArch64] Add comments to ScaleRowDown38_{2,3}_Box_NEON impls Add a few comments to help illustrate the permute operations. As requested here: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872803 Change-Id: I8596ef63af5fae4dba1e6fdb548742ba7e191ab9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975317 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:47:12 +00:00
George Steed	f27b983f38	[AArch64] Add SVE2 implementation of DivideRow_16 SVE contains the UMULH instruction which allows us to multiply and take the high half of the result in a single instruction rather than needing separate widening multiply and then narrowing shift steps. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -21.2% Cortex-A520: -20.9% Cortex-A715: -47.9% Cortex-A720: -47.6% Cortex-X2: -5.2% Cortex-X3: -2.6% Cortex-X4: -32.4% Cortex-X925: -1.5% Bug: b/42280942 Change-Id: I25154699b17772db1fb5cb84c049919181d86f4b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975318 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:46:02 +00:00
George Steed	aec4b4e22e	[AArch64] Add SME implementation of ScaleRowDown2Box There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: I5021aeda30f4c5f1aa4cc6326c8d7886851d2c09 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913885 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:42:21 +00:00
George Steed	b0f72309c6	Remove duplicate kernel assignment from scale_uv.cc The assignment of ScaleUVRowDown2Box_NEON is already done in the block immediately below this one, so just remove this code. Change-Id: I83c0f18dbe66e908cd4fbce73e20e96a137860cf Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979723 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-01 15:42:21 +00:00
George Steed	f00c43f4d6	[AArch64] Unroll HalfFloat{,1}Row_NEON The existing C implementation compiled with a recent LLVM is auto-vectorised and unrolled to process four vectors per loop iteration, making the Neon implementation slower than the C implementation on little cores. To avoid this, unroll the Neon implementation to also process four vectors per iteration. Reduction in cycle counts observed compared to the existing Neon implementation: \| HalfFloat1Row_NEON \| HalfFloatRow_NEON Cortex-A510 \| -37.1% \| -40.8% Cortex-A520 \| -32.3% \| -37.4% Cortex-A720 \| 0.0% \| -10.6% Cortex-X2 \| 0.0% \| -7.8% Cortex-X4 \| +0.3% \| -6.9% Bug: b/42280945 Change-Id: I12b474c970fc4355d75ed924c4ca6169badda2bc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872805 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-30 17:58:29 +00:00
George Steed	51d07554a0	[AArch64] Add SME implementation of ScaleRowDown2Linear There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: Ie6b91bd4407130ba2653838088e81e72e4460f68 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913884 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-30 17:57:15 +00:00
George Steed	593965cea2	[AArch64] Add SME implementation of ScaleRowDown2 Including associated changes for adding a new scale_sme.cc file. There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: I47d149613fbabd8c203605a809811f1a668e8fb7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913883 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-30 17:56:41 +00:00
George Steed	237f39cb8c	[AArch64] Add SME implementation of I444ToARGBRow This is based on an unrolled version of the existing SVE2 code. The implementation in this case is a pure streaming-SVE (SSVE) implementation based on the existing SVE2 implementation, we do not use the ZA tile. Change-Id: I83d8e58aafd814125b3446fb1c9ec4a5fb56fe3e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913882 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-29 18:10:23 +00:00
George Steed	22c5c18778	[AArch64] Add SME implementation of I422ToARGBRow Including addition of a new row_sme.cc file and associated infrastructure. The actual implementation in this case is a pure streaming-SVE (SSVE) implementation based on the existing SVE2 implementation, we do not use the ZA tile. Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-29 05:49:28 +00:00
George Steed	775fd92e59	[AArch64] Optimize ScaleRowDown38_3_Box_NEON Replace LD4 and TRN instructions with LD1s and TBL since LD4 is known to be slow on some micro-architectures, and remove other unnecessary permutes. Reduction in run times: Cortex-A55: -24.8% Cortex-A510: -32.7% Cortex-A520: -37.7% Cortex-A76: -51.8% Cortex-A715: -58.9% Cortex-A720: -58.9% Cortex-X1: -54.8% Cortex-X2: -50.3% Cortex-X3: -57.1% Cortex-X4: -49.8% Cortex-X925: -52.0% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: b/42280945 Change-Id: Ie96bac30fffbe41f8d1501ee289795830ab127e5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872803 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-28 17:04:22 +00:00
George Steed	0bce5120f6	[AArch64] Optimize ScaleRowDown38_2_Box_NEON Replace LD4 and TRN instructions with LD1s and TBL since LD4 is known to be slow on some micro-architectures, and remove other unnecessary permutes. Reduction in run times: Cortex-A55: -17.9% Cortex-A510: -28.7% Cortex-A520: -31.8% Cortex-A76: -40.8% Cortex-A715: -46.1% Cortex-A720: -46.1% Cortex-X1: -44.3% Cortex-X2: -40.1% Cortex-X3: -46.3% Cortex-X4: -40.2% Cortex-X925: -42.3% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: b/42280945 Change-Id: I84e2cd04912fc11d59b4407a1836f047b74a4c92 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872802 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-28 17:03:54 +00:00
George Steed	22ac86800e	[AArch64] Add SVE2 implementation of I422ToARGB4444Row This makes use of the same approach as the Neon code to avoid redundant narrowing and then widening shifts by instead placing the values at the top portion of the lanes and then shifting down from there instead. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -35.5% Cortex-A520: -38.2% Cortex-A715: -19.8% Cortex-A720: -19.8% Cortex-X2: -24.2% Cortex-X3: -24.1% Cortex-X4: -21.6% Cortex-X925: -19.5% Bug: b/42280942 Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 21:27:39 +00:00
George Steed	f4eaeca22a	[AArch64] Add SVE2 implementation of I422ToARGB1555Row This makes use of the same approach as the Neon code to avoid redundant narrowing and then widening shifts by instead placing the values at the top portion of the lanes and then shifting down from there instead. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -41.8% Cortex-A520: -42.6% Cortex-A715: -22.5% Cortex-A720: -22.6% Cortex-X2: -22.7% Cortex-X3: -22.4% Cortex-X4: -19.4% Cortex-X925: -27.0% Bug: b/42280942 Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 21:27:39 +00:00
George Steed	f40042533c	[AArch64] Add SVE2 implementation of I422ToRGB565Row This makes use of the same approach as the Neon code to avoid redundant narrowing and then widening shifts by instead placing the values at the top portion of the lanes and then shifting down from there instead. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -41.1% Cortex-A520: -38.2% Cortex-A715: -21.5% Cortex-A720: -21.6% Cortex-X2: -21.6% Cortex-X3: -22.0% Cortex-X4: -23.5% Cortex-X925: -21.7% Bug: b/42280942 Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 21:27:39 +00:00
George Steed	4621b0cc7f	[AArch64] Rework data loading in ScaleFilterCols_NEON Lane-indexed LD2 instructions are slow and introduce an unnecessary dependency on the previous iteration of the loop. To avoid this dependency use a scalar load for the first iteration and lane-indexed LD1 for the remainder, then TRN1 and TRN2 to split out the even and odd elements. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -6.7% Cortex-A510: -13.2% Cortex-A520: -13.1% Cortex-A76: -54.5% Cortex-A715: -60.3% Cortex-A720: -61.0% Cortex-X1: -69.1% Cortex-X2: -68.6% Cortex-X3: -73.9% Cortex-X4: -73.8% Cortex-X925: -69.0% Bug: b/42280945 Change-Id: I1c4adfb82a43bdcf2dd4cc212088fc21a5812244 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872804 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 21:25:23 +00:00
George Steed	faade2f73f	[AArch64] Avoid partial vector stores in ScaleRowDown38_NEON The existing code performs a pair of stores since there is no AArch64 instruction in Neon to store exactly 12 bytes from a vector register. It is guaranteed to be safe to write full vectors until the last iteration of the loop, since the extra four bytes will be over-written by subsequent iterations. This allows us to avoid duplicating the store instruction and address arithmetic. Reduction in runtime observed relative to the existing Neon implementation: Cortex-A55: +2.0% Cortex-A510: -25.3% Cortex-A520: -15.1% Cortex-A76: -32.2% Cortex-A715: -19.7% Cortex-A720: -19.6% Cortex-X1: -31.6% Cortex-X2: -27.1% Cortex-X3: -25.9% Cortex-X4: -24.7% Cortex-X925: -35.8% Bug: b/42280945 Change-Id: I222ed662f169d82f5f472bebb1bcfe6d428ccae2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872843 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 20:52:08 +00:00
George Steed	0dce974ca0	[AArch64] Add SVE2 implementation of I422ToRGB24Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -57.8% Cortex-A520: -41.7% Cortex-A715: -28.0% Cortex-A720: -28.1% Cortex-X2: -29.7% Cortex-X3: -28.7% Cortex-X4: -30.5% Cortex-X925: -30.3% Bug: b/42280942 Change-Id: I328bd16babda75fb089c8da8f2714465f658187e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802965 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-24 02:17:32 +00:00
Wan-Teh Chang	6ac7c8f251	Revert "Do not enable libyuv_use_sme for is_android" This reverts commit 51e2e12b9b59452b1ad16c33a88bbcdd085b5450. Reason for revert: The llvm bug fix https://github.com/llvm/llvm-project/pull/102979 has been rolled into Chrome in https://chromium-review.googlesource.com/5921462. Original change's description: > Do not enable libyuv_use_sme for is_android > > Revert the changes to libyuv.gni in commit dfa279f. > > The linker error "undefined symbol: __getauxval" referenced by > sme-abi-init.c:26 on Android, previously reported in > https://libyuv.g-issues.chromium.org/issues/359006069#comment2, has not > been fixed yet. See > https://chromium-review.googlesource.com/c/chromium/src/+/5918245?tab=checks. > > Change-Id: I94bd243e2863b9c316909f63f757fd95ec55dc18 > Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5917455 > Reviewed-by: Frank Barchard <fbarchard@chromium.org> Bug: 359006069 Change-Id: Ic801c1bcb65894fdfe718ba6454669c8623a2e15 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5935026 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com> Reviewed-by: George Steed <george.steed@arm.com>	2024-10-15 18:20:36 +00:00
Wan-Teh Chang	a8e59d2074	Fix the test case The test case should have the dst width and height, and the src width and height should be specified by the --libyuv_width and --libyuv_height options to libyuv_unittest. Tested: libyuv_unittest --gtest_filter=LibYUVScaleTest.I420ScaleTo264x216_Box \ --libyuv_width=352 --libyuv_height=288 Bug: b/369963535, b/366045177 Change-Id: I8166a264c9c4840e0d16c0d3c1818c18aebc1b2e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5896466 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-09 08:26:10 +00:00
Wan-Teh Chang	51e2e12b9b	Do not enable libyuv_use_sme for is_android Revert the changes to libyuv.gni in commit dfa279f. The linker error "undefined symbol: __getauxval" referenced by sme-abi-init.c:26 on Android, previously reported in https://libyuv.g-issues.chromium.org/issues/359006069#comment2, has not been fixed yet. See https://chromium-review.googlesource.com/c/chromium/src/+/5918245?tab=checks. Change-Id: I94bd243e2863b9c316909f63f757fd95ec55dc18 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5917455 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-09 08:24:00 +00:00
Frank Barchard	7633328b5f	Make functions that malloc check for ubsan math overflow - add support for negative heights - sanity check null pointers and invalid width/height Bug: b/371615496 Change-Id: Icbefcb1ccc5cdf90e417c73440c6fad3b63ed7df Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5917072 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-10-08 21:08:34 +00:00
Wan-Teh Chang	364b7fa81b	Remove redundant unsigned integer overflow tests Bug: b/371615496 Change-Id: I28df888942085138a54e18c7e939300d959c68b0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5914872 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-08 01:14:35 +00:00
Frank Barchard	ffd791f749	Check malloc allocation sizes are less than SIZE_MAX Bug: b/371615496 Change-Id: I75a94b08469d6d6b6fd55a8659031cbcb3d48eed Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5912039 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-10-07 21:34:15 +00:00
George Steed	dfa279fc65	Re-enable SME when building for AArch64 Android Now that SME has been re-enabled for Linux for a while, also re-enable it for Android when building with a sufficiently new version of LLVM. Bug: b/359006069 Change-Id: Ibaa47e31826cf20136a11d551621fd62c1abab3c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5908389 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-10-04 17:43:26 +00:00
Wan-Teh Chang	77f3acade4	ScalePlaneDown34: test dst_width%24 == 0 for armv7 In ScalePlaneDown34(), check if dst_width % 24 == 0 for armv7, and check if dst_width % 48 == 0 for aarch64. No-Try: True Bug: b/369963535, b/366045177 Change-Id: I7dc1227517c83c97a1d1052ef2230d5cec41da10 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5896492 Commit-Queue: Wan-Teh Chang <wtc@google.com> Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>	2024-09-27 23:00:19 +00:00
Frank Barchard	61bf0b61f7	Fix for ARGB scaling down by 4x horizontally but not vertically Add test ARGBScaleTo50x1_Box libyuv_test '--gunit_filter=ARGBScaleTo50x1' --libyuv_width=200 --libyuv_height=50 Bug: chromium:361611480 Change-Id: Ic984951d74eb0c377c6746f61e91593a8a7d1a66 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5884656 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-09-24 18:00:47 +00:00
George Steed	02c6e8baca	Change ARGBMultiplyRow_C to match Neon The existing behaviour does not round correctly in all cases, so adjust it to match the existing Neon implementation. Update the tests to require bit-exactness and disable other implementations that do not round correctly. Change-Id: Ie790fb4b4805b555d74d689d83802e1dd4f33df5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5869115 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-23 21:48:33 +00:00
George Steed	a37e6bc81b	[AArch64] Re-enable SME only for Linux and new versions of Clang This was previously disabled in 679e851f653866a49e21f69fe8380bd20123f0ee, so re-enable it but only for Linux where SME is known to work correctly. Change-Id: I2626b03f3854b27162df1b55fc6767e02ffe318d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802958 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-09-23 09:29:53 +00:00
George Steed	8315fa1d3a	Avoid duplication of CPU feature disable macros The same conditions are repeated across all *_row.h headers which makes it harder than necessary to guard enabling new architecture features depending on compiler versions etc. Avoid this duplication by merging the conditions into a new cpu_support.h header. Change-Id: Ibe7dfcef138edca6cc36870f1cfbb1bb108083e3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802957 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-09-23 09:28:24 +00:00
Wan-Teh Chang	85e55115f0	Untangle arm and aarch64 #ifdefs in GetCpuFlags() Change-Id: I5df39c20a700aee38954bc9288fdee116138645d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5879350 Reviewed-by: George Steed <george.steed@arm.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-20 23:40:19 +00:00
Alex Richardson	f1b28b3510	Avoid reading /proc/cpuinfo for non-Linux Arm platforms While we will return kCpuHasNEON if the file fails to open, this does unnecessarily introduce filesystem operations which are not needed e.g. on embedded non-Linux platforms. When not building for Linux, we can simply rely on the compiler flags to determine whether NEON support is present for Arm32. Change-Id: Ifb0eab2a46969fca5f733ce624abdf54da9b32a2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778479 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: George Steed <george.steed@arm.com>	2024-09-20 22:22:03 +00:00
George Steed	0d5a31eccb	Update README.md and environment_variables.md for Arm Now that there are newer architecture extensions used, update the documentation to reflect this. Also add missing empty lines after headers in environment_variables.md to ensure the file is valid markdown. Change-Id: I61d5616e1f815f80186440f27dd68ac5460c38b1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5868021 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-20 00:29:33 +00:00
George Steed	7eb552c891	[AArch64] Avoid unnecessary MOVs in ScaleARGBRowDownEvenBox_NEON The existing code uses three MOV instructions through a temporary register to swap the low and high halves of a vector register, however this can be done with a pair of ZIP instructions instead. Also use a pair of RSHRN rather than RSHRN2 to allow these to execute in parallel on little cores. Reduction in runtime observed compared to the existing Neon implementation: Cortex-A55: -8.3% Cortex-A510: -20.6% Cortex-A520: -16.6% Cortex-A76: -6.8% Cortex-A715: -6.2% Cortex-A720: -6.2% Cortex-X1: -22.0% Cortex-X2: -18.7% Cortex-X3: -21.1% Cortex-X4: -25.8% Cortex-X925: -21.9% Change-Id: I87ae133be86c3c9f850d5848ec19d9b71ebda4d9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872801 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-20 00:28:12 +00:00
George Steed	23a6a412e5	[AArch64] Unroll and use TBL in ScaleRowDown34_NEON ST3 is known to be slow on a number of modern micro-architectures. By unrolling the code we are able to use TBL to shuffle elements into the correct indices without needing to use LD4 and ST3, giving a good improvement in performance across the board. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -14.4% Cortex-A510: -66.0% Cortex-A520: -50.8% Cortex-A76: -60.5% Cortex-A715: -63.9% Cortex-A720: -64.2% Cortex-X1: -74.3% Cortex-X2: -75.4% Cortex-X3: -75.5% Cortex-X4: -48.1% Bug: b/42280945 Change-Id: Ia1efb03af2d6ec00bc5a4b72168963fede9f0c83 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785971 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 15:37:27 +00:00
George Steed	d5303f4f77	[AArch64] Unroll ARGB1555ToARGBRow_NEON to use full Neon vectors Processing more data per loop iteration means that we can use the full 128-bit Neon vectors and also allows us to use e.g. UZP1 to perform XTN + XTN2 in a single instruction. The early Cortex-X cores are not a fan of ST4 .16b with a post-increment, so split out the pointer increment to a separate instruction to avoid this bottleneck. Reductions in runtime observed for ARGB1555ToARGBRow_NEON: Cortex-A55: -18.1% Cortex-A510: -11.2% Cortex-A520: -39.5% Cortex-A76: -18.0% Cortex-A715: -34.8% Cortex-A720: -34.8% Cortex-X1: -0.9% Cortex-X2: -4.6% Cortex-X3: -3.6% Cortex-X4: -20.8% Bug: libyuv:976 Change-Id: Iae2ac24ffdbc718cd1e05bb77191f8d1df3fcf6f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790975 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-09-16 04:36:43 +00:00
George Steed	772f0fde1c	[AArch64] Use full Neon vectors in RGB565To{ARGB,UV,Y}Row_NEON The existing code only makes use of half of the vector lanes in the RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more data to allow using full vectors, adjusting the "any" kernel macros to match. For the RGB565ToUVRow kernel we already have plenty of data but currently call the macro twice as much as needed, so refactor the code to only call it once but operating with full vectors instead. Reduction in runtimes observed for selected micro-architectures: \| RGB565ToARGBRow \| RGB565ToUVRow \| RGB565ToYRow Cortex-A53 \| -35.2% \| -28.8% \| -31.1% Cortex-A55 \| -32.5% \| -34.4% \| -42.9% Cortex-A510 \| -21.6% \| -27.7% \| -47.2% Cortex-A76 \| -0.9% \| -42.0% \| -21.4% Cortex-A720 \| -28.6% \| -37.2% \| -26.1% Cortex-X1 \| -3.2% \| -42.3% \| -23.4% Bug: b/42280945 Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:35:47 +00:00

1 2 3 4 5 ...

2740 Commits