libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2026-01-01 03:12:16 +08:00

Author	SHA1	Message	Date
George Steed	a37e6bc81b	[AArch64] Re-enable SME only for Linux and new versions of Clang This was previously disabled in 679e851f653866a49e21f69fe8380bd20123f0ee, so re-enable it but only for Linux where SME is known to work correctly. Change-Id: I2626b03f3854b27162df1b55fc6767e02ffe318d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802958 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-09-23 09:29:53 +00:00
Wan-Teh Chang	85e55115f0	Untangle arm and aarch64 #ifdefs in GetCpuFlags() Change-Id: I5df39c20a700aee38954bc9288fdee116138645d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5879350 Reviewed-by: George Steed <george.steed@arm.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-20 23:40:19 +00:00
Alex Richardson	f1b28b3510	Avoid reading /proc/cpuinfo for non-Linux Arm platforms While we will return kCpuHasNEON if the file fails to open, this does unnecessarily introduce filesystem operations which are not needed e.g. on embedded non-Linux platforms. When not building for Linux, we can simply rely on the compiler flags to determine whether NEON support is present for Arm32. Change-Id: Ifb0eab2a46969fca5f733ce624abdf54da9b32a2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778479 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: George Steed <george.steed@arm.com>	2024-09-20 22:22:03 +00:00
George Steed	7eb552c891	[AArch64] Avoid unnecessary MOVs in ScaleARGBRowDownEvenBox_NEON The existing code uses three MOV instructions through a temporary register to swap the low and high halves of a vector register, however this can be done with a pair of ZIP instructions instead. Also use a pair of RSHRN rather than RSHRN2 to allow these to execute in parallel on little cores. Reduction in runtime observed compared to the existing Neon implementation: Cortex-A55: -8.3% Cortex-A510: -20.6% Cortex-A520: -16.6% Cortex-A76: -6.8% Cortex-A715: -6.2% Cortex-A720: -6.2% Cortex-X1: -22.0% Cortex-X2: -18.7% Cortex-X3: -21.1% Cortex-X4: -25.8% Cortex-X925: -21.9% Change-Id: I87ae133be86c3c9f850d5848ec19d9b71ebda4d9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5872801 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-20 00:28:12 +00:00
George Steed	23a6a412e5	[AArch64] Unroll and use TBL in ScaleRowDown34_NEON ST3 is known to be slow on a number of modern micro-architectures. By unrolling the code we are able to use TBL to shuffle elements into the correct indices without needing to use LD4 and ST3, giving a good improvement in performance across the board. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -14.4% Cortex-A510: -66.0% Cortex-A520: -50.8% Cortex-A76: -60.5% Cortex-A715: -63.9% Cortex-A720: -64.2% Cortex-X1: -74.3% Cortex-X2: -75.4% Cortex-X3: -75.5% Cortex-X4: -48.1% Bug: b/42280945 Change-Id: Ia1efb03af2d6ec00bc5a4b72168963fede9f0c83 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785971 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 15:37:27 +00:00
George Steed	d5303f4f77	[AArch64] Unroll ARGB1555ToARGBRow_NEON to use full Neon vectors Processing more data per loop iteration means that we can use the full 128-bit Neon vectors and also allows us to use e.g. UZP1 to perform XTN + XTN2 in a single instruction. The early Cortex-X cores are not a fan of ST4 .16b with a post-increment, so split out the pointer increment to a separate instruction to avoid this bottleneck. Reductions in runtime observed for ARGB1555ToARGBRow_NEON: Cortex-A55: -18.1% Cortex-A510: -11.2% Cortex-A520: -39.5% Cortex-A76: -18.0% Cortex-A715: -34.8% Cortex-A720: -34.8% Cortex-X1: -0.9% Cortex-X2: -4.6% Cortex-X3: -3.6% Cortex-X4: -20.8% Bug: libyuv:976 Change-Id: Iae2ac24ffdbc718cd1e05bb77191f8d1df3fcf6f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790975 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-09-16 04:36:43 +00:00
George Steed	772f0fde1c	[AArch64] Use full Neon vectors in RGB565To{ARGB,UV,Y}Row_NEON The existing code only makes use of half of the vector lanes in the RGB565TOARGB macro. In the RGB565To{ARGB,Y} kernels we can load more data to allow using full vectors, adjusting the "any" kernel macros to match. For the RGB565ToUVRow kernel we already have plenty of data but currently call the macro twice as much as needed, so refactor the code to only call it once but operating with full vectors instead. Reduction in runtimes observed for selected micro-architectures: \| RGB565ToARGBRow \| RGB565ToUVRow \| RGB565ToYRow Cortex-A53 \| -35.2% \| -28.8% \| -31.1% Cortex-A55 \| -32.5% \| -34.4% \| -42.9% Cortex-A510 \| -21.6% \| -27.7% \| -47.2% Cortex-A76 \| -0.9% \| -42.0% \| -21.4% Cortex-A720 \| -28.6% \| -37.2% \| -26.1% Cortex-X1 \| -3.2% \| -42.3% \| -23.4% Bug: b/42280945 Change-Id: Ib1f68e5b87cc05a1485bbe96cfef87e6ac119fc3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790974 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:35:47 +00:00
George Steed	2dfb84b311	[AArch64] Unroll to use full vectors in ARGBToARGB1555Row_NEON By loading packed 16-bit AR/GB data and operating on that directly we avoid the need to perform a separate widening step before the conversion. Reduction in runtime observed compared to the existing Neon code: Cortex-A55: -13.2% Cortex-A510: -5.4% Cortex-A76: -21.5% Cortex-A720: -25.2% Cortex-X1: -50.6% Cortex-X2: -36.8% Bug: b/42280945 Change-Id: I780c71fdff1d017464c6e4e38f86979dda0e43ad Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790973 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-09-16 04:33:22 +00:00
George Steed	432d186116	[AArch64] Add Neon dot-product implementation for ARGBSepiaRow We can use the dot product instructions to apply the coefficients directly without the need for LD4 de-interleaving load instructions, since these are known to be slow on some micro-architectures. ST4 is also known to be slow on more modern micro-architectures, however avoiding this is left for a future SVE implementation where we can make use of interleaving-narrowing instructions. Reduction in cycle counts observed compared to existing Neon code: Cortex-A55: -5.8% Cortex-A510: -18.9% Cortex-A76: -21.8% Cortex-A720: -30.2% Cortex-X1: -28.6% Cortex-X2: -23.4% Bug: b/42280946 Change-Id: I5887559649cc805a810d867b652c85d48285657d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790970 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:31:35 +00:00
George Steed	1c31461771	[AArch64] Add Neon dot-product implementation for ARGBGrayRow We can use dot product instructions to apply the coefficients without needing to use LD4 deinterleaving load instructions, and then TBL to mix in the original alpha component. This is significantly faster on some micro-architectures where LD4 instructions are known to be slow compared to normal loads. Reduction in cycle counts observed compared to existing Neon code: Cortex-A55: -12.6% Cortex-A510: -48.6% Cortex-A76: -39.7% Cortex-A720: -52.3% Cortex-X1: -63.5% Cortex-X2: -67.0% Bug: b/42280946 Change-Id: I3641785e74873438acc00d675f5bc490dfa95b50 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785972 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:31:11 +00:00
George Steed	2d62d8d22a	[AArch64] Unroll ScaleRowDown4_NEON We can use wider load/store instructions here which is mostly an improvement across the board. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: +4.9% (!) Cortex-A510: -46.3% Cortex-A520: -49.0% Cortex-A76: -12.2% Cortex-A715: -15.5% Cortex-A720: -15.0% Cortex-X1: -12.4% Cortex-X2: -12.5% Cortex-X3: -12.3% Cortex-X4: +0.3% Bug: b/42280945 Change-Id: Id8af6499c63919924c2a954dfe7765b703ce4820 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785970 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:30:04 +00:00
George Steed	e6297afd14	[AArch64] Optimize ScaleARGBRowDown2Linear_NEON Replace LD4 with a pair of LD2 instructions to avoid needing an ST2 instruction for storing the result, since ST2 instructions are known to be slow on some micro-architectures. Observed reduction in runtimes compared to the existing Neon code: Cortex-A55: -23.3% Cortex-A510: -49.6% Cortex-A520: -31.1% Cortex-A76: -44.5% Cortex-A715: -45.8% Cortex-A720: -46.0% Cortex-X1: -74.5% Cortex-X2: -72.4% Cortex-X3: -76.8% Cortex-X4: -39.5% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Iab9e802d0784d69b7e970dcc8f1f4036985cd2e1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790972 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:28:25 +00:00
George Steed	00886670bb	[AArch64] Avoid LD4/ST2 in ScaleARGBRowDown2_NEON Use separate permute instructions to avoid using LD4/ST2 as these instructions are known to be slow on some micro-architectures. Observed reduction in runtimes compared to the existing Neon code: Cortex-A55: -12.4% Cortex-A510: -44.8% Cortex-A520: -31.1% Cortex-A76: -55.3% Cortex-A715: -63.7% Cortex-A720: -62.3% Cortex-X1: -79.0% Cortex-X2: -78.9% Cortex-X3: -79.6% Cortex-X4: -59.8% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I33cf27ae5e16c1ce62f1f343043e6bd9fca92558 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790971 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:27:39 +00:00
Frank Barchard	4620f17058	ScalePlane crash fix for 3/4 scaling - Scaling 48 pixels at a time, but calling code checked for 24 pixels - Added test for scaling to 1080x1920 libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560 Was libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560 [ RUN ] LibYUVScaleTest.I420ScaleTo1080x1920_Box Segmentation fault Traceback (most recent call last): Now [ RUN ] LibYUVScaleTest.I420ScaleTo1080x1920_Box filter 3 - 6741 us C - 3566 us OPT [ OK ] LibYUVScaleTest.I420ScaleTo1080x1920_Box (43 ms) Bug: b/366045177 Change-Id: I0ea6c2d6a32b2e7ca44cd030abc9f248115be44a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5857554 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-09-13 01:20:39 +00:00
Frank Barchard	679e851f65	Convert16To8Row_AVX512BW using vpmovuswb - avx2 is pack/perm is mutating order - cvt method maintains channel order on avx512 Sapphire Rapids Benchmark of 640x360 on Sapphire Rapids AVX512BW [ OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms) AVX2 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms) SSE2 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms) Skylake Xeon Now vpmovuswb [ OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms) Was vpackuswb [ OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms) Switch from vpunpcklwd to vpbroadcastw for scale value parameter Was vpunpcklwd %%xmm2,%%xmm2,%%xmm2 vbroadcastss %%xmm2,%%ymm2 Now vpbroadcastw %%xmm2,%%ymm2 Bug: 357439226, 357721018 Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-15 20:13:33 +00:00
Wan-Teh Chang	6157cc4583	Remove the ' separators in hex integer constants They are a C++14 feature, not supported in C++11 mode (-std=c++11). Change-Id: I618020342d4964b994aefa06af83b2e8d553a032 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786607 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-14 20:50:28 +00:00
Wan-Teh Chang	2707098fb1	Fix -Wunused-parameter warnings in release builds Add (void) casts to the 'src_width' parameters that are only used in assertions. Change-Id: I72d1b55f50a9b02b07b206e40e5583005b27928b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786606 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-14 20:49:37 +00:00
Frank Barchard	336e6fd25b	I010ToNV12 conversion using 2 step row function for UV - convert full Y plane with row coalescing if possible - convert rows of UV from 10 bit to 8 bit then call MergeUV libyuv_test '--gunit_filter=010ToNV12_Opt' --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Note: Google Test filter = 010ToNV12_Opt Skylake Xeon Was 2 pass planes [ OK ] LibYUVConvertTest.I010ToNV12_Opt (4512 ms) Now 2 pass rows [ OK ] LibYUVConvertTest.I010ToNV12_Opt (2400 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (2265 ms) On Samsung S23 libyuv_test --gunit_filter=*.????ToNV12_Opt --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000' Was [ OK ] LibYUVConvertTest.I010ToNV12_Opt (3563 ms) Now [ OK ] LibYUVConvertTest.AYUVToNV12_Opt (3068 ms [ OK ] LibYUVConvertTest.ARGBToNV12_Opt (2990 ms [ OK ] LibYUVConvertTest.ABGRToNV12_Opt (2904 ms [ OK ] LibYUVConvertTest.P010ToNV12_Opt (1177 ms [ OK ] LibYUVConvertTest.I010ToNV12_Opt (1150 ms <- now [ OK ] LibYUVConvertTest.I444ToNV12_Opt (1118 ms [ OK ] LibYUVConvertTest.MM21ToNV12_Opt (1008 ms [ OK ] LibYUVConvertTest.UYVYToNV12_Opt (1007 ms [ OK ] LibYUVConvertTest.YUY2ToNV12_Opt (938 ms) [ OK ] LibYUVConvertTest.NV21ToNV12_Opt (496 ms) [ OK ] LibYUVConvertTest.I420ToNV12_Opt (466 ms) Bug: b/357439226, b/357721018 Change-Id: I48405929ae835b171e7d556a16794eac22c50ae9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5782404 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-13 19:30:16 +00:00
Wan-Teh Chang	5dfa75670d	scale_neon.cc: Fix -Wmissing-prototypes warnings Somehow I missed this file in https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601. Change-Id: Ibd8ed7102d1af12fb929e2ec9bcc87da7d8be306 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785253 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-13 03:50:51 +00:00
Wan-Teh Chang	3cf54e90d3	Fix -Wmissing-prototypes warnings Declare functions as static. Declare functions in a header. Include the header that declares the functions. Delete undeclared and unused functions ScaleFilterRows_NEON() and ScaleRowUp2_16_NEON(). Delete unused function ScaleY() in psnr_main.cc. Change-Id: I182ec30611df83c61ffd01bbab595cd61fb5f1e5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601 Commit-Queue: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-12 19:08:24 +00:00
Frank Barchard	a97746349b	Add test for I010ToNV12 - Add support for negative height to invert - Fix off by 1 on odd width and height - Bump version to 1895 Initial I010 is 2 step planar conversion libyuv_test '--gunit_filter=*010ToNV12_Opt' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Skylake Xeon [ OK ] LibYUVConvertTest.I010ToNV12_Opt (2675 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (1547 ms) Pixel 7 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (464 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (125 ms) Bug: b/357721018, b/357439226 Change-Id: I2ae59783cf328a6592d0ab80c374ae4dc281daf3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778595 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-12 18:57:56 +00:00
Chunbo Hua	fc94178260	Implement I010ToNV12 conversion I010, also known as YUV420P10, is 10 bit YUV pixel format with 3 planes. Both I010 and NV12 are 4:2:0 subsampling. NV12 has a Y plane, and an interleaved UV plane. Bug: 357721018 Change-Id: If215529b9eda8e0fb32aed666ca179c90244aaff Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5764823 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-06 17:36:13 +00:00
Frank Barchard	32ccd53bb3	Add P010ToNV12 to convert 10 bit biplanar to 8 bit biplanar - P010 and NV12 have the same layout: Full size Y plane and half size UV plane. P010 and NV12 are 4:2:0 subsampling - P010 uses upper 10 bits of 16 bit elements - NV12 uses 8 bit elements - The Convert16To8 used internally will discard the low 2 bits. - UV order is the same - U first in memory, followed by V, interleaved - UV plane is be rounded up in size to allow odd size Y to have UV values - Similar code could be used to convert P210ToNV16, P410ToNV24, with the size of the UV plane affected by subsampling 4:2:2 and 4:4:4 variants. Bug: b/357439226 Change-Id: I5d6ec84d97d0e0cc4008eeb18a929ea28570d6d9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5761958 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-05 18:55:44 +00:00
Wan-Teh Chang	e462de319c	Fix -Wundef warnings Change-Id: I803b70f66ca938665ba39b961bdb31625c6bc503 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5758156 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-02 17:39:59 +00:00
Frank Barchard	4cd90347e7	Rotate use NULL for C compatability Bug: b/353323977 Change-Id: I2472f23ce8fcc0bc09a292bd6fb758304c6c2b18 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5735714 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-07-23 18:02:47 +00:00
George Steed	8f039f639c	[AArch64] Unroll ScaleRowDown4Box_NEON We can use wider load/store instructions and avoid the need to waste half of the ADDP/RSHRN vector data. The duplicated UADDLP and UADALP instructions also provide a good improvement on little cores due to their limited out-of-order capability. The mask in the "any" kernel definition is already set up to handle an unrolling of eight so no change to scale_any.cc is needed. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -19.5% Cortex-A520: -38.3% Cortex-A76: -36.0% Cortex-A715: -18.1% Cortex-A720: -17.9% Cortex-X1: -25.4% Cortex-X2: -18.5% Cortex-X3: -8.2% Cortex-X4: -3.8% Bug: b/42280945 Change-Id: Iebba5da4db5e25af4b9fa5651c7396364dedffba Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725172 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 19:52:21 +00:00
George Steed	dc392094fc	[AArch64] Unroll ScaleRowDown34_0_Box_NEON The additional parallel instruction streams provide a good benefit to little cores with limited out-of-order capability. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -19.1% Cortex-A510: -31.6% Cortex-A520: -35.2% Cortex-A76: -14.3% Cortex-A715: +0.1% Cortex-A720: =0.0% Cortex-X1: -6.6% Cortex-X2: -0.1% Cortex-X3: -0.2% Cortex-X4: -7.2% Bug: b/42280945 Change-Id: Idca21a5af1dc6f189e644a81537d41f50ef66498 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725171 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 19:52:01 +00:00
George Steed	776a509891	[AArch64] Unroll ScaleRowDown34_1_Box_NEON We can make use of wider instructions for the loads and stores as well as the URHADD instructions. In addition the duplicated instructions of the code from the unrolling provides a further small improvement for little cores with limited out-of-order capability. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -23.5% Cortex-A510: -35.4% Cortex-A520: -40.5% Cortex-A76: -15.1% Cortex-A715: -6.2% Cortex-A720: -6.2% Cortex-X1: -17.9% Cortex-X2: -18.4% Cortex-X3: -18.3% Cortex-X4: -14.0% Bug: b/42280945 Change-Id: I5905e026a0507870bfc580b702906d6acb4ed6f4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725170 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 19:51:45 +00:00
George Steed	be5de19db3	[AArch64] Unroll ScaleRowUp2_Linear_NEON On little cores with limited out-of-order capability this gives a good improvement. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -21.3% Cortex-A520: -33.6% Cortex-A76: +1.1% Cortex-A715: =0.0% Cortex-A720: =0.0% Cortex-X1: +10.4% (!) Cortex-X2: -5.3% Cortex-X3: -4.3% Cortex-X4: -9.9% Bug: b/42280945 Change-Id: I45b3510f13c05b19d61052e2f8e447199dbd0551 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725169 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 19:51:17 +00:00
George Steed	42d33341d3	[AArch64] Unroll {RAW,RGB24}To{ARGB,RGBA}Row_SVE2 Unrolling gives a nice improvement to the little cores and even a small improvement to the big cores thanks to avoiding the loop control overhead. Observed performance improvement relative to the existing SVE2 code. \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 RAWToARGBRow_SVE2 \| -28.4% \| -10.1% \| -3.5% RAWToRGBARow_SVE2 \| -28.5% \| -10.1% \| -4.4% RGB24ToARGBRow_SVE2 \| -28.5% \| -10.4% \| -5.5% Bug: libyuv:973 Change-Id: I7aa03fdaa1a24ecfdd13418647a02e5effe8333f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725174 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 16:01:56 +00:00
George Steed	4ad050b5ec	[AArch64] Unroll {I422,I422Alpha}ToARGBRow_SVE2 Since the UV components are duplicated in I422 we end up wasting half of the vector bandwidth processing the same elements twice. By unrolling the kernel to process two vectors of Y per iteration we can fill a whole vector of U/V components. Rather than packing RGBA components into pairs during the narrowing we now just narrow into individual component vectors and use ST4B instead. This by itself is slower on some micro-architectures like Cortex-A510 but the benefit from unrolling significantly outweights this. \| I422AlphaToARGBRow_SVE2 \| I422ToARGBRow_SVE2 Cortex-A510 \| -46.2% \| -48.8% Cortex-A720 \| -20.8% \| -21.0% Cortex-X2 \| -11.3% \| -7.5% Cortex-X4 \| -15.4% \| -15.5% Bug: libyuv:973 Change-Id: I69389c4279861f7a460ae0c28186f023c728c4e8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725173 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 15:55:59 +00:00
George Steed	b5f9d7cb76	[AArch64] Add SME implementation of TransposeUVWxH We can make use of the ZA tile register to do the transpose and de-interleaving of UV components without any explicit permute instructions: the tile is loaded horizontally placing UV components into alternative columns, then we can just store the independent components vertically. Change-Id: I67bd82dc840a43888290be1c9db8a3c05f16d730 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703588 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 12:15:40 +00:00
George Steed	15ecca81f7	[AArch64] Add SME implementation of TransposeWxH We can make use of the ZA tile register to do the transpose without any explicit permute instructions: just load the tile horizontally and store it vertically. Change-Id: I1c31e89af52a408e3491e62d6c9e6fee41b1b80a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703587 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 12:14:39 +00:00
George Steed	a4ccf9940e	[AArch64] Add I8MM implementation of ARGBToUV444Row We cannot use the standard dot-product instructions since the coefficients multiplication results are both added and subtracted, but I8MM supports mixed-sign dot products which work well here. We need to add an additional variant of the coefficient structs since we need negative constants for the elements that were previously subtracted. Reduction in runtimes observed compared to the previous Neon implementation: Cortex-A510: -37.3% Cortex-A520: -31.1% Cortex-A715: -37.1% Cortex-A720: -37.0% Cortex-X2: -62.1% Cortex-X3: -62.2% Cortex-X4: -40.4% Bug: libyuv:977 Change-Id: Idc3d9a6408c30e1bce3816a1ed926ecd76792236 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712928 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-16 17:32:52 +00:00
George Steed	a64fffe632	Revert "Disable NV12ToARGB_SVE2 which fails the 'any' test" This reverts commit f480fa1c4a4af0ce3c34cd7b1ab0d85f1a36ce17. This code has a number of small issues: * The YUVTORGB_SVE_SETUP macro requires p0 to be initialized to all-true, however the existing kernel does not initialise p0 until after this macro is called, so flip the order. * The p2 register is missing from the clobber list, so add it. * The existing code uses the wrong condition flags when determining whether to do the tail iteration using WHILE instructions or not. Additionally the number of tail iterations is incorrect, as it was incorrectly not changed from when the tail code was always executed. While we are here, make another few small improvements: * Remove the single-quote digit separators as requested here: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133 * Remove "volatile" from the asm block counting the vector length. This particular asm block cannot be removed by the compiler since the output register is consumed by subsequent code, so "volatile" is unnecessary here and we remove it. * Add some additional empty comments to force clang-format to put macros into the next line rather than on the same line as other asm. Bug: b/352371649 Change-Id: I45676fab95343f588cf11ce2cf9186ffbe87489e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703586 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-15 18:13:42 +00:00
George Steed	e1a93c79fc	[AArch64] Fix rotate by odd sizes The existing disabled gtest rotate tests fail because the existing "any" kernels always assume we are processing height=8 rows at a time. This was recently changed to 16 on AArch64 which triggered this bug. To fix this, amend the TANY macro to explicitly specify the fallback kernel, such that we can use the height=16 kernel to match the SIMD optimized version where necessary. Also change other architecture versions to match. Bug: b/352351302 Change-Id: I8080fa8f44c7c67fa970a78fb426f2f801a9a00e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703585 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-15 18:13:31 +00:00
George Steed	c1fe5663f5	[AArch64] Use full vectors in ARGB4444To{Y,UV}Row_NEON The existing ARGB4444TORGB macro only makes use of 64 bit wide vectors rather than the full 128 bits available, so unroll it to allow us to process more data per instruction. For ARGB4444ToUVRow_NEON we already have enough data available each iteration to make use of full vectors, but for ARGB4444ToYRow_NEON we also need to adjust the "any" kernel to allow us to process 16 elements per iteration. Reduction in runtimes observed compared to the existing Neon kernels: \| ARGB4444ToUVRow \| ARGB4444ToYRow Cortex-A55 \| -27.8% \| -34.6% Cortex-A510 \| -37.0% \| -44.4% Cortex-A76 \| -40.2% \| -22.0% Cortex-A720 \| -33.4% \| -35.5% Cortex-X1 \| -34.1% \| -19.7% Cortex-X2 \| -32.1% \| -26.3% Bug: libyuv:976 Change-Id: I08f6286bab0ebf5e24d5d5803f8c45ec6ba776ee Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631541 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-10 23:12:43 +00:00
George Steed	5bac99fe09	[AArch64] Rework data loading in ScaleARGBFilterCols_NEON The existing code makes use of lane-indexed LD2 instructions to load the input data however this creates a strong dependency chain between consecutive load instructions. We can reduce this dependency chain by instead loading two vectors with wider lane-indexed LD1 instructions and then performing a permute to unzip the data. We can also avoid the need for a complex sequence of DUP + EXT instructions by using TBL to permute the data exactly as we want it. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: =0.0% Cortex-A510: -44.2% Cortex-A520: -47.6% Cortex-A76: -45.8% Cortex-A715: -58.3% Cortex-A720: -58.4% Cortex-X1: -66.7% Cortex-X2: -68.0% Cortex-X3: -67.9% Cortex-X4: -70.0% Change-Id: I8a1d1fe08d8a2ddb0b86d4a44f0d49b69ab03ece Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5683126 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-10 23:10:43 +00:00
George Steed	a425b559bd	[AArch64] Use full vectors in ARGB1555To{Y,UV}Row_NEON The existing RGB555TOARGB macro only makes use of 64 bit wide vectors rather than the full 128 bits available, so unroll it to allow us to process more data per instruction. For ARGB1555ToUVRow_NEON we already have enough data available each iteration to make use of full vectors, but for ARGB1555ToYRow_NEON we also need to adjust the "any" kernel to allow us to process 16 elements per iteration. Reduction in runtimes observed compared to the existing Neon kernels: \| ARGB1555ToUVRow \| ARGB1555ToYRow Cortex-A55 \| -28.8% \| -35.3% Cortex-A510 \| -34.0% \| -48.5% Cortex-A76 \| -36.7% \| -25.1% Cortex-A720 \| -29.7% \| -31.1% Cortex-X1 \| -31.6% \| -19.7% Cortex-X2 \| -27.6% \| -22.7% Bug: libyuv:976 Change-Id: Idd745c133b5fb65001652a59f01ac1aa3bb42067 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631540 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-10 23:09:53 +00:00
Frank Barchard	3902eaaf86	Fix for source/row_neon64.cc:551:12: error: unused variable 'alpha' [-Werror,-Wunused-variable] 551 \| uint16_t alpha = 0xc000; \| ^~~~~ 1 error generated. Bug: None Change-Id: Ifdfe39f75c003921e4f759bcbbbffe0e766039bd Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5690260 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-07-09 22:51:33 +00:00
George Steed	899bc48327	[AArch64] Add SVE2 implementations of ARGBTo{RAW,RGB24}Row There is no nice way of forming the TBL permute indices here since we are operating on sets of three bytes at a time, so instead load the appropriate indices from a static array. We can make use of SVE predication to ensure we are operating on a multiple of three bytes for the load/store instructions rather than needing to make use of more expensive LD4 or ST3 instructions. Reduction in runtime observed compared to the existing Neon implementations: \| ARGBToRAWRow \| ARGBToRGB24Row Cortex-A510 \| -50.8% \| -19.9% Cortex-A720 \| -39.8% \| -39.1% Cortex-X2 \| -66.5% \| -51.9% Bug: libyuv:973 Change-Id: Iaead678715a3d70d54cf823391272a6196836769 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631544 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 20:27:54 +00:00
George Steed	5236846b64	[AArch64] Keep UV interleaved in some *ToARGBRow_SVE2 kernels The existing I4XXTORGB_SVE macro operates only on even byte lanes of the loaded U/V vectors. This is sub-optimal since we are effectively wasting half of the vector in any pre-processing steps before the conversion. In particular, where the UV components are loaded from interleaved data we can save a TBL instruction by maintaining the interleaved format. This commit introduces a new NVTORGB_SVE macro to handle the case where U/V components are interleaved into even/odd bytes of a vector, mirroring a similar macro in the AArch64 Neon implementation. Reduction in runtimes observed compared to the existing SVE2 code: \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 NV12ToARGBRow_SVE2 \| -5.3% \| -0.2% \| -4.4% NV21ToARGBRow_SVE2 \| -5.3% \| -0.2% \| -4.4% UYVYToARGBRow_SVE2 \| -5.6% \| 0.0% \| -4.6% YUY2ToARGBRow_SVE2 \| -5.5% \| -0.1% \| -4.2% Bug: libyuv:973 Change-Id: I418de2e684e0b6b0b9e41c39b564438531e44671 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 20:26:23 +00:00
George Steed	555f80f3ce	[AArch64] Add SVE2 implementation of RGB24ToARGBRow This can make use of the existing helper functions for RAWToARGBRow_SVE2 and RAWToRGBARow_SVE2 since the layouts are similar, we just need to adjust the TBL constants to match the different input layout. Observed reduction in runtime compared to the existing Neon kernel: Cortex-A510: -25.6% Cortex-A720: -15.2% Cortex-X2: -10.2% Cortex-X4: -30.2% Bug: libyuv:973 Change-Id: Ie3676693286be90d09f0045766c3492cbc04ea64 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5638555 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-08 20:12:05 +00:00
George Steed	fcbe22c59c	[AArch64] Enable SME feature detection on Apple Silicon Check for availability of SME and SME2 by looking for the hw.optional.arm.FEAT_SME2 feature string in sysctlbyname. Non-streaming SVE is not supported but for our purposes the features can be treated as orthogonal since our SME code will only ever run in streaming mode. Change-Id: I7e9d242e0f581217b625d74c7c3b0c76a0fe03da Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5683128 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 16:19:27 +00:00
George Steed	11ff6067a5	[AArch64] Add SVE2 implementation of RAWToRGB24Row There is no nice way of forming the TBL permute indices here since we are operating on sets of three bytes at a time, so instead load the appropriate indices from a static array. We can make use of SVE predication to ensure we are operating on a multiple of three bytes for the load/store instructions rather than needing to make use of more expensive LD3 or ST3 instructions. Reduction in runtime observed compared to the existing Neon implementation: Cortex-A510: -39.2% Cortex-A720: -34.5% Cortex-X2: -31.0% Bug: libyuv:973 Change-Id: I68560bde7a529e5cec150b0e9d3ffe4341038fb8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631543 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 15:55:14 +00:00
George Steed	c613c3f102	[AArch64] Add SVE2 implementations for RAWTo{ARGB,RGBA}Row We can construct particular predicates to load only up to 3/4 of a full vector, allowing us to use TBL to shuffle elements into the correct place rather than needing to rely on more expensive LD3 or ST4 instructions. Reduction in runtimes observed compared to the existing Neon implementation: \| RAWToARGBRow \| RAWToRGBARow Cortex-A510 \| -32.4% \| -31.9% Cortex-A720 \| -15.7% \| -15.6% Cortex-X2 \| -24.6% \| -24.4% Bug: libyuv:973 Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-06 22:40:15 +00:00
George Steed	d1ec694ad3	[AArch64] Add P{210,410}To{ARGB,AR30}Row_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 P210ToARGBRow \| -59.8% \| -16.8% \| -53.2% P210ToAR30Row \| -48.1% \| -21.8% \| -54.0% P410ToARGBRow \| -56.5% \| -32.2% \| -54.1% P410ToAR30Row \| -42.4% \| -4.5% \| -50.4% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-06 22:37:08 +00:00
Frank Barchard	611806a155	[AArch64] Fix SVE/SME vector length printing in cpuid A semicolon is treated as the start of a comment by some assemblers causing the vector length to be reported incorrectly, so use a newline instead. - Add volatile asm in row_gcc and row_neon64 Bug: b/5631539 Change-Id: I6b0836fcdd9247ef7b9e8ceda01df3150519ecf8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5666060 Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-02 19:44:41 +00:00
George Steed	d32436e8f8	[AArch64] Add Neon implementation for I422ToAR30Row_NEON There is an existing x86 implementation for this kernel, but not for AArch64, so add one. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: Cortex-A55: -43.1% Cortex-A510: -22.3% Cortex-A76: -54.8% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-02 18:16:33 +00:00
George Steed	bbd9cedc4f	[AArch64] Add Neon impls for I212To{ARGB,AR30}Row_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| I210ToAR30Row \| I210ToARGBRow Cortex-A55 \| -40.8% \| -54.4% Cortex-A510 \| -26.2% \| -22.7% Cortex-A76 \| -49.2% \| -44.5% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-02 18:16:33 +00:00

1 2 3 4 5 ...

1885 Commits