libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2025-12-07 17:26:49 +08:00

Author	SHA1	Message	Date
George Steed	432d186116	[AArch64] Add Neon dot-product implementation for ARGBSepiaRow We can use the dot product instructions to apply the coefficients directly without the need for LD4 de-interleaving load instructions, since these are known to be slow on some micro-architectures. ST4 is also known to be slow on more modern micro-architectures, however avoiding this is left for a future SVE implementation where we can make use of interleaving-narrowing instructions. Reduction in cycle counts observed compared to existing Neon code: Cortex-A55: -5.8% Cortex-A510: -18.9% Cortex-A76: -21.8% Cortex-A720: -30.2% Cortex-X1: -28.6% Cortex-X2: -23.4% Bug: b/42280946 Change-Id: I5887559649cc805a810d867b652c85d48285657d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790970 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:31:35 +00:00
George Steed	1c31461771	[AArch64] Add Neon dot-product implementation for ARGBGrayRow We can use dot product instructions to apply the coefficients without needing to use LD4 deinterleaving load instructions, and then TBL to mix in the original alpha component. This is significantly faster on some micro-architectures where LD4 instructions are known to be slow compared to normal loads. Reduction in cycle counts observed compared to existing Neon code: Cortex-A55: -12.6% Cortex-A510: -48.6% Cortex-A76: -39.7% Cortex-A720: -52.3% Cortex-X1: -63.5% Cortex-X2: -67.0% Bug: b/42280946 Change-Id: I3641785e74873438acc00d675f5bc490dfa95b50 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785972 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:31:11 +00:00
George Steed	2d62d8d22a	[AArch64] Unroll ScaleRowDown4_NEON We can use wider load/store instructions here which is mostly an improvement across the board. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: +4.9% (!) Cortex-A510: -46.3% Cortex-A520: -49.0% Cortex-A76: -12.2% Cortex-A715: -15.5% Cortex-A720: -15.0% Cortex-X1: -12.4% Cortex-X2: -12.5% Cortex-X3: -12.3% Cortex-X4: +0.3% Bug: b/42280945 Change-Id: Id8af6499c63919924c2a954dfe7765b703ce4820 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785970 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:30:04 +00:00
George Steed	e6297afd14	[AArch64] Optimize ScaleARGBRowDown2Linear_NEON Replace LD4 with a pair of LD2 instructions to avoid needing an ST2 instruction for storing the result, since ST2 instructions are known to be slow on some micro-architectures. Observed reduction in runtimes compared to the existing Neon code: Cortex-A55: -23.3% Cortex-A510: -49.6% Cortex-A520: -31.1% Cortex-A76: -44.5% Cortex-A715: -45.8% Cortex-A720: -46.0% Cortex-X1: -74.5% Cortex-X2: -72.4% Cortex-X3: -76.8% Cortex-X4: -39.5% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Iab9e802d0784d69b7e970dcc8f1f4036985cd2e1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790972 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:28:25 +00:00
George Steed	00886670bb	[AArch64] Avoid LD4/ST2 in ScaleARGBRowDown2_NEON Use separate permute instructions to avoid using LD4/ST2 as these instructions are known to be slow on some micro-architectures. Observed reduction in runtimes compared to the existing Neon code: Cortex-A55: -12.4% Cortex-A510: -44.8% Cortex-A520: -31.1% Cortex-A76: -55.3% Cortex-A715: -63.7% Cortex-A720: -62.3% Cortex-X1: -79.0% Cortex-X2: -78.9% Cortex-X3: -79.6% Cortex-X4: -59.8% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I33cf27ae5e16c1ce62f1f343043e6bd9fca92558 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790971 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:27:39 +00:00
Frank Barchard	4620f17058	ScalePlane crash fix for 3/4 scaling - Scaling 48 pixels at a time, but calling code checked for 24 pixels - Added test for scaling to 1080x1920 libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560 Was libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560 [ RUN ] LibYUVScaleTest.I420ScaleTo1080x1920_Box Segmentation fault Traceback (most recent call last): Now [ RUN ] LibYUVScaleTest.I420ScaleTo1080x1920_Box filter 3 - 6741 us C - 3566 us OPT [ OK ] LibYUVScaleTest.I420ScaleTo1080x1920_Box (43 ms) Bug: b/366045177 Change-Id: I0ea6c2d6a32b2e7ca44cd030abc9f248115be44a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5857554 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-09-13 01:20:39 +00:00
Frank Barchard	679e851f65	Convert16To8Row_AVX512BW using vpmovuswb - avx2 is pack/perm is mutating order - cvt method maintains channel order on avx512 Sapphire Rapids Benchmark of 640x360 on Sapphire Rapids AVX512BW [ OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms) AVX2 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms) SSE2 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms) Skylake Xeon Now vpmovuswb [ OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms) Was vpackuswb [ OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms) Switch from vpunpcklwd to vpbroadcastw for scale value parameter Was vpunpcklwd %%xmm2,%%xmm2,%%xmm2 vbroadcastss %%xmm2,%%ymm2 Now vpbroadcastw %%xmm2,%%ymm2 Bug: 357439226, 357721018 Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-15 20:13:33 +00:00
Wan-Teh Chang	6157cc4583	Remove the ' separators in hex integer constants They are a C++14 feature, not supported in C++11 mode (-std=c++11). Change-Id: I618020342d4964b994aefa06af83b2e8d553a032 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786607 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-14 20:50:28 +00:00
Wan-Teh Chang	2707098fb1	Fix -Wunused-parameter warnings in release builds Add (void) casts to the 'src_width' parameters that are only used in assertions. Change-Id: I72d1b55f50a9b02b07b206e40e5583005b27928b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5786606 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-14 20:49:37 +00:00
Frank Barchard	336e6fd25b	I010ToNV12 conversion using 2 step row function for UV - convert full Y plane with row coalescing if possible - convert rows of UV from 10 bit to 8 bit then call MergeUV libyuv_test '--gunit_filter=010ToNV12_Opt' --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Note: Google Test filter = 010ToNV12_Opt Skylake Xeon Was 2 pass planes [ OK ] LibYUVConvertTest.I010ToNV12_Opt (4512 ms) Now 2 pass rows [ OK ] LibYUVConvertTest.I010ToNV12_Opt (2400 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (2265 ms) On Samsung S23 libyuv_test --gunit_filter=*.????ToNV12_Opt --libyuv_width=3840 --libyuv_height=2160 --libyuv_repeat=1000' Was [ OK ] LibYUVConvertTest.I010ToNV12_Opt (3563 ms) Now [ OK ] LibYUVConvertTest.AYUVToNV12_Opt (3068 ms [ OK ] LibYUVConvertTest.ARGBToNV12_Opt (2990 ms [ OK ] LibYUVConvertTest.ABGRToNV12_Opt (2904 ms [ OK ] LibYUVConvertTest.P010ToNV12_Opt (1177 ms [ OK ] LibYUVConvertTest.I010ToNV12_Opt (1150 ms <- now [ OK ] LibYUVConvertTest.I444ToNV12_Opt (1118 ms [ OK ] LibYUVConvertTest.MM21ToNV12_Opt (1008 ms [ OK ] LibYUVConvertTest.UYVYToNV12_Opt (1007 ms [ OK ] LibYUVConvertTest.YUY2ToNV12_Opt (938 ms) [ OK ] LibYUVConvertTest.NV21ToNV12_Opt (496 ms) [ OK ] LibYUVConvertTest.I420ToNV12_Opt (466 ms) Bug: b/357439226, b/357721018 Change-Id: I48405929ae835b171e7d556a16794eac22c50ae9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5782404 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-13 19:30:16 +00:00
Wan-Teh Chang	5dfa75670d	scale_neon.cc: Fix -Wmissing-prototypes warnings Somehow I missed this file in https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601. Change-Id: Ibd8ed7102d1af12fb929e2ec9bcc87da7d8be306 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785253 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-13 03:50:51 +00:00
Wan-Teh Chang	3cf54e90d3	Fix -Wmissing-prototypes warnings Declare functions as static. Declare functions in a header. Include the header that declares the functions. Delete undeclared and unused functions ScaleFilterRows_NEON() and ScaleRowUp2_16_NEON(). Delete unused function ScaleY() in psnr_main.cc. Change-Id: I182ec30611df83c61ffd01bbab595cd61fb5f1e5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601 Commit-Queue: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-12 19:08:24 +00:00
Frank Barchard	a97746349b	Add test for I010ToNV12 - Add support for negative height to invert - Fix off by 1 on odd width and height - Bump version to 1895 Initial I010 is 2 step planar conversion libyuv_test '--gunit_filter=*010ToNV12_Opt' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Skylake Xeon [ OK ] LibYUVConvertTest.I010ToNV12_Opt (2675 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (1547 ms) Pixel 7 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (464 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (125 ms) Bug: b/357721018, b/357439226 Change-Id: I2ae59783cf328a6592d0ab80c374ae4dc281daf3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778595 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-12 18:57:56 +00:00
Chunbo Hua	fc94178260	Implement I010ToNV12 conversion I010, also known as YUV420P10, is 10 bit YUV pixel format with 3 planes. Both I010 and NV12 are 4:2:0 subsampling. NV12 has a Y plane, and an interleaved UV plane. Bug: 357721018 Change-Id: If215529b9eda8e0fb32aed666ca179c90244aaff Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5764823 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-06 17:36:13 +00:00
Frank Barchard	32ccd53bb3	Add P010ToNV12 to convert 10 bit biplanar to 8 bit biplanar - P010 and NV12 have the same layout: Full size Y plane and half size UV plane. P010 and NV12 are 4:2:0 subsampling - P010 uses upper 10 bits of 16 bit elements - NV12 uses 8 bit elements - The Convert16To8 used internally will discard the low 2 bits. - UV order is the same - U first in memory, followed by V, interleaved - UV plane is be rounded up in size to allow odd size Y to have UV values - Similar code could be used to convert P210ToNV16, P410ToNV24, with the size of the UV plane affected by subsampling 4:2:2 and 4:4:4 variants. Bug: b/357439226 Change-Id: I5d6ec84d97d0e0cc4008eeb18a929ea28570d6d9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5761958 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-05 18:55:44 +00:00
Wan-Teh Chang	e462de319c	Fix -Wundef warnings Change-Id: I803b70f66ca938665ba39b961bdb31625c6bc503 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5758156 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-02 17:39:59 +00:00
Frank Barchard	4cd90347e7	Rotate use NULL for C compatability Bug: b/353323977 Change-Id: I2472f23ce8fcc0bc09a292bd6fb758304c6c2b18 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5735714 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-07-23 18:02:47 +00:00
George Steed	8f039f639c	[AArch64] Unroll ScaleRowDown4Box_NEON We can use wider load/store instructions and avoid the need to waste half of the ADDP/RSHRN vector data. The duplicated UADDLP and UADALP instructions also provide a good improvement on little cores due to their limited out-of-order capability. The mask in the "any" kernel definition is already set up to handle an unrolling of eight so no change to scale_any.cc is needed. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -19.5% Cortex-A520: -38.3% Cortex-A76: -36.0% Cortex-A715: -18.1% Cortex-A720: -17.9% Cortex-X1: -25.4% Cortex-X2: -18.5% Cortex-X3: -8.2% Cortex-X4: -3.8% Bug: b/42280945 Change-Id: Iebba5da4db5e25af4b9fa5651c7396364dedffba Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725172 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 19:52:21 +00:00
George Steed	dc392094fc	[AArch64] Unroll ScaleRowDown34_0_Box_NEON The additional parallel instruction streams provide a good benefit to little cores with limited out-of-order capability. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -19.1% Cortex-A510: -31.6% Cortex-A520: -35.2% Cortex-A76: -14.3% Cortex-A715: +0.1% Cortex-A720: =0.0% Cortex-X1: -6.6% Cortex-X2: -0.1% Cortex-X3: -0.2% Cortex-X4: -7.2% Bug: b/42280945 Change-Id: Idca21a5af1dc6f189e644a81537d41f50ef66498 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725171 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 19:52:01 +00:00
George Steed	776a509891	[AArch64] Unroll ScaleRowDown34_1_Box_NEON We can make use of wider instructions for the loads and stores as well as the URHADD instructions. In addition the duplicated instructions of the code from the unrolling provides a further small improvement for little cores with limited out-of-order capability. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -23.5% Cortex-A510: -35.4% Cortex-A520: -40.5% Cortex-A76: -15.1% Cortex-A715: -6.2% Cortex-A720: -6.2% Cortex-X1: -17.9% Cortex-X2: -18.4% Cortex-X3: -18.3% Cortex-X4: -14.0% Bug: b/42280945 Change-Id: I5905e026a0507870bfc580b702906d6acb4ed6f4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725170 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 19:51:45 +00:00
George Steed	be5de19db3	[AArch64] Unroll ScaleRowUp2_Linear_NEON On little cores with limited out-of-order capability this gives a good improvement. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -21.3% Cortex-A520: -33.6% Cortex-A76: +1.1% Cortex-A715: =0.0% Cortex-A720: =0.0% Cortex-X1: +10.4% (!) Cortex-X2: -5.3% Cortex-X3: -4.3% Cortex-X4: -9.9% Bug: b/42280945 Change-Id: I45b3510f13c05b19d61052e2f8e447199dbd0551 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725169 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 19:51:17 +00:00
George Steed	42d33341d3	[AArch64] Unroll {RAW,RGB24}To{ARGB,RGBA}Row_SVE2 Unrolling gives a nice improvement to the little cores and even a small improvement to the big cores thanks to avoiding the loop control overhead. Observed performance improvement relative to the existing SVE2 code. \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 RAWToARGBRow_SVE2 \| -28.4% \| -10.1% \| -3.5% RAWToRGBARow_SVE2 \| -28.5% \| -10.1% \| -4.4% RGB24ToARGBRow_SVE2 \| -28.5% \| -10.4% \| -5.5% Bug: libyuv:973 Change-Id: I7aa03fdaa1a24ecfdd13418647a02e5effe8333f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725174 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 16:01:56 +00:00
George Steed	4ad050b5ec	[AArch64] Unroll {I422,I422Alpha}ToARGBRow_SVE2 Since the UV components are duplicated in I422 we end up wasting half of the vector bandwidth processing the same elements twice. By unrolling the kernel to process two vectors of Y per iteration we can fill a whole vector of U/V components. Rather than packing RGBA components into pairs during the narrowing we now just narrow into individual component vectors and use ST4B instead. This by itself is slower on some micro-architectures like Cortex-A510 but the benefit from unrolling significantly outweights this. \| I422AlphaToARGBRow_SVE2 \| I422ToARGBRow_SVE2 Cortex-A510 \| -46.2% \| -48.8% Cortex-A720 \| -20.8% \| -21.0% Cortex-X2 \| -11.3% \| -7.5% Cortex-X4 \| -15.4% \| -15.5% Bug: libyuv:973 Change-Id: I69389c4279861f7a460ae0c28186f023c728c4e8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5725173 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 15:55:59 +00:00
George Steed	b5f9d7cb76	[AArch64] Add SME implementation of TransposeUVWxH We can make use of the ZA tile register to do the transpose and de-interleaving of UV components without any explicit permute instructions: the tile is loaded horizontally placing UV components into alternative columns, then we can just store the independent components vertically. Change-Id: I67bd82dc840a43888290be1c9db8a3c05f16d730 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703588 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 12:15:40 +00:00
George Steed	15ecca81f7	[AArch64] Add SME implementation of TransposeWxH We can make use of the ZA tile register to do the transpose without any explicit permute instructions: just load the tile horizontally and store it vertically. Change-Id: I1c31e89af52a408e3491e62d6c9e6fee41b1b80a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703587 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 12:14:39 +00:00
George Steed	a4ccf9940e	[AArch64] Add I8MM implementation of ARGBToUV444Row We cannot use the standard dot-product instructions since the coefficients multiplication results are both added and subtracted, but I8MM supports mixed-sign dot products which work well here. We need to add an additional variant of the coefficient structs since we need negative constants for the elements that were previously subtracted. Reduction in runtimes observed compared to the previous Neon implementation: Cortex-A510: -37.3% Cortex-A520: -31.1% Cortex-A715: -37.1% Cortex-A720: -37.0% Cortex-X2: -62.1% Cortex-X3: -62.2% Cortex-X4: -40.4% Bug: libyuv:977 Change-Id: Idc3d9a6408c30e1bce3816a1ed926ecd76792236 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712928 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-16 17:32:52 +00:00
George Steed	a64fffe632	Revert "Disable NV12ToARGB_SVE2 which fails the 'any' test" This reverts commit f480fa1c4a4af0ce3c34cd7b1ab0d85f1a36ce17. This code has a number of small issues: * The YUVTORGB_SVE_SETUP macro requires p0 to be initialized to all-true, however the existing kernel does not initialise p0 until after this macro is called, so flip the order. * The p2 register is missing from the clobber list, so add it. * The existing code uses the wrong condition flags when determining whether to do the tail iteration using WHILE instructions or not. Additionally the number of tail iterations is incorrect, as it was incorrectly not changed from when the tail code was always executed. While we are here, make another few small improvements: * Remove the single-quote digit separators as requested here: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133 * Remove "volatile" from the asm block counting the vector length. This particular asm block cannot be removed by the compiler since the output register is consumed by subsequent code, so "volatile" is unnecessary here and we remove it. * Add some additional empty comments to force clang-format to put macros into the next line rather than on the same line as other asm. Bug: b/352371649 Change-Id: I45676fab95343f588cf11ce2cf9186ffbe87489e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703586 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-15 18:13:42 +00:00
George Steed	e1a93c79fc	[AArch64] Fix rotate by odd sizes The existing disabled gtest rotate tests fail because the existing "any" kernels always assume we are processing height=8 rows at a time. This was recently changed to 16 on AArch64 which triggered this bug. To fix this, amend the TANY macro to explicitly specify the fallback kernel, such that we can use the height=16 kernel to match the SIMD optimized version where necessary. Also change other architecture versions to match. Bug: b/352351302 Change-Id: I8080fa8f44c7c67fa970a78fb426f2f801a9a00e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703585 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-15 18:13:31 +00:00
George Steed	c1fe5663f5	[AArch64] Use full vectors in ARGB4444To{Y,UV}Row_NEON The existing ARGB4444TORGB macro only makes use of 64 bit wide vectors rather than the full 128 bits available, so unroll it to allow us to process more data per instruction. For ARGB4444ToUVRow_NEON we already have enough data available each iteration to make use of full vectors, but for ARGB4444ToYRow_NEON we also need to adjust the "any" kernel to allow us to process 16 elements per iteration. Reduction in runtimes observed compared to the existing Neon kernels: \| ARGB4444ToUVRow \| ARGB4444ToYRow Cortex-A55 \| -27.8% \| -34.6% Cortex-A510 \| -37.0% \| -44.4% Cortex-A76 \| -40.2% \| -22.0% Cortex-A720 \| -33.4% \| -35.5% Cortex-X1 \| -34.1% \| -19.7% Cortex-X2 \| -32.1% \| -26.3% Bug: libyuv:976 Change-Id: I08f6286bab0ebf5e24d5d5803f8c45ec6ba776ee Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631541 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-10 23:12:43 +00:00
George Steed	5bac99fe09	[AArch64] Rework data loading in ScaleARGBFilterCols_NEON The existing code makes use of lane-indexed LD2 instructions to load the input data however this creates a strong dependency chain between consecutive load instructions. We can reduce this dependency chain by instead loading two vectors with wider lane-indexed LD1 instructions and then performing a permute to unzip the data. We can also avoid the need for a complex sequence of DUP + EXT instructions by using TBL to permute the data exactly as we want it. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: =0.0% Cortex-A510: -44.2% Cortex-A520: -47.6% Cortex-A76: -45.8% Cortex-A715: -58.3% Cortex-A720: -58.4% Cortex-X1: -66.7% Cortex-X2: -68.0% Cortex-X3: -67.9% Cortex-X4: -70.0% Change-Id: I8a1d1fe08d8a2ddb0b86d4a44f0d49b69ab03ece Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5683126 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-10 23:10:43 +00:00
George Steed	a425b559bd	[AArch64] Use full vectors in ARGB1555To{Y,UV}Row_NEON The existing RGB555TOARGB macro only makes use of 64 bit wide vectors rather than the full 128 bits available, so unroll it to allow us to process more data per instruction. For ARGB1555ToUVRow_NEON we already have enough data available each iteration to make use of full vectors, but for ARGB1555ToYRow_NEON we also need to adjust the "any" kernel to allow us to process 16 elements per iteration. Reduction in runtimes observed compared to the existing Neon kernels: \| ARGB1555ToUVRow \| ARGB1555ToYRow Cortex-A55 \| -28.8% \| -35.3% Cortex-A510 \| -34.0% \| -48.5% Cortex-A76 \| -36.7% \| -25.1% Cortex-A720 \| -29.7% \| -31.1% Cortex-X1 \| -31.6% \| -19.7% Cortex-X2 \| -27.6% \| -22.7% Bug: libyuv:976 Change-Id: Idd745c133b5fb65001652a59f01ac1aa3bb42067 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631540 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-10 23:09:53 +00:00
Frank Barchard	3902eaaf86	Fix for source/row_neon64.cc:551:12: error: unused variable 'alpha' [-Werror,-Wunused-variable] 551 \| uint16_t alpha = 0xc000; \| ^~~~~ 1 error generated. Bug: None Change-Id: Ifdfe39f75c003921e4f759bcbbbffe0e766039bd Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5690260 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-07-09 22:51:33 +00:00
George Steed	899bc48327	[AArch64] Add SVE2 implementations of ARGBTo{RAW,RGB24}Row There is no nice way of forming the TBL permute indices here since we are operating on sets of three bytes at a time, so instead load the appropriate indices from a static array. We can make use of SVE predication to ensure we are operating on a multiple of three bytes for the load/store instructions rather than needing to make use of more expensive LD4 or ST3 instructions. Reduction in runtime observed compared to the existing Neon implementations: \| ARGBToRAWRow \| ARGBToRGB24Row Cortex-A510 \| -50.8% \| -19.9% Cortex-A720 \| -39.8% \| -39.1% Cortex-X2 \| -66.5% \| -51.9% Bug: libyuv:973 Change-Id: Iaead678715a3d70d54cf823391272a6196836769 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631544 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 20:27:54 +00:00
George Steed	5236846b64	[AArch64] Keep UV interleaved in some *ToARGBRow_SVE2 kernels The existing I4XXTORGB_SVE macro operates only on even byte lanes of the loaded U/V vectors. This is sub-optimal since we are effectively wasting half of the vector in any pre-processing steps before the conversion. In particular, where the UV components are loaded from interleaved data we can save a TBL instruction by maintaining the interleaved format. This commit introduces a new NVTORGB_SVE macro to handle the case where U/V components are interleaved into even/odd bytes of a vector, mirroring a similar macro in the AArch64 Neon implementation. Reduction in runtimes observed compared to the existing SVE2 code: \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 NV12ToARGBRow_SVE2 \| -5.3% \| -0.2% \| -4.4% NV21ToARGBRow_SVE2 \| -5.3% \| -0.2% \| -4.4% UYVYToARGBRow_SVE2 \| -5.6% \| 0.0% \| -4.6% YUY2ToARGBRow_SVE2 \| -5.5% \| -0.1% \| -4.2% Bug: libyuv:973 Change-Id: I418de2e684e0b6b0b9e41c39b564438531e44671 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622133 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 20:26:23 +00:00
George Steed	555f80f3ce	[AArch64] Add SVE2 implementation of RGB24ToARGBRow This can make use of the existing helper functions for RAWToARGBRow_SVE2 and RAWToRGBARow_SVE2 since the layouts are similar, we just need to adjust the TBL constants to match the different input layout. Observed reduction in runtime compared to the existing Neon kernel: Cortex-A510: -25.6% Cortex-A720: -15.2% Cortex-X2: -10.2% Cortex-X4: -30.2% Bug: libyuv:973 Change-Id: Ie3676693286be90d09f0045766c3492cbc04ea64 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5638555 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-08 20:12:05 +00:00
George Steed	fcbe22c59c	[AArch64] Enable SME feature detection on Apple Silicon Check for availability of SME and SME2 by looking for the hw.optional.arm.FEAT_SME2 feature string in sysctlbyname. Non-streaming SVE is not supported but for our purposes the features can be treated as orthogonal since our SME code will only ever run in streaming mode. Change-Id: I7e9d242e0f581217b625d74c7c3b0c76a0fe03da Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5683128 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 16:19:27 +00:00
George Steed	11ff6067a5	[AArch64] Add SVE2 implementation of RAWToRGB24Row There is no nice way of forming the TBL permute indices here since we are operating on sets of three bytes at a time, so instead load the appropriate indices from a static array. We can make use of SVE predication to ensure we are operating on a multiple of three bytes for the load/store instructions rather than needing to make use of more expensive LD3 or ST3 instructions. Reduction in runtime observed compared to the existing Neon implementation: Cortex-A510: -39.2% Cortex-A720: -34.5% Cortex-X2: -31.0% Bug: libyuv:973 Change-Id: I68560bde7a529e5cec150b0e9d3ffe4341038fb8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631543 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-08 15:55:14 +00:00
George Steed	c613c3f102	[AArch64] Add SVE2 implementations for RAWTo{ARGB,RGBA}Row We can construct particular predicates to load only up to 3/4 of a full vector, allowing us to use TBL to shuffle elements into the correct place rather than needing to rely on more expensive LD3 or ST4 instructions. Reduction in runtimes observed compared to the existing Neon implementation: \| RAWToARGBRow \| RAWToRGBARow Cortex-A510 \| -32.4% \| -31.9% Cortex-A720 \| -15.7% \| -15.6% Cortex-X2 \| -24.6% \| -24.4% Bug: libyuv:973 Change-Id: I271c625d97bab3b0e08ac1e9d7fcf7d18f3d6894 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5631542 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-06 22:40:15 +00:00
George Steed	d1ec694ad3	[AArch64] Add P{210,410}To{ARGB,AR30}Row_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 P210ToARGBRow \| -59.8% \| -16.8% \| -53.2% P210ToAR30Row \| -48.1% \| -21.8% \| -54.0% P410ToARGBRow \| -56.5% \| -32.2% \| -54.1% P410ToAR30Row \| -42.4% \| -4.5% \| -50.4% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I24a5addd2c54c7fdfb9717e2a45ae5acd43d6e96 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607764 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-06 22:37:08 +00:00
Frank Barchard	611806a155	[AArch64] Fix SVE/SME vector length printing in cpuid A semicolon is treated as the start of a comment by some assemblers causing the vector length to be reported incorrectly, so use a newline instead. - Add volatile asm in row_gcc and row_neon64 Bug: b/5631539 Change-Id: I6b0836fcdd9247ef7b9e8ceda01df3150519ecf8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5666060 Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-02 19:44:41 +00:00
George Steed	d32436e8f8	[AArch64] Add Neon implementation for I422ToAR30Row_NEON There is an existing x86 implementation for this kernel, but not for AArch64, so add one. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: Cortex-A55: -43.1% Cortex-A510: -22.3% Cortex-A76: -54.8% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Ifead36bcb8682a527136223e0dcd210e9abe744a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607763 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-02 18:16:33 +00:00
George Steed	bbd9cedc4f	[AArch64] Add Neon impls for I212To{ARGB,AR30}Row_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| I210ToAR30Row \| I210ToARGBRow Cortex-A55 \| -40.8% \| -54.4% Cortex-A510 \| -26.2% \| -22.7% Cortex-A76 \| -49.2% \| -44.5% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I967951a6b453ac0023a30d96b754c85c2a3bf14a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5607762 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-02 18:16:33 +00:00
Frank Barchard	fa16ddbb9f	cpuid show vector length on ARM and RISCV - additional asm volatile changes from github - rotate mips remove C function - moved to common Run on Samsung S22 [ RUN ] LibYUVBaseTest.TestCpuHas Kernel Version 5.10 Has Arm 0x2 Has Neon 0x4 Has Neon DotProd 0x10 Has Neon I8MM 0x20 Has SVE 0x40 Has SVE2 0x80 Has SME 0x0 SVE vector length: 16 bytes [ OK ] LibYUVBaseTest.TestCpuHas (0 ms) [ RUN ] LibYUVBaseTest.TestCompilerMacros __ATOMIC_RELAXED 0 __cplusplus 201703 __clang_major__ 17 __clang_minor__ 0 __GNUC__ 4 __GNUC_MINOR__ 2 __aarch64__ 1 __clang__ 1 __llvm__ 1 __pic__ 2 INT_TYPES_DEFINED __has_feature Run on RISCV qemu emulating SiFive X280: [ RUN ] LibYUVBaseTest.TestCpuHas Kernel Version 6.6 Has RISCV 0x10000000 Has RVV 0x20000000 RVV vector length: 64 bytes [ OK ] LibYUVBaseTest.TestCpuHas (4 ms) [ RUN ] LibYUVBaseTest.TestCompilerMacros __ATOMIC_RELAXED 0 __cplusplus 202002 __clang_major__ 9999 __clang_minor__ 0 __GNUC__ 4 __GNUC_MINOR__ 2 __riscv 1 __riscv_vector 1 __riscv_v_intrinsic 12000 __riscv_zve64x 1000000 __clang__ 1 __llvm__ 1 __pic__ 2 INT_TYPES_DEFINED __has_feature Bug: b/42280943 Change-Id: I53cf0450be4965a28942e113e4c77295ace70999 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5672088 Reviewed-by: David Gao <davidgao@google.com>	2024-07-02 18:10:56 +00:00
Frank Barchard	616bee5420	Add volatile for gcc inline to avoid being removed Bug: b/42280943 Change-Id: I4439077a92ffa6dff91d2d10accd5251b76f7544 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5671187 Reviewed-by: David Gao <davidgao@google.com>	2024-07-02 01:25:24 +00:00
Frank Barchard	efd164d64e	Disable RVV ScaleDownBy4 if compiler option is not enabled - Some configs have int64 elements off by default. Disable ScaleDownBy4 row function to avoid compile error Bug: 344954354 Change-Id: Ie0d74daea72375eff6438ab54cb2803d68d67e52 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598460 Reviewed-by: James Zern <jzern@google.com>	2024-06-18 01:52:40 +00:00
Frank Barchard	b0dfa70114	RVV remove unused variables - ARM Planar test use regular asm volatile syntax - x86 row functions remove volatile from asm Bug: 347111119, 347112532 Change-Id: I535b3dfa1a7a19824503bd95584a63b047b0e9a1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5637058 Reviewed-by: Justin Green <greenjustin@google.com>	2024-06-17 20:25:31 +00:00
Bruce Lai	7758c961c5	Support RVV v0.12 intrinsics for row_rvv.cc & scale_rvv.cc 1. Add two defined marco LIBYUV_RVV_HAS_TUPLE_TYPE & LIBYUV_RVV_HAS_VXRM_ARG Intrinsic v0.12 introduces - tuple type in segment load & store - vxrm argument in fixed-point intrinsics (e.g vnclip) These two marcos are controled by __riscv_v_intrinsic. 2. Support RVV v0.12 intrinsics in row_rvv.cc & scale_rvv.cc Change-Id: I921f91d9dc8fdda031e7b6647d0e296aa2793c39 Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4767120 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-17 18:01:49 +00:00
George Steed	367dd50755	[AArch64] Add SVE2 impls for {UYVY,YUY2}ToARGBRow This is mostly similar to the existing NV{12,21}ToARGBRow_SVE2 kernels except reading the YUV components all from the same interleaved input array. We load four-byte elements and then use TBL to de-interleave the UV components. Unlike the NV{12,21} cases we need to de-interleave bytes rather than widened 16-bit elements. Since we need a TBL instruction already it would ordinarily be possible to perform the zero-extension from bytes to 16-bit elements by setting the index for every other byte to be out of range. Such an approach does not work in SVE since at a vector length of 2048 bits since all possible byte values (0-255) are valid indices into the vector. We instead get around this by rewriting the I4XXTORGB_SVE macro to perform widening multiplies, operating on the low byte of each 16-bit UV element instead of the full value and therefore eliminating the need for a zero-extension. Observed reductions in runtimes compared to the existing Neon code: \| UYVYToARGBRow \| YUY2ToARGBRow Cortex-A510 \| -30.2% \| -30.2% Cortex-A720 \| -4.8% \| -4.7% Cortex-X2 \| -9.6% \| -10.1% Bug: libyuv:973 Change-Id: I841a049aba020d0517563d24d2f14f4d1221ebc6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5622132 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-13 22:06:46 +00:00
George Steed	cd4113f4e8	[AArch64] Add SVE2 implementation of I400ToARGBRow This is mostly a copy of the I422ToARGBRow_SVE2 implementation, but we can pre-calculate the UV component results before the loop body. Unlike in the Neon version of the code we can make use of MOVPRFX and USQADD to avoid needing to apply the bias separately from the UV coefficient multiply additions. Reduction in runtime observed compared to the existing Neon code: Cortex-A510: -26.1% Cortex-A520: -5.9% Cortex-A715: -49.5% Cortex-A720: -49.4% Cortex-X2: -22.5% Cortex-X3: -23.5% Cortex-X4: -21.6% Bug: libyuv:973 Change-Id: Ib9fc52bd53a1c6a1aac8bd865ab88539aca098ea Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598767 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-13 22:02:06 +00:00
George Steed	34abe98fe2	[AArch64] Add SVE2 implementations for NV{12,21}ToARGBRow We need a permute to duplicate the UV components, so we can share a common implementation for both NV12 and NV21 by varying the inputs to the INDEX instruction that generates the TBL indices. Observed reductions in runtimes compared to the existing Neon code: \| NV12ToARGBRow_SVE2 \| NV21ToARGBRow_SVE2 Cortex-A510 \| -29.1% \| -29.1% Cortex-A720 \| -4.8% \| -4.8% Cortex-X2 \| -9.2% \| -9.2% Bug: libyuv:973 Change-Id: I40e20f0438cf7bad05a5ecc4db83b4a6168da958 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598766 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-12 16:24:40 +00:00
George Steed	a758a15dbf	[AArch64] Add I8MM implementation of ARGBColorMatrixRow We cannot use the standard dot-product instructions since the matrix of coefficients are signed, but I8MM supports mixed-sign products which work well here. Reduction in runtimes observed compared to the previous Neon implementation: Cortex-A510: -50.8% Cortex-A520: -33.3% Cortex-A715: -38.6% Cortex-A720: -38.5% Cortex-X2: -43.2% Cortex-X3: -40.0% Cortex-X4: -55.0% Change-Id: Ia4fe486faf8f43d0b837ad21bb37e2159f3bdb77 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5621577 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-12 16:17:59 +00:00
George Steed	89cf221baa	[AArch64] Avoid unnecessary widening in I422ToARGB1555Row_NEON The existing code first widens the component vectors from 8-bit elements to 16-bits to construct the final ARGB1555 result, however this is unnecessary since the inputs to the widening are themselves the result of having just been narrowed in the RGBTORGB8 macro. By making use of the new RGBTORGB8_TOP macro we can get rid of both the widening as well as the prior narrowing step. Also remove volatile from the asm, it is unnecessary. Reduction in runtime observed for I422ToARGB1555Row_NEON: Cortex-A55: -7.8% Cortex-A76: -15.0% Cortex-A720: -20.3% Cortex-X1: -20.2% Cortex-X2: -20.3% Bug: libyuv:976 Change-Id: Id031c5d4d788828297adcc2fe2c2cd8d99b45433 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5616050 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-11 23:36:13 +00:00
George Steed	e6c4b9ad2e	[Arm][AArch64] Remove unused ARGBToUVJ444Row_NEON definition There is no corresponding declaration in a header file and it appears to be unused, so remove from both the Arm and AArch64. Change-Id: I4de9fb7ce8e8dff6e76f4a99fdd93c743f92bf18 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5587507 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-10 18:36:31 +00:00
George Steed	c8974cf8d4	[AArch64] Add SME feature detection on Linux This commit just adds the kCpuHasSME to represent that the CPU has the Arm Scalable Matrix Extension enabled, but this commit does not introduce any code to actually use it yet. Add a test to check that the HWCAP value is interpreted correctly. Change-Id: I2de7bca26ca44ff3ee278b59108298a299a171b7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598869 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-08 23:34:22 +00:00
George Steed	910f8e3645	[AArch64] Remove redundant semicolons after ANY41CT Introduced by 5b4160b9c322fda98e2208d80c2ea75dd7e7f25f. Bug: 345650115 Change-Id: I68c4c34ad9701f62729590ad137d743324497d28 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5604588 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-06-08 23:33:54 +00:00
George Steed	a68b959873	[AArch64] Add initial build system support for SME Extend both the CMake and BUILD.gn configurations to support building a library with the Arm Scalable Matrix Extension (SME). Add an initial (empty) rotate_sme.cc source file to populate the library for now. Change-Id: Icd4bd6a8ce72ba132299b00c99478a18a85d869a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5588664 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-08 23:32:41 +00:00
George Steed	3f657221f0	[AArch64] Remove unused vars in I{210,410}{,Alpha}ToARGBRow_NEON The elements of the YUV constants are passed directly to the inline asm block, so no need to pull them out into variables first. Also remove "volatile" from inline asm blocks, it is unnecessary. Bug: 344998222 Change-Id: I7d97dec8c7495651e5a31c10eda2d4aeed36fe6a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5598764 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-07 02:39:20 +00:00
George Steed	96bbdb53ed	[AArch64] Add SVE2 implementation of I422ToRGBARow This is almost identical to the existing I422ToARGBRow_SVE2 kernel, we just need to interleave differently for the output. The RGBA format actually saves us an instruction compared to ARGB since there is no need to merge in the alpha component, we can just replace the odd elements of the alpha vector itself during the narrowing. Also rename some existing macros to make more sense when distinguishing between ARGB and RGBA. Reductions in runtime observed compared to the existing Neon code: Cortex-A510: -27.0% Cortex-A720: -5.3% Cortex-X2: -14.7% Bug: libyuv:973 Change-Id: I1e12ff608ee49c25b918097007e16d87b39cb067 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593797 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	004352ba16	[AArch64] Add SVE2 implementations for AYUVTo{UV,VU}Row These kernels are mostly identical to each other except for the order of the results, so we can use a single macro to parameterize the pairwise addition and use the same macro for both implementations, just with the register order flipped. Similar to other 2x2 kernels the implementation here differs slightly for the last element if the problem size is odd, so use an "any" kernel to avoid needing to handle this in the common code path. Observed reduction in runtime compared to the existing Neon code: \| AYUVToUVRow \| AYUVToVURow Cortex-A510 \| -33.1% \| -33.0% Cortex-A720 \| -25.1% \| -25.1% Cortex-X2 \| -59.5% \| -53.9% Cortex-X4 \| -39.2% \| -39.4% Bug: libyuv:973 Change-Id: I957db9ea31c8830535c243175790db0ff2a3ccae Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522316 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	d0da5a3298	[AArch64] Add SVE2 implementation of ARGB1555ToARGBRow Avoiding LD4 and unrolling gives a good perf improvement for the little core especially. Observed reduction in runtime relative to the existing Neon code: Cortex-A510: -69.7% Cortex-A720: -7.7% Cortex-X2: -41.9% Cortex-X4: -14.5% Bug: libyuv:973 Change-Id: I4b3292fa23a6e866d761dfca035538cb09eba9bc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5522315 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-06-04 18:18:07 +00:00
George Steed	250e1e1ba3	[AArch64] Add SVE2 implementation of ARGBToRGB565DitherRow Observed performance improvements compared to the existing Neon implementation: Cortex-A510: -21.7% Cortex-A720: -49.2% Cortex-X2: -62.6% Bug: libyuv:973 Change-Id: I2c7ae483c0b488a122bb3b80a745412ed44622df Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505539 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-06-03 23:15:04 +00:00
George Steed	dff7bad43d	[AArch64] Use full Neon vectors in ARGB4444ToARGBRow_NEON The existing Neon code narrows the input 16-bit packed data to 8-bit elements and separates the color channels, causing us to only process half a Neon vector per instruction for the channel widening from 4-bit color data to 8-bits. We can note that the processing being done is identical for all color channels and therefore we can keep them partially interleaved during the widening step. This allows us to use full Neon vectors for the whole loop body. Reductions in runtimes observed for ARGB4444ToARGBRow_NEON: Cortex-A55: -30.7% Cortex-A510: -44.3% Cortex-A76: -51.6% Cortex-X2: -54.2% Bug: libyuv:976 Change-Id: I9d9cda7e16eb07619c6d7f1de2e6b8c0fb6d64cf Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5594389 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-03 22:52:33 +00:00
George Steed	7633c818ec	[AArch64] Remove pointless MOVI in ARGB1555ToARGBRow_NEON This function takes the alpha component from the loaded data rather than hard-coding it to 255, so initialising v3 to 255 is unused here. Change-Id: I668825e0eeb317d1365035ce3bb47f3d92081c6f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5594388 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-03 22:47:01 +00:00
George Steed	6c70eb2819	[AArch64] Add Neon impls for I{210,410}ToAR30Row_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: I210ToAR30Row on Cortex-A55: -43.8% I210ToAR30Row on Cortex-A510: -27.0% I210ToAR30Row on Cortex-A76: -50.4% I410ToAR30Row on Cortex-A55: -44.3% I410ToAR30Row on Cortex-A510: -17.5% I410ToAR30Row on Cortex-A76: -57.2% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Ib5fb9b2ce6ef06ec76ecd8473be5fe76d2622fbc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5593931 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-03 22:46:12 +00:00
George Steed	214b4a25c7	[Arm] Clean up rotate_neon.cc kernels Get rid of unused tail loops, since they are already handled by the "any" kernels. Also remove unnecessary "volatile" specifier from asm blocks. Bug: libyuv:976 Change-Id: I4676fc807bcaedbb5f0f52b1bed20a172fef4ed6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5553719 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-06-03 22:23:40 +00:00
George Steed	bce3392830	[AArch64] Add SVE2 implementation of ARGBToRGB565Row Observed performance improvements compared to the existing Neon implementation: Cortex-A510: -27.1% Cortex-A720: -49.4% Cortex-X2: -67.9% Bug: libyuv:973 Change-Id: I321dc080a6e89301cd959c2ee18bc6680f749312 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505538 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-31 17:42:27 +00:00
George Steed	812b4955b2	[AArch64] Add Neon impls for I{210,410}ToARGBRow_NEON There is are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| I210ToARGBRow \| I410ToARGBRow Cortex-A55 \| -55.6% \| -56.2% Cortex-A510 \| -22.6% \| -35.6% Cortex-A76 \| -48.1% \| -57.2% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I2ccae1388760a129c73d2e550b32bb0b5af235d6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465594 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 17:40:48 +00:00
George Steed	5b4160b9c3	[AArch64] Add Neon impls for I{210,410}AlphaToARGBRow_NEON There are existing x86 implementations for these kernels, but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| I210AlphaToARGBRow \| I410AlphaToARGBRow Cortex-A55 \| -55.3% \| -56.1% Cortex-A510 \| -27.9% \| -42.6% Cortex-A76 \| -54.9% \| -60.3% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: Ieb7ad945abda72babd0cfe1020738d31e3562705 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465593 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 08:41:31 +00:00
George Steed	e348995a92	[AArch64] Optimize MergeXR30Row_10_NEON By keeping intermediate data as 16-bits wide we can compute twice as much and use ST2 to store the final result. This appears to be much better even on micro-architectures where ST2 is slightly slower than ST1. We save a couple of instructions by taking advantage of multiply-add instructions to perform an effective shift-left and bitwise-or, since we know the set of nonzero bits are disjoint after the UMIN. Reduction in runtime observed for MergeXR30Row_10_NEON: Cortex-A55: -34.2% Cortex-A510: -35.6% Cortex-A76: -44.9% Cortex-X2: -48.3% Bug: libyuv:976 Change-Id: I6e2627f9aa8e400ea82ff381ed587fcfc0d94648 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509199 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 08:32:55 +00:00
George Steed	56258c125b	[AArch64] Avoid redundant shift around RGB565 conversion The existing code performs a narrowing shift right (in RGBTORGB8) followed by a widening left shift (in ARGBTORGB565). This is redundant since we could have simply not performed a narrowing operation to begin with and instead done a saturating left shift to saturate against the top of the 16-bit lanes rather than the narrowed 8-bit lanes. To enable this we introduce new RGBTORGB8_TOP and ARGBTORGB565_FROM_TOP macros which produce and consume values from the high half of each 16-bit lane rather than a narrowed 8-bit intermediate. Reduction in runtime for selected kernels: \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 \| Cortex-X2 I422ToRGB565Row_NEON \| -10.8% \| -6.1% \| -17.2% \| -23.6% NV12ToRGB565Row_NEON \| -11.4% \| -4.9% \| -20.4% \| -17.4% Bug: libyuv:976 Change-Id: I3337b8f41ff62a7af1b70a56b774239bdb55d0f1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509197 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 08:29:54 +00:00
George Steed	c5f9583b1c	[AArch64] Avoid extracting alpha in ARGB1555ToYRow_NEON The existing implementation of this kernel uses the ARGB1555TOARGB macro which extracts and sign-extends the alpha component into v3, however this particular kernel does not need the alpha component. We can avoid calculating the alpha component completely by using the existing RGB555TOARGB macro, so use that instead. Reduction in runtimes observed for ARGB1555ToYRow_NEON (no noticeable improvement observed on Cortex-A510): Cortex-A55: -3.6% Cortex-A76: -20.9% Cortex-X2: -15.1% Bug: libyuv:976 Change-Id: I2cf2729c8297c53dcd32d0df28e64d4d5c7f6def Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509200 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-31 08:28:27 +00:00
George Steed	7c122e8859	[AArch64] Use ST2 to avoid TRN step in TransposeWx16_NEON ST2 with 64-bit lanes has good performance on all micro-architectures of interest and saves us 8 TRN instructions, so use that instead. Reduction in runtimes observed compared to the existing Neon implementation: Cortex-A55: -8.6% Cortex-A510: -4.9% Cortex-A520: -6.0% Cortex-A76: -14.4% Cortex-A720: -5.3% Cortex-X1: -13.6% Cortex-X2: -5.8% Bug: libyuv:976 Change-Id: I08bb5517bbdc54c4784fce42a885b12f91e7a982 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5581597 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-31 08:27:05 +00:00
George Steed	6b9604dffc	[AArch64] Remove unused code from TransposeUVWx8_NEON We already have an "any" helper function set up for this kernel, so use it to match the other existing architecture paths. This change also affects the 32-bit Arm paths, which will be cleaned up in a later commit. With this change the kernel is now only entered with width as a multiple of eight, so remove the now-unneeded tail loops. Also remove volatile specifier from the asm block, it is unnecessary. Change-Id: If37428ac2d6035a8c27eec9bd80d014a98ac3eb1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5553717 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-27 21:52:56 +00:00
George Steed	d0c28db56c	[AArch64] Optimize Merge{ARGB,XRGB}16To8Row_NEON Rather than shifting the data into the low half of each lane and then using a saturating narrowing operation, we can do the saturation as part of a shift into the highest half of the lane and then use a simpler TRN2 instruction to extract pairs of high halves into full vectors. This also has the nice advantage of allowing us to use ST2 rather than ST4 for storing the result, since ST4 is known to be slow on some micro-architectures. Reduction in runtimes observed for the two kernels: \| MergeARGB16To8Row_NEON \| MergeXRGB16To8Row_NEON Cortex-A55 \| -8.0% \| -12.2% Cortex-A510 \| -29.9% \| -31.4% Cortex-A76 \| -29.0% \| -32.0% Cortex-X2 \| -33.5% \| -43.4% Bug: libyuv:976 Change-Id: I9da3beedc27ab43527b3642aa6d4decf3b5b6683 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509198 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-21 07:55:03 +00:00
George Steed	4f7fd808b7	[AArch64] Use full vectors in TransposeWx{8 => 16}_NEON The existing Neon code only makes use of 64-bit vectors throughout which limits the performance on larger cores. To avoid this, swap the Neon code from a Wx8 implementation to a Wx16 implementation and process blocks of 16 full vectors at a time. The original code also handled widths that were not exact multiples of 16, however this should already be handled by the "any" kernel so it is removed. Finally, avoid duplicating the TransposeWx16_C fallback kernel definition in all architectures that need it, and just put it once in rotate_common.cc instead. Observed speedups for TransposePlane across a range of micro-architectures: Cortex-A53: -40.0% Cortex-A55: -20.7% Cortex-A57: -43.9% Cortex-A510: -43.5% Cortex-A520: -43.9% Cortex-A720: -31.1% Cortex-X2: -38.3% Cortex-X4: -43.6% Change-Id: Ic7c4d5f24eb27091d743ddc00cd95ef178b6984e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5545459 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-21 07:46:42 +00:00
George Steed	9fac9a4a82	[AArch64] Add Neon implementations for {ARGB,ABGR}ToAR30Row There are existing x86 implementations for these kernels but not for AArch64, so add them. Reduction in runtimes, compared to the existing C code compiled with LLVM 17: \| ABGRToAR30Row \| ARGBToAR30Row Cortex-A55 \| -55.1% \| -55.1% Cortex-A510 \| -39.3% \| -40.1% Cortex-A76 \| -62.3% \| -63.6% Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com> Bug: libyuv:976 Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-21 07:35:07 +00:00
George Steed	83c48c782a	[AArch64] Improve ARGB4444TOARGB using SRI instructions Also avoid constructing the alpha component when it isn't needed by introducing a new ARGB4444TORGB macro. Reduction in runtime for selected kernels: \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 ARGB4444ToARGBRow_NEON \| -27.5% \| -27.9% \| -29.1% ARGB4444ToUVRow_NEON \| -20.2% \| -25.2% \| -21.7% ARGB4444ToYRow_NEON \| -16.0% \| -20.2% \| -21.3% Bug: libyuv:976 Change-Id: Ida061e1c49ba228b02c2f691a067b58edad073a8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509196 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-21 07:29:11 +00:00
George Steed	5618a5c762	[AArch64] Use REV16 rather than TBL in SwapUVRow_NEON We don't need a general-purpose purmute here, REV16 does exactly what we want and saves us needing to load the permute indices array. Bug: libyuv:976 Change-Id: Ib3bc2e4d21b00d53aeda6a11c6e6f1016ca6029e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509201 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-05-21 07:26:54 +00:00
George Steed	c6632d43ae	[AArch64] Impose feature dependencies in detection code The strict architectural requirements between features are reasonably relaxed and difficult to map out fully, in particular: * FEAT_DotProd is architecturally available from Armv8.1-A and becomes mandatory from Armv8.4-A. * FEAT_I8MM is architecturally available from Armv8.1-A and becomes mandatory from Armv8.6-A. It does not strictly depend on FEAT_DotProd being implemented however I am not aware of a micro-architecture where FEAT_I8MM is implemented without FEAT_DotProd also being implemented. * FEAT_SVE is architecturally available from Armv8.2-A. It does not strictly depend on either of FEAT_DotProd or FEAT_I8MM being implemented. The only micro-architecture I am aware of where FEAT_SVE is implemented without FEAT_DotProd and FEAT_I8MM both also being implemented is the Fujitsu A64FX. * FEAT_SVE2 is architecturally available from Armv9.0-A. If FEAT_SVE2 is implemented then FEAT_SVE must also be implemented. Since Armv9.0-A is based on Armv8.5-A this implies that FEAT_DotProd is also implemented. Interestingly this means that FEAT_I8MM is not mandatory since it only becomes mandatory from Armv8.6-A (Armv9.1-A), however I am not aware of a micro-architecture where FEAT_SVE2 is implemented without all three of the above features also being implemented. Additionally, when testing under emulation there are sometimes bugs where even mandatory architecture relationships are broken. For example there is one known case where SVE2 may be reported as available even when SVE is explicitly disabled. To simplify these dependencies, don't try to enable later extensions unless earlier extensions are reported implemented. This notably penalises code if it were to run on a Fujitsu A64FX, however this is not a likely target for libyuv deployment. Change-Id: Ifa32f7a43043641f99afb120e591945e136c9fd1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5546385 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-21 07:21:49 +00:00
Wan-Teh Chang	ec6f15079f	Remove unneeded #ifdef HAVE_JPEG code Change-Id: Ic7e1393b48bec735625197243b3d436ea01cfb07 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5529467 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-05-09 23:02:18 +00:00
George Steed	ee830a5f77	[AArch64] Enable feature detection on Windows and Apple Silicon Using the platform-specific functions IsProcessorFeaturePresent and sysctlbyname to check individual features. Bug: libyuv:980 Change-Id: I7971238ca72e5df862c30c2e65331c46dc634074 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465591 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-05-03 18:42:51 +00:00
George Steed	a114f85e50	[AArch64] Fix naming in ARGBToUVMatrixRow_SVE2 etc constants Avoid abbreviations and capitalize ARGB and UV naming, as suggested here: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537 Bug: libyuv:973 Change-Id: I0d0143154594c03e6aca7c859b874e39634ca54f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5513544 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-05-03 17:25:14 +00:00
George Steed	6f1d8b1e11	[AArch64] Add SVE2 implementations for ARGBToUVRow and similar By maintaining the interleaved format of the data we can use a common kernel for all input channel orderings and simply pass a different vector of constants instead. A similar approach is possible with only Neon by making use of multiplies and repeated application of ADDP to combine channels, however this is slower on older cores like Cortex-A53 so is not pursued further. For odd problem sizes we need a slightly different implementation for the final element, so introduce an "any" kernel to address that rather than bloating the code for the common case. Observed affect on runtimes compared to the existing Neon kernels: \| Cortex-A510 \| Cortex-A720 \| Cortex-X2 ABGRToUVJRow \| -15.5% \| +5.4% \| -33.1% ABGRToUVRow \| -15.6% \| +5.3% \| -35.9% ARGBToUVJRow \| -10.1% \| +5.4% \| -32.7% ARGBToUVRow \| -10.1% \| +5.4% \| -29.3% BGRAToUVRow \| -15.5% \| +4.6% \| -32.8% RGBAToUVRow \| -10.1% \| +4.2% \| -36.0% Bug: libyuv:973 Change-Id: I041ca44db0ae8a2adffcdf24e822eebe962baf33 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5505537 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-05-01 19:46:43 +00:00
George Steed	67e5e79dbe	[AArch64] Add Neon implementation of HashDjb2 Reduction in runtime observed compared to the existing C code compiled with LLVM 18: Cortex-A55: -46.2% Cortex-A510: -60.4% Cortex-A76: -82.9% Cortex-A720: -87.4% Cortex-X1: -90.0% Cortex-X2: -91.7% Change-Id: I39a4479f78299508043a864e64fb40578c66ce19 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5494094 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-05-01 19:37:31 +00:00
George Steed	1eae2efbc7	[AArch64] Use LD1/ST1 rather than LD4/ST4 in ARGBShadeRow_NEON The use of LD4 and ST4 to de-interleave ARGB color channels is unnecessary here since we can just adjust the scale multiplicand to match the interleaved layout. LD4 and ST4 are known to perform poorly on some micro-architectures so using LD1 and ST1 here should be preferred. Reduction in runtime for ARGBShadeRow_NEON: Cortex-A55: -19.9% Cortex-A510: -50.8% Cortex-A76: -36.0% Cortex-X2: -46.4% Bug: libyuv:976 Change-Id: I10a0e6a0a62242826d39b1e963063770f084226a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5494093 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-30 00:48:35 +00:00
George Steed	ce32eb773f	[AArch64] Avoid extraneous CMP in I{444,422}ToARGBRow_SVE2 impl We can use subs to set condition flags as part of the subtract, no need for a separate compare instruction. No performance difference observed from this change, but it now matches the other SVE2 kernels. Also remove unnecessary volatile from asm blocks. Bug: libyuv:973 Change-Id: I9bb4f5f1101086602f7d5223feaeae0fb63b385c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463951 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-29 18:56:22 +00:00
George Steed	f483007b9a	[AArch64] Add SVE implementation for I422AlphaToARGBRow This is mostly identical to the existing I422ToARGBRow_SVE implementation, we just need to make sure to load the alpha component rather than hard-coding it to 255. Reduction in runtimes observed compared to the existing Neon code: Cortex-A510: -32.1% Cortex-A720: -5.1% Cortex-X2: -10.1% Bug: libyuv:973 Change-Id: I6f800f3ef59f1dc82b409233017b3cb108da0257 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444426 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-29 18:54:07 +00:00
George Steed	b53b27d6bf	[AArch64] Add SVE implementation for I444AlphaToARGBRow This is mostly identical to the existing I444ToARGBRow_SVE implementation, we just need to make sure to load the alpha component rather than hard-coding it to 255. Reduction in runtimes observed compared to the existing Neon code: Cortex-A510: -34.2% Cortex-A720: -17.6% Cortex-X2: -9.6% Bug: libyuv:973 Change-Id: Ief63965f6f1048ea24baf8f4037aabdd184e2925 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444425 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-29 18:54:02 +00:00
George Steed	6ac90403a1	[AArch64] Add SVE implementation for I422ToARGBRow We need a new macro for reading I422 data, but is otherwise mostly identical to the existing I444ToARGBRow_SVE implementation. Reduction in runtimes observed compared to the existing Neon code: Cortex-A510: -25.0% Cortex-A720: -5.0% Cortex-X2: -10.8% Change-Id: I27ddb604a46a53e61c9bde21f76dbc7bd91e0cef Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5444424 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-27 18:26:11 +00:00
George Steed	95eed2b75f	[AArch64] Add Neon dot-product implementation of HammingDistance We can use the Neon dot-product instructions as a slightly faster widening accumulation. This also has the advantage of widening to 32 bits so avoids the risk of overflow present in the original Neon code. Reduction in runtimes observed for HammingDistance compared to the existing Neon code: Cortex-A55: -4.4% Cortex-A510: -26.5% Cortex-A76: -8.1% Cortex-A720: -15.5% Cortex-X1: -4.1% Cortex-X2: -5.1% Bug: libyuv:977 Change-Id: I9e5e10d228c339d905cb2e668a9811ff0a6af5de Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5490049 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-26 18:39:00 +00:00
George Steed	6433029df7	[AArch64] Unroll SumSquareError_NEON_DotProd The kernel is only ever called with count as a multiple of 32 so it is safe to unroll this and maintain two accumulators. Reduction in runtime observed compared to the existing SumSquareError_NEON_DotProd implementation: Cortex-A55: -28.2% Cortex-A510: -27.6% Cortex-A76: -33.0% Cortex-A720: -35.3% Cortex-X1: -16.9% Cortex-X2: -13.3% Bug: libyuv:977 Change-Id: Iee423106c38e97cc38007d73fa80e8374dd96721 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5490048 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-26 16:22:01 +00:00
George Steed	f5882ed1c5	[AArch64] getauxval(AT_HWCAP{,2}) feature detection, attempt #2 This re-lands commit ba0bba5b2b7e38c9365a5d152b4efa0458863213. Now with additional #ifdef __linux__ guards to avoid compiling Linux-specific code on non-Linux platforms. Non-linux feature detection will be added in a separate patch. Using getauxval(AT_HWCAP{,2}) has the advantage of also working under emulation where faking /proc/cpuinfo is not supported. For the Chromium sandbox, getauxval is supported since API version 18. The minimum supported API version at time of writing is 21 so we should be able to use getauxval unconditionally. On the off-chance the call fails it will return 0 and we will correctly fall-back to using only Neon. If we want to read the current CPU implementer or part number we could do this by checking HWCAP_CPUID and then reading MIDR_EL1. This will cause a kernel trap to emulate the EL1 read but should still be a lot faster than reading the whole of /proc/cpuinfo. Bug: libyuv:980 Change-Id: I8ae103ea7e32ef44db72f3c9896417bfe97ff5c5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465590 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-25 21:26:31 +00:00
George Steed	356232b687	[AArch64] Replace UQXTN{,2} with UZP2 in Convert16To8Row_NEON The existing code makes use of a pair of shifts to put the bits we want in the low part of each vector lane and then a pair of UQXTN and UQXTN2 instructions to perform a saturating cast down from 16-bit elements to 8-bit elements. We can instead achieve the same thing by adding eight to the first shift amount so that the bits we want appear in the high half of the lane, doing the saturation at the same time, and then simply use UZP2 to pull out the high halves of each lane in a single instruction. Reduction in runtime for Convert16To8Row_NEON: Cortex-A55: -19.7% Cortex-A510: -23.5% Cortex-A76: -35.4% Cortex-X2: -34.1% Bug: libyuv:976 Change-Id: I9a80c0f4f2c6b5203f23e422c0970d3167052f91 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463950 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-25 21:23:55 +00:00
George Steed	4f52235a67	[AArch64] Replace SHRN{,2} pair by UZP2 in DivideRow_16_NEON Shift instructions have worse throughput than other permute instructions on some micro-architectures, and we can avoid the need for two separate narrowing instructions by taking the high halves of each lane directly through use of the UZP2 instruction. Reduction in runtime for DivideRow_16_NEON: Cortex-A55: -6.2% Cortex-A510: -30.0% Cortex-A76: -11.9% Cortex-X2: -46.8% Bug: libyuv:976 Change-Id: I4aa06eab06ab6134bb80bc3af5328a1a83b3d249 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463949 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-25 21:21:52 +00:00
George Steed	53b65220da	[AArch64] Add Neon dot-product implementation of SumSquareError The Neon dot-product instructions perform two widening steps rather than one, saving us the need to widen the absolute difference to 16-bits before accumulating. Additionally, the dot-product instructions tend to have better performance characteristics than traditional widening multiply instructions like SMLAL used in the existing SumSquareError_NEON code. Observed reduction in runtimes compared to the existing Neon kernel: Cortex-A55: -9.1% Cortex-A510: -36.7% Cortex-A76: -37.6% Cortex-A720: -48.8% Cortex-X1: -56.1% Cortex-X2: -42.6% Bug: libyuv:977 Change-Id: Ie20c69040cc47a803d8e95620d31e0bf1e1dac12 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463945 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-25 20:54:48 +00:00
George Steed	9e223c3fc0	[AArch64] Replace instances of ORR with MOV where possible The MOV instruction is an alias of ORR where both registers are the same and should be preferred. Both ORR and MOV are not zero-cost instructions on all micro-architectures so there may be better ways to express these kernels, but this is left for a later commit. Bug: libyuv:975 Change-Id: I29b7f182a57a61855cb7f8a867691080f153b10b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5332385 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-25 20:48:16 +00:00
Frank Barchard	fe51553f5f	Revert "[AArch64] Use getauxval(AT_HWCAP{,2}) for feature detection" This reverts commit ba0bba5b2b7e38c9365a5d152b4efa0458863213. Reason for revert: breaks builds on windows and mac Step _compile_ failed. Error logs are shown below: [1/104] CXX obj/libyuv_internal/cpu_id.o FAILED: obj/libyuv_internal/cpu_id.o ../../buildtools/reclient/rewrapper -cfg=../../buildtools/reclient_cfgs/chromium-browser-clang/rewra...(too long) ../../source/cpu_id.cc:25:10: fatal error: 'sys/auxv.h' file not found 25 \| #include // For getauxval() \| ^~~~~~~~~~~~ 1 error generated. More information in raw_io.output_text[failure_summary] Original change's description: > [AArch64] Use getauxval(AT_HWCAP{,2}) for feature detection > > This has the advantage of also working under emulation where > faking /proc/cpuinfo is not supported. > > For the Chromium sandbox, getauxval is supported since API version 18. > The minimum supported API version at time of writing is 21 so we should > be able to use getauxval unconditionally. On the off-chance the call > fails it will return 0 and we will correctly fall-back to using only > Neon. > > Change-Id: Ibbaa9caec1915ac0725c42d6cd2abc7ce19786c7 > Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5453620 > Reviewed-by: Frank Barchard <fbarchard@chromium.org> Change-Id: Ic0f764217af7b4d998f19a8f78fc04ca85a45a3b No-Presubmit: true No-Tree-Checks: true No-Try: true Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463918 Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-19 06:52:22 +00:00
George Steed	73f6e82b1a	[AArch64] Add missing clobber, fix zero-init for compare kernels The "memory" clobber needs to be present even if the asm does not store anything to memory, since otherwise the compiler would be allowed to reorder earlier stores to the pointers after they would be needed by the asm. Also fix up the zero-initialisation of accumulators in SumSquareError_NEON, since EOR'ing a register by itself is not a recognised zeroing idiom on most AArch64 micro-architectures. Bug: libyuv:976 Change-Id: I3175367abf6f59db8371b4478f1156950277d7c5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5378705 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-19 06:38:06 +00:00
George Steed	ba0bba5b2b	[AArch64] Use getauxval(AT_HWCAP{,2}) for feature detection This has the advantage of also working under emulation where faking /proc/cpuinfo is not supported. For the Chromium sandbox, getauxval is supported since API version 18. The minimum supported API version at time of writing is 21 so we should be able to use getauxval unconditionally. On the off-chance the call fails it will return 0 and we will correctly fall-back to using only Neon. Change-Id: Ibbaa9caec1915ac0725c42d6cd2abc7ce19786c7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5453620 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-19 06:37:04 +00:00
George Steed	4838e7a194	[AArch64] Load full vectors in ARGB{Add,Subtract}Row Using full vectors for Add and Subtract is a win across the board. Using full vectors for the multiply is less obviously a win, especially for smaller cores like Cortex-A53 or Cortex-A57, so is not considered for this change. Observed changes in performance with this change compared to the existing Neon code: \| ARGBAddRow_NEON \| ARGBSubtractRow_NEON Cortex-A55 \| -5.1% \| -5.1% Cortex-A510 \| -18.4% \| -18.4% Cortex-A76 \| -28.9% \| -28.7% Cortex-A720 \| -36.1% \| -36.2% Cortex-X1 \| -14.2% \| -14.4% Cortex-X2 \| -12.5% \| -12.5% Bug: libyuv:976 Change-Id: I85316d4399c93b53baa62d0d43b2fa453517f5b4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5457433 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-18 19:02:43 +00:00
George Steed	90070986ae	[AArch64] Improve RGB565TOARGB using SRI instructions The existing code performs a lot of shifts and combines the R and B components into a single vector unnecessarily. We can express this much more cleanly by making use of the SRI instruction to insert and replace shifted bits into the original data, performing the 5/6-bit to 8-bit expansion in a single instruction if the source bits are already in the high bits of the byte. We still need a single separate XTN instruction to narrow the B component before the left shift since Neon does not have a narrowing left shift instruction. Reduction in runtime for selected kernels: Kernel \| Cortex-A55 \| Cortex-A76 \| Cortex-X2 RGB565ToYRow_NEON \| -22.1% \| -23.4% \| -25.1% RGB565ToUVRow_NEON \| -26.8% \| -20.5% \| -18.8% RGB565ToARGBRow_NEON \| -38.9% \| -32.0% \| -23.5% Bug: libyuv:976 Change-Id: I77b8d58287b70dbb9549451fc15ed3dd0d2a4dda Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5374286 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-04-18 19:01:26 +00:00
George Steed	1ca7c4e1cc	[AArch64] Avoid lane-indexed loads for UV when loading I444/I422 Most micro-architectures seem to prefer an additional ZIP1 instruction in READYUV422 to needing a lane-indexed LD1 load instruction. We introduce a new macro to handle the YUV to RGB conversion where the U and V components are in separate vectors. This avoids causing a slowdown for the UV-interleaved input format kernels (NV12 and NV21) where we do not want to separate them. Reduction in runtime for selected kernels on Cortex cores (no performance difference observed on Cortex-A55): A510 A76 A720 X1 X2 I422AlphaToARGBRow_NEON -4.3% -7.3% -10.1% -4.0% -4.4% I422ToARGB1555Row_NEON -4.5% +0.4% -7.9% -4.8% -3.9% I422ToARGB4444Row_NEON -7.7% -2.6% -4.1% -1.9% -1.3% I422ToARGBRow_NEON -3.7% -2.9% -10.2% -3.8% -4.4% I422ToRGB24Row_NEON -5.9% +5.4% -3.2% -4.3% -4.3% I422ToRGB565Row_NEON -4.8% -2.8% -8.5% -3.8% -4.6% I422ToRGBARow_NEON -3.7% +4.6% -10.5% -3.0% -4.5% I444AlphaToARGBRow_NEON -3.5% +2.7% -3.7% -5.0% -8.2% I444ToARGBRow_NEON -1.8% -15.1% -3.5% -6.5% -8.1% I444ToRGB24Row_NEON -2.0% -6.8% +0.1% -4.7% +1.2% There are a few cases which are slower on Cortex-A76, but significant speedups elsewhere. Bug: libyuv:976 Change-Id: Ib3b4ef81f7bfc1d7ff9c4c24aef9ad86741410ff Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465580 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-18 18:46:59 +00:00
George Steed	bfedc8bc11	[AArch64] Improve ARGB{,1}555TOARGB using SRI instructions The existing transformations can be more cleanly expressed by using SRI instructions to perform a shift and simultaneously merge in to an existing value. Reduction in runtime for selected kernels: Kernel \| Cortex-A55 \| Cortex-A76 \| Cortex-X2 ARGB1555ToYRow_NEON \| -26.2% \| -14.9% \| -28.2% ARGB1555ToUVRow_NEON \| -25.2% \| -18.4% \| -20.9% ARGB1555ToARGBRow_NEON \| -43.6% \| -32.8% \| -19.7% Bug: libyuv:976 Change-Id: Id07ac6f2cd3eb9bb70f9e29fc1f4b29fe26156ec Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5383444 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-18 18:46:10 +00:00
George Steed	95b0a3326c	[AArch64] Improve ARGBTOARGB4444 using SRI instructions The existing sequence to convert from 8-bit ARGB to 4-bit ARGB4444 makes use of a lot of shifts and bit-clears before ORR'ing the pairs together. This is unnecessary since we can do the same with the SRI instruction, so use that instead. Reduction in runtime for selected kernels: Kernel \| Cortex-A55 \| Cortex-A76 ARGBToARGB4444Row_NEON \| -15.3% \| -16.6% I422ToARGB4444Row_NEON \| -2.7% \| -11.9% Bug: libyuv:976 Change-Id: I86cd86c7adf1105558787a679272179821f31a9d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5383443 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-18 18:26:27 +00:00
George Steed	b265c311b7	[AArch64] Avoid unnecessary work in READYUV400 The value of UV components in the vector are known and the vectors are never overwritten, so we can hoist the UV-specific parts of the calculation out of the loop. Reduction in runtimes for I400ToARGBRow_NEON: Cortex-A55: -10.0% Cortex-A510: -3.7% Cortex-A76: -19.3% Cortex-X2: -14.4% Bug: libyuv:976 Change-Id: I17d6de4e1790f71407e12ff84548568cc3ebbe1a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5457434 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-17 16:47:58 +00:00
George Steed	ea56460300	[AArch64] Use LD1/ST1 rather than LD4/ST4 in ARGBMultiplyRow_NEON There is no need to de-interleave channels here since we are applying the same operation across all lanes. LD4 and ST4 are known to be significantly slower than LD1/ST1 on some micro-architectures so we should prefer to avoid them where possible. Reduction in runtimes observed for ARGBMultiplyRow_NEON: Cortex-A55: -22.3% Cortex-A510: -56.6% Cortex-A76: -45.5% Cortex-X2: -54.6% Change-Id: I9103111a109a4d87d358e06eb513746314aaf66a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454832 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-16 07:28:56 +00:00
George Steed	7266cda79c	[AArch64] Use LD1/ST1 rather than LD4/ST4 in ARGBSubtractRow_NEON There is no need to de-interleave channels here since we are applying the same operation across all lanes. LD4 and ST4 are known to be significantly slower than LD1/ST1 on some micro-architectures so we should prefer to avoid them where possible. Reduction in runtimes observed for ARGBSubtractRow_NEON: Cortex-A55: -15.0% Cortex-A510: -59.8% Cortex-A76: -54.4% Cortex-X2: -70.4% Change-Id: Ifbfce9e6a45159932c09d9b0229215a36fa22f43 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454833 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-16 07:04:43 +00:00
George Steed	e646991347	[AArch64] Use LD1/ST1 rather than LD4/ST4 in ARGBAddRow_NEON There is no need to de-interleave channels here since we are applying the same operation across all lanes. LD4 and ST4 are known to be significantly slower than LD1/ST1 on some micro-architectures so we should prefer to avoid them where possible. Reduction in runtimes observed for ARGBAddRow_NEON: Cortex-A55: -15.0% Cortex-A510: -59.8% Cortex-A76: -54.4% Cortex-X2: -70.4% Change-Id: Id04e5259d8e5e7511dad5df85cdf9759b392cb99 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454831 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-16 07:03:44 +00:00
Cosmina Dunca	9d200b704f	[AArch64] Optimize ScaleARGBRowDown2Box_NEON Use a pair of LD2s to load data interleaved and perform a couple of additions on the registers in order to avoid needing LD4 and ST4 instructions, since these are costly on some micro-architectures. Reduction in run times: Cortex-A55: -20.5% Cortex-A510: -28.3% Cortex-A76: -21.5% Bug: libyuv:976 Change-Id: If66e1e148b031c2cd288ff412f351d7a0b9b91e7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5371774 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-04-10 20:07:22 +00:00
Cosmina Dunca	9441ddd883	[AArch64] Optimize ScaleARGBRowDownEven_NEON Replace indexed LD1 instructions with LDRs to avoid loop-carried dependencies on unused lanes between consecutive iterations of the loop. Reduction in run times: Cortex-A55: -10.9% Cortex-A510: -70.7% Cortex-A76: -56.8% Bug: libyuv:976 Change-Id: Ia767e76002c7823177e80163ebf034e023e9a6cc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5371771 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-04-10 20:03:39 +00:00
George Steed	e52007eff9	[AArch64] Add SVE2 implementation for I444ToARGBRow Being able to use SVE2 functionality for these kernels has a number of performance wins compared to the existing Neon code: * For the Y component calculation we are able to use UMULH, versus the existing UMULL x2 + UZP2 sequence in Neon. * For the RGBTORGBA8 calculation we are able to take advantage of interleaving narrowing instructions, allowing us to use ST2 rather than ST4 for the store. This is a big performance win on some micro-architectures where ST4 is costly. * The use of predication means we do not need to add "any" kernels, we can simply rerun the calculation with a not-full predicate for the final iteration. To avoid the overhead of generating a predicate register on every iteration we duplicate the loop body and only generate a predicate on the final iteration of the loop. This costs a small amount on the final iteration but should still be significantly quicker than the overhead of a function call needed by the "any" cases. Duplicating the loop body to reduce the use of the WHILELT instruction improves little core performance by ~12% by itself but has negligable impact on other micro-architectures. Reduction in runtime for the new SVE2 implementation compared to the existing Neon implementation on selected micro-architectures: Cortex-A510: -36.5% Cortex-A720: -17.3% Cortex-X2: -11.3% Bug: libyuv:973 Change-Id: I2a485f0dfa077a56f96b80a667ad38bbea47b4b4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424739 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-09 03:11:01 +00:00
George Steed	9a8be20def	[AArch64] Add :libyuv_sve library in preparation for SVE kernels This commit only adds the bare minimum to get the new library building through GN, the actual content of row_sve.cc is empty for now until we start porting some kernels across. Bug: libyuv:973 Change-Id: Ibdf4fc258761f3e507d700f27a405099c667ac75 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424738 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-09 03:10:01 +00:00
George Steed	f2e78e1304	[AArch64] Use Neon dot-product instructions in ARGBToYMatrixRow Using the dot-product instructions here allows us to avoid needing LD4 for loading individual colour channels, which gives a big benefit on some micro-architectures where such instructions perform significantly worse than LD1. In addition the dot-product instructions have higher throughput compared to the Neon Observed reduction in runtimes for selected kernels moving from _NEON to _NEON_DotProd: Kernel \| Cortex-A55 \| Cortex-A510 \| Cortex-A76 \| Cortex-X2 ABGRToYJRow \| -6.5% \| -22.5% \| -43.5% \| -71.2% ABGRToYRow \| -6.5% \| -22.5% \| -43.5% \| -68.3% ARGBToYJRow \| -6.5% \| -22.5% \| -43.5% \| -68.1% ARGBToYRow \| -6.5% \| -22.5% \| -43.5% \| -68.1% BGRAToYRow \| -6.5% \| -22.5% \| -42.3% \| -68.4% RGBAToYJRow \| -6.5% \| -22.5% \| -42.2% \| -73.7% RGBAToYRow \| -6.5% \| -22.5% \| -42.3% \| -64.9% Bug: libyuv:977 Change-Id: If244190a7bdacf7e6e6b16af7e6853ee13ff6585 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424737 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-09 03:09:36 +00:00
George Steed	a038cda7b8	[AArch64] Enable detection of additional architecture features In particular there are a few extensions that are interesting for us: * FEAT_DotProd adds 4-way dot-product instructions which are useful in e.g. ARGBToY. * FEAT_I8MM adds additional mixed-sign dot-product instructions which could be useful in e.g. ARGBToUV. * FEAT_SVE and FEAT_SVE2 add support for the Scalable Vector Extension, which adds an array of new instructions including new widening loads and narrowing stores for dealing with mixed-width integer arithmetic efficiently and predication for avoiding the need for "any" cleanup loops. This commit simply adds support for detecting the presence of these features by extending the existing /proc/cpuinfo parsing, splitting it into separate Arm and AArch64 functions for simplicity. Since we have no space left in the bitset entries between Arm and X86 entries, we reuse some of the X86 entries for new AArch64 extensions. This doesn't seem obviously problematic as long as we avoid setting kCpuHasX86. Bug: libyuv:973 Bug: libyuv:977 Change-Id: I8e256225fe12a4ba5da24460f54061e16eab6c57 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5378150 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-04-05 17:48:22 +00:00
George Steed	ba796a32e7	[AArch64] Remove out of date TODO around ARGBMultiplyRow_NEON The comment refers to the code needing to be re-enabled but as far as I can tell it is already enabled, so simply remove the comment. Change-Id: Id014e8b7f5cd43c8211e1d38758299de2fad49de Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5387650 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-03-25 22:44:45 +00:00
George Steed	5d694bec38	[AArch64] Replace UQSHRN{,2} pair by UZP2 in YUVTORGB The existing Neon code makes use of a pair of UQSHRN and UQSHRN2 instructions to extract the top half of a widened multiply result. These instructions would ordinarily saturate, however saturation can never happen in this case since we are shifting by 16 to get the top half of each element, the top bits remain as-is. We could move this to using a slightly simpler non-saturating shift, however in this case it is simpler and faster to just use UZP2 to extract the top half of each 32-bit lane directly. Reduction in runtime for selected kernels: Kernel \| Cortex-A55 \| Cortex-A76 \| Cortex-X2 I400ToARGBRow_NEON \| -9.4% \| -14.9% \| -13.9% I422AlphaToARGBRow_NEON \| -7.9% \| -11.4% \| -11.5% I422ToARGB1555Row_NEON \| -7.3% \| -17.2% \| -14.7% I422ToARGB4444Row_NEON \| -7.6% \| -17.9% \| -13.7% I422ToARGBRow_NEON \| -8.2% \| -9.8% \| -11.9% I422ToRGB24Row_NEON \| -8.0% \| -13.3% \| -12.8% I422ToRGB565Row_NEON \| -7.5% \| -15.1% \| -14.6% I422ToRGBARow_NEON \| -8.3% \| -13.1% \| -12.2% I444AlphaToARGBRow_NEON \| -8.3% \| -7.6% \| -12.7% I444ToARGBRow_NEON \| -8.6% \| -3.5% \| -13.5% I444ToRGB24Row_NEON \| -8.5% \| -7.8% \| -13.4% NV12ToARGBRow_NEON \| -8.8% \| -1.4% \| -12.0% NV12ToRGB24Row_NEON \| -8.5% \| -11.5% \| -12.3% NV12ToRGB565Row_NEON \| -7.9% \| -15.0% \| -15.7% NV21ToARGBRow_NEON \| -8.7% \| -1.6% \| -12.3% NV21ToRGB24Row_NEON \| -8.4% \| -11.5% \| -12.0% UYVYToARGBRow_NEON \| -8.8% \| -8.9% \| -11.9% YUY2ToARGBRow_NEON \| -8.7% \| -10.8% \| -13.3% Bug: libyuv:976 Change-Id: I6c505fe722e5f91f93718b85fe881ad056d8602d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5366653 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-03-14 20:04:46 +00:00
George Steed	8d0d885c2f	[AArch64] Avoid LD2 in YUY2ToARGBRow_NEON In this case we have an LD2 instruction followed by a pair of permutes (ZIP1 and TBL). On some micro-architectures LD2 involves use of the vector pipelines, so in these cases it is preferable to do an LD1 and then a different pair of permutes (TRN + TBL) instead to avoid the extra vector pipeline usage. Reduction in runtime on selected kernels (no observed performance delta on Cortex-A55): Kernel \| Cortex-A76 \| Cortex-X2 UYVYToARGBRow_NEON \| -2.6% \| -8.8% YUY2ToARGBRow_NEON \| -6.2% \| -4.9% Bug: libyuv:976 Change-Id: I7ca45e0c7bf7cb50cc5ab37c6a01215d9689039a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5366652 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-03-14 19:51:05 +00:00
George Steed	188e4e3afb	[AArch64] Avoid unnecessary lane-indexed loads in READYUV The existing code makes use of a pair of lane-indexed load instructions to fill the two halves of the input vector, however this has the effect of introducing an unnecessary dependency on the value of the vector from the previous loop iteration. This doesn't really seem to affect little core performance since these cores never execute enough work concurrently to hit the bottleneck, however we can improve performance on mid and big cores quite a bit by using LDR instead of LD1 to load the low lane, zeroing the upper portion of the vector rather than keeping the previous value. Reduction in runtime for select kernels (no observed performance delta on Cortex-A55): Kernel \| Cortex-A76 \| Cortex-X2 I422ToARGB4444Row_NEON \| -23.1% \| -49.3% I422ToARGBRow_NEON \| -1.2% \| -2.5% I422ToRGB24Row_NEON \| -11.7% \| -7.0% I422ToRGBARow_NEON \| -4.7% \| -3.4% I444AlphaToARGBRow_NEON \| -1.1% \| -2.4% I444ToARGBRow_NEON \| -1.6% \| -3.2% I444ToRGB24Row_NEON \| -9.6% \| -6.8% Bug: libyuv:976 Change-Id: I8c9413e0e6ed97b8f060ce42b6e8abdfb77914b9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5365868 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-03-13 18:35:31 +00:00
George Steed	772bddaed7	Add missing memory/cc clobbers to AArch64 Neon kernels There are a few functions in source/scale_neon64.cc which write memory and set condition flags despite not declaring this in the asm clobber list, so add the missing clobbers. Also move a couple of memory/cc clobbers to the start of the clobber list to match other kernels. Bug: libyuv:974 Change-Id: I85f5ff5718e78a4481f7bc53cedaeceb14438895 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5309254 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-03-04 10:22:51 +00:00
Frank Barchard	b66c42d4a8	Revert "AMX detect OS support for linux kernel" This reverts commit 8c8a33762d64b916ae8469cc3fc602a64080a23a. Reason for revert: breaks sandbox Original change's description: > AMX detect OS support for linux kernel > > Bug: b/327013106 > Change-Id: Ie1784249f3a121c52e6504ff502bdc3eb245d858 > Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5329907 > Commit-Queue: Frank Barchard <fbarchard@chromium.org> > Reviewed-by: richard winterton <rrwinterton@gmail.com> Bug: b/327013106 Change-Id: If54bb84bc1167177c1869763f6ccfdf1f92fbe09 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5332617 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Bot-Commit: Rubber Stamper <rubber-stamper@appspot.gserviceaccount.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-02-29 00:33:29 +00:00
Frank Barchard	8c8a33762d	AMX detect OS support for linux kernel Bug: b/327013106 Change-Id: Ie1784249f3a121c52e6504ff502bdc3eb245d858 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5329907 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2024-02-28 03:13:44 +00:00
Frank Barchard	a6a2ec654b	Add AMXINT8 cpu detect sde -spr -- libyuv_test -- --gunit_filter=Cpu Note: Google Test filter = Cpu [==========] Running 4 tests from 2 test suites. [----------] Global test environment set-up. [----------] 3 tests from LibYUVBaseTest [ RUN ] LibYUVBaseTest.TestCpuHas Cpu Flags 0x57fff9 Has X86 0x8 Has SSE2 0x10 Has SSSE3 0x20 Has SSE41 0x40 Has SSE42 0x80 Has AVX 0x100 Has AVX2 0x200 Has ERMS 0x400 Has FMA3 0x800 Has F16C 0x1000 Has AVX512BW 0x2000 Has AVX512VL 0x4000 Has AVX512VNNI 0x8000 Has AVX512VBMI 0x10000 Has AVX512VBMI2 0x20000 Has AVX512VBITALG 0x40000 Has AVX10 0x0 HAS AVXVNNI 0x100000 Has AVXVNNIINT8 0x0 Has AMXINT8 0x400000 [ OK ] LibYUVBaseTest.TestCpuHas (34 ms) Bug: b/324356616 Change-Id: I5129b8946363a501bdd570e6dba3936c54aacd6c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5283433 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-02-15 21:44:47 +00:00
Hans Wennborg	2f2c04c157	Drop TARGET_IPHONE_SIMULATOR macro check Recent versions of Clang always define these TARGET_ macros (to 0 or 1 as appropriate) for Apple targets. https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5249072 made the code correctly check the value of the macro rather than whether it was defined or not. However, the code was still broken when actually targeting the iOS simulator (where the macro is now 1). It seems the use of this macro was just incorrect, and the code only worked since it was never defined at all. The original use of the macro in this file was added in `2c8108e6c2` but it 's not quite clear to me why. All other uses have subsequently been removed, e.g. in `6a1d01220a` this removes the last instance, and should fix the iOS simulator builds. Bug: chromium:1519899 Change-Id: Iaf44d2c37086f1153096044df5d9b61797f66a4f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5272224 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-02-06 17:38:45 +00:00
Hans Wennborg	d359a9f922	Correctly check the TARGET_IPHONE_SIMULATOR macro The macro may be defined to 0; the code needs to check the value, not just whether it's defined. Recent Clang versions will define all Apple "target OS" macros by default (see bug). Bug: chromium:1519899 Change-Id: I3d61f1b23de06d7db7db7916182a789f26345bce Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5249072 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-01-31 19:33:56 +00:00
Frank Barchard	3e435fe6d4	YUY2ToARGB use ymm6/7 for shuffle constants - 1 load and 2 shuffles from registers replaces 2 loads and 2 memory shuffles - vbroadcastf128 16 byte shuffler replaces 32 byte shufflers - bump version and apply clang-format libyuv_test '--gunit_filter=*.???2ToARGB_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 AMD Zen2 I422ToARGB_Opt (272 ms) NV12ToARGB_Opt (255 ms) YUY2ToARGB_Opt (208 ms) Was YUY2ToARGB_Opt (214 ms) Change-Id: I1fa4d462d04536c877d1cab1a14586be8ed1b2f2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5218447 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2024-01-22 21:47:23 +00:00
Frank Barchard	914624f0b8	YUY2ToARGBMatrix and UYVYToARGBMatrix added to allow any color matrix Bug: libyuv:971 Change-Id: If15d4598d75500a3717f07d02c0c295fdc58254e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5214453 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-01-19 21:21:37 +00:00
Frank Barchard	5625f42424	I444ToI420 and I422ToI420 check U and V pointers and return -1 if NULL. - Add detect linux kernel version number in util/cpuid adbrun -- blaze-bin/third_party/libyuv/cpuid Kernel Version 4.14 Cpu Flags 0x7 Has ARM 0x2 Bug: libyuv:970 Change-Id: I655ed598db3655ca8448be08f1d71fbc328ced66 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5207990 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-01-18 21:56:11 +00:00
Frank Barchard	af6ac8265b	AVX10 cpuid detect added Replace unused popcount feature bit Bug: libyuv:911 Change-Id: Icd88fcc732751d39b0950d5f09a58bc9ac2c4e30 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5179911 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-01-10 00:08:22 +00:00
Hao Chen	ee53a66c5c	Fix compilation errors. Fix the narrowing conversion error from ‘long unsigned int’ to ‘long long int’ that occurs when using the new compiler on the LoongArch platform. Bug: libyuv:913 Change-Id: Ic535946a2453bc48840bab05355854670c52114f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5161066 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-01-03 19:15:56 +00:00
Bruce Lai	1dcbc30553	Add HAS_SCALEARGBROWDOWNEVEN_RVV marco and disable it by default HAS_SCALEARGBROWDOWNEVEN_RVV wasn't defined, so we cannot use ScaleARGBRowDownEven_RVV & ScaleARGBRowDownEvenBox_RVV. - Seperate to two conditional statements when selecting DownEven or DownEvenBox. - Also, add HAS_SCALEARGBROWDOWNEVEN_RVV and disable it by default. Bug: libyuv:965 Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Change-Id: Ic7ec40520b64131a456c6f3eea0639b3620f11ae Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4882441 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-12-07 22:54:23 +00:00
Frank Barchard	def473f501	malloc return 1 for failures and assert for internal functions Bug: libyuv:968 Change-Id: Iea2f907061532d2e00347996124bc80d079a7bdc Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5010874 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-12-04 22:55:20 +00:00
Wan-Teh Chang	fb6341d326	Change ScalePlane,ScalePlane_16,... to return int Change ScalePlane(), ScalePlane_16(), and ScalePlane_12() to return int so that they can report memory allocation failures (by returning 1). BUG=libyuv:968 Change-Id: Ie5c183ee42e3d595302671f9ecb7b3472dc8fdb5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5005031 Commit-Queue: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-11-03 23:53:24 +00:00
Frank Barchard	31e1d6f896	Check allocations that return NULL and return early BUG=libyuv:968 Change-Id: I9e8594440a6035958511f9c50072820131331fc8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4977552 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-10-27 17:41:36 +00:00
Frank Barchard	331c361581	AVX-VNNI detect - Add kCpuHasAVXVNNI flag - Remove deprecated GFNI detect to make space. Meteor Lake has AVX-VNNI but not AVX512 ~/intelsde/sde -mtl -- blaze-bin/third_party/libyuv/libyuv_test --gunit_filter=CpuHas doyuv3 Note: Google Test filter = CpuHas [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from LibYUVBaseTest [ RUN ] LibYUVBaseTest.TestCpuHas Cpu Flags 0x203ff1 Has X86 0x10 Has SSE2 0x20 Has SSSE3 0x40 Has SSE41 0x80 Has SSE42 0x100 Has AVX 0x200 Has AVX2 0x400 Has ERMS 0x800 Has FMA3 0x1000 Has F16C 0x2000 Has AVX512BW 0x0 Has AVX512VL 0x0 Has AVX512VNNI 0x0 Has AVX512VBMI 0x0 Has AVX512VBMI2 0x0 Has AVX512VBITALG 0x0 Has AVX512VPOPCNTDQ 0x0 HAS AVXVNNI 0x200000 Has AVXVNNIINT8 0x0 AVX-VNNI detect - Add kCpuHasAVXVNNI flag - Remove deprecated GFNI detect to make space. https://bugs.chromium.org/p/libyuv/issues/detail?id=967 Meteor Lake has AVX-VNNI but not AVX512 ~/intelsde/sde -mtl -- blaze-bin/third_party/libyuv/libyuv_test --gunit_filter=CpuHas doyuv3 Note: Google Test filter = CpuHas [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from LibYUVBaseTest [ RUN ] LibYUVBaseTest.TestCpuHas Cpu Flags 0x203ff1 Has X86 0x10 Has SSE2 0x20 Has SSSE3 0x40 Has SSE41 0x80 Has SSE42 0x100 Has AVX 0x200 Has AVX2 0x400 Has ERMS 0x800 Has FMA3 0x1000 Has F16C 0x2000 Has AVX512BW 0x0 Has AVX512VL 0x0 Has AVX512VNNI 0x0 Has AVX512VBMI 0x0 Has AVX512VBMI2 0x0 Has AVX512VBITALG 0x0 Has AVX512VPOPCNTDQ 0x0 HAS AVXVNNI 0x200000 Has AVXVNNIINT8 0x0 Running on all cpus the following report avx-vnni grep 'AVXVNNI 0x2' / adl/libyuv64.txt:HAS AVXVNNI 0x200000 gnr/libyuv64.txt:HAS AVXVNNI 0x200000 grr/libyuv64.txt:HAS AVXVNNI 0x200000 mtl/libyuv64.txt:HAS AVXVNNI 0x200000 rpl/libyuv64.txt:HAS AVXVNNI 0x200000 spr/libyuv64.txt:HAS AVXVNNI 0x200000 srf/libyuv64.txt:HAS AVXVNNI 0x200000 while these support avx512 vnni grep 'VNNI 0x1' / clx/libyuv64.txt:Has AVX512VNNI 0x10000 cpx/libyuv64.txt:Has AVX512VNNI 0x10000 gnr/libyuv64.txt:Has AVX512VNNI 0x10000 icl/libyuv64.txt:Has AVX512VNNI 0x10000 icx/libyuv64.txt:Has AVX512VNNI 0x10000 spr/libyuv64.txt:Has AVX512VNNI 0x10000 tgl/libyuv64.txt:Has AVX512VNNI 0x10000 and these support avx-vnni-int8 grep AVXVNNIINT8.0x4 / grr/libyuv64.txt:Has AVXVNNIINT8 0x400000 srf/libyuv64.txt:Has AVXVNNIINT8 0x400000 Bug: libyuv:967 Change-Id: I84cd71d1b320e7c284173eb695fc1d3b72d14ddb Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4912017 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2023-10-05 21:24:09 +00:00
Frank Barchard	709d60e6ee	VNNI-INT8 detect - Add kCpuHasAVXVNNIINT8 flag - Move mips flags up a bit to make space. ~/intelsde/sde -srf -- blaze-bin/third_party/libyuv/libyuv_test --gunit_filter=CpuHas Note: Google Test filter = CpuHas [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from LibYUVBaseTest [ RUN ] LibYUVBaseTest.TestCpuHas Cpu Flags 0x403ff1 Has X86 0x10 Has SSE2 0x20 Has SSSE3 0x40 Has SSE41 0x80 Has SSE42 0x100 Has AVX 0x200 Has AVX2 0x400 Has ERMS 0x800 Has FMA3 0x1000 Has F16C 0x2000 Has AVX512BW 0x0 Has AVX512VL 0x0 Has AVX512VNNI 0x0 Has AVX512VBMI 0x0 Has AVX512VBMI2 0x0 Has AVX512VBITALG 0x0 Has AVX512VPOPCNTDQ 0x0 Has AVXVNNIINT8 0x400000 Has GFNI 0x0 [ OK ] LibYUVBaseTest.TestCpuHas (32 ms) INT8 supported on srf and grr -srf Set chip-check and CPUID for Intel(R) Sierra Forest CPU -grr Set chip-check and CPUID for Intel(R) Grand Ridge CPU Bug: b/303434603 Change-Id: I628007929ff0518b2b36e1469b4d9aed71a9fa8f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4912015 Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-10-04 16:31:36 +00:00
Yannis Guyon	a3b9c36eb9	Fix unused arg errors in ScalePlane*() in Release src_width parameter is used for assertions and unused with NDEBUG. Fix the warning treated as an error when -Wall -Wextra -Werror is used to build that part of the code. BUG=libyuv:967 Change-Id: I4c02ab013e8e2684b3bed5ce9693e1493d7751b9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4905033 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Wan-Teh Chang <wtc@google.com>	2023-10-03 15:19:25 +00:00
Bruce Lai	ec2e9ca000	[RVV] Support AR64ToAB64 and RGBA-family color conversions Add scalar code for AR64ToAB64, ARGBToRGBA, ARGBToBGRA, ARGBToABGR, RGBAToARGB, BGRAToARGB, and ABGRToARGB. They are originally implemented by ARGBShffle. This CL independetly implements them, and only enables for risc-v now. This CL also add RVV implementation for `RGBA-family <-> RGBA-family` color conversions. * Run on SiFive internal FPGA(VLEN=128): Test Case Speedup AR64ToAB64_Opt x4.6 ARGBToRGBA_Opt x6 ARGBToBGRA_Opt x6 ARGBToABGR_Opt x6 RGBAToARGB_Opt x6 Change-Id: Ie0630901046084aa259699fcdeccc64170d7103f Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4797451 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-09-05 22:44:48 +00:00
Frank Barchard	696e619571	RVV check __riscv_v_intrinsic version Bug: libyuv:965 Change-Id: I9b02abd13ab3345288655fa7a16383f59cf66bb8 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4750230 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>	2023-08-04 18:39:27 +00:00
Wan-Teh Chang	a8a37a25c9	Eliminate a common subexpression in YPixel() Save the value of a common subexpression in a local variable. Change-Id: I5724fcf341900cb2a65eb37b505194b8d3c3da9a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4735651 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Wan-Teh Chang <wtc@google.com>	2023-07-31 20:53:54 +00:00
Bruce Lai	c60ac4025c	[RVV] Enable ScaleRowDown38_RVV & ScaleRowDown38_{2,3}_Box_RVV * Run on SiFive internal FPGA: Test Case Speedup I420ScaleDownBy3by8_None 4.2 I420ScaleDownBy3by8_Linear 1.7 I420ScaleDownBy3by8_Bilinear 1.7 I420ScaleDownBy3by8_Box 1.7 I444ScaleDownBy3by8_None 4.2 I444ScaleDownBy3by8_Linear 1.8 I444ScaleDownBy3by8_Bilinear 1.8 I444ScaleDownBy3by8_Box 1.8 Change-Id: Ic2e98de2494d9e7b25f5db115a7f21c618eaefed Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4711857 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-07-27 02:59:47 +00:00
Darren Hsieh	10de943a12	[RVV] Enable ScaleRowUp2_(Bi)linear_RVV/ScaleUVRowUp2_(Bi)linear_RVV ScaleUVRowUp2_(Bi)linear_RVV function is equal to other platforms' ScaleRowUp2_(Bi)linear_Any_XXX. We process entire row in this function. Other platforms only implement non-edge part of image and process edge with scalar. ScaleRowUp2_(Bi)linear_Any_XXX: Combine ScaleRowUp2_(Bi)linear_XXX(non-edge) + ScaleRowUp2_(Bi)linear_C(edge) by SBUH2LANY/SU2BLANY. * Run on SiFive internal FPGA: Test case RVV function Speedup I444ScaleFrom640x360_Bilinear ScaleRowUp2_Bilinear_RVV 8.21 I444ScaleFrom640x360_Linear ScaleRowUp2_Linear_RVV 8.08 UVScaleFrom640x360_Bilinear ScaleUVRowUp2_Bilinear_RVV 7.80 UVScaleFrom640x360_Linear ScaleUVRowUp2_Linear_RVV 7.03 Change-Id: I539245ce51858f077506a78f0e7e82377ac6a95d Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4666062 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-07-26 18:05:50 +00:00
Bruce Lai	d33edd2373	[RVV] Enable ARGBBlendRow_RVV/BlendPlaneRow_RVV * Run on SiFive internal FPGA: Test case Speedup ARGBBlend_Opt 4.60 BlendPlane_Opt 5.96 I420Blend_Opt 5.83 - Also, add code to use ScaleRowDown2Box_RVV in I420Blend Change-Id: Icc75e05d26b3427a98269d2a33c4474074033264 Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4681100 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-07-25 16:38:55 +00:00
Darren Hsieh	aed6dbef17	[RVV] Enable NV{12,21}To{ARGB,RGB24}Row_RVV * Run on SiFive internal FPGA(w/ -march=rv64gcv): Test Case Speedup NV12ToARGB_Opt 12.0 NV21ToARGB_Opt 12.1 NV12ToABGR_Opt 12.6 NV21ToABGR_Opt 12.0 NV12ToRGB24_Opt 12.5 NV21ToRGB24_Opt 11.7 NV12ToRAW_Opt 12.1 NV21ToRAW_Opt 11.4 Change-Id: Icae2bac2b4ebbd4c5a89e847fde9a74fe6481878 Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4707804 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-07-24 17:07:01 +00:00
Frank Barchard	650be7496f	Fix warnings for missing prototypes - Add static to internal scale and rotate functions - Remove unittest that tested an internal scale function - Remove unused private functions - Include missing scale_argb.h header - Bump version and apply clang format Bug: libyuv:830 Change-Id: I45bab0423b86334f9707f935aedd0c6efc442dd4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4658956 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>	2023-06-30 17:46:56 +00:00
Frank Barchard	a34a0ba687	ARGBExtractAlpha rename variables to match format Bug: libyuv:956 Change-Id: I31070791754fc69b72c6dcc61be2e038d2676ed9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4646636 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2023-06-27 03:50:35 +00:00
Bruce Lai	873d0db989	[RVV] Fix TestARGBInterpolate test fail Root cause: Because InterpolateRow_RVV doesn't setup rounding mode to round-to-nearest-up when y1_fraction == 128. The rounding mode register is set to round-down in ARGBAttenuateRow_RVV. It cause InterpolateRow_RVV(y1_fraction == 128) runs on round-down mode. Running on round-down mode make output result differs from round-to-nearest-up mode. Solved by: ensure to use correct rounding mode in InterpolateRow_RVV. Also, removing unnecessary rounding mode setup in ARGBAttenuateRow_RVV. Bug: libyuv:956 Change-Id: Ib5265d42bad76b036e42b8f91ee42a9afe1f768d Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4624492 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-06-19 16:49:52 +00:00
Bruce Lai	4472b5b849	[RVV] Update ARGBAttenuateRow_RVV implementation Bug: libyuv:956 Change-Id: Ib539c2196767e88fa6e419ed2f22d95b6deaf406 Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4623172 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-06-17 15:50:34 +00:00
Bruce Lai	7939e039e7	[RVV] Fix compile warning in row_rvv 1. Fix compile warning in row_rvv.cc 2. Avoid compile row_rvv.cc/scale_rvv.cc when using GCC There is no RVV segment load & store on GCC. Hence, avoid compiling rvv code on GCC temporarily. 3. Add several compile options to cmake build flow -Wno-sign-compare -Wno-unused-function -Wunused-variable -Wuninitialized Bug: libyuv:956 Change-Id: I9577f98190fc9b28fb6fde65d82d0c67ce54f9ee Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4615441 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-06-17 15:41:45 +00:00
Frank Barchard	a366ad714a	ARGBAttenuate use (a + b + 255) >> 8 - Makes ARM and Intel match and fixes some off by 1 cases - Add ARGBToUV444MatrixRow_NEON - Add ConvertFP16ToFP32Column_NEON - scale_rvv fix intinsic build error - disable row_win version of ARGBAttenuate/Unattenuate Bug: libyuv:936, libyuv:956 Change-Id: Ied99aaad3a11a8eb69212b628c58f86ec0723c38 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617013 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2023-06-16 21:37:53 +00:00
Bruce Lai	04821d1e7d	[RVV] Enable ARGBExtractAlphaRow/ARGBCopyYToAlphaRow * Run on SiFive internal FPGA: TestARGBExtractAlpha(~3.2x vs scalar) TestARGBCopyYToAlpha(~1.6x vs scalar) Change-Id: I36525c67e8ac3f71ea9d1a58c7dc15a4009d9da1 Signed-off-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617955 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2023-06-15 23:45:24 +00:00

1 2 3 4 5 ...

1977 Commits