libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2025-12-06 16:56:55 +08:00

Author	SHA1	Message	Date
Frank Barchard	f145aa26da	Add SME2 detect Bug: None Change-Id: I36e576de1cf468049faaf3923b6c21fc9ad14271 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6401373 Reviewed-by: George Steed <george.steed@arm.com>	2025-03-27 11:08:08 -07:00
Frank Barchard	5f284054cb	RVV disable 64 bit elements and vcombine_v Bug: 405451074 Change-Id: I8e4437be92934b3c367c94d867d7967c32747260 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6385788 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-03-25 12:51:25 -07:00
Frank Barchard	0c07032182	clang format applies to git repo Bug: None Change-Id: Ida65a0033e8c783230cadf6912416ffd9bbf90e1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6393515 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-03-25 11:49:25 -07:00
Frank Barchard	918329caee	Make constant 0x0101 using vpcmpeqb+vpabsb Was vpcmpeqb %%ymm4,%%ymm4,%%ymm4 vpsrlw $0xf,%%ymm4,%%ymm4 vpackuswb %%ymm4,%%ymm4,%%ymm4 Now vpcmpeqb %%ymm4,%%ymm4,%%ymm4 vpabsb %%ymm4,%%ymm4 Bug: 381138208 Change-Id: Ib70c24ac636fff95a10c7f06ed8f0a3bc7514906 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6312925 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2025-03-10 13:25:16 -07:00
Frank Barchard	c060118bea	ARGBToJ444 use 256 for fixed point scale UV - use negative coefficients for UV to allow -128 - change shift to truncate instead of round for UV - adapt all row_gcc RGB to UV into matrix functions - add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc Bug: 381138208 Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: James Zern <jzern@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-27 13:04:15 -08:00
Frank Barchard	5257ba4db0	Apply clang format Bug: None Change-Id: Ibd694d0351966a2b5812445de74bbced9c881a79 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6302317 Reviewed-by: James Zern <jzern@google.com> Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-25 11:39:19 -08:00
Frank Barchard	3a7e0ba671	Apply format with no code changes Bug: None Change-Id: I8923bacb9af7e7d4f13e210c8b3d7ea6b81568a5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6301086 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>	2025-02-24 23:57:01 -08:00
Frank Barchard	61354d2671	ARGBToUV Matrix for AVX2 and SSSE3 - Round before shifting to 8 bit to match NEON - RAWToARGB use unaligned loads and port to AVX2 Was C/SSSE/AVX2 ARGBToI444_Opt (343 ms) ARGBToJ444_Opt (677 ms) RAWToI444_Opt (405 ms) RAWToJ444_Opt (803 ms) Now AVX2 ARGBToI444_Opt (283 ms) ARGBToJ444_Opt (284 ms) RAWToI444_Opt (316 ms) RAWToJ444_Opt (339 ms) Profile Now AVX2 38.31% ARGBToUVJ444Row_AVX2 32.31% RAWToARGBRow_AVX2 23.99% ARGBToYJRow_AVX2 Profile Was C/SSSE/AVX2 73.15% ARGBToUVJ444Row_C 15.74% RAWToARGBRow_SSSE3 8.87% ARGBToYJRow_AVX2 Bug: 381138208 Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Ben Weiss <bweiss@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-10 18:36:18 -08:00
Frank Barchard	d32d19ccf2	UV subsample on ARM use rounding average of 4 pixels Performance on Samsung S22 Exynos (SVE2+I8MM+DOTPROD+Neon) AArch64 ARGBToI400_Opt (168 ms) ARGBToJ400_Opt (103 ms) ABGRToJ400_Opt (81 ms) RGBAToJ400_Opt (82 ms) RGB24ToJ400_Opt (176 ms) RAWToJ400_Opt (176 ms) ABGRToI420_Opt (258 ms) ARGBToI420_Opt (259 ms) ARGBToI422_Opt (403 ms) ARGBToI444_Opt (213 ms) ARGBToJ420_Opt (257 ms) ARGBToJ422_Opt (403 ms) ARGBToJ444_Opt (214 ms) ABGRToJ420_Opt (255 ms) ABGRToJ422_Opt (399 ms) ARGB4444ToI420_Opt (285 ms) RGB565ToI420_Opt (316 ms) ARGB1555ToI420_Opt (324 ms) BGRAToI420_Opt (260 ms) RAWToI420_Opt (303 ms) RAWToI444_Opt (303 ms) RAWToJ420_Opt (335 ms) RAWToJ444_Opt (308 ms) RGB24ToI420_Opt (372 ms) RGB24ToJ420_Opt (365 ms) RGBAToI420_Opt (259 ms) AArch32 (Neon) ARGBToI400_Opt (496 ms) ARGBToJ400_Opt (478 ms) ABGRToJ400_Opt (483 ms) RGBAToJ400_Opt (493 ms) RGB24ToJ400_Opt (343 ms) RAWToJ400_Opt (341 ms) ABGRToI420_Opt (993 ms) ARGBToI420_Opt (992 ms) ARGBToI422_Opt (1503 ms) ARGBToI444_Opt (1257 ms) ARGBToJ420_Opt (1006 ms) ARGBToJ422_Opt (1521 ms) ARGBToJ444_Opt (1267 ms) ABGRToJ420_Opt (1002 ms) ABGRToJ422_Opt (1504 ms) ARGB4444ToI420_Opt (1180 ms) RGB565ToI420_Opt (1112 ms) ARGB1555ToI420_Opt (1115 ms) BGRAToI420_Opt (993 ms) RAWToI420_Opt (703 ms) RAWToI444_Opt (1717 ms) RAWToJ420_Opt (704 ms) RAWToJ444_Opt (1739 ms) RGB24ToI420_Opt (703 ms) RGB24ToJ420_Opt (703 ms) RGBAToI420_Opt (993 ms) Bug: 381138208 Change-Id: I33728d5237f357362b0bfc509a9ebe6fe46f45d4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6228987 Reviewed-by: Ben Weiss <bweiss@google.com> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-02-04 15:19:19 -08:00
Frank Barchard	5a9a6ea936	Add RAWToI444 Skylake Xeon RAWToI444_Opt (433 ms) RAWToJ444_Opt (1781 ms) ARGBToI444_Opt (352 ms) ARGBToJ444_Opt (1577 ms) Samsung S22 Exynos ARGBToI444_Opt (283 ms) ARGBToJ444_Opt (209 ms) RAWToI444_Opt (294 ms) RAWToJ444_Opt (293 ms) Profiling on Samsung S22 Exynos 37.62%, ARGBToUV444Row_NEON_I8MM 29.42%, RAWToARGBRow_SVE2 19.61%, ARGBToYRow_NEON_DotProd Passing different --libyuv_cpu_info=N etc we can compare each ISA C 1 RAWToI444_Opt (781 ms) NEON 511 RAWToI444_Opt (757 ms) NEONDOT 1023 RAWToI444_Opt (571 ms) NEONI8MM 2047 RAWToI444_Opt (334 ms) SVE2 8191 RAWToI444_Opt (307 ms) Bug: 390247964 Change-Id: I0316fedd32222588455afa751f5b854f46bce024 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6223658 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-02-03 16:13:03 -08:00
Frank Barchard	b3fd3f3f3b	Fix ARGBToUV444Row_NEON - constants passed in are signed and need to be negated to positive. Bug: 394127527 Change-Id: I531e475d2ddd4583922d4abef13b9282d002dd7a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6226854 Reviewed-by: Ben Weiss <bweiss@google.com>	2025-02-03 13:33:39 -08:00
Frank Barchard	96f98f6915	ARGBToJ444 and RAWToJ444 NEON - Pass JPEG matrix to ARGBToUV444MatrixRow_NEON - Remove NEON unsigned constants in favor of DOTPROD signed constants Samsung S23: Was C for UV ARGBToJ444_Opt (320 ms) RAWToJ444_Opt (411 ms) Now I8MM ARGBToJ444_Opt (196 ms) RAWToJ444_Opt (301 ms) NEON ARGBToJ444_Opt (505 ms) RAWToJ444_Opt (596 ms) 32 bit ARM NEON ARGBToJ444_Opt (1135 ms) RAWToJ444_Opt (1546 ms) Profile of RAWToJ444 37.72% ARGBToUVJ444Row_NEON_I8MM 34.48% RAWToARGBRow_NEON 14.65% ARGBToYJRow_NEON_DotProd Bug: 390247964 Change-Id: Ia26240bee974a0baf502548f2fc896b193c3006c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6220890 Reviewed-by: Ben Weiss <bweiss@google.com>	2025-01-31 16:46:29 -08:00
Frank Barchard	c1bac9e6a5	RAWToJ444 and ARGBToJ444 - ARGBToJ444 implements ARGBToUVJ444Row_C - RAWToJ444 implemented as 2 steps - RAWToARGB and ARGBToJ444 libyuv_test '--gunit_filter=RTo?444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 (with bit exact off) Samsung S23 RAWToJ444_Opt (437 ms) ARGBToJ444_Opt (337 ms) ARGBToI444_Opt (196 ms) Skylake Xeon RAWToJ444_Opt (1699 ms) ARGBToJ444_Opt (1559 ms) ARGBToI444_Opt (346 ms) Bug: 390247964 Change-Id: Id1b1b45a5e4512ab50830aadf62f780fbe631575 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207845 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-29 15:18:38 -08:00
George Steed	c4a0c8d34a	[AArch64] Add SVE2 and SME implementations for Convert8To8Row SVE can make use of the UMULH instruction to avoid needing separate widening multiply and narrowing steps for the scale application. Reduction in runtime for Convert8To8Row_SVE2 observed compared to the existing Neon implementation: Cortex-A510: -13.2% Cortex-A520: -16.4% Cortex-A710: -37.1% Cortex-A715: -38.5% Cortex-A720: -38.4% Cortex-X2: -33.2% Cortex-X3: -31.8% Cortex-X4: -31.8% Cortex-X925: -13.9% Change-Id: I17c0cb81661c5fbce786b47cdf481549cfdcbfc7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207692 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-01-28 15:53:26 -08:00
Frank Barchard	6c2415bfab	J420ToI420 AVX2 libyuv_test '--gunit_filter=J420ToI420' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Skylake Xeon AVX2 J420ToI420_Opt (114 ms) C J420ToI420_Opt (596 ms) Sapphire Rapids AVX2 J420ToI420_Opt (126 ms) C J420ToI420_Opt (717 ms) Samsung S23 NEON J420ToI420_Opt (46 ms) C J420ToI420_Opt (95 ms) Bug: 381327032 Change-Id: I2b551507c2a8b1da4f04651b622fc9247a75050d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6201239 Reviewed-by: Justin Green <greenjustin@google.com>	2025-01-27 11:23:44 -08:00
Frank Barchard	67f3f17d9a	aarch32 J420ToI420 benchmark on medium core adbrun -- taskset 10 blaze-bin/third_party/libyuv/libyuv_test '--gunit_filter=J420ToI420' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Now Neon J420ToI420_Opt (159 ms) Was C J420ToI420_Opt (215 ms) AArch64 J420ToI420_Opt (93 ms) C version does this: vld1.8 {d20, d21}, [r6]! vorr q12, q8, q8 subs r4, #16 vmovl.u8 q11, d21 vmovl.u8 q10, d20 vmul.i16 q11, q9, q11 vmul.i16 q10, q9, q10 vsra.u16 q12, q11, #8 vorr q11, q8, q8 vsra.u16 q11, q10, #8 vmovn.i16 d21, q12 vmovn.i16 d20, q11 vst1.8 {d20, d21}, [r5]! bne 0x3d9078 <Convert8To8Row_C+0x36> @ imm = #-54 Explanation of above C code vorr moves 16 into register vsra does shift + accumulate to that register Compared to aarch64 instead of mull, C uses movl+mul instead of uzp2, C uses sra #8 + movn. takes 2 movn vs 1 uzp2 instead of add, C does vorr + sra Change-Id: I9648f06e52ccbafaecf07bd89f8ffff27565d025 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6189497 Reviewed-by: Justin Green <greenjustin@google.com>	2025-01-22 13:47:09 -08:00
Frank Barchard	26277baf96	J420ToI420 using planar 8 bit scaling - Add Convert8To8Plane which scale and add 8 bit values allowing full range YUV to be converted to limited range YUV libyuv_test '--gunit_filter=J420ToI420' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Samsung S23 J420ToI420_Opt (45 ms) I420ToI420_Opt (37 ms) Skylake J420ToI420_Opt (596 ms) I420ToI420_Opt (99 ms) Bug: 381327032 Change-Id: I380c3fa783491f2e3727af28b0ea9ce16d2bb8a4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6182631 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-22 02:50:24 -08:00
Frank Barchard	ef52c1658a	avx10_2 detect Run with sde only -dmr reports AVX10.2 emr:Has AVX10_2 0x0 adl:Has AVX10_2 0x0 icx:Has AVX10_2 0x0 snb:Has AVX10_2 0x0 tnt:Has AVX10_2 0x0 icl:Has AVX10_2 0x0 slm:Has AVX10_2 0x0 dmr:Has AVX10_2 0x2000000 cwf:Has AVX10_2 0x0 mrm:Has AVX10_2 0x0 skx:Has AVX10_2 0x0 wsm:Has AVX10_2 0x0 gnr:Has AVX10_2 0x0 gnr256:Has AVX10_2 0x0 bdw:Has AVX10_2 0x0 cpx:Has AVX10_2 0x0 rpl:Has AVX10_2 0x0 snr:Has AVX10_2 0x0 ptl:Has AVX10_2 0x0 slt:Has AVX10_2 0x0 ivb:Has AVX10_2 0x0 spr:Has AVX10_2 0x0 tgl:Has AVX10_2 0x0 arl:Has AVX10_2 0x0 srf:Has AVX10_2 0x0 nhm:Has AVX10_2 0x0 skl:Has AVX10_2 0x0 mtl:Has AVX10_2 0x0 pnr:Has AVX10_2 0x0 glp:Has AVX10_2 0x0 lnl:Has AVX10_2 0x0 cnl:Has AVX10_2 0x0 hsw:Has AVX10_2 0x0 clx:Has AVX10_2 0x0 glm:Has AVX10_2 0x0 sde -dmr -- libyuv_test --gunit_filter=Cpu [ RUN ] LibYUVBaseTest.TestCpuId Cpu Vendor: GenuineIntel 0x756e6547 0x49656e69 0x6c65746e Cpu Family 6 (0x6), Model 214 (0xd6) [ OK ] LibYUVBaseTest.TestCpuId (34 ms) [ RUN ] LibYUVBaseTest.TestCpuHas Kernel Version 6.10 Has X86 0x8 Has SSE2 0x100 Has SSSE3 0x200 Has SSE4.1 0x400 Has SSE4.2 0x800 Has AVX 0x1000 Has AVX2 0x2000 Has ERMS 0x4000 Has FSMR 0x8000 Has FMA3 0x10000 Has F16C 0x20000 Has AVX512BW 0x40000 Has AVX512VL 0x80000 Has AVX512VNNI 0x100000 Has AVX512VBMI 0x200000 Has AVX512VBMI2 0x400000 Has AVX512VBITALG 0x800000 Has AVX10 0x1000000 Has AVX10_2 0x2000000 HAS AVXVNNI 0x4000000 Has AVXVNNIINT8 0x8000000 Has AMXINT8 0x10000000 [ OK ] LibYUVBaseTest.TestCpuHas (10 ms) This is how oneDNN does avx10 version: `e15d2c220f/src/cpu/x64/xbyak/xbyak_util.h (L698-L701)` Bug: b/350318244 Change-Id: I6f78402fecc38a92019d137b3439d7bce950510c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6172267 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-01-21 13:53:19 -08:00
Frank Barchard	47ddac2996	Sub sampling conversions use CopyPlane for Y channel - Replace ScalePlane with CopyPlane for Y channel - Vertical mirroring is supported, but not horizontal mirroring. - Check src_y is not null when dst_y is not null for all libyuv functions that allow a null dst_y. - Apply clang-format - Bump version to 1899 Bug: None Change-Id: Id1805b52b8024ba95a7f1b098dabf45af48670eb Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6128599 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-02 13:34:11 -08:00
Frank Barchard	e0040eb318	Apply clang format Bug: None Change-Id: I0d9db4b384144523e61ae32b6ab3f72e93a0c265 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6138934 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-02 13:31:20 -08:00
Darren Hsieh	b5a18f9d93	[RVV] Optimize ScaleARGBFilterCols with RVV * Run on SiFive internal FPGA: Test Case Speedup ARGBScaleDownBy3by8_Linear x2.05 ARGBScaleDownBy3by8_Bilinear x1.76 ARGBScaleDownBy3by8_Box x1.76 Bug: 42280924 Co-Developed-by: Bruce Lai <bruce.lai@sifive.com> Change-Id: Ib9979b1f2ca92d2ef5aa373f9b2459c246ded6c8 Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5103572 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-29 17:32:00 -08:00
George Steed	db5a71c528	[AArch64] Remove unused variables in HalfRow_{16To8,16}_SME The HalfRow kernels assume that the fraction is exactly half, so there is no need to calculate it. No-Try: True Change-Id: I2319d55ba99f202aa22c9693ec44c9891e7f72d5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087914 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Mirko Bonadei <mbonadei@chromium.org>	2024-12-13 08:00:58 -08:00
George Steed	7fd0bd197e	[AArch64] Port YUVToRGB color conversions to SME Some of the color conversion kernels already have Streaming-SVE implementations however many do not. We can re-use the existing SVE implementation by moving it to a new shared row_sve.h header and marking it with a "streaming-compatible" attribute to ensure it can be called from both streaming and non-streaming execution modes. As part of this move to a common header we also add duplicated streaming-mode implementations of the following kernels that did not previously have an SME implementation: - I210AlphaToARGBRow_SME - I210ToAR30Row_SME - I210ToARGBRow_SME - I212ToAR30Row_SME - I212ToARGBRow_SME - I400ToARGBRow_SME - I410AlphaToARGBRow_SME - I410ToAR30Row_SME - I410ToARGBRow_SME - I422AlphaToARGBRow_SME - I422ToARGB1555Row_SME - I422ToARGB4444Row_SME - I422ToRGB24Row_SME - I422ToRGB565Row_SME - I422ToRGBARow_SME - I444AlphaToARGBRow_SME - NV12ToARGBRow_SME - NV12ToRGB24Row_SME - NV21ToARGBRow_SME - NV21ToRGB24Row_SME - P210ToAR30Row_SME - P210ToARGBRow_SME - P410ToAR30Row_SME - P410ToARGBRow_SME - UYVYToARGBRow_SME - YUY2ToARGBRow_SME Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:07:54 -08:00
George Steed	c2e7f8389a	[AArch64] Add SME implementations of InterpolateRow{,_16,_16To8} InterpolateRow_SME and InterpolateRow_16_SME need special cases to handle if source_y_fraction is 256 since this would overflow a byte and can just be a call to memcpy instead. InterpolateRow_16To8_SME is never called with a source_y_fraction value of 256 so there is no need for a special case here. Change-Id: I67805b5db2c411acb93ada626cf414b35620f467 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:03:41 -08:00
George Steed	2d8652f3e7	[AArch64] Add SME implementation of CopyRow Add a streaming-SVE implementation of CopyRow using normal vector load/store instructions. Change-Id: Ia551413f9740a96473fa2e8a0958953be2f4b04e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074374 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:02:07 -08:00
George Steed	418b6df0de	[AArch64] Add SME implementation of Convert16To8Row Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel. SVE has a "widening multiply get high half" instruction in UMULH, however using the same technique as the Neon code to avoid the need for a widening multiply at all is more performant here. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 03:01:55 -08:00
runzezhang	192b8c2238	Add NV24 scaling support to libyuv Some projects require scaling support for the NV24 format, but libyuv currently lacks this functionality. This commit adds a scaling function for NV24, enabling its use in projects that require NV24 format processing. Change-Id: I6e6b2bea342e1df7f387056ab3bc5003da983bb7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6068715 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 02:46:11 -08:00
George Steed	85331e00cc	[AArch64] Add SME impls of ScaleRowDown2{,Linear,Box}_16 Mostly just straightforward copies of the Neon code ported to Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2 SME kernels, but operating on 16-bit data rather than 8-bit. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I7bad0719d24cdb1760d1039c63c0e77726b28a54 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070784 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 01:21:08 -08:00
George Steed	15f2ae7d70	[AArch64] Add SME impls of ScaleARGBRowDown2{,Linear,Box} Mostly just straightforward copies of the Neon code ported to Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2 and ScaleUVRowDown2 SME kernels, but operating on 32-bit ARGB tuples rather than 8-bit data or 16-bit UV tuples. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I15600c2498cc592f5ea1d97b78fafec327de7947 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070783 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 01:19:20 -08:00
George Steed	7391559cb4	[AArch64] Add SME implementation of MergeUVRow{,_16} Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel and use ST2 to avoid needing a separate ZIP instruction. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 01:16:19 -08:00
George Steed	3e75e41e79	[AArch64] Add "limit" variable explanations in SVE *AR30 kernels As requested here: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023583/1/source/row_sve.cc#1973 Change-Id: I15d8ca1f724a7123fbf52ac60b18c850e4004e64 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067153 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-11 23:50:27 -08:00
George Steed	11ef227b6d	[AArch64] Clean up formatting in row_sve.cc Force macros onto empty lines with empty comments and adjust some other comments to be consistent with the rest of the file. Change-Id: I1a35283608b868c53e91b337187ebe0e402c9834 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067152 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-11 23:48:57 -08:00
George Steed	8f659daffd	[AArch64] Add SVE2 implementations of NV{12,21}ToRGB24Row Now that we have the `_2X` versions of the macros we can use these to implement `ToRGB24` kernels. These cannot use the bottom/top approach previously used by other SVE kernels since there are three rather than two or four elements each. Reduction in runtimes observed compared to the existing Neon implementations: \| NV12ToRGB24Row \| NV21ToRGB24Row Cortex-A510 \| -60.7% \| -60.7% Cortex-A520 \| -46.0% \| -46.0% Cortex-A715 \| -25.2% \| -25.2% Cortex-A720 \| -25.2% \| -25.2% Cortex-X2 \| -28.9% \| -29.0% Cortex-X3 \| -28.2% \| -28.1% Cortex-X4 \| -30.8% \| -30.7% Cortex-X925 \| -28.8% \| -28.9% Change-Id: I39853d124bfdcac38584109870b398b8ecd5b632 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067149 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-04 17:51:08 +00:00
George Steed	233f859e3c	[AArch64] Remove redundant increments in ScaleRowDown2_16_NEON These were mistakenly copied from the main loop body, however this particular block of the code is only executed at most once so we do not need to perform the address updates. Also adjust formatting with clang-format to match other kernels. Change-Id: I8214821417d5e4f455ebe8805e1a37a9728ab8d2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067154 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-04 17:48:11 +00:00
George Steed	9144583f22	[AArch64] Add SME impls of MultiplyRow_16 and ARGBMultiplyRow Mostly just a translation of the existing Neon code to SME. Change-Id: Ic3d6b8ac774c9a1bb9204ed6c78c8802668bffe9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067147 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-03 22:11:19 +00:00
George Steed	88a3472f52	[AArch64] Unroll SVE2 impls of NV{12,21}ToARGBRow We can reuse most of the logic from the existing I422TORGB_SVE_2X macro and simply amend the existing READNV_SVE macro to read twice as much data. Unrolling is primarily beneficial for little cores but also provides some smaller benefits to larger cores as well. \| NV12ToARGBRow_SVE2 \| NV21ToARGBRow_SVE2 Cortex-A510 \| -48.0% \| -47.9% Cortex-A520 \| -48.1% \| -48.2% Cortex-A715 \| -20.4% \| -20.4% Cortex-A720 \| -20.6% \| -20.6% Cortex-X2 \| -7.1% \| -7.3% Cortex-X3 \| -4.0% \| -4.3% Cortex-X4 \| -14.1% \| -14.3% Cortex-X925 \| -8.2% \| -8.6% Change-Id: I195005d23e743d7d46319220ad05ee89bb7385ae Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067148 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-03 22:03:42 +00:00
George Steed	03a935493d	[AArch64] Simplify predicate width calculations Several of the existing SVE kernels used calculations of the form: remainder = width & (vl - 1) == 0 ? vl : width & (vl - 1); This is due to initial SVE contributed code unconditionally using the predicated tail for the final iteration even if the width was a perfect multiple of the vector length. In the current code the fully-predicated main body loop will instead iterate through the width completely and simply skip over the tail entirely. Skipping over the tail means that the case handled by the ternary condition now never occurs, and the remainder calculation can now simply be: remainder = width & (vl - 1); This avoids the need for a compare and conditional select in the function prologue. Change-Id: Ia73f5f8bc66fad6bea64439dc2beeaccb54622d2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067151 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-03 21:54:32 +00:00
George Steed	2c32b689e4	[AArch64] Improve instruction interleaving in READI212_SVE The existing instruction arrangement is sub-optimal on little cores since it has instructions with dependencies next to each other, so spread them out to improve performance. No significant change observed on bigger cores, but little cores do show some small improvements except for the Alpha kernels which regress slightly. Runtimes observed compared to the previous SVE implementation: \| Cortex-A510 \| Cortex-A520 I210AlphaToARGBRow \| (!) +7.0% \| (!) +6.8% I210ToAR30Row \| -10.3% \| -9.9% I210ToARGBRow \| -2.4% \| -2.3% I212ToAR30Row \| -10.3% \| -9.9% I212ToARGBRow \| -2.4% \| -2.3% Change-Id: I626942ce02c4610cfac1ea4f8e7890653ee4324f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067150 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-03 21:50:47 +00:00
Hao Chen	532126bf70	Fix bugs in ARGBAttenuateRow_LASX/LSX function Fix errors in ARGBAttenuateRow_LASX and ARGBAttenuateRow_LSX functions caused by changes in calculation methods. In addition, add the option to automatically add "-mlsx" and "-mlasx" to enable SIMD optimization when compiling with cmake on LoongArch platform. Bug: libyuv:913 Change-Id: I7215f5198d3fb94f981d60969dc21a483006023e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802829 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Ben Weiss <bweiss@google.com>	2024-11-30 23:09:04 +00:00
George Steed	9a9752134e	[AArch64] Add Neon implementation of ScaleRowDown2Linear_16 Reduction in runtime observed relative to the auto-vectorized C implementation compiled with LLVM 19: Cortex-A55: -13.7% Cortex-A510: -49.0% Cortex-A520: -32.0% Cortex-A76: -34.3% Cortex-A710: -56.7% Cortex-A715: -45.4% Cortex-A720: -44.7% Cortex-X1: -70.6% Cortex-X2: -67.9% Cortex-X3: -72.2% Cortex-X4: -40.0% Cortex-X925: -24.1% Bug: b/42280942 Change-Id: I977899a2239e752400c9901f4d8482a76841269a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040154 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-25 21:10:26 +00:00
George Steed	11c57f4f12	[AArch64] Add Neon implementation of ScaleRowDown2_16_NEON The auto-vectorized implementation unrolls to process 32 elements per iteration, so unroll the new Neon implementation to match and avoid a performance regression on little cores. Performance relative to the auto-vectorized C implementation compiled with LLVM 19: Cortex-A55: -35.8% Cortex-A510: -20.4% Cortex-A520: -22.1% Cortex-A76: -54.8% Cortex-A710: -44.5% Cortex-A715: -31.1% Cortex-A720: -31.4% Cortex-X1: -48.5% Cortex-X2: -47.8% Cortex-X3: -47.6% Cortex-X4: -51.1% Cortex-X925: -14.6% Bug: b/42280942 Change-Id: Ib4e89ba230d554f2717052e934ca0e8a109ccc42 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040153 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-25 21:10:05 +00:00
George Steed	952d6a282f	[AArch64] Enable use of ScaleRowDown2Box_16_NEON The #ifdef surrounding the use of this kernel is never defined and ScaleRowDown2_16_NEON does not exist, so add the missing #define and remove the use of ScaleRowDown2_16_NEON for now. Additionally since there is no implementation of this kernel for 32-bit Arm, restrict the define to only be present on AArch64. Bug: b/42280942 Change-Id: Icc35c145c1bad1c0df2933a2d8bc7dcf7fe63cb7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040152 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-24 19:58:00 +00:00
George Steed	9ed07258c7	[AArch64] Add SVE2 implementation of I410ToAR30Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -18.1% Cortex-A520: -6.0% Cortex-A715: -22.0% Cortex-A720: -21.1% Cortex-X2: -9.4% Cortex-X3: -12.0% Cortex-X4: -7.6% Cortex-X925: -5.8% Bug: b/42280942 Change-Id: I853a028e08f1f1076ac20cd9c7f4f8ac8a211ac1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023584 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:59:55 +00:00
George Steed	3dd047733e	[AArch64] Add SVE2 implementation of I410AlphaToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -37.2% Cortex-A520: -6.9% Cortex-A715: -14.8% Cortex-A720: -16.0% Cortex-X2: -14.8% Cortex-X3: -17.5% Cortex-X4: -12.8% Cortex-X925: -13.0% Bug: b/42280942 Change-Id: I1977fd1e1dfac25021724483fd89c6ff3e227d8b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023582 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:58:11 +00:00
George Steed	e84d809348	[AArch64] Add SVE2 implementation of I410ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -37.9% Cortex-A520: -9.2% Cortex-A715: -14.3% Cortex-A720: -14.2% Cortex-X2: -10.9% Cortex-X3: -11.1% Cortex-X4: -12.5% Cortex-X925: -10.6% Bug: b/42280942 Change-Id: I6720b07c900c7dfbd849ee38e413e98b9374dac2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023581 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:54:48 +00:00
George Steed	7c9c72ab4b	[AArch64] Add SVE2 implementation of I210ToAR30Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -15.5% Cortex-A520: -3.8% Cortex-A715: -15.8% Cortex-A720: -15.8% Cortex-X2: -7.9% Cortex-X3: -6.5% Cortex-X4: -5.0% Cortex-X925: -5.3% Bug: b/42280942 Change-Id: I5171537fd125b3214d25a0ae503a8f40dbeb6042 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023583 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-11-23 00:53:16 +00:00
George Steed	fc3569ad27	[AArch64] Add SVE2 implementation of I210AlphaToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -33.9% Cortex-A520: -4.2% Cortex-A715: -22.0% Cortex-A720: -22.4% Cortex-X2: -14.6% Cortex-X3: -14.5% Cortex-X4: -11.6% Cortex-X925: -12.6% Bug: b/42280942 Change-Id: Ifb4ed7a865c369d584af498cc65b84d065cfb207 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023580 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:47:32 +00:00
George Steed	50108f29fb	[AArch64] Add SVE2 implementation of I212ToAR30Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -15.4% Cortex-A520: -3.8% Cortex-A715: -15.7% Cortex-A720: -15.6% Cortex-X2: -7.9% Cortex-X3: -5.7% Cortex-X4: -5.3% Cortex-X925: -4.8% Bug: b/42280942 Change-Id: I99846820682687c8e0f52d05f5aa3d50369fe0a2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025829 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:27:57 +00:00
George Steed	305a7a4ede	[AArch64] Add SVE2 implementation of I212ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -34.5% Cortex-A520: -6.5% Cortex-A715: -10.1% Cortex-A720: -16.1% Cortex-X2: -11.9% Cortex-X3: -11.9% Cortex-X4: -9.3% Cortex-X925: -11.2% Bug: b/42280942 Change-Id: Idc30e69552f7d227217ac7011a786210b11e4752 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025828 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:21:27 +00:00
Frank Barchard	595146434a	HalfFloat fix SigIll on aarch64 - Remove special case Scale of 1 which used fp16 cvt but requires cpuid - Port aarch64 to aarch32 - Use C for aarch32 with small (denormal) scale value Bug: 377693555 Change-Id: I38e207e79ac54907ed6e65118b8109288fddb207 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6043392 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-11-22 22:08:00 +00:00

1 2 3 4 5 ...

1969 Commits