libyuv

mirror of https://chromium.googlesource.com/libyuv/libyuv synced 2026-06-15 08:26:06 +08:00

Author	SHA1	Message	Date
George Steed	88798bcd63	[AArch64] Add SME implementation of Convert8To16Row_SME Mostly just a straightforward copy of the Neon code ported to Streaming-SVE. There is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: Ide34dbb7125b5f2a1edda6ef7111a1a49aad324f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6651565 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-23 11:32:56 -07:00
Frank Barchard	6f729fbe65	ARGBToUV SSE use average of 4 pixels - Was using avgb twice for non-exact and C for exact. On Skylake Xeon: Now SSE3 ARGBToJ420_Opt (326 ms) Was Exact C ARGBToJ420_Opt (871 ms) Not exact AVX2 ARGBToJ420_Opt (237 ms) Not exact SSSE3 ARGBToJ420_Opt (312 ms) Bug: 381138208 Change-Id: I6d1081bb52e36f06736c0c6575fa82bb2268629b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6629605 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Ben Weiss <bweiss@google.com>	2025-06-17 11:55:27 -07:00
Frank Barchard	889613683a	Add hybrid detect for Intel laptop cpus - Add +i8mm build option for sve ARGBToUV which uses usdot - util/cpuid Get cpu count (windows, macos, linux) - For each x86 cpu, detect hybrid (e-core) - Includes a comment fix for ubsan unittest - Bump version - Apply clang format to util/.c as well as all .cc/*.h Bug: 424637372 Change-Id: I08310e18051fff62c9e4e4a10d1e4361871119ac Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6635640 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-06-13 13:22:54 -07:00
George Steed	1b2f6cdbe8	[AArch64] Unroll I210ToAR30Row_{SVE2,SME} Now that we have a STOREAR30_SVE_2X implementation, we can use this to unroll other kernels. The predication on I210ToAR30Row needs adjusting to allow loading two vectors of Y compared to one vector of U/V, and additionally UZP is needed to ensure the data arrangement in vector lanes matches the U/V layout. LD2H could also be used, however this provides no performance improvement on most cores and would necessitate the addition of an "any" kernel to handle the case where width % 2 != 0. Reduction in run times of I210ToAR30Row_SVE2 observed compared to the previous SVE2 implementation: (note that even in the observed slowdowns, the SVE2 implementation still outperforms the existing Neon code) Cortex-A510: -37.1% Cortex-A520: -39.1% Cortex-A710: +1.6% (!) Cortex-A715: +6.5% (!) Cortex-A720: +6.5% (!) Cortex-X2: -2.9% Cortex-X3: -2.2% Cortex-X4: -8.8% Cortex-X925: -3.5% Change-Id: I2ff285b48105883526eceb8be1fcbe0e033a553b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6640989 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2025-06-12 14:10:21 -07:00
George Steed	867bdc51ed	[AArch64] Unroll I422ToAR30Row_{SVE2,SME} The existing STOREAR30_SVE macro works fine for out of order cores, however for in-order cores the number of dependent vector instructions laid out consecutively impacts performance. We can improve this by unrolling the loop to process two sets of vectors at a time, allowing little cores to process two independent streams of vector instructions at the same time to improve performance. Using one set of ZIP instructions at the end allows us to (a) avoid ST4 which we know is slow on some micro-architectures, and (b) enable the use of predication and avoid the need for separate "any" kernels. Reduction in run times of I422ToAR30Row_SVE2 observed compared to the previous SVE2 implementation: Cortex-A510: -37.7% Cortex-A520: -38.8% Cortex-A710: -14.8% Cortex-A715: -17.1% Cortex-A720: -16.9% Cortex-X2: -10.3% Cortex-X3: -6.7% Cortex-X4: -9.4% Cortex-X925: -7.1% Change-Id: I160fb41300d2d08fce2e6eb92181324fd723a02d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6632916 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2025-06-12 14:09:49 -07:00
Frank Barchard	4ac0a3ae3d	ubsan compliant '_any' functions using ptrdiff_t for pointer math Bug: 416842099 Change-Id: I1e3c7bc1b363c11baeb3b529ee78e5ac8878c359 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6634217 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-06-10 15:01:52 -07:00
George Steed	cd0ae0a222	row_sve.h: Add missing z21 clobber The z21 register is used in the I444TORGB_SVE_2X macro and other places, so add it to the clobber list macro that is used throughout this file. Change-Id: If4277c1ffcac0fa68cc44263acc6f41a9e82ec8b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6619508 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-08 19:41:44 -07:00
George Steed	998bec7ca9	Sort row.h #define *_NEON lists Sort the Arm Neon and Neon DotProd #define lists to match the alphabetical ordering used for the SVE2 and SME lists. Change-Id: Ibeb380f477d5476d0018d20a754557a5f93f2190 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6613686 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-06-08 19:38:30 -07:00
George Steed	ef9833fc70	Add Neon implementation of Convert8To16Row Add a Neon implementation of the Convert8To16Row kernel. Compared to the C implementation we can take advantage of knowing that the "scale" parameter is always an unsigned power of two and fits in 16-bits, allowing us to combine this with the shift and avoid needing to widen the input data. Reduction in run times observed compared to the existing C implementation: Cortex-A55: -44.5% Cortex-A510: -26.1% Cortex-A520: -30.6% Cortex-A76: -61.6% Cortex-A710: -57.6% Cortex-X1: -46.5% Cortex-X2: -54.4% Cortex-X3: -57.1% Cortex-X4: -55.0% Cortex-X925: -49.3% Change-Id: I34b858605ece47e46588c0680a1d2afa7a90d7a0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6516186 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-29 13:37:48 -07:00
George Steed	7e5863ae5a	Add SVE2 and SME implementations of I422ToAR30Row This can make use of the existing load/convert/store macros that are already present for other kernels, so add I422ToAR30Row_SVE2 and I422ToAR30Row_SME to match the existing kernels. Reduction in time taken observed for the new SVE2 implementation, compared to the existing Neon implementation: Cortex-A510: -9.1% Cortex-A520: +6.8% (!) Cortex-A710: -4.0% Cortex-A715: -1.1% Cortex-A720: -1.1% Cortex-X2: -5.7% Cortex-X3: -5.9% Cortex-X4: -2.8% Cortex-X925: -4.0% Change-Id: Ibf8bfaaeaba51f426649ded621cb0c8948dd9ee1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6592332 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-27 11:39:00 -07:00
George Steed	949cb623bf	Add SVE2 and SME implementations of I444ToRGB24Row Move the READYUV444_SVE_2X and I444TORGB_SVE_2X macros to row_sve.h so they are usable in both SVE2 and SME implementations, and use them to add new I444ToRGB24Row implementations for SVE2 and SME. We need to use the unrolled versions here to use the ST3B interleaving store instructions, since there is no partial vector version of this store instruction. Reduction in time taken observed for the new SVE2 implementation, compared to the existing Neon implementation: Cortex-A510: -57.6% Cortex-A520: -38.1% Cortex-A710: -15.5% Cortex-A715: -9.2% Cortex-A720: -9.2% Cortex-X2: -25.8% Cortex-X3: -26.2% Cortex-X4: -23.2% Cortex-X925: -17.8% Change-Id: I6acd0b798a35e5352d4fad664769f12d3d938ed7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6530646 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-05-22 13:33:06 -07:00
Frank Barchard	0853c9353f	ARGBToUV 64 bit use ymm8 for shuffler Bug: 381138208 Change-Id: I5e69bc1610bd6269bf9a4113e729cf307dd36f60 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6536833 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-05-12 15:09:40 -07:00
George Steed	61bdaee13a	Add Neon I8MM implementations of ARGB to UV and variants The maximum coefficient is 128, so store constants negated to take advantage of -128 being representable in 8-bit integers. This allows us to use the I8MM USDOT instructions. Reduction in time taken observed compared to the existing Neon implementation, as a geomean of all ARGBToUV variants: Cortex-A510: -7.1% Cortex-A520: -2.1% Cortex-A710: -8.4% Cortex-A715: -0.3% Cortex-A720: -0.3% Cortex-X2: -40.0% Cortex-X3: -43.3% Cortex-X4: -11.3% Cortex-X925: -2.5% Change-Id: Id06dc17d101b66975b84b93e5abe91c0032921dd Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6535686 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-05-12 11:14:00 -07:00
Frank Barchard	9f9b5cf660	ARGBToUV allow 32 bit x86 build - make width loop count on stack - set YMM constants in its own asm block - make struct for shuffle and add constants - disable clang format on row_neon.cc function Bug: 413781394 Change-Id: I263f6862cb7589dc31ac65d118f7ebeb65dbb24a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6495259 Reviewed-by: Wan-Teh Chang <wtc@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-04-28 12:11:00 -07:00
Wan-Teh Chang	8c48036d15	Remove duplicate code in planar_functions.h The declarations of ARGBAffineRow_C and ARGBAffineRow_SSE2 and the code to support those declarations are duplicated in planar_functions.h. They are already in row.h, so we can simply remove them. Change-Id: I9b522fdd201ca530f1268bf4200cd2e18b806ba5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6434733 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Wan-Teh Chang <wtc@google.com>	2025-04-04 15:48:23 -07:00
Wan-Teh Chang	b7a857659f	Disable Arm SME and SVE assmbly code under MSan The code that disables Arm and Intel assembly code under MSan is duplicated in cpu_support.h and planar_functions.h. This CL does not address the code duplication. Bug: b:407277484, b:407278016, b:407278132 Change-Id: If70fd8d3382916041d75efabcc84010ea3f1e60e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6430806 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-04-03 11:27:31 -07:00
Frank Barchard	23d416d6f3	Detect SME without SVE dependency Bug: None Change-Id: Ibe29488e893a493699ea3fae1a1a54a4fff5969c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6418571 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-03-31 17:27:40 -07:00
Frank Barchard	f145aa26da	Add SME2 detect Bug: None Change-Id: I36e576de1cf468049faaf3923b6c21fc9ad14271 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6401373 Reviewed-by: George Steed <george.steed@arm.com>	2025-03-27 11:08:08 -07:00
George Steed	64ac2d8f0f	Avoid odd width stores in I422ToRGB565Row_{SVE2,SME} The existing code for creating RGB565 data in SVE2 and SME produces two vectors of interleaved 16-bit elements due to the nature of how SVE widening instructions operate. This means that the indices of the 16-bit data created appear in the two result vectors as such: z18.b: [elem0 byte0, elem0 byte1, elem2 byte0, elem2 byte1, ...] z19.b: [elem1 byte0, elem1 byte1, elem3 byte0, elem3 byte1, ...] This is problematic for the final (predicated) iteration of the conversion since the p1 predicate input to the ST2H instruction controls storing the four bytes corresponding to the first two elements, in the first two bytes of z18 and z19. This means that in the case that the width is an odd number there is no way of storing just elem0 in z18 individually. This patch addresses this by permuting the z18/z19 data such that the two bytes from each element are split evenly across the two vectors: z20.b: [elem0 byte0, elem1 byte0, elem2 byte0, elem3 byte0, ...] z21.b: [elem0 byte1, elem1 byte1, elem2 byte1, elem3 byte1, ...] Since we would now always store the same lanes from both vectors we can continue to use the same predicate without further changes. The existing (non-tail) loop body utilizes an all-true predicate so we can avoid the extra permutes in this case, avoiding any performance degradation. Change-Id: I7d2be27c84cd9eb02cebac54a14c3498911f21d3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6395137 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-03-26 04:08:46 -07:00
Frank Barchard	5f284054cb	RVV disable 64 bit elements and vcombine_v Bug: 405451074 Change-Id: I8e4437be92934b3c367c94d867d7967c32747260 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6385788 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-03-25 12:51:25 -07:00
Frank Barchard	c060118bea	ARGBToJ444 use 256 for fixed point scale UV - use negative coefficients for UV to allow -128 - change shift to truncate instead of round for UV - adapt all row_gcc RGB to UV into matrix functions - add -DLIBYUV_ENABLE_ROWWIN to allow clang on Windows to use row_win.cc Bug: 381138208 Change-Id: I6016062c859faf147a8a2cdea6c09976cbf2963c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6277710 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: James Zern <jzern@google.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-27 13:04:15 -08:00
Frank Barchard	61354d2671	ARGBToUV Matrix for AVX2 and SSSE3 - Round before shifting to 8 bit to match NEON - RAWToARGB use unaligned loads and port to AVX2 Was C/SSSE/AVX2 ARGBToI444_Opt (343 ms) ARGBToJ444_Opt (677 ms) RAWToI444_Opt (405 ms) RAWToJ444_Opt (803 ms) Now AVX2 ARGBToI444_Opt (283 ms) ARGBToJ444_Opt (284 ms) RAWToI444_Opt (316 ms) RAWToJ444_Opt (339 ms) Profile Now AVX2 38.31% ARGBToUVJ444Row_AVX2 32.31% RAWToARGBRow_AVX2 23.99% ARGBToYJRow_AVX2 Profile Was C/SSSE/AVX2 73.15% ARGBToUVJ444Row_C 15.74% RAWToARGBRow_SSSE3 8.87% ARGBToYJRow_AVX2 Bug: 381138208 Change-Id: I696b2d83435bc985aa38df831e01ff1a658da56e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6231592 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Ben Weiss <bweiss@google.com> Reviewed-by: richard winterton <rrwinterton@gmail.com> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-02-10 18:36:18 -08:00
Frank Barchard	d32d19ccf2	UV subsample on ARM use rounding average of 4 pixels Performance on Samsung S22 Exynos (SVE2+I8MM+DOTPROD+Neon) AArch64 ARGBToI400_Opt (168 ms) ARGBToJ400_Opt (103 ms) ABGRToJ400_Opt (81 ms) RGBAToJ400_Opt (82 ms) RGB24ToJ400_Opt (176 ms) RAWToJ400_Opt (176 ms) ABGRToI420_Opt (258 ms) ARGBToI420_Opt (259 ms) ARGBToI422_Opt (403 ms) ARGBToI444_Opt (213 ms) ARGBToJ420_Opt (257 ms) ARGBToJ422_Opt (403 ms) ARGBToJ444_Opt (214 ms) ABGRToJ420_Opt (255 ms) ABGRToJ422_Opt (399 ms) ARGB4444ToI420_Opt (285 ms) RGB565ToI420_Opt (316 ms) ARGB1555ToI420_Opt (324 ms) BGRAToI420_Opt (260 ms) RAWToI420_Opt (303 ms) RAWToI444_Opt (303 ms) RAWToJ420_Opt (335 ms) RAWToJ444_Opt (308 ms) RGB24ToI420_Opt (372 ms) RGB24ToJ420_Opt (365 ms) RGBAToI420_Opt (259 ms) AArch32 (Neon) ARGBToI400_Opt (496 ms) ARGBToJ400_Opt (478 ms) ABGRToJ400_Opt (483 ms) RGBAToJ400_Opt (493 ms) RGB24ToJ400_Opt (343 ms) RAWToJ400_Opt (341 ms) ABGRToI420_Opt (993 ms) ARGBToI420_Opt (992 ms) ARGBToI422_Opt (1503 ms) ARGBToI444_Opt (1257 ms) ARGBToJ420_Opt (1006 ms) ARGBToJ422_Opt (1521 ms) ARGBToJ444_Opt (1267 ms) ABGRToJ420_Opt (1002 ms) ABGRToJ422_Opt (1504 ms) ARGB4444ToI420_Opt (1180 ms) RGB565ToI420_Opt (1112 ms) ARGB1555ToI420_Opt (1115 ms) BGRAToI420_Opt (993 ms) RAWToI420_Opt (703 ms) RAWToI444_Opt (1717 ms) RAWToJ420_Opt (704 ms) RAWToJ444_Opt (1739 ms) RGB24ToI420_Opt (703 ms) RGB24ToJ420_Opt (703 ms) RGBAToI420_Opt (993 ms) Bug: 381138208 Change-Id: I33728d5237f357362b0bfc509a9ebe6fe46f45d4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6228987 Reviewed-by: Ben Weiss <bweiss@google.com> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-02-04 15:19:19 -08:00
George Steed	ccdf870348	[AArch64] Fix up inline asm name in Convert8To8Row_SVE_SC The existing implementation mistakenly refers to the parameter %2. This works fine however the parameter is already named %[width], and using the name should be preferred. Change-Id: Ifaf8fc83cdfc9b15c79d52e7e47cb72b53270a12 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6225753 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2025-02-04 10:28:17 -08:00
Frank Barchard	5a9a6ea936	Add RAWToI444 Skylake Xeon RAWToI444_Opt (433 ms) RAWToJ444_Opt (1781 ms) ARGBToI444_Opt (352 ms) ARGBToJ444_Opt (1577 ms) Samsung S22 Exynos ARGBToI444_Opt (283 ms) ARGBToJ444_Opt (209 ms) RAWToI444_Opt (294 ms) RAWToJ444_Opt (293 ms) Profiling on Samsung S22 Exynos 37.62%, ARGBToUV444Row_NEON_I8MM 29.42%, RAWToARGBRow_SVE2 19.61%, ARGBToYRow_NEON_DotProd Passing different --libyuv_cpu_info=N etc we can compare each ISA C 1 RAWToI444_Opt (781 ms) NEON 511 RAWToI444_Opt (757 ms) NEONDOT 1023 RAWToI444_Opt (571 ms) NEONI8MM 2047 RAWToI444_Opt (334 ms) SVE2 8191 RAWToI444_Opt (307 ms) Bug: 390247964 Change-Id: I0316fedd32222588455afa751f5b854f46bce024 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6223658 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-02-03 16:13:03 -08:00
Frank Barchard	b3fd3f3f3b	Fix ARGBToUV444Row_NEON - constants passed in are signed and need to be negated to positive. Bug: 394127527 Change-Id: I531e475d2ddd4583922d4abef13b9282d002dd7a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6226854 Reviewed-by: Ben Weiss <bweiss@google.com>	2025-02-03 13:33:39 -08:00
Frank Barchard	96f98f6915	ARGBToJ444 and RAWToJ444 NEON - Pass JPEG matrix to ARGBToUV444MatrixRow_NEON - Remove NEON unsigned constants in favor of DOTPROD signed constants Samsung S23: Was C for UV ARGBToJ444_Opt (320 ms) RAWToJ444_Opt (411 ms) Now I8MM ARGBToJ444_Opt (196 ms) RAWToJ444_Opt (301 ms) NEON ARGBToJ444_Opt (505 ms) RAWToJ444_Opt (596 ms) 32 bit ARM NEON ARGBToJ444_Opt (1135 ms) RAWToJ444_Opt (1546 ms) Profile of RAWToJ444 37.72% ARGBToUVJ444Row_NEON_I8MM 34.48% RAWToARGBRow_NEON 14.65% ARGBToYJRow_NEON_DotProd Bug: 390247964 Change-Id: Ia26240bee974a0baf502548f2fc896b193c3006c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6220890 Reviewed-by: Ben Weiss <bweiss@google.com>	2025-01-31 16:46:29 -08:00
Frank Barchard	c1bac9e6a5	RAWToJ444 and ARGBToJ444 - ARGBToJ444 implements ARGBToUVJ444Row_C - RAWToJ444 implemented as 2 steps - RAWToARGB and ARGBToJ444 libyuv_test '--gunit_filter=RTo?444_Opt' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 (with bit exact off) Samsung S23 RAWToJ444_Opt (437 ms) ARGBToJ444_Opt (337 ms) ARGBToI444_Opt (196 ms) Skylake Xeon RAWToJ444_Opt (1699 ms) ARGBToJ444_Opt (1559 ms) ARGBToI444_Opt (346 ms) Bug: 390247964 Change-Id: Id1b1b45a5e4512ab50830aadf62f780fbe631575 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207845 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-29 15:18:38 -08:00
George Steed	c4a0c8d34a	[AArch64] Add SVE2 and SME implementations for Convert8To8Row SVE can make use of the UMULH instruction to avoid needing separate widening multiply and narrowing steps for the scale application. Reduction in runtime for Convert8To8Row_SVE2 observed compared to the existing Neon implementation: Cortex-A510: -13.2% Cortex-A520: -16.4% Cortex-A710: -37.1% Cortex-A715: -38.5% Cortex-A720: -38.4% Cortex-X2: -33.2% Cortex-X3: -31.8% Cortex-X4: -31.8% Cortex-X925: -13.9% Change-Id: I17c0cb81661c5fbce786b47cdf481549cfdcbfc7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6207692 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2025-01-28 15:53:26 -08:00
Frank Barchard	6c2415bfab	J420ToI420 AVX2 libyuv_test '--gunit_filter=J420ToI420' --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Skylake Xeon AVX2 J420ToI420_Opt (114 ms) C J420ToI420_Opt (596 ms) Sapphire Rapids AVX2 J420ToI420_Opt (126 ms) C J420ToI420_Opt (717 ms) Samsung S23 NEON J420ToI420_Opt (46 ms) C J420ToI420_Opt (95 ms) Bug: 381327032 Change-Id: I2b551507c2a8b1da4f04651b622fc9247a75050d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6201239 Reviewed-by: Justin Green <greenjustin@google.com>	2025-01-27 11:23:44 -08:00
Frank Barchard	67f3f17d9a	aarch32 J420ToI420 benchmark on medium core adbrun -- taskset 10 blaze-bin/third_party/libyuv/libyuv_test '--gunit_filter=J420ToI420' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Now Neon J420ToI420_Opt (159 ms) Was C J420ToI420_Opt (215 ms) AArch64 J420ToI420_Opt (93 ms) C version does this: vld1.8 {d20, d21}, [r6]! vorr q12, q8, q8 subs r4, #16 vmovl.u8 q11, d21 vmovl.u8 q10, d20 vmul.i16 q11, q9, q11 vmul.i16 q10, q9, q10 vsra.u16 q12, q11, #8 vorr q11, q8, q8 vsra.u16 q11, q10, #8 vmovn.i16 d21, q12 vmovn.i16 d20, q11 vst1.8 {d20, d21}, [r5]! bne 0x3d9078 <Convert8To8Row_C+0x36> @ imm = #-54 Explanation of above C code vorr moves 16 into register vsra does shift + accumulate to that register Compared to aarch64 instead of mull, C uses movl+mul instead of uzp2, C uses sra #8 + movn. takes 2 movn vs 1 uzp2 instead of add, C does vorr + sra Change-Id: I9648f06e52ccbafaecf07bd89f8ffff27565d025 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6189497 Reviewed-by: Justin Green <greenjustin@google.com>	2025-01-22 13:47:09 -08:00
Frank Barchard	26277baf96	J420ToI420 using planar 8 bit scaling - Add Convert8To8Plane which scale and add 8 bit values allowing full range YUV to be converted to limited range YUV libyuv_test '--gunit_filter=J420ToI420' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Samsung S23 J420ToI420_Opt (45 ms) I420ToI420_Opt (37 ms) Skylake J420ToI420_Opt (596 ms) I420ToI420_Opt (99 ms) Bug: 381327032 Change-Id: I380c3fa783491f2e3727af28b0ea9ce16d2bb8a4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6182631 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-22 02:50:24 -08:00
Frank Barchard	ef52c1658a	avx10_2 detect Run with sde only -dmr reports AVX10.2 emr:Has AVX10_2 0x0 adl:Has AVX10_2 0x0 icx:Has AVX10_2 0x0 snb:Has AVX10_2 0x0 tnt:Has AVX10_2 0x0 icl:Has AVX10_2 0x0 slm:Has AVX10_2 0x0 dmr:Has AVX10_2 0x2000000 cwf:Has AVX10_2 0x0 mrm:Has AVX10_2 0x0 skx:Has AVX10_2 0x0 wsm:Has AVX10_2 0x0 gnr:Has AVX10_2 0x0 gnr256:Has AVX10_2 0x0 bdw:Has AVX10_2 0x0 cpx:Has AVX10_2 0x0 rpl:Has AVX10_2 0x0 snr:Has AVX10_2 0x0 ptl:Has AVX10_2 0x0 slt:Has AVX10_2 0x0 ivb:Has AVX10_2 0x0 spr:Has AVX10_2 0x0 tgl:Has AVX10_2 0x0 arl:Has AVX10_2 0x0 srf:Has AVX10_2 0x0 nhm:Has AVX10_2 0x0 skl:Has AVX10_2 0x0 mtl:Has AVX10_2 0x0 pnr:Has AVX10_2 0x0 glp:Has AVX10_2 0x0 lnl:Has AVX10_2 0x0 cnl:Has AVX10_2 0x0 hsw:Has AVX10_2 0x0 clx:Has AVX10_2 0x0 glm:Has AVX10_2 0x0 sde -dmr -- libyuv_test --gunit_filter=Cpu [ RUN ] LibYUVBaseTest.TestCpuId Cpu Vendor: GenuineIntel 0x756e6547 0x49656e69 0x6c65746e Cpu Family 6 (0x6), Model 214 (0xd6) [ OK ] LibYUVBaseTest.TestCpuId (34 ms) [ RUN ] LibYUVBaseTest.TestCpuHas Kernel Version 6.10 Has X86 0x8 Has SSE2 0x100 Has SSSE3 0x200 Has SSE4.1 0x400 Has SSE4.2 0x800 Has AVX 0x1000 Has AVX2 0x2000 Has ERMS 0x4000 Has FSMR 0x8000 Has FMA3 0x10000 Has F16C 0x20000 Has AVX512BW 0x40000 Has AVX512VL 0x80000 Has AVX512VNNI 0x100000 Has AVX512VBMI 0x200000 Has AVX512VBMI2 0x400000 Has AVX512VBITALG 0x800000 Has AVX10 0x1000000 Has AVX10_2 0x2000000 HAS AVXVNNI 0x4000000 Has AVXVNNIINT8 0x8000000 Has AMXINT8 0x10000000 [ OK ] LibYUVBaseTest.TestCpuHas (10 ms) This is how oneDNN does avx10 version: `e15d2c220f/src/cpu/x64/xbyak/xbyak_util.h (L698-L701)` Bug: b/350318244 Change-Id: I6f78402fecc38a92019d137b3439d7bce950510c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6172267 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com>	2025-01-21 13:53:19 -08:00
Frank Barchard	47ddac2996	Sub sampling conversions use CopyPlane for Y channel - Replace ScalePlane with CopyPlane for Y channel - Vertical mirroring is supported, but not horizontal mirroring. - Check src_y is not null when dst_y is not null for all libyuv functions that allow a null dst_y. - Apply clang-format - Bump version to 1899 Bug: None Change-Id: Id1805b52b8024ba95a7f1b098dabf45af48670eb Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6128599 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-02 13:34:11 -08:00
Frank Barchard	e0040eb318	Apply clang format Bug: None Change-Id: I0d9db4b384144523e61ae32b6ab3f72e93a0c265 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6138934 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2025-01-02 13:31:20 -08:00
Darren Hsieh	b5a18f9d93	[RVV] Optimize ScaleARGBFilterCols with RVV * Run on SiFive internal FPGA: Test Case Speedup ARGBScaleDownBy3by8_Linear x2.05 ARGBScaleDownBy3by8_Bilinear x1.76 ARGBScaleDownBy3by8_Box x1.76 Bug: 42280924 Co-Developed-by: Bruce Lai <bruce.lai@sifive.com> Change-Id: Ib9979b1f2ca92d2ef5aa373f9b2459c246ded6c8 Signed-off-by: Darren Hsieh <darren.hsieh@sifive.com> Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5103572 Reviewed-by: Wan-Teh Chang <wtc@google.com> Reviewed-by: Bruce Lai <bruce.lai@sifive.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-29 17:32:00 -08:00
George Steed	cce8950816	[AArch64] Remove unused SVE INDEX instrs from NV{12,21} kernels When reading subsampled UV data in NV{12,21} we previously needed to permute the data to both (a) duplicate each element into the corresponding pair of lanes for the Y elements; and (b) arrange the UV components in the correct lanes. This was done in a vector-length agnostic way by generating the permute indices dynamically at runtime through an SVE INDEX instruction. Now that we are using the READNV_SVE_2X macro everywhere these instructions are now redundant: the multiplications are done on the subsampled UV data before the duplication and the conversion macro takes arguments that adjust whether we need to operate on the even or odd lanes of the vector. Since the permute indices generated by these INDEX instructions are now unused, remove them. Change-Id: I3298a83aadfda52c4cc89bc4fd6518b06765a187 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6089957 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-26 14:47:00 -08:00
George Steed	45c7107f95	[AArch64] Fix compilation when SME is not supported The STREAMING_COMPATIBLE macro is designed to enable use of the __arm_streaming_compatible attribute with the intent that this macro expanded to empty if SME is not supported by the compiler or platform being compiled for, however in reality this macro remained undefined causing compilation to fail. Fix this by defining the macro to empty as originally intended. No-Try: True Change-Id: I8f5a8a606289b7c045fa1cce609f5a6d644891ac Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087913 Reviewed-by: Justin Green <greenjustin@google.com> Commit-Queue: Mirko Bonadei <mbonadei@chromium.org> Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-12-13 08:16:50 -08:00
George Steed	7fd0bd197e	[AArch64] Port YUVToRGB color conversions to SME Some of the color conversion kernels already have Streaming-SVE implementations however many do not. We can re-use the existing SVE implementation by moving it to a new shared row_sve.h header and marking it with a "streaming-compatible" attribute to ensure it can be called from both streaming and non-streaming execution modes. As part of this move to a common header we also add duplicated streaming-mode implementations of the following kernels that did not previously have an SME implementation: - I210AlphaToARGBRow_SME - I210ToAR30Row_SME - I210ToARGBRow_SME - I212ToAR30Row_SME - I212ToARGBRow_SME - I400ToARGBRow_SME - I410AlphaToARGBRow_SME - I410ToAR30Row_SME - I410ToARGBRow_SME - I422AlphaToARGBRow_SME - I422ToARGB1555Row_SME - I422ToARGB4444Row_SME - I422ToRGB24Row_SME - I422ToRGB565Row_SME - I422ToRGBARow_SME - I444AlphaToARGBRow_SME - NV12ToARGBRow_SME - NV12ToRGB24Row_SME - NV21ToARGBRow_SME - NV21ToRGB24Row_SME - P210ToAR30Row_SME - P210ToARGBRow_SME - P410ToAR30Row_SME - P410ToARGBRow_SME - UYVYToARGBRow_SME - YUY2ToARGBRow_SME Change-Id: I84583478e465351cbe6fc0ec65254c3009922e84 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6087804 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:07:54 -08:00
George Steed	c2e7f8389a	[AArch64] Add SME implementations of InterpolateRow{,_16,_16To8} InterpolateRow_SME and InterpolateRow_16_SME need special cases to handle if source_y_fraction is 256 since this would overflow a byte and can just be a call to memcpy instead. InterpolateRow_16To8_SME is never called with a source_y_fraction value of 256 so there is no need for a special case here. Change-Id: I67805b5db2c411acb93ada626cf414b35620f467 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074375 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:03:41 -08:00
George Steed	2d8652f3e7	[AArch64] Add SME implementation of CopyRow Add a streaming-SVE implementation of CopyRow using normal vector load/store instructions. Change-Id: Ia551413f9740a96473fa2e8a0958953be2f4b04e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6074374 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 03:02:07 -08:00
George Steed	418b6df0de	[AArch64] Add SME implementation of Convert16To8Row Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel. SVE has a "widening multiply get high half" instruction in UMULH, however using the same technique as the Neon code to avoid the need for a widening multiply at all is more performant here. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: Ib12699c5b8b168d004ebc74c0281ea3772ca8d32 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070786 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 03:01:55 -08:00
runzezhang	192b8c2238	Add NV24 scaling support to libyuv Some projects require scaling support for the NV24 format, but libyuv currently lacks this functionality. This commit adds a scaling function for NV24, enabling its use in projects that require NV24 format processing. Change-Id: I6e6b2bea342e1df7f387056ab3bc5003da983bb7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6068715 Reviewed-by: Mirko Bonadei <mbonadei@chromium.org> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 02:46:11 -08:00
George Steed	85331e00cc	[AArch64] Add SME impls of ScaleRowDown2{,Linear,Box}_16 Mostly just straightforward copies of the Neon code ported to Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2 SME kernels, but operating on 16-bit data rather than 8-bit. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I7bad0719d24cdb1760d1039c63c0e77726b28a54 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070784 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 01:21:08 -08:00
George Steed	15f2ae7d70	[AArch64] Add SME impls of ScaleARGBRowDown2{,Linear,Box} Mostly just straightforward copies of the Neon code ported to Streaming-SVE, these follow the same pattern as the prior ScaleRowDown2 and ScaleUVRowDown2 SME kernels, but operating on 32-bit ARGB tuples rather than 8-bit data or 16-bit UV tuples. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I15600c2498cc592f5ea1d97b78fafec327de7947 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070783 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-12-12 01:19:20 -08:00
George Steed	7391559cb4	[AArch64] Add SME implementation of MergeUVRow{,_16} Mostly just a straightforward copy of the Neon code ported to Streaming-SVE, we can use predication to avoid needing an `Any` kernel and use ST2 to avoid needing a separate ZIP instruction. These is no benefit from this kernel when the SVE vector length is only 128 bits, so skip writing a non-streaming SVE implementation. Change-Id: I5ae36afe699b88f119dc545e49c59c5d85e98742 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6070785 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-12 01:16:19 -08:00
George Steed	8f659daffd	[AArch64] Add SVE2 implementations of NV{12,21}ToRGB24Row Now that we have the `_2X` versions of the macros we can use these to implement `ToRGB24` kernels. These cannot use the bottom/top approach previously used by other SVE kernels since there are three rather than two or four elements each. Reduction in runtimes observed compared to the existing Neon implementations: \| NV12ToRGB24Row \| NV21ToRGB24Row Cortex-A510 \| -60.7% \| -60.7% Cortex-A520 \| -46.0% \| -46.0% Cortex-A715 \| -25.2% \| -25.2% Cortex-A720 \| -25.2% \| -25.2% Cortex-X2 \| -28.9% \| -29.0% Cortex-X3 \| -28.2% \| -28.1% Cortex-X4 \| -30.8% \| -30.7% Cortex-X925 \| -28.8% \| -28.9% Change-Id: I39853d124bfdcac38584109870b398b8ecd5b632 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067149 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-04 17:51:08 +00:00
George Steed	9144583f22	[AArch64] Add SME impls of MultiplyRow_16 and ARGBMultiplyRow Mostly just a translation of the existing Neon code to SME. Change-Id: Ic3d6b8ac774c9a1bb9204ed6c78c8802668bffe9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6067147 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-12-03 22:11:19 +00:00
George Steed	9a9752134e	[AArch64] Add Neon implementation of ScaleRowDown2Linear_16 Reduction in runtime observed relative to the auto-vectorized C implementation compiled with LLVM 19: Cortex-A55: -13.7% Cortex-A510: -49.0% Cortex-A520: -32.0% Cortex-A76: -34.3% Cortex-A710: -56.7% Cortex-A715: -45.4% Cortex-A720: -44.7% Cortex-X1: -70.6% Cortex-X2: -67.9% Cortex-X3: -72.2% Cortex-X4: -40.0% Cortex-X925: -24.1% Bug: b/42280942 Change-Id: I977899a2239e752400c9901f4d8482a76841269a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040154 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-25 21:10:26 +00:00
George Steed	11c57f4f12	[AArch64] Add Neon implementation of ScaleRowDown2_16_NEON The auto-vectorized implementation unrolls to process 32 elements per iteration, so unroll the new Neon implementation to match and avoid a performance regression on little cores. Performance relative to the auto-vectorized C implementation compiled with LLVM 19: Cortex-A55: -35.8% Cortex-A510: -20.4% Cortex-A520: -22.1% Cortex-A76: -54.8% Cortex-A710: -44.5% Cortex-A715: -31.1% Cortex-A720: -31.4% Cortex-X1: -48.5% Cortex-X2: -47.8% Cortex-X3: -47.6% Cortex-X4: -51.1% Cortex-X925: -14.6% Bug: b/42280942 Change-Id: Ib4e89ba230d554f2717052e934ca0e8a109ccc42 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040153 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-25 21:10:05 +00:00
George Steed	952d6a282f	[AArch64] Enable use of ScaleRowDown2Box_16_NEON The #ifdef surrounding the use of this kernel is never defined and ScaleRowDown2_16_NEON does not exist, so add the missing #define and remove the use of ScaleRowDown2_16_NEON for now. Additionally since there is no implementation of this kernel for 32-bit Arm, restrict the define to only be present on AArch64. Bug: b/42280942 Change-Id: Icc35c145c1bad1c0df2933a2d8bc7dcf7fe63cb7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6040152 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-24 19:58:00 +00:00
George Steed	9ed07258c7	[AArch64] Add SVE2 implementation of I410ToAR30Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -18.1% Cortex-A520: -6.0% Cortex-A715: -22.0% Cortex-A720: -21.1% Cortex-X2: -9.4% Cortex-X3: -12.0% Cortex-X4: -7.6% Cortex-X925: -5.8% Bug: b/42280942 Change-Id: I853a028e08f1f1076ac20cd9c7f4f8ac8a211ac1 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023584 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:59:55 +00:00
George Steed	3dd047733e	[AArch64] Add SVE2 implementation of I410AlphaToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -37.2% Cortex-A520: -6.9% Cortex-A715: -14.8% Cortex-A720: -16.0% Cortex-X2: -14.8% Cortex-X3: -17.5% Cortex-X4: -12.8% Cortex-X925: -13.0% Bug: b/42280942 Change-Id: I1977fd1e1dfac25021724483fd89c6ff3e227d8b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023582 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:58:11 +00:00
George Steed	e84d809348	[AArch64] Add SVE2 implementation of I410ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -37.9% Cortex-A520: -9.2% Cortex-A715: -14.3% Cortex-A720: -14.2% Cortex-X2: -10.9% Cortex-X3: -11.1% Cortex-X4: -12.5% Cortex-X925: -10.6% Bug: b/42280942 Change-Id: I6720b07c900c7dfbd849ee38e413e98b9374dac2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023581 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:54:48 +00:00
George Steed	7c9c72ab4b	[AArch64] Add SVE2 implementation of I210ToAR30Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -15.5% Cortex-A520: -3.8% Cortex-A715: -15.8% Cortex-A720: -15.8% Cortex-X2: -7.9% Cortex-X3: -6.5% Cortex-X4: -5.0% Cortex-X925: -5.3% Bug: b/42280942 Change-Id: I5171537fd125b3214d25a0ae503a8f40dbeb6042 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023583 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-11-23 00:53:16 +00:00
George Steed	fc3569ad27	[AArch64] Add SVE2 implementation of I210AlphaToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -33.9% Cortex-A520: -4.2% Cortex-A715: -22.0% Cortex-A720: -22.4% Cortex-X2: -14.6% Cortex-X3: -14.5% Cortex-X4: -11.6% Cortex-X925: -12.6% Bug: b/42280942 Change-Id: Ifb4ed7a865c369d584af498cc65b84d065cfb207 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023580 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:47:32 +00:00
George Steed	50108f29fb	[AArch64] Add SVE2 implementation of I212ToAR30Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -15.4% Cortex-A520: -3.8% Cortex-A715: -15.7% Cortex-A720: -15.6% Cortex-X2: -7.9% Cortex-X3: -5.7% Cortex-X4: -5.3% Cortex-X925: -4.8% Bug: b/42280942 Change-Id: I99846820682687c8e0f52d05f5aa3d50369fe0a2 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025829 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:27:57 +00:00
George Steed	305a7a4ede	[AArch64] Add SVE2 implementation of I212ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -34.5% Cortex-A520: -6.5% Cortex-A715: -10.1% Cortex-A720: -16.1% Cortex-X2: -11.9% Cortex-X3: -11.9% Cortex-X4: -9.3% Cortex-X925: -11.2% Bug: b/42280942 Change-Id: Idc30e69552f7d227217ac7011a786210b11e4752 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6025828 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-23 00:21:27 +00:00
Frank Barchard	595146434a	HalfFloat fix SigIll on aarch64 - Remove special case Scale of 1 which used fp16 cvt but requires cpuid - Port aarch64 to aarch32 - Use C for aarch32 with small (denormal) scale value Bug: 377693555 Change-Id: I38e207e79ac54907ed6e65118b8109288fddb207 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6043392 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-11-22 22:08:00 +00:00
Frank Barchard	307b951229	Add CopyPlane_Unaligned, _Any and _Invert tests/benchmarksCpuId test - Add AMD_ERMSB detect for ERMS on AMD Bug: 379457420 Change-Id: I608568556024faf19abe4d0662aeeee553a0a349 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6032852 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-11-19 23:53:05 +00:00
Frank Barchard	1c501a8f3f	CpuId test FSMR - Fast Short Rep Movsb - Renumber cpuid bits to use low byte to ID the type of CPU and upper 24 bits for features Intel CPUs starting at Icelake support FSMR adl:Has FSMR 0x8000 arl:Has FSMR 0x0 bdw:Has FSMR 0x0 clx:Has FSMR 0x0 cnl:Has FSMR 0x0 cpx:Has FSMR 0x0 emr:Has FSMR 0x8000 glm:Has FSMR 0x0 glp:Has FSMR 0x0 gnr:Has FSMR 0x8000 gnr256:Has FSMR 0x8000 hsw:Has FSMR 0x0 icl:Has FSMR 0x8000 icx:Has FSMR 0x8000 ivb:Has FSMR 0x0 knl:Has FSMR 0x0 knm:Has FSMR 0x0 lnl:Has FSMR 0x8000 mrm:Has FSMR 0x0 mtl:Has FSMR 0x8000 nhm:Has FSMR 0x0 pnr:Has FSMR 0x0 rpl:Has FSMR 0x8000 skl:Has FSMR 0x0 skx:Has FSMR 0x0 slm:Has FSMR 0x0 slt:Has FSMR 0x0 snb:Has FSMR 0x0 snr:Has FSMR 0x0 spr:Has FSMR 0x8000 srf:Has FSMR 0x0 tgl:Has FSMR 0x8000 tnt:Has FSMR 0x0 wsm:Has FSMR 0x0 Intel CPUs starting at Ivybridge support ERMS adl:Has ERMS 0x4000 arl:Has ERMS 0x4000 bdw:Has ERMS 0x4000 clx:Has ERMS 0x4000 cnl:Has ERMS 0x4000 cpx:Has ERMS 0x4000 emr:Has ERMS 0x4000 glm:Has ERMS 0x4000 glp:Has ERMS 0x4000 gnr:Has ERMS 0x4000 gnr256:Has ERMS 0x4000 hsw:Has ERMS 0x4000 icl:Has ERMS 0x4000 icx:Has ERMS 0x4000 ivb:Has ERMS 0x4000 knl:Has ERMS 0x4000 knm:Has ERMS 0x4000 lnl:Has ERMS 0x4000 mrm:Has ERMS 0x0 mtl:Has ERMS 0x4000 nhm:Has ERMS 0x0 pnr:Has ERMS 0x0 rpl:Has ERMS 0x4000 skl:Has ERMS 0x4000 skx:Has ERMS 0x4000 slm:Has ERMS 0x4000 slt:Has ERMS 0x0 snb:Has ERMS 0x0 snr:Has ERMS 0x4000 spr:Has ERMS 0x4000 srf:Has ERMS 0x4000 tgl:Has ERMS 0x4000 tnt:Has ERMS 0x4000 wsm:Has ERMS 0x0 Change-Id: I18e5a3905f2691ab66d4d0cb6f668c0a0ff72d37 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6027541 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2024-11-18 17:56:45 +00:00
Frank Barchard	75f7cfdde5	SplitRGB for SSE4 and AVX2 libyuv_test '--gunit_filter=SplitRGB' --libyuv_width=640 --libyuv_height=360 --libyuv_repeat=100000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Note: Google Test filter = SplitRGB Skylake Xeon x86 32 bit AVX2 LibYUVPlanarTest.SplitRGBPlane_Opt (4143 ms) SSE4 LibYUVPlanarTest.SplitRGBPlane_Opt (4543 ms) SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5346 ms) C LibYUVPlanarTest.SplitRGBPlane_Opt (22965 ms) Skylake Xeon x86 64 bit AVX2 LibYUVPlanarTest.SplitRGBPlane_Opt (4470 ms) SSE4 LibYUVPlanarTest.SplitRGBPlane_Opt (4723 ms) SSSE3 LibYUVPlanarTest.SplitRGBPlane_Opt (5465 ms) C LibYUVPlanarTest.SplitRGBPlane_Opt (4707 ms) Bug: 379186682 Change-Id: Idce67a4ded836f2ee31854aa06f3903e7bcb7791 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6024314 Reviewed-by: richard winterton <rrwinterton@gmail.com>	2024-11-15 00:46:25 +00:00
George Steed	823d960afc	[AArch64] Add SVE2 implementations of {P210,P410}ToAR30Row Observed reductions in runtime compared to the existing Neon code: \| P210ToAR30Row \| P410ToAR30Row Cortex-A510 \| -16.5% \| -21.2% Cortex-A520 \| (!) +2.7% \| -8.7% Cortex-A715 \| -6.1% \| -6.1% Cortex-A720 \| -6.2% \| -5.9% Cortex-X2 \| -4.1% \| -4.2% Cortex-X3 \| -4.2% \| -4.2% Cortex-X4 \| -1.2% \| -1.2% Cortex-X925 \| -3.6% \| -2.8% Bug: b/42280942 Change-Id: I40723a370fad1ccb53f8ccd9d32cddb502500dd6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023036 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-14 16:52:21 +00:00
George Steed	0ddf3f7b90	[AArch64] Add SVE2 implementation of I210ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -34.5% Cortex-A520: -6.5% Cortex-A715: -10.1% Cortex-A720: -13.9% Cortex-X2: -11.9% Cortex-X3: -11.6% Cortex-X4: -9.5% Cortex-X925: -11.5% Bug: b/42280942 Change-Id: Ie97dc3b5efd021ecfea14d4c477cc205191e09c3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6023037 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-14 16:36:41 +00:00
George Steed	5b906a0ec8	[AArch64] Add SVE2 implementation of P410ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -34.7% Cortex-A520: -2.4% Cortex-A715: -18.7% Cortex-A720: -18.8% Cortex-X2: -7.7% Cortex-X3: -8.9% Cortex-X4: +1.0% (!) Cortex-X925: -8.3% Bug: b/42280942 Change-Id: I90dca0573887a9a24e2172378a9e0eb6812e2131 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975321 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:34:56 +00:00
George Steed	b753822d47	[AArch64] Add SVE2 implementation of P210ToARGBRow Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -32.8% Cortex-A520: +8.7% (!) Cortex-A715: -18.9% Cortex-A720: -18.9% Cortex-X2: -7.9% Cortex-X3: -8.8% Cortex-X4: +1.0% (!) Cortex-X925: -8.6% Bug: b/42280942 Change-Id: Ibe557500c3788b4fb39372c92b2f42ba216e6fea Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975320 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-11-12 18:32:55 +00:00
George Steed	721ad4aa18	[AArch64] Add SME implementation of ScaleUVRowDown2Box There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: Ie15bb4e7484b61e78f405ad4e8a8a7bbb66b7edb Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979727 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:30:30 +00:00
George Steed	576218dbce	[AArch64] Add SME implementation of ScaleUVRowDown2Linear There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: I401eb6ad14b3159917c8e3a79ab20dde318d28b6 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979726 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:28:57 +00:00
George Steed	551cee7845	[AArch64] Add SME implementation of ScaleUVRowDown2 There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: Ic4ba5f97dc57afc558c08a57e9b5009d6e487e0f Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5979725 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-12 18:24:28 +00:00
George Steed	5c12e0b2de	[AArch64] Add SVE2 implementations of HalfFloat{,1}Row For HalfFloat1Row, SVE has direct 16-bit integer to half-float conversion instructions so there is no need to widen to 32-bits. For HalfFloatRow, SVE zero-extending loads avoid the need for seperate UXTL(2) instructions. Observed reductions in runtime compared to the existing Neon code: \| HalfFloat1Row \| HalfFloatRow Cortex-A510 \| -38.3% \| -17.3% Cortex-A520 \| -37.6% \| -18.8% Cortex-A720 \| -50.1% \| -7.8% Cortex-X2 \| -50.2% \| -0.4% Cortex-X4 \| -51.5% \| -12.5% Bug: b/42280942 Change-Id: I445071ccd453113144ce42d465ba03c9ee89ec9e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975319 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:53:00 +00:00
George Steed	f27b983f38	[AArch64] Add SVE2 implementation of DivideRow_16 SVE contains the UMULH instruction which allows us to multiply and take the high half of the result in a single instruction rather than needing separate widening multiply and then narrowing shift steps. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -21.2% Cortex-A520: -20.9% Cortex-A715: -47.9% Cortex-A720: -47.6% Cortex-X2: -5.2% Cortex-X3: -2.6% Cortex-X4: -32.4% Cortex-X925: -1.5% Bug: b/42280942 Change-Id: I25154699b17772db1fb5cb84c049919181d86f4b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5975318 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:46:02 +00:00
George Steed	aec4b4e22e	[AArch64] Add SME implementation of ScaleRowDown2Box There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: I5021aeda30f4c5f1aa4cc6326c8d7886851d2c09 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913885 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-11-07 18:42:21 +00:00
George Steed	51d07554a0	[AArch64] Add SME implementation of ScaleRowDown2Linear There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: Ie6b91bd4407130ba2653838088e81e72e4460f68 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913884 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-30 17:57:15 +00:00
George Steed	593965cea2	[AArch64] Add SME implementation of ScaleRowDown2 Including associated changes for adding a new scale_sme.cc file. There is no benefit from an SVE version of this kernel for devices with an SVE vector length of 128-bits, so skip directly to SME instead. We do not use the ZA tile here, so this is a purely streaming-SVE (SSVE) implementation. Change-Id: I47d149613fbabd8c203605a809811f1a668e8fb7 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913883 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-30 17:56:41 +00:00
George Steed	237f39cb8c	[AArch64] Add SME implementation of I444ToARGBRow This is based on an unrolled version of the existing SVE2 code. The implementation in this case is a pure streaming-SVE (SSVE) implementation based on the existing SVE2 implementation, we do not use the ZA tile. Change-Id: I83d8e58aafd814125b3446fb1c9ec4a5fb56fe3e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913882 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-29 18:10:23 +00:00
George Steed	22c5c18778	[AArch64] Add SME implementation of I422ToARGBRow Including addition of a new row_sme.cc file and associated infrastructure. The actual implementation in this case is a pure streaming-SVE (SSVE) implementation based on the existing SVE2 implementation, we do not use the ZA tile. Change-Id: Ibc132c55de8d41a107e563b95f842323fef94444 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5913881 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-29 05:49:28 +00:00
George Steed	22ac86800e	[AArch64] Add SVE2 implementation of I422ToARGB4444Row This makes use of the same approach as the Neon code to avoid redundant narrowing and then widening shifts by instead placing the values at the top portion of the lanes and then shifting down from there instead. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -35.5% Cortex-A520: -38.2% Cortex-A715: -19.8% Cortex-A720: -19.8% Cortex-X2: -24.2% Cortex-X3: -24.1% Cortex-X4: -21.6% Cortex-X925: -19.5% Bug: b/42280942 Change-Id: I0a916600e7bdee0f5480ea843b44ab046bb3d082 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802968 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 21:27:39 +00:00
George Steed	f4eaeca22a	[AArch64] Add SVE2 implementation of I422ToARGB1555Row This makes use of the same approach as the Neon code to avoid redundant narrowing and then widening shifts by instead placing the values at the top portion of the lanes and then shifting down from there instead. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -41.8% Cortex-A520: -42.6% Cortex-A715: -22.5% Cortex-A720: -22.6% Cortex-X2: -22.7% Cortex-X3: -22.4% Cortex-X4: -19.4% Cortex-X925: -27.0% Bug: b/42280942 Change-Id: I24b092bb352d9858e3d969d82b55940bb00ac7e0 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802967 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 21:27:39 +00:00
George Steed	f40042533c	[AArch64] Add SVE2 implementation of I422ToRGB565Row This makes use of the same approach as the Neon code to avoid redundant narrowing and then widening shifts by instead placing the values at the top portion of the lanes and then shifting down from there instead. Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -41.1% Cortex-A520: -38.2% Cortex-A715: -21.5% Cortex-A720: -21.6% Cortex-X2: -21.6% Cortex-X3: -22.0% Cortex-X4: -23.5% Cortex-X925: -21.7% Bug: b/42280942 Change-Id: Id84872141435566bbf94a4bbf0227554b5b5fb91 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802966 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-10-24 21:27:39 +00:00
George Steed	0dce974ca0	[AArch64] Add SVE2 implementation of I422ToRGB24Row Observed reduction in runtime compared to the existing Neon code: Cortex-A510: -57.8% Cortex-A520: -41.7% Cortex-A715: -28.0% Cortex-A720: -28.1% Cortex-X2: -29.7% Cortex-X3: -28.7% Cortex-X4: -30.5% Cortex-X925: -30.3% Bug: b/42280942 Change-Id: I328bd16babda75fb089c8da8f2714465f658187e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802965 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-10-24 02:17:32 +00:00
Frank Barchard	ffd791f749	Check malloc allocation sizes are less than SIZE_MAX Bug: b/371615496 Change-Id: I75a94b08469d6d6b6fd55a8659031cbcb3d48eed Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5912039 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-10-07 21:34:15 +00:00
George Steed	dfa279fc65	Re-enable SME when building for AArch64 Android Now that SME has been re-enabled for Linux for a while, also re-enable it for Android when building with a sufficiently new version of LLVM. Bug: b/359006069 Change-Id: Ibaa47e31826cf20136a11d551621fd62c1abab3c Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5908389 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>	2024-10-04 17:43:26 +00:00
George Steed	02c6e8baca	Change ARGBMultiplyRow_C to match Neon The existing behaviour does not round correctly in all cases, so adjust it to match the existing Neon implementation. Update the tests to require bit-exactness and disable other implementations that do not round correctly. Change-Id: Ie790fb4b4805b555d74d689d83802e1dd4f33df5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5869115 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-23 21:48:33 +00:00
George Steed	a37e6bc81b	[AArch64] Re-enable SME only for Linux and new versions of Clang This was previously disabled in 679e851f653866a49e21f69fe8380bd20123f0ee, so re-enable it but only for Linux where SME is known to work correctly. Change-Id: I2626b03f3854b27162df1b55fc6767e02ffe318d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802958 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-09-23 09:29:53 +00:00
George Steed	8315fa1d3a	Avoid duplication of CPU feature disable macros The same conditions are repeated across all *_row.h headers which makes it harder than necessary to guard enabling new architecture features depending on compiler versions etc. Avoid this duplication by merging the conditions into a new cpu_support.h header. Change-Id: Ibe7dfcef138edca6cc36870f1cfbb1bb108083e3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5802957 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-09-23 09:28:24 +00:00
George Steed	432d186116	[AArch64] Add Neon dot-product implementation for ARGBSepiaRow We can use the dot product instructions to apply the coefficients directly without the need for LD4 de-interleaving load instructions, since these are known to be slow on some micro-architectures. ST4 is also known to be slow on more modern micro-architectures, however avoiding this is left for a future SVE implementation where we can make use of interleaving-narrowing instructions. Reduction in cycle counts observed compared to existing Neon code: Cortex-A55: -5.8% Cortex-A510: -18.9% Cortex-A76: -21.8% Cortex-A720: -30.2% Cortex-X1: -28.6% Cortex-X2: -23.4% Bug: b/42280946 Change-Id: I5887559649cc805a810d867b652c85d48285657d Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5790970 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:31:35 +00:00
George Steed	1c31461771	[AArch64] Add Neon dot-product implementation for ARGBGrayRow We can use dot product instructions to apply the coefficients without needing to use LD4 deinterleaving load instructions, and then TBL to mix in the original alpha component. This is significantly faster on some micro-architectures where LD4 instructions are known to be slow compared to normal loads. Reduction in cycle counts observed compared to existing Neon code: Cortex-A55: -12.6% Cortex-A510: -48.6% Cortex-A76: -39.7% Cortex-A720: -52.3% Cortex-X1: -63.5% Cortex-X2: -67.0% Bug: b/42280946 Change-Id: I3641785e74873438acc00d675f5bc490dfa95b50 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5785972 Reviewed-by: Justin Green <greenjustin@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-09-16 04:31:11 +00:00
Frank Barchard	4620f17058	ScalePlane crash fix for 3/4 scaling - Scaling 48 pixels at a time, but calling code checked for 24 pixels - Added test for scaling to 1080x1920 libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560 Was libyuv_test --gunit_filter=LibYUVScaleTest.I420ScaleTo1080x1920_Box* --libyuv_width=1440 --libyuv_height=2560 [ RUN ] LibYUVScaleTest.I420ScaleTo1080x1920_Box Segmentation fault Traceback (most recent call last): Now [ RUN ] LibYUVScaleTest.I420ScaleTo1080x1920_Box filter 3 - 6741 us C - 3566 us OPT [ OK ] LibYUVScaleTest.I420ScaleTo1080x1920_Box (43 ms) Bug: b/366045177 Change-Id: I0ea6c2d6a32b2e7ca44cd030abc9f248115be44a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5857554 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-09-13 01:20:39 +00:00
Frank Barchard	679e851f65	Convert16To8Row_AVX512BW using vpmovuswb - avx2 is pack/perm is mutating order - cvt method maintains channel order on avx512 Sapphire Rapids Benchmark of 640x360 on Sapphire Rapids AVX512BW [ OK ] LibYUVConvertTest.I010ToNV12_Opt (3547 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (3186 ms) AVX2 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (4000 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (3190 ms) SSE2 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (5433 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (4840 ms) Skylake Xeon Now vpmovuswb [ OK ] LibYUVConvertTest.I010ToNV12_Opt (7946 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (7071 ms) Was vpackuswb [ OK ] LibYUVConvertTest.I010ToNV12_Opt (7684 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (7059 ms) Switch from vpunpcklwd to vpbroadcastw for scale value parameter Was vpunpcklwd %%xmm2,%%xmm2,%%xmm2 vbroadcastss %%xmm2,%%ymm2 Now vpbroadcastw %%xmm2,%%ymm2 Bug: 357439226, 357721018 Change-Id: Ifc9c82ab70dba58af6efa0f57f5f7a344014652e Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5787040 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-15 20:13:33 +00:00
Wan-Teh Chang	0c2cf03c5c	Fix a -Wundef warning on macOS with Apple silicon Change-Id: Ia78dcc913e06dd8876119a96bd7760c1d2af4341 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5788821 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-14 22:10:43 +00:00
Wan-Teh Chang	02e2ff4745	Note stride params of HalfFloatPlane are in bytes The HalfFloatPlane() function does not follow libyuv's convention of buffer stride in units of the corresponding buffer pointer. Document that. Change-Id: Id8d466ccc2df263a49ad788ab349bc3993a48259 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5770639 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-12 20:17:23 +00:00
Wan-Teh Chang	3cf54e90d3	Fix -Wmissing-prototypes warnings Declare functions as static. Declare functions in a header. Include the header that declares the functions. Delete undeclared and unused functions ScaleFilterRows_NEON() and ScaleRowUp2_16_NEON(). Delete unused function ScaleY() in psnr_main.cc. Change-Id: I182ec30611df83c61ffd01bbab595cd61fb5f1e5 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778601 Commit-Queue: Wan-Teh Chang <wtc@google.com> Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-12 19:08:24 +00:00
Frank Barchard	a97746349b	Add test for I010ToNV12 - Add support for negative height to invert - Fix off by 1 on odd width and height - Bump version to 1895 Initial I010 is 2 step planar conversion libyuv_test '--gunit_filter=*010ToNV12_Opt' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1 Skylake Xeon [ OK ] LibYUVConvertTest.I010ToNV12_Opt (2675 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (1547 ms) Pixel 7 [ OK ] LibYUVConvertTest.I010ToNV12_Opt (464 ms) [ OK ] LibYUVConvertTest.P010ToNV12_Opt (125 ms) Bug: b/357721018, b/357439226 Change-Id: I2ae59783cf328a6592d0ab80c374ae4dc281daf3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5778595 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-12 18:57:56 +00:00
Chunbo Hua	e23bc72e8e	Bump version number in order to expose new API Bug: 357721018 Change-Id: I2c6e115cd049db2038631195305c5907764d5c7b Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5768078 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-07 22:10:05 +00:00
Chunbo Hua	fc94178260	Implement I010ToNV12 conversion I010, also known as YUV420P10, is 10 bit YUV pixel format with 3 planes. Both I010 and NV12 are 4:2:0 subsampling. NV12 has a Y plane, and an interleaved UV plane. Bug: 357721018 Change-Id: If215529b9eda8e0fb32aed666ca179c90244aaff Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5764823 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-08-06 17:36:13 +00:00
Frank Barchard	32ccd53bb3	Add P010ToNV12 to convert 10 bit biplanar to 8 bit biplanar - P010 and NV12 have the same layout: Full size Y plane and half size UV plane. P010 and NV12 are 4:2:0 subsampling - P010 uses upper 10 bits of 16 bit elements - NV12 uses 8 bit elements - The Convert16To8 used internally will discard the low 2 bits. - UV order is the same - U first in memory, followed by V, interleaved - UV plane is be rounded up in size to allow odd size Y to have UV values - Similar code could be used to convert P210ToNV16, P410ToNV24, with the size of the UV plane affected by subsampling 4:2:2 and 4:4:4 variants. Bug: b/357439226 Change-Id: I5d6ec84d97d0e0cc4008eeb18a929ea28570d6d9 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5761958 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-08-05 18:55:44 +00:00
Frank Barchard	4cd90347e7	Rotate use NULL for C compatability Bug: b/353323977 Change-Id: I2472f23ce8fcc0bc09a292bd6fb758304c6c2b18 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5735714 Reviewed-by: Wan-Teh Chang <wtc@google.com>	2024-07-23 18:02:47 +00:00
George Steed	b5f9d7cb76	[AArch64] Add SME implementation of TransposeUVWxH We can make use of the ZA tile register to do the transpose and de-interleaving of UV components without any explicit permute instructions: the tile is loaded horizontally placing UV components into alternative columns, then we can just store the independent components vertically. Change-Id: I67bd82dc840a43888290be1c9db8a3c05f16d730 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703588 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 12:15:40 +00:00
George Steed	15ecca81f7	[AArch64] Add SME implementation of TransposeWxH We can make use of the ZA tile register to do the transpose without any explicit permute instructions: just load the tile horizontally and store it vertically. Change-Id: I1c31e89af52a408e3491e62d6c9e6fee41b1b80a Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5703587 Reviewed-by: Frank Barchard <fbarchard@chromium.org>	2024-07-19 12:14:39 +00:00
George Steed	a4ccf9940e	[AArch64] Add I8MM implementation of ARGBToUV444Row We cannot use the standard dot-product instructions since the coefficients multiplication results are both added and subtracted, but I8MM supports mixed-sign dot products which work well here. We need to add an additional variant of the coefficient structs since we need negative constants for the elements that were previously subtracted. Reduction in runtimes observed compared to the previous Neon implementation: Cortex-A510: -37.3% Cortex-A520: -31.1% Cortex-A715: -37.1% Cortex-A720: -37.0% Cortex-X2: -62.1% Cortex-X3: -62.2% Cortex-X4: -40.4% Bug: libyuv:977 Change-Id: Idc3d9a6408c30e1bce3816a1ed926ecd76792236 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5712928 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Reviewed-by: Justin Green <greenjustin@google.com>	2024-07-16 17:32:52 +00:00

1 2 3 4 5 ...

1927 Commits