George Steed 4f7fd808b7 [AArch64] Use full vectors in TransposeWx{8 => 16}_NEON
The existing Neon code only makes use of 64-bit vectors throughout which
limits the performance on larger cores. To avoid this, swap the Neon
code from a Wx8 implementation to a Wx16 implementation and process
blocks of 16 full vectors at a time.

The original code also handled widths that were not exact multiples of
16, however this should already be handled by the "any" kernel so it is
removed.

Finally, avoid duplicating the TransposeWx16_C fallback kernel
definition in all architectures that need it, and just put it once in
rotate_common.cc instead.

Observed speedups for TransposePlane across a range of
micro-architectures:

 Cortex-A53: -40.0%
 Cortex-A55: -20.7%
 Cortex-A57: -43.9%
Cortex-A510: -43.5%
Cortex-A520: -43.9%
Cortex-A720: -31.1%
  Cortex-X2: -38.3%
  Cortex-X4: -43.6%

Change-Id: Ic7c4d5f24eb27091d743ddc00cd95ef178b6984e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5545459
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
2024-05-21 07:46:42 +00:00
..
basic_types.h Disable old int types by default. 2018-07-09 21:16:47 +00:00
compare_row.h [AArch64] Add Neon implementation of HashDjb2 2024-05-01 19:37:31 +00:00
compare.h Lint cleanup after C99 change CL 2018-01-24 19:16:03 +00:00
convert_argb.h YUY2ToARGBMatrix and UYVYToARGBMatrix added to allow any color matrix 2024-01-19 21:21:37 +00:00
convert_from_argb.h MM21ToYUY2 and ABGRToJ420 conversion 2022-08-16 22:07:38 +00:00
convert_from.h Add 10/12 bit YUV To YUV functions 2021-02-25 23:16:54 +00:00
convert.h Fix tidy warning that uint32_t dither4 should not be const 2023-06-02 00:42:02 +00:00
cpu_id.h [AArch64] Enable feature detection on Windows and Apple Silicon 2024-05-03 18:42:51 +00:00
loongson_intrinsics.h RAWToJ400 faster version for ARM 2022-03-18 07:22:36 +00:00
macros_msa.h RAWToJ400 faster version for ARM 2022-03-18 07:22:36 +00:00
mjpeg_decoder.h add YUV24 and AYUV formats 2019-03-05 02:53:56 +00:00
planar_functions.h Disable NEON if memory sanitizer is enabled 2023-08-31 18:07:42 +00:00
rotate_argb.h Switch to C99 types 2018-01-23 19:16:05 +00:00
rotate_row.h [AArch64] Use full vectors in TransposeWx{8 => 16}_NEON 2024-05-21 07:46:42 +00:00
rotate.h Add 10 bit rotate methods. 2023-01-04 21:10:01 +00:00
row.h [AArch64] Add Neon implementations for {ARGB,ABGR}ToAR30Row 2024-05-21 07:35:07 +00:00
scale_argb.h Switch to C99 types 2018-01-23 19:16:05 +00:00
scale_rgb.h RGBScale function using 3 steps: RGB24ToARGB, ARGBScale, ARGBToRGB24 2022-03-19 01:44:06 +00:00
scale_row.h Split convert_test and convert_argb_test to allow building on small systems that run out of memory compiling unittests. 2023-12-08 13:39:56 +00:00
scale_uv.h add yuvconvstants util 2021-02-12 19:45:16 +00:00
scale.h Change ScalePlane,ScalePlane_16,... to return int 2023-11-03 23:53:24 +00:00
version.h Fix environment variable LIBYUV_CPU_INFO for unittests 2024-04-20 17:41:56 +00:00
video_common.h Add support for AR64 format 2021-03-13 20:55:21 +00:00