mirror of
https://chromium.googlesource.com/libyuv/libyuv
synced 2025-12-07 09:16:48 +08:00
Being able to use SVE2 functionality for these kernels has a number of performance wins compared to the existing Neon code: * For the Y component calculation we are able to use UMULH, versus the existing UMULL x2 + UZP2 sequence in Neon. * For the RGBTORGBA8 calculation we are able to take advantage of interleaving narrowing instructions, allowing us to use ST2 rather than ST4 for the store. This is a big performance win on some micro-architectures where ST4 is costly. * The use of predication means we do not need to add "any" kernels, we can simply rerun the calculation with a not-full predicate for the final iteration. To avoid the overhead of generating a predicate register on every iteration we duplicate the loop body and only generate a predicate on the final iteration of the loop. This costs a small amount on the final iteration but should still be significantly quicker than the overhead of a function call needed by the "any" cases. Duplicating the loop body to reduce the use of the WHILELT instruction improves little core performance by ~12% by itself but has negligable impact on other micro-architectures. Reduction in runtime for the new SVE2 implementation compared to the existing Neon implementation on selected micro-architectures: Cortex-A510: -36.5% Cortex-A720: -17.3% Cortex-X2: -11.3% Bug: libyuv:973 Change-Id: I2a485f0dfa077a56f96b80a667ad38bbea47b4b4 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5424739 Reviewed-by: Frank Barchard <fbarchard@chromium.org> |
||
|---|---|---|
| .. | ||
| compare_common.cc | ||
| compare_gcc.cc | ||
| compare_msa.cc | ||
| compare_neon64.cc | ||
| compare_neon.cc | ||
| compare_win.cc | ||
| compare.cc | ||
| convert_argb.cc | ||
| convert_from_argb.cc | ||
| convert_from.cc | ||
| convert_jpeg.cc | ||
| convert_to_argb.cc | ||
| convert_to_i420.cc | ||
| convert.cc | ||
| cpu_id.cc | ||
| mjpeg_decoder.cc | ||
| mjpeg_validate.cc | ||
| planar_functions.cc | ||
| rotate_any.cc | ||
| rotate_argb.cc | ||
| rotate_common.cc | ||
| rotate_gcc.cc | ||
| rotate_lsx.cc | ||
| rotate_msa.cc | ||
| rotate_neon64.cc | ||
| rotate_neon.cc | ||
| rotate_win.cc | ||
| rotate.cc | ||
| row_any.cc | ||
| row_common.cc | ||
| row_gcc.cc | ||
| row_lasx.cc | ||
| row_lsx.cc | ||
| row_msa.cc | ||
| row_neon64.cc | ||
| row_neon.cc | ||
| row_rvv.cc | ||
| row_sve.cc | ||
| row_win.cc | ||
| scale_any.cc | ||
| scale_argb.cc | ||
| scale_common.cc | ||
| scale_gcc.cc | ||
| scale_lsx.cc | ||
| scale_msa.cc | ||
| scale_neon64.cc | ||
| scale_neon.cc | ||
| scale_rgb.cc | ||
| scale_rvv.cc | ||
| scale_uv.cc | ||
| scale_win.cc | ||
| scale.cc | ||
| test.sh | ||
| video_common.cc | ||