mirror of
https://chromium.googlesource.com/libyuv/libyuv
synced 2025-12-09 18:26:49 +08:00
I422ToUYVYRow_AVX2 optimized from 7 cycles per 32 pixels to 4.6 cycles. Instead of 2 vpermq and vpunpcklbw: vmovdqu (%1),%%xmm2 vmovdqu 0x00(%1,%2,1),%%xmm3 vpermq $0xd8,%%ymm2,%%ymm2 vpermq $0xd8,%%ymm3,%%ymm3 vpunpcklbw %%ymm3,%%ymm2,%%ymm2 ..use vpmovzxbd to expand the bytes to shorts, then vpslld and vpor vpmovzxbd (%1),%%ymm2 vpmovzxbd 0x00(%1,%2,1),%%ymm3 vpslld $0x10,%%ymm3,%%ymm3 vpor %%ymm3,%%ymm2,%%ymm2 which reduces the port 5 bottleneck by 1 cycle. Bug: libyuv:556 Test: out/Release/libyuv_unittest --gtest_filter=*I42?To*UY*Opt Change-Id: I53799e53cc6b090a1a695c839094c193be3eecaf Reviewed-on: https://chromium-review.googlesource.com/899873 Commit-Queue: Frank Barchard <fbarchard@chromium.org> Reviewed-by: richard winterton <rrwinterton@gmail.com> Reviewed-by: Cheng Wang <wangcheng@google.com> |
||
|---|---|---|
| .. | ||
| basic_types.h | ||
| compare_row.h | ||
| compare.h | ||
| convert_argb.h | ||
| convert_from_argb.h | ||
| convert_from.h | ||
| convert.h | ||
| cpu_id.h | ||
| macros_msa.h | ||
| mjpeg_decoder.h | ||
| planar_functions.h | ||
| rotate_argb.h | ||
| rotate_row.h | ||
| rotate.h | ||
| row.h | ||
| scale_argb.h | ||
| scale_row.h | ||
| scale.h | ||
| version.h | ||
| video_common.h | ||