mirror of
https://chromium.googlesource.com/libyuv/libyuv
synced 2025-12-08 01:36:47 +08:00
benchmark on medium core
adbrun -- taskset 10 blaze-bin/third_party/libyuv/libyuv_test '--gunit_filter=*J420ToI420*' --gunit_also_run_disabled_tests --libyuv_width=1280 --libyuv_height=720 --libyuv_repeat=1000 --libyuv_flags=-1 --libyuv_cpu_info=-1
Now Neon
J420ToI420_Opt (159 ms)
Was C
J420ToI420_Opt (215 ms)
AArch64
J420ToI420_Opt (93 ms)
C version does this:
vld1.8 {d20, d21}, [r6]!
vorr q12, q8, q8
subs r4, #16
vmovl.u8 q11, d21
vmovl.u8 q10, d20
vmul.i16 q11, q9, q11
vmul.i16 q10, q9, q10
vsra.u16 q12, q11, #8
vorr q11, q8, q8
vsra.u16 q11, q10, #8
vmovn.i16 d21, q12
vmovn.i16 d20, q11
vst1.8 {d20, d21}, [r5]!
bne 0x3d9078 <Convert8To8Row_C+0x36> @ imm = #-54
Explanation of above C code
vorr moves 16 into register
vsra does shift + accumulate to that register
Compared to aarch64
instead of mull, C uses movl+mul
instead of uzp2, C uses sra #8 + movn. takes 2 movn vs 1 uzp2
instead of add, C does vorr + sra
Change-Id: I9648f06e52ccbafaecf07bd89f8ffff27565d025
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6189497
Reviewed-by: Justin Green <greenjustin@google.com>
|
||
|---|---|---|
| .. | ||
| libyuv | ||
| libyuv.h | ||