There are existing x86 implementations for these kernels but not for
AArch64, so add them.
Reduction in runtimes, compared to the existing C code compiled with
LLVM 17:
| ABGRToAR30Row | ARGBToAR30Row
Cortex-A55 | -55.1% | -55.1%
Cortex-A510 | -39.3% | -40.1%
Cortex-A76 | -62.3% | -63.6%
Co-authored-by: Cosmina Dunca <cosmina.dunca@arm.com>
Bug: libyuv:976
Change-Id: I307f03bddcbe5429c2d3ab2f42aa023a3539ddd0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465592
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
We don't need a general-purpose purmute here, REV16 does exactly what we
want and saves us needing to load the permute indices array.
Bug: libyuv:976
Change-Id: Ib3bc2e4d21b00d53aeda6a11c6e6f1016ca6029e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5509201
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
The use of LD4 and ST4 to de-interleave ARGB color channels is
unnecessary here since we can just adjust the scale multiplicand to
match the interleaved layout. LD4 and ST4 are known to perform poorly on
some micro-architectures so using LD1 and ST1 here should be preferred.
Reduction in runtime for ARGBShadeRow_NEON:
Cortex-A55: -19.9%
Cortex-A510: -50.8%
Cortex-A76: -36.0%
Cortex-X2: -46.4%
Bug: libyuv:976
Change-Id: I10a0e6a0a62242826d39b1e963063770f084226a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5494093
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code makes use of a pair of shifts to put the bits we want
in the low part of each vector lane and then a pair of UQXTN and UQXTN2
instructions to perform a saturating cast down from 16-bit elements to
8-bit elements. We can instead achieve the same thing by adding eight to
the first shift amount so that the bits we want appear in the high half
of the lane, doing the saturation at the same time, and then simply use
UZP2 to pull out the high halves of each lane in a single instruction.
Reduction in runtime for Convert16To8Row_NEON:
Cortex-A55: -19.7%
Cortex-A510: -23.5%
Cortex-A76: -35.4%
Cortex-X2: -34.1%
Bug: libyuv:976
Change-Id: I9a80c0f4f2c6b5203f23e422c0970d3167052f91
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463950
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Shift instructions have worse throughput than other permute instructions
on some micro-architectures, and we can avoid the need for two separate
narrowing instructions by taking the high halves of each lane directly
through use of the UZP2 instruction.
Reduction in runtime for DivideRow_16_NEON:
Cortex-A55: -6.2%
Cortex-A510: -30.0%
Cortex-A76: -11.9%
Cortex-X2: -46.8%
Bug: libyuv:976
Change-Id: I4aa06eab06ab6134bb80bc3af5328a1a83b3d249
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5463949
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The MOV instruction is an alias of ORR where both registers are the
same and should be preferred.
Both ORR and MOV are not zero-cost instructions on all
micro-architectures so there may be better ways to express these
kernels, but this is left for a later commit.
Bug: libyuv:975
Change-Id: I29b7f182a57a61855cb7f8a867691080f153b10b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5332385
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Using full vectors for Add and Subtract is a win across the board. Using
full vectors for the multiply is less obviously a win, especially for
smaller cores like Cortex-A53 or Cortex-A57, so is not considered for
this change.
Observed changes in performance with this change compared to the
existing Neon code:
| ARGBAddRow_NEON | ARGBSubtractRow_NEON
Cortex-A55 | -5.1% | -5.1%
Cortex-A510 | -18.4% | -18.4%
Cortex-A76 | -28.9% | -28.7%
Cortex-A720 | -36.1% | -36.2%
Cortex-X1 | -14.2% | -14.4%
Cortex-X2 | -12.5% | -12.5%
Bug: libyuv:976
Change-Id: I85316d4399c93b53baa62d0d43b2fa453517f5b4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5457433
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing code performs a lot of shifts and combines the R and B
components into a single vector unnecessarily. We can express this much
more cleanly by making use of the SRI instruction to insert and replace
shifted bits into the original data, performing the 5/6-bit to 8-bit
expansion in a single instruction if the source bits are already in the
high bits of the byte. We still need a single separate XTN instruction
to narrow the B component before the left shift since Neon does not have
a narrowing left shift instruction.
Reduction in runtime for selected kernels:
Kernel | Cortex-A55 | Cortex-A76 | Cortex-X2
RGB565ToYRow_NEON | -22.1% | -23.4% | -25.1%
RGB565ToUVRow_NEON | -26.8% | -20.5% | -18.8%
RGB565ToARGBRow_NEON | -38.9% | -32.0% | -23.5%
Bug: libyuv:976
Change-Id: I77b8d58287b70dbb9549451fc15ed3dd0d2a4dda
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5374286
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Most micro-architectures seem to prefer an additional ZIP1 instruction
in READYUV422 to needing a lane-indexed LD1 load instruction.
We introduce a new macro to handle the YUV to RGB conversion where the U
and V components are in separate vectors. This avoids causing a slowdown
for the UV-interleaved input format kernels (NV12 and NV21) where we do
not want to separate them.
Reduction in runtime for selected kernels on Cortex cores (no
performance difference observed on Cortex-A55):
A510 A76 A720 X1 X2
I422AlphaToARGBRow_NEON -4.3% -7.3% -10.1% -4.0% -4.4%
I422ToARGB1555Row_NEON -4.5% +0.4% -7.9% -4.8% -3.9%
I422ToARGB4444Row_NEON -7.7% -2.6% -4.1% -1.9% -1.3%
I422ToARGBRow_NEON -3.7% -2.9% -10.2% -3.8% -4.4%
I422ToRGB24Row_NEON -5.9% +5.4% -3.2% -4.3% -4.3%
I422ToRGB565Row_NEON -4.8% -2.8% -8.5% -3.8% -4.6%
I422ToRGBARow_NEON -3.7% +4.6% -10.5% -3.0% -4.5%
I444AlphaToARGBRow_NEON -3.5% +2.7% -3.7% -5.0% -8.2%
I444ToARGBRow_NEON -1.8% -15.1% -3.5% -6.5% -8.1%
I444ToRGB24Row_NEON -2.0% -6.8% +0.1% -4.7% +1.2%
There are a few cases which are slower on Cortex-A76, but significant
speedups elsewhere.
Bug: libyuv:976
Change-Id: Ib3b4ef81f7bfc1d7ff9c4c24aef9ad86741410ff
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5465580
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing transformations can be more cleanly expressed by using SRI
instructions to perform a shift and simultaneously merge in to an
existing value.
Reduction in runtime for selected kernels:
Kernel | Cortex-A55 | Cortex-A76 | Cortex-X2
ARGB1555ToYRow_NEON | -26.2% | -14.9% | -28.2%
ARGB1555ToUVRow_NEON | -25.2% | -18.4% | -20.9%
ARGB1555ToARGBRow_NEON | -43.6% | -32.8% | -19.7%
Bug: libyuv:976
Change-Id: Id07ac6f2cd3eb9bb70f9e29fc1f4b29fe26156ec
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5383444
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The existing sequence to convert from 8-bit ARGB to 4-bit ARGB4444 makes
use of a lot of shifts and bit-clears before ORR'ing the pairs together.
This is unnecessary since we can do the same with the SRI instruction,
so use that instead.
Reduction in runtime for selected kernels:
Kernel | Cortex-A55 | Cortex-A76
ARGBToARGB4444Row_NEON | -15.3% | -16.6%
I422ToARGB4444Row_NEON | -2.7% | -11.9%
Bug: libyuv:976
Change-Id: I86cd86c7adf1105558787a679272179821f31a9d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5383443
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The value of UV components in the vector are known and the vectors are
never overwritten, so we can hoist the UV-specific parts of the
calculation out of the loop.
Reduction in runtimes for I400ToARGBRow_NEON:
Cortex-A55: -10.0%
Cortex-A510: -3.7%
Cortex-A76: -19.3%
Cortex-X2: -14.4%
Bug: libyuv:976
Change-Id: I17d6de4e1790f71407e12ff84548568cc3ebbe1a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5457434
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
There is no need to de-interleave channels here since we are applying
the same operation across all lanes. LD4 and ST4 are known to be
significantly slower than LD1/ST1 on some micro-architectures so we
should prefer to avoid them where possible.
Reduction in runtimes observed for ARGBMultiplyRow_NEON:
Cortex-A55: -22.3%
Cortex-A510: -56.6%
Cortex-A76: -45.5%
Cortex-X2: -54.6%
Change-Id: I9103111a109a4d87d358e06eb513746314aaf66a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454832
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no need to de-interleave channels here since we are applying
the same operation across all lanes. LD4 and ST4 are known to be
significantly slower than LD1/ST1 on some micro-architectures so we
should prefer to avoid them where possible.
Reduction in runtimes observed for ARGBSubtractRow_NEON:
Cortex-A55: -15.0%
Cortex-A510: -59.8%
Cortex-A76: -54.4%
Cortex-X2: -70.4%
Change-Id: Ifbfce9e6a45159932c09d9b0229215a36fa22f43
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454833
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
There is no need to de-interleave channels here since we are applying
the same operation across all lanes. LD4 and ST4 are known to be
significantly slower than LD1/ST1 on some micro-architectures so we
should prefer to avoid them where possible.
Reduction in runtimes observed for ARGBAddRow_NEON:
Cortex-A55: -15.0%
Cortex-A510: -59.8%
Cortex-A76: -54.4%
Cortex-X2: -70.4%
Change-Id: Id04e5259d8e5e7511dad5df85cdf9759b392cb99
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5454831
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
The comment refers to the code needing to be re-enabled but as far as I
can tell it is already enabled, so simply remove the comment.
Change-Id: Id014e8b7f5cd43c8211e1d38758299de2fad49de
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5387650
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The existing Neon code makes use of a pair of UQSHRN and UQSHRN2
instructions to extract the top half of a widened multiply result.
These instructions would ordinarily saturate, however saturation can
never happen in this case since we are shifting by 16 to get the top
half of each element, the top bits remain as-is.
We could move this to using a slightly simpler non-saturating shift,
however in this case it is simpler and faster to just use UZP2 to
extract the top half of each 32-bit lane directly.
Reduction in runtime for selected kernels:
Kernel | Cortex-A55 | Cortex-A76 | Cortex-X2
I400ToARGBRow_NEON | -9.4% | -14.9% | -13.9%
I422AlphaToARGBRow_NEON | -7.9% | -11.4% | -11.5%
I422ToARGB1555Row_NEON | -7.3% | -17.2% | -14.7%
I422ToARGB4444Row_NEON | -7.6% | -17.9% | -13.7%
I422ToARGBRow_NEON | -8.2% | -9.8% | -11.9%
I422ToRGB24Row_NEON | -8.0% | -13.3% | -12.8%
I422ToRGB565Row_NEON | -7.5% | -15.1% | -14.6%
I422ToRGBARow_NEON | -8.3% | -13.1% | -12.2%
I444AlphaToARGBRow_NEON | -8.3% | -7.6% | -12.7%
I444ToARGBRow_NEON | -8.6% | -3.5% | -13.5%
I444ToRGB24Row_NEON | -8.5% | -7.8% | -13.4%
NV12ToARGBRow_NEON | -8.8% | -1.4% | -12.0%
NV12ToRGB24Row_NEON | -8.5% | -11.5% | -12.3%
NV12ToRGB565Row_NEON | -7.9% | -15.0% | -15.7%
NV21ToARGBRow_NEON | -8.7% | -1.6% | -12.3%
NV21ToRGB24Row_NEON | -8.4% | -11.5% | -12.0%
UYVYToARGBRow_NEON | -8.8% | -8.9% | -11.9%
YUY2ToARGBRow_NEON | -8.7% | -10.8% | -13.3%
Bug: libyuv:976
Change-Id: I6c505fe722e5f91f93718b85fe881ad056d8602d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5366653
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
In this case we have an LD2 instruction followed by a pair of permutes
(ZIP1 and TBL). On some micro-architectures LD2 involves use of the
vector pipelines, so in these cases it is preferable to do an LD1 and
then a different pair of permutes (TRN + TBL) instead to avoid the extra
vector pipeline usage.
Reduction in runtime on selected kernels (no observed performance delta
on Cortex-A55):
Kernel | Cortex-A76 | Cortex-X2
UYVYToARGBRow_NEON | -2.6% | -8.8%
YUY2ToARGBRow_NEON | -6.2% | -4.9%
Bug: libyuv:976
Change-Id: I7ca45e0c7bf7cb50cc5ab37c6a01215d9689039a
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5366652
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
The existing code makes use of a pair of lane-indexed load instructions
to fill the two halves of the input vector, however this has the effect
of introducing an unnecessary dependency on the value of the vector from
the previous loop iteration.
This doesn't really seem to affect little core performance since these
cores never execute enough work concurrently to hit the bottleneck,
however we can improve performance on mid and big cores quite a bit by
using LDR instead of LD1 to load the low lane, zeroing the upper portion
of the vector rather than keeping the previous value.
Reduction in runtime for select kernels (no observed performance delta
on Cortex-A55):
Kernel | Cortex-A76 | Cortex-X2
I422ToARGB4444Row_NEON | -23.1% | -49.3%
I422ToARGBRow_NEON | -1.2% | -2.5%
I422ToRGB24Row_NEON | -11.7% | -7.0%
I422ToRGBARow_NEON | -4.7% | -3.4%
I444AlphaToARGBRow_NEON | -1.1% | -2.4%
I444ToARGBRow_NEON | -1.6% | -3.2%
I444ToRGB24Row_NEON | -9.6% | -6.8%
Bug: libyuv:976
Change-Id: I8c9413e0e6ed97b8f060ce42b6e8abdfb77914b9
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5365868
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Add static to internal scale and rotate functions
- Remove unittest that tested an internal scale function
- Remove unused private functions
- Include missing scale_argb.h header
- Bump version and apply clang format
Bug: libyuv:830
Change-Id: I45bab0423b86334f9707f935aedd0c6efc442dd4
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4658956
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
- Makes ARM and Intel match and fixes some off by 1 cases
- Add ARGBToUV444MatrixRow_NEON
- Add ConvertFP16ToFP32Column_NEON
- scale_rvv fix intinsic build error
- disable row_win version of ARGBAttenuate/Unattenuate
Bug: libyuv:936, libyuv:956
Change-Id: Ied99aaad3a11a8eb69212b628c58f86ec0723c38
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4617013
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Allows code to be optimized with clang 17 -flto-thin
- Bump version number to 1864 to allow detection of fix
- Apply clang format to standardize formatting; No impact on code generated
Bug: chromium:1424089
Change-Id: Ib745836b27915a5e4cb1d7d928ee52659360612b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4370052
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
Fix the algorithm for unpacking the lower 2 bits of M2T2 pixels.
Bug: b:258474032
Change-Id: Iea1d63f26e3f127a70ead26bc04ea3d939e793e3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4337978
Commit-Queue: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Convert MergeUVRow_AVX512BW to assembly
- Enable MergeUVRow_AVX512BW for Windows with clangcl
- MergeUVRow_AVX2 use vpmovzxbw and vpsllw
- MergeUVRow_16_AVX2 use vpmovzxbw and vpsllw with different shift for U and V
AMD Zen 4 640x360 100000 iterations
Was
AVX512 MergeUVPlane_Opt (884 ms)
AVX2 MergeUVPlane_Opt (945 ms)
AVX2 MergeUVPlane_16_Opt (2167 ms)
Now
AVX512 MergeUVPlane_Opt (865 ms)
AVX2 MergeUVPlane_Opt (943 ms)
SSE2 MergeUVPlane_Opt (973 ms)
AVX2 MergeUVPlane_16_Opt (2102 ms)
Bug: None
Change-Id: I658ada2a75d44c3f93be8bd3ed96f83d5fa2ab8d
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4271230
Reviewed-by: Fritz Koenig <frkoenig@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
- was dup of 8h but mul of 4s. now use umull
Bug: libyuv:951
Change-Id: If6cb01f5f006c2235886b81ce120642d7e24a9bb
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4166563
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- MT2T support for source strides added, but only works for positive values.
- Reduced casting in row_common - one cast per assignment.
- scaling functions use intptr_t for intermediate calculations, then cast strides to ptrdiff_t
Bug: libyuv:948, b/257266635, b/262468594
Change-Id: I0409a0ce916b777da2a01c0ab0b56dccefed3b33
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/4102203
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: Justin Green <greenjustin@google.com>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Ernest Hua <ernesthua@google.com>
- Optimized YUY2ToNV12 that reduces it from 3 steps to 2 steps
- Was SplitUV, memcpy Y, InterpolateUV
- Now YUY2ToY, YUY2ToNVUV
- rollback LIBYUV_UNLIMITED_DATA
3840x2160 1000 iterations:
Pixel 2 Cortex A73
Was YUY2ToNV12_Opt (6515 ms)
Now YUY2ToNV12_Opt (3350 ms)
AB7 Mediatek P35 Cortex A53
Was YUY2ToNV12_Opt (6435 ms)
Now YUY2ToNV12_Opt (3301 ms)
Skylake AVX2 x64
Was YUY2ToNV12_Opt (1872 ms)
Now YUY2ToNV12_Opt (1657 ms)
SSE2 x64
Was YUY2ToNV12_Opt (2008 ms)
Now YUY2ToNV12_Opt (1691 ms)
Windows Skylake AVX2 32 bit x86
Was YUY2ToNV12_Opt (2161 ms)
Now YUY2ToNV12_Opt (1628 ms)
Bug: libyuv:943
Change-Id: I6c2ba2ae765413426baf770b837de114f808f6d0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3929843
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- YUV to RGB use linear for first and last row.
- add assert(yuvconstants)
- rename pointers to match row functions.
- use macros that match row functions.
- use 12 bit upsampler for conversions of 10 and 12 bits
Cortex A53 AArch32
I420ToRGB24_Opt (3627 ms)
I422ToRGB24_Opt (4099 ms)
I444ToRGB24_Opt (4186 ms)
I420ToRGB24Filter_Opt (5451 ms)
I422ToRGB24Filter_Opt (5430 ms)
AVX2
Was I420ToRGB24Filter_Opt (583 ms)
Now I420ToRGB24Filter_Opt (560 ms)
Neon Cortex A7
Was I420ToRGB24Filter_Opt (5447 ms)
Now I420ToRGB24Filter_Opt (5439 ms)
Bug: libyuv:938
Change-Id: I1731f2dd591073ae11a756f06574103ba0f803c7
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3906082
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Explicitly initialize the 'pad' field of RgbConstants to 0. This
prevents the following warning/error in some compilers:
error: missing field 'pad' initializer [-Werror,-Wmissing-field-initializers]
Bug: b/241008246
Change-Id: Id6a0beb75c5c709404290c75915049f8a3898c83
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3808044
Reviewed-by: Wan-Teh Chang <wtc@google.com>
MergeRGB and SplitRGB use a register to point to 9 shuffle tables.
- fixes an out of registers error with -mcmodel=large
InterpolateRow_16To8_NEON improves performance for I210ToI420:
On Pixel 4 for 720p x1000 images
Was I210ToI420_Opt (608 ms)
Now I210ToI420_Opt (336 ms)
On Skylake Xeon
Was I210ToI420_Opt (259 ms)
Now I210ToI420_Opt (209 ms)
Bug: libyuv:931, libyuv:930
Change-Id: I20f8244803f06da511299bf1a2ffc7945eb35221
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3717054
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Justin Green <greenjustin@google.com>
- Avoid stepping to height + 1 for bilinear filter 2nd row for last row of source
- Box filter ubsan fix for 3/4 and 3/8 scaling for 16 bit planar
- Height 1 asan fixes
Bug: libyuv:935, b/206716399
Change-Id: I56088520f2a884a37b987ee5265def175047673e
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3717263
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Fixes chromium PaintCanvasVideoRendererTest.HighBitDepth
sqdmulh was creating a 9 bit value with rounding, and then shifted it right 1 with no rounding. The rounding had an off by 1 impact in some tests.
Pixel 3
C I010ToI420_Opt (749 ms)
Was sqdmulh I010ToI420_Opt (370 ms)
Now ushl I010ToI420_Opt (324 ms)
Pixel 4
C I010ToI420_Opt (581 ms)
Was sqdmulh I010ToI420_Opt (240 ms)
Now ushl I010ToI420_Opt (231 ms)
Bug: b/216321733, b/233233302
Change-Id: I26f673bb411401d1e4a8126bf22d61c649223e9b
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3694143
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
- Add I210ToI420 to convert 10 bit 4:2:2 YUV to 4:2:0 8 bit
- Add NEON InterpolateRow_16 for fast 10 bit scaling
- When scaling up, set step to interpolate toward height - 1 to avoid buffer overread
- When scaling down, center the 2 rows used for source to achieve filtering.
- CopyPlane check for 0 size and return
Bug: libyuv:931, b/228605787, b/233233302, b/233634772, b/234558395, b/234340482
Change-Id: I63e8580710a57812b683c2fe40583ac5a179c4f1
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3687552
Reviewed-by: Mirko Bonadei <mbonadei@chromium.org>
Reviewed-by: richard winterton <rrwinterton@gmail.com>
Pixel 3
Was C I010ToI420_Opt (749 ms)
Now NEON I010ToI420_Opt (356 ms)
Pixel 4
Was C I010ToI420_Opt (581 ms)
Now NEON I010ToI420_Opt (163 ms)
Bug: b/233233302, b/233634772
Change-Id: I60a84648a66f77d97c0a7822b29bd18b8e3a3355
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3661401
Reviewed-by: Justin Green <greenjustin@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
- Unrolled to 16 pixels
- Take constants via structure, allowing different colorspace and channel order
- Use ADDHN to add 16.5 and take upper 8 bits of 16 bit values, narrowing to 8 bits
- clang-format applied, affecting mips code
On Cortex A510
Was RAWToJ400_Opt (1623 ms)
Now RAWToJ400_Opt (862 ms)
C RAWToJ400_Opt (1627 ms)
Bug: b/220171611
Change-Id: I06a9baf9650ebe2802fb6ff6dfbd524e2c06ada0
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/3534023
Reviewed-by: Wan-Teh Chang <wtc@google.com>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>