mirror of
https://chromium.googlesource.com/libyuv/libyuv
synced 2025-12-06 08:46:47 +08:00
Avoid odd width stores in I422ToRGB565Row_{SVE2,SME}
The existing code for creating RGB565 data in SVE2 and SME produces two
vectors of interleaved 16-bit elements due to the nature of how SVE
widening instructions operate. This means that the indices of the 16-bit
data created appear in the two result vectors as such:
z18.b: [elem0 byte0, elem0 byte1, elem2 byte0, elem2 byte1, ...]
z19.b: [elem1 byte0, elem1 byte1, elem3 byte0, elem3 byte1, ...]
This is problematic for the final (predicated) iteration of the
conversion since the p1 predicate input to the ST2H instruction controls
storing the four bytes corresponding to the first two elements, in the
first two bytes of z18 and z19. This means that in the case that the
width is an odd number there is no way of storing just elem0 in z18
individually.
This patch addresses this by permuting the z18/z19 data such that the
two bytes from each element are split evenly across the two vectors:
z20.b: [elem0 byte0, elem1 byte0, elem2 byte0, elem3 byte0, ...]
z21.b: [elem0 byte1, elem1 byte1, elem2 byte1, elem3 byte1, ...]
Since we would now always store the same lanes from both vectors we can
continue to use the same predicate without further changes.
The existing (non-tail) loop body utilizes an all-true predicate so we
can avoid the extra permutes in this case, avoiding any performance
degradation.
Change-Id: I7d2be27c84cd9eb02cebac54a14c3498911f21d3
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6395137
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
Commit-Queue: Frank Barchard <fbarchard@chromium.org>
This commit is contained in:
parent
5f284054cb
commit
64ac2d8f0f
@ -501,7 +501,11 @@ static inline void I422ToRGB565Row_SVE_SC(
|
||||
"whilelt p1.b, wzr, %w[width] \n" //
|
||||
READYUV422_SVE_2X I422TORGB_SVE_2X RGBTOARGB8_SVE_TOP_2X
|
||||
RGB8TORGB565_SVE_FROM_TOP_2X
|
||||
"st2h {z18.h, z19.h}, p1, [%[dst]] \n"
|
||||
// Need to permute the data on the final iteration such that the
|
||||
// predicates (.b) line up with the 16-bit element data.
|
||||
"trn1 z20.b, z18.b, z19.b \n"
|
||||
"trn2 z21.b, z18.b, z19.b \n"
|
||||
"st2b {z20.b, z21.b}, p1, [%[dst]] \n"
|
||||
|
||||
"99: \n"
|
||||
: [src_y] "+r"(src_y), // %[src_y]
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user