Avoid odd width stores in I422ToRGB565Row_{SVE2,SME}

The existing code for creating RGB565 data in SVE2 and SME produces two vectors of interleaved 16-bit elements due to the nature of how SVE widening instructions operate. This means that the indices of the 16-bit data created appear in the two result vectors as such: z18.b: [elem0 byte0, elem0 byte1, elem2 byte0, elem2 byte1, ...] z19.b: [elem1 byte0, elem1 byte1, elem3 byte0, elem3 byte1, ...] This is problematic for the final (predicated) iteration of the conversion since the p1 predicate input to the ST2H instruction controls storing the four bytes corresponding to the first two elements, in the first two bytes of z18 and z19. This means that in the case that the width is an odd number there is no way of storing just elem0 in z18 individually. This patch addresses this by permuting the z18/z19 data such that the two bytes from each element are split evenly across the two vectors: z20.b: [elem0 byte0, elem1 byte0, elem2 byte0, elem3 byte0, ...] z21.b: [elem0 byte1, elem1 byte1, elem2 byte1, elem3 byte1, ...] Since we would now always store the same lanes from both vectors we can continue to use the same predicate without further changes. The existing (non-tail) loop body utilizes an all-true predicate so we can avoid the extra permutes in this case, avoiding any performance degradation. Change-Id: I7d2be27c84cd9eb02cebac54a14c3498911f21d3 Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6395137 Reviewed-by: Frank Barchard <fbarchard@chromium.org> Commit-Queue: Frank Barchard <fbarchard@chromium.org>
2026-01-01 03:12:16 +08:00 · 2025-03-24 13:21:42 +00:00 · 2025-03-24 13:21:42 +00:00 · 64ac2d8f0f
commit 64ac2d8f0f
parent 5f284054cb
1 changed files with 5 additions and 1 deletions
--- a/include/libyuv/row_sve.h
+++ b/include/libyuv/row_sve.h
@ -501,7 +501,11 @@ static inline void I422ToRGB565Row_SVE_SC(
      "whilelt  p1.b, wzr, %w[width]                    \n"  //
      READYUV422_SVE_2X I422TORGB_SVE_2X RGBTOARGB8_SVE_TOP_2X
          RGB8TORGB565_SVE_FROM_TOP_2X
-      "st2h     {z18.h, z19.h}, p1, [%[dst]] \n"
+      // Need to permute the data on the final iteration such that the
+      // predicates (.b) line up with the 16-bit element data.
+      "trn1     z20.b, z18.b, z19.b                     \n"
+      "trn2     z21.b, z18.b, z19.b                     \n"
+      "st2b     {z20.b, z21.b}, p1, [%[dst]]            \n"

      "99:                                              \n"
      : [src_y] "+r"(src_y),                               // %[src_y]