From cc823114a1b6bca2f44d6e5ee7d6642f20bca7c5 Mon Sep 17 00:00:00 2001
From: George Steed <george.steed@arm.com>
Date: Wed, 22 May 2024 10:58:33 +0100
Subject: [PATCH] [docs] Add documentation on AArch64 SME for feature detection

Give a brief explanation of the Scalable Matrix Extension and where we
believe it will be beneficial, in line with the existing documentation
for Neon and SVE.

Change-Id: I477b7f293c00740ce8346a96a9a0ad133f4ef1c2
Reviewed-on: https://chromium-review.googlesource.com/c/libyuv/libyuv/+/5587508
Reviewed-by: Frank Barchard <fbarchard@chromium.org>
---
 docs/feature_detection.md | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/docs/feature_detection.md b/docs/feature_detection.md
index 5f494d709..d32e84bfe 100644
--- a/docs/feature_detection.md
+++ b/docs/feature_detection.md
@@ -18,7 +18,7 @@ Neon is available and mandatory in AArch64 from the base Armv8.0-A
 architecture. Neon can be used even if later extensions like the Scalable
 Vector Extension (SVE) are also present. The exception to this is if the CPU is
 currently operating in streaming mode as introduced by the Scalable Matrix
-Extension, which is not currently used in libyuv.
+Extension, described later.
 
 There are also a couple of architecture extensions present for Neon that we can
 take advantage of in libyuv:
@@ -64,6 +64,27 @@ Armv8.6-A or Armv9.1-A, however there is no micro-architecture at time of
 writing where SVE2 is implemented without all previously-mentioned features
 also being implemented.
 
+### The Scalable Matrix Extension (SME)
+
+The Scalable Matrix Extension (SME) is an optional feature introduced from
+Armv9.2-A. SME exists alongside SVE and introduces new execution modes for
+applications performing extended periods of data processing. In particular SME
+introduces a few new components of interest:
+
+* Access to a scalable two-dimensional ZA tile register and new instructions to
+  interact with rows and columns of the ZA tiles. This can be useful for data
+  transformations like transposes.
+
+* A streaming SVE (SSVE) mode, during which the SVE vector length matches the
+  ZA tile register width. In typical systems where the ZA tile register width
+  is longer than the core SVE vector length, SSVE processing allows for faster
+  data processing, even if the ZA tile register is unused.  While the CPU is
+  executing in streaming mode, Neon instructions are unavailable.
+
+* When both SSVE and the ZA tile registers are enabled there are additional
+  outer-product instructions accumulating into a whole ZA tile, suitable for
+  accelerating matrix arithmetic. This is likely less useful in libyuv.
+
 ## Linux and Android
 
 On AArch64 running under Linux and Android, features are detected by inspecting