Add a 4-digit SWAR follow-up to loop_parse_if_eight_digits (clang)

After the 8-digit SWAR block loop, consume a remaining 4-7 digit run in one
read4_to_u32 + parse_four_digits_unrolled step instead of byte-by-byte (reusing
the existing 4-digit helpers). The parsed result is identical; this is purely a
faster way to consume the same digits.

Gated to clang: on gcc the extra 4-digit check regresses inputs whose remainder
is < 4 digits (e.g. the 17-digit fraction of uniform [0,1] -> -3% on 'random'),
because the check becomes pure overhead there; clang does not show that.

m8g.metal-24xl (Graviton4), -O3 -march=native, simple_fastfloat_benchmark,
from_chars->double, clang 18, base vs patch back-to-back (2 samples):
  canada.txt +11.7%, mesh.txt +7.4%, random ~flat. No regression.
This commit is contained in:
fcostaoliveira 2026-06-01 10:55:04 +01:00
parent 7790aa6231
commit 7589a4fea5

View File

@ -266,6 +266,21 @@ loop_parse_if_eight_digits(char const *&p, char const *const pend,
p)); // in rare cases, this will overflow, but that's ok
p += 8;
}
// Consume a remaining 4-7 digit run in a single SWAR step instead of
// byte-by-byte (reuses the existing 4-digit helpers). The parsed result is
// identical either way. Gated to clang: on gcc the extra 4-digit check
// regresses inputs whose remainder is shorter than 4 digits (it becomes pure
// overhead there); clang does not show that.
#if defined(__clang__)
if ((pend - p) >= 4) {
uint32_t const val4 = read4_to_u32(p);
if (is_made_of_four_digits_fast(val4)) {
i = i * 10000 +
parse_four_digits_unrolled(val4); // may overflow, that's ok
p += 4;
}
}
#endif
}
enum class parse_error {