parsed_number_string_t carries two span<UC const> members (integer, fraction)
that are only read on the rare slow paths (digit_comp, and the >19-significant-
digit truncation recompute). Materializing them on every parse forces the ~56/64-
byte struct to be written out and marshaled through the by-value return, which
shows up as backend/store pressure on the hot path.
This adds a runtime `store_spans` flag (default true, so all existing callers are
unchanged) to parse_number_string; from_chars_float_advanced parses with it false,
attempts the Clinger and Eisel-Lemire fast paths inline, and only re-parses with
spans on the two rare slow branches. The re-parse is pushed into a single
`fastfloat_noinline` (noinline+cold) helper so the force-inlined hot scanner is
emitted once rather than duplicated into the caller (without this the extra inline
copies regress some targets, e.g. ARM gcc, by bloating the hot frame and lengthening
the loop-carried dependency chain).
A runtime flag is used deliberately rather than a template parameter: a template
would create a second instantiation of the whole scanner whose icache cost wipes
out the gain.
Measured (per-parser microbench, median of 5, pinned core), fast_float from_chars
<double>/<float>, vs the current tip:
- Intel Ice Lake (Xeon 8360Y): +17-19% (gcc), Intel TMA shows backend-bound
26.0% -> 2.2% and retiring 60.3% -> 77.3% on short floats (the eliminated span
spill), with -36% pipeline slots.
- Intel Cascade Lake (Xeon 6248): +18-22% (gcc), +13-23% (clang).
- ARM Neoverse-V2 (Graviton4): +73-196% (gcc), +8-11% (clang) -- the struct spill
dominated the gcc hot loop there.
Correctness: the full float exhaustive suite (exhaustive32, exhaustive32_64,
exhaustive32_midpoint, random64) passes, and a 2^32 sweep is byte-identical to the
current tip. Public from_chars / from_chars_advanced / parsed_number_string_t are
unchanged.
Same std::thread split as exhaustive32_midpoint; preserves each test's existing
failure behavior (abort for exhaustive32, stop-flag for exhaustive32_64).
After the 8-digit SWAR block loop, consume a remaining 4-7 digit run in one
read4_to_u32 + parse_four_digits_unrolled step instead of byte-by-byte (reusing
the existing 4-digit helpers). The parsed result is identical; this is purely a
faster way to consume the same digits.
Gated to clang: on gcc the extra 4-digit check regresses inputs whose remainder
is < 4 digits (e.g. the 17-digit fraction of uniform [0,1] -> -3% on 'random'),
because the check becomes pure overhead there; clang does not show that.
m8g.metal-24xl (Graviton4), -O3 -march=native, simple_fastfloat_benchmark,
from_chars->double, clang 18, base vs patch back-to-back (2 samples):
canada.txt +11.7%, mesh.txt +7.4%, random ~flat. No regression.
parse_number_string scans the integer part one byte at a time in a while loop,
while the fraction already uses the 8-digit SWAR loop. Most integer parts are
1-5 digits, so the loop back-edge dominates. Peel the first five iterations into
nested ifs, falling through to the original while for longer runs. Semantics are
identical (i = 10*i + digit, advancing p); no behavior change.
AWS m8g.metal-24xl (Graviton4), -O3 -march=native, simple_fastfloat_benchmark,
from_chars->double. base vs patch measured back-to-back, mean of 2 runs:
canada: gcc +3.1%, clang +2.8%
mesh: gcc +5.4%, clang +5.1%
random: ~flat (1-digit integer part)
No regression; gcc and clang agree.
Alternatives benchmarked and rejected: reusing loop_parse_if_eight_digits for the
integer part regressed 5-8% (integer parts are too short for 8-digit SWAR setup);
a counted for(k<5) loop matched on gcc but clang optimized it worse (canada -0.9%).
The explicit peel is the only form solidly positive on both compilers.
Replaces uses of std::min with ternary operators in ascii_number.h, digit_comparison.h, and float_common.h to remove the dependency on the <algorithm> header in those files.
With bzlmod, native rules like cc_library are no longer implicitly available
and must be explicitly loaded from rules_cc. Add the rules_cc dependency to
MODULE.bazel and the corresponding load statement to BUILD.bazel.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>