The previous commit detects multi-wrap u64 overflow at the max_digits
boundary by re-parsing the digits through a checked multiply-add loop
(O(max_digits)). Replace that with the constant-time check used in
simdjson: the leading digit plus a single threshold comparison.
For a max_digits-length value, min_safe_u64(base) == base^(max_digits-1)
is the smallest such value and also the width of each leading-digit band
[d*ms, (d+1)*ms). Since that width is < 2^64, the only band that can
straddle 2^64 is d == dmax (the largest leading digit that still fits),
and there it straddles at most once, so a single threshold dmax*ms
separates wrapped from non-wrapped values. A leading digit above dmax
always overflows; below dmax always fits. dmax and the threshold derive
from the existing min_safe_u64 table, so no new tables are needed and
dmax*ms cannot itself overflow.
Add a programmatic, self-verifying test for parse_int_string overflow
detection covering bases 2..36, complementing the hand-picked strings
added earlier. Every generated input is cross-checked against an
independent trusted oracle (a plain 64-bit checked multiply-add); on
success the parsed value is also compared exactly and full consumption
of the input is asserted.
Per base it exercises:
- an exact-boundary sweep of the 64 values straddling 2^64
(UINT64_MAX-31 .. 2^64+31), built by walking the digit string;
- UINT64_MAX, 2^64 and the all-max-digit value, each also with
leading zeros;
- random max_digits-length values across every leading digit, with
the heaviest sampling on the lead == dmax band that straddles 2^64,
and full coverage of lead > dmax (the multi-wrap region the naive
min_safe check accepted by mistake);
- max_digits-1 (never overflows) and max_digits+1 (always overflows).
A small signed (int64_t) section checks the exact INT64_MIN/INT64_MAX
limits round-trip and that INT64_MAX+1 / INT64_MIN-1 are rejected in
every base.
Same std::thread split as exhaustive32_midpoint; preserves each test's existing
failure behavior (abort for exhaustive32, stop-flag for exhaustive32_64).
When calling ch_to_digit() with a UTF-16 or UTF-32 code unit, it simply
truncates away any data stored in the non-low byte(s) of the code unit.
It then uses a lookup table to determine whether the low byte
corresponds to an ASCII digit. This is incorrect because as soon as any
bit outside the low byte is set, the number will never correspond to a
ASCII digit anymore.
To fix this, we produce a mask that is all zeroes if any bit outside the
low byte is set in the code unit, all ones otherwise. Anding this mask
with the original code unit forces the table lookup to return the
sentinel value from the zero-index if any high bit was set and causes
the code unit not to be parsed as integer.
This bug was discovered when loading Mastodon posts inside the Ladybird
browser where some of Mastodon's JavaScript would trigger the code path
that erroneously parsed the emoji as integer. It had the visible effect
that some digits inside the posts would get rendered as one of the
emojis that parsed to that digit. For more details see this issue:
https://github.com/LadybirdBrowser/ladybird/issues/6205
The emojis in the test case are simply all the emojis used on Mastodon
that caused the bug. They can be found here:
06803422da/app/javascript/mastodon/features/emoji/emoji_map.json