Rather than using a huge frequency table through some state machine code
that I don't even understand, I noticed that the Big5 encoding is from
the start organized in frequent and non-frequent characters tables (per
Wikipedia page on Big5). This makes it very easy to count characters by
just counting which class each character is in.
Making a few tests with random Chinese text converted to Big5, it seems
to work pretty well (and fix the test which got broken with previous
commit), and it doesn't slow down detection in any significant way
either.
This may be the next step towards improving also the various multi-byte
encoding detection, which are still using some coding state generated
machines which mostly still elude me.
nsEscCharsetProber will still only return a single candidate, because
this is detected by a state machine, not language statistics anyway.
Anyway now it will also return the language attached to the encoding.
... in unspecified behavior.
When compiling with UBSan (-fsanitize=undefined), execution complains:
> runtime error: load of value 5, which is not a valid value for type 'nsSMState'
Since the machine states depend on every different charset's state
machine, it is not possible to simply extend the enum with more generic
values. Instead let's just make the state as an unsigned int value and
define the 3 generic states as constants.