2 Commits

Author SHA1 Message Date
Jehan
9518f4d7a2 Rebuild a bunch of language models.
Adding generic language model (see coming commit), which uses the same
data as specific single-byte encoding statistics model, except that it
applies it to unicode code points.
For this to work, instead of the CharToOrderMap which was mapping
directly from encoded byte (always 256 values) to order, now we add an
array of frequent characters, ordered by generic unicode code points to
the order of frequency (which can be used on the same sequence mapping
array).

This of course means that each prober where we will want to use these
generic models will have to implement their own byte to code point
decoder, as this is per-encoding logics anyway. This will come in a
subsequent commit.
2021-03-16 12:35:18 +01:00
Jehan
9c3c37517c LangModels: add Arabic support.
Models constructed for ISO-8859-6 and Windows-1256.
2015-12-13 18:42:16 +01:00