I.e. horizontally or "breadth first" rather than vertical tree traversal.
This allows to make sure all the start pages in particular are searched,
when using max_page option.
Previous technical text about charsets themselves were not relevant
to identify a language. In particular the special characters different
between ISO-8859-1 and ISO-8859-15 were used by themselves, out of a
char sequence context. Therefore without language understanding, they
could have as well been representing the ISO-8859-15 letters or the
ISO-8859-1 symbols at the corresponding codepoints.
Replacing with text from this Wikipedia page:
https://fr.wikipedia.org/wiki/Œuf_(cuisine)
This uses some of these same characters (in particular 'œ') but in
contextual character sequences, making it relevant for our algorithm.
With the new case_mapping lang property, we can consider upper and lower
case versions of the same character as one character.
This makes sense in some language, and would allow to enter some rarer
characters (but still in the main alphabet) inside the frequent
character list. For instance 'œ' and 'Œ' in French.
The lib used to return "" for both properly detected ASCII and
detection failure. And the tool would return "ascii/unknown".
Make a proper distinction between the 2 cases.
Mostly generated with a script from Wikipedia data (only the typical
positive ratio is slightly modified).
This is a first test before adding my generating script to the main tree.
Control characters, carriage, symbols and numbers.
Also add a constant for illegal characters (not used for now).
This will allow easier processing and charmap reading.
Some charsets are simply not supported (ex: fr:iso-8859-1), some are
temporarily deactivated (ex: hu:iso-8859-2) and some are wrongly
detected as closely related charsets.
These were broken (or not efficient) from the start, and there is no
need to pollute the `make test` output with these, which may make us
miss when actual regressions will occur. So let's hide these away for
now until we can improve the situation.
... and some minor space issues.
Some explicit parentheses were needed to make precedence obvious.
Warning was:
"warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]"
I realize that the language information a text has been written in is
very important since it would completely change the character
distribution. Our test files should take this into account, and we
should create several test files in different languages for encoding
used in various languages.
Taken from French page about ISO-8859-1:
https://fr.wikipedia.org/wiki/ISO_8859-1
... and Hungarian Wikipedia page about ISO-8859-2:
https://hu.wikipedia.org/wiki/ISO/IEC_8859-2
We don't have support for ISO-8859-1, and both these files are detected
as "WINDOWS-1252" (which is acceptable for iso-8859-1.txt since
Windows-1252 is a superset of ISO-8859-1). ISO-8859-2 support is
disabled because the ISO-8859-1 file would be detected as ISO-8859-2,
which would in turn be a clear error.