Previous technical text about charsets themselves were not relevant
to identify a language. In particular the special characters different
between ISO-8859-1 and ISO-8859-15 were used by themselves, out of a
char sequence context. Therefore without language understanding, they
could have as well been representing the ISO-8859-15 letters or the
ISO-8859-1 symbols at the corresponding codepoints.
Replacing with text from this Wikipedia page:
https://fr.wikipedia.org/wiki/Œuf_(cuisine)
This uses some of these same characters (in particular 'œ') but in
contextual character sequences, making it relevant for our algorithm.
I realize that the language information a text has been written in is
very important since it would completely change the character
distribution. Our test files should take this into account, and we
should create several test files in different languages for encoding
used in various languages.