This way it always shows up in ccmake, even if not defined.
A string is used instead of path because I personally think it makes more
sense in the following use-cases:
STRING:
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins
installs everything to /home/user/{lib,etc,share,(...)} and executables to
${CMAKE_INSTALL_PREFIX}/bins
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin
everything to /home/user/{lib,etc,share,(...)} and executables to
/opt/bin
PATH:
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=bins
everything to /home/user/{lib,etc,share,(...)} and executables to
$(pwd)/bins (!)
-DCMAKE_INSTALL_PREFIX=/home/user -DCMAKE_INSTALL_BINDIR=/opt/bin
same as STRING
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
I disable only ISO-8859-15 which is similar to ISO-8859-1 for all
Spanish letters. Unfortunately illegal codepoints are similar too.
Difference should likely be done on symbols (like the euro symbol)
but our current algorithm does nothing about this for charset
comparison.
Text from https://es.wikipedia.org/wiki/España
ISO-8859-2 and Windows-1250 are absolutely similar for all letters in
the Hungarian alphabet. So for most texts, it is not an error to return
one charset or the other.
What could make the difference is for instance that Windows-1250 has
some symbols where ISO-8859-2 has control characters, like quotes,
dashes, the euro symbol…
Since control characters have a negative impact on confidence now,
texts with such symbols would tend towards Windows-1250 decision.
The new test file has such quote symbols.
There is no "exception" in encoding. The non-breaking space 0xA0 is not
ASCII, and therefore returning "ASCII" will later create issues (for
instance trying to re-encode with iconv produces an error).
This was obviously an explicit decision in original code (according to
code comments), probably tied to specifity of the original program from
Mozilla. Now we want strict detection.
I will return "ISO-8859-1" for "nearly-ASCII texts with NBSP as only
exception" (note that I could have returned any ISO-8859 charsets since
they all have this character in common).
According to RFC 2781, section 3.3: "Systems labelling UTF-16BE/LE text
MUST NOT prepend a BOM to the text."
Since uchardet cannot (and should not, obviously, it's not its role)
modify input text, when a BOM is present, we should always label the
encoding as "UTF-16" only.
Also it broke unit tests in using programs since a conversion from UTF-8
to UTF-16LE/BE would create a text without BOM, and a conversion from
UTF-16LE/BE to UTF-8 creates a UTF-8 text with a BOM, which changed
existing behaviours.
Same goes for UTF-32.
See also Unicode 5.0.0 standard, section 3.10 (tables 3.8 and 3.9 in
particular).
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
… for langs for which Python lower() algorithm fails.
In particular Turkish dotted/dotless 'i' does not follow same rules
as common western languages.
Lowercase for 'I' is indeed not 'i' but 'ı'.
Uppercase for 'i' is indeed not 'I' but 'İ'.
I had the case with the Turkish dotted 'İ' that lowercasing it with
Python algorithm returned me a decomposed character that it was not able
to recompose. Therefore ord() raised a TypeError because the string
length was 2.
Control characters are not an error per-se. Nevertheless they are clearly not
frequent in single-byte charset texts. It is only normal for them to lower
confidence in a charset. In particular a higher ctrl-per-letter ratio means
a lower confidence.
This fixes for instance our Windows-1252 German test (otherwise detected as
ISO-8859-1).