103 Commits

Author SHA1 Message Date
Jehan
d1bc09e4d7 Update authors.
I think I deserved being listed in the authors by now. ;-)
2015-12-03 19:44:13 +01:00
Jehan
c4fa728e7a Merge branch 'master' of https://github.com/lovasoa/uchardet into lovasoa-master
Let's shortcut Single Byte charset detection on invalid codepoints.
Merging and fixing the contributor's commit conflicts after code
redesign: in particular we added an illegal character concept (they were
mixed with control characters in current charmaps. Yet ctrl characters
are NOT to be considered invalid) and constants instead of hardcoded
numbers ('ILL' rather than 255).
2015-12-03 19:26:19 +01:00
Jehan
d686fcc1cd LangModels: add illegal codepoints information on single byte charmaps. 2015-12-03 19:04:07 +01:00
Jehan
683255278d Re-enable Hungarian language models.
Now that we have at least one model for ISO-8859-1, the risk of
detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.
2015-12-02 22:24:36 +01:00
Jehan
f4f9fc3f28 test: reenable Windows-1251 test for Russian.
Commit 4f1c3ff actually fixed it!
2015-12-02 21:53:27 +01:00
Jehan
9dd6b34e93 test: add French UTF-8 test.
Text from:
https://fr.wikipedia.org/wiki/UTF-8
2015-11-30 20:03:33 +01:00
Jehan
4f1c3ff85e nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars.
If all sequences in a text are positive sequences, the ratio of positive
sequences cannot make the difference between 2 very close charsets.
A ratio of positive sequences per letters on the other hand will
change a tie between 2 encoding. If while adding a letter, the number
of positive sequences does not increase, the confidence will decrease
(corresponding to the fact it was likely not a letter).
On the other hand, if the number of positive sequences increase, so
will the confidence.
For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15.
When letters only available in ISO-8859-15 appear in a text, we expect
confidence to tilt towards the close yet slightly different ISO-8859-15.
2015-11-30 19:52:07 +01:00
Jehan
9cb5764b73 LangModels: update the French language models.
Fully built with the script.
2015-11-30 19:20:55 +01:00
Jehan
dc5caa46bc BuildLangModel: fix hardcoded file names. 2015-11-30 19:18:25 +01:00
Jehan
3e5d37a6b5 BuildLangModel: process pages level per level.
I.e. horizontally or "breadth first" rather than vertical tree traversal.
This allows to make sure all the start pages in particular are searched,
when using max_page option.
2015-11-30 19:12:04 +01:00
Jehan
04f9309932 tests: update ISO-8859-15 French test file.
Previous technical text about charsets themselves were not relevant
to identify a language. In particular the special characters different
between ISO-8859-1 and ISO-8859-15 were used by themselves, out of a
char sequence context. Therefore without language understanding, they
could have as well been representing the ISO-8859-15 letters or the
ISO-8859-1 symbols at the corresponding codepoints.
Replacing with text from this Wikipedia page:
https://fr.wikipedia.org/wiki/Œuf_(cuisine)
This uses some of these same characters (in particular 'œ') but in
contextual character sequences, making it relevant for our algorithm.
2015-11-30 00:19:15 +01:00
Jehan
d9d347099e BuildLangModel: fix some minor comment from a previous spec. 2015-11-30 00:09:23 +01:00
Jehan
192f8de165 BuildLangModel: build models with computed frequent characters count. 2015-11-30 00:04:44 +01:00
Jehan
429448199f French language model: fix a start page.
Because of a bug in the Wikipedia querying Python library.
2015-11-29 23:55:03 +01:00
Jehan
dbb4c1d2ff nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE...
... with per-language model "frequent character" count.
2015-11-29 23:51:55 +01:00
Jehan
b64831ff89 BuildLangModel: allow a list of start pages...
... and add a page with a word with œ in French to make sure
we have such words in our stats.
2015-11-29 15:51:23 +01:00
Jehan
dce79a6631 BuildLangModel: the SequenceModel naming must include the language name. 2015-11-29 15:49:56 +01:00
Jehan
c59465adfc BuildLangModel: save lang model directly in the right directory. 2015-11-29 13:26:10 +01:00
Jehan
72fbd33dec Add a .gitignore. 2015-11-29 02:27:42 +01:00
Jehan
290fbd2e2e BuildLangModel: add the licensing header to generated files. 2015-11-29 02:26:33 +01:00
Jehan
7f290975ba BuildLangModel: map different cases of the same character together.
With the new case_mapping lang property, we can consider upper and lower
case versions of the same character as one character.
This makes sense in some language, and would allow to enter some rarer
characters (but still in the main alphabet) inside the frequent
character list. For instance 'œ' and 'Œ' in French.
2015-11-29 02:14:48 +01:00
Jehan
00a78faa1d BuildLangModel: the max_depth should be a script option...
... rather than a language property.
2015-11-29 01:59:28 +01:00
Jehan
274386f424 BuildLangModel: add a --max-page option to limit data size.
This is mostly useful for debugging while we don't want to wait forever
to test the script.
2015-11-29 01:42:36 +01:00
Jehan
0314f98ece BuildLangModel.py: some in-progress script to build language models. 2015-11-29 01:30:04 +01:00
Jehan
a8e9de307b Add UTF-16 test files without BOM...
... and disable the tests for now for these since uchardet is not able
to detect UTF-16 without a BOM as for now.
2015-11-28 19:50:18 +01:00
Jehan
92efc0b0b0 Update README: Unicode is "International". 2015-11-28 19:44:13 +01:00
Jehan
573b303fe3 Add an ASCII test file for English...
... with escape characters because even with ESC, a file is ASCII
unless proven otherwise.
2015-11-28 17:49:13 +01:00
Jehan
0289c2a232 Differentiate ASCII and detection failure.
The lib used to return "" for both properly detected ASCII and
detection failure. And the tool would return "ascii/unknown".
Make a proper distinction between the 2 cases.
2015-11-28 17:04:52 +01:00
Jehan
4dbc6e7ab3 Update README with French support. 2015-11-28 02:20:57 +01:00
Jehan
50588ba375 Add a ISO-8859-15 test file for French. 2015-11-28 02:18:57 +01:00
Jehan
005fd98086 Add initial support for French with ISO-8859-1 and ISO-8859-15.
Mostly generated with a script from Wikipedia data (only the typical
positive ratio is slightly modified).
This is a first test before adding my generating script to the main tree.
2015-11-28 02:14:39 +01:00
Jehan
2106173546 Move all Single-Byte language models to a subdirectory. 2015-11-27 23:11:23 +01:00
Jehan
b67370230b Update README and manual...
... to indicate several files can be specified on command line.
2015-11-27 18:27:11 +01:00
Jehan
984d8f7b09 Add language information in model names when they were missing.
Models are language specific (there could be several models for the same
charset but different languages). Let's have a clear naming scheme.
2015-11-27 18:21:13 +01:00
Jehan
c61e65aeb3 s/MACCYRILLIC/MAC-CYRILLIC/
Write encoding names in README same as what uchardet returns.
2015-11-27 18:19:02 +01:00
Jehan
942ac05ff5 Add some Russian test files.
Texts from:
IBM855: https://ru.wikipedia.org/wiki/CP855
IBM866: https://ru.wikipedia.org/wiki/Альтернативная_кодировка
MAC-CYRILLIC: https://ru.wikipedia.org/wiki/MacCyrillic
2015-11-27 18:17:20 +01:00
Jehan
42b91898da Create 3-letter constants for special charmap characters.
Control characters, carriage, symbols and numbers.
Also add a constant for illegal characters (not used for now).
This will allow easier processing and charmap reading.
2015-11-27 17:41:54 +01:00
Jehan
7fa0fefef8 Add UTF-16 and UTF-32 test files in French, with BOM.
Unfortunately uchardet currently seems unable to detect UTF-16/32
text without a BOM.
2015-11-26 02:45:00 +01:00
Ophir LOJKINE
5ef60164fc Stop detection early on control characters 2015-11-24 22:07:41 +03:00
Jehan
e8dd55995a Add "LE/BE" suffix to "UTF-16" result for Little/Big Endian info...
... and add UTF-32 BOM detection.
2015-11-24 18:50:23 +01:00
Jehan
9a74d08b3c Fix minor space issues. 2015-11-24 00:15:44 +01:00
Jehan
d082704fec Add Mageia command and specify Mint compatibility. 2015-11-23 17:46:01 +01:00
Jehan
ff5fd5eff9 Release: version 0.0.3. v0.0.3 2015-11-19 15:18:11 +01:00
Jehan
5dcff7b241 Hide away tests known to fail.
Some charsets are simply not supported (ex: fr:iso-8859-1), some are
temporarily deactivated (ex: hu:iso-8859-2) and some are wrongly
detected as closely related charsets.
These were broken (or not efficient) from the start, and there is no
need to pollute the `make test` output with these, which may make us
miss when actual regressions will occur. So let's hide these away for
now until we can improve the situation.
2015-11-18 20:02:58 +01:00
Jehan
4b38e68aa2 CMake tests: separate the lang and charset with colon...
... rather than an hyphen. It makes it easier to read.
2015-11-18 19:42:35 +01:00
Jehan
35153b1e50 Fixes boolean operation precedence warnings...
... and some minor space issues.
Some explicit parentheses were needed to make precedence obvious.
Warning was:
"warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]"
2015-11-18 19:38:12 +01:00
Jehan
0d70a36910 Adding some more test files for Russian and Chinese.
Taken from:
https://zh.wikipedia.org/wiki/EUC
https://ru.wikipedia.org/wiki/КОИ-8
And rename a file s/utf8.txt/utf-8.txt/ to fix a build test.
2015-11-18 19:27:38 +01:00
Jehan
eb727d3aca Add automatic testing against every test file. 2015-11-18 18:18:27 +01:00
Jehan
f303a41735 Add Thai test file for UTF-8.
Text from Thai Wikipedia:
https://th.wikipedia.org/wiki/ยูนิโคด
2015-11-18 03:26:34 +01:00
Jehan
9d9257072a s/windows-1255/WINDOWS-1255/ to follow iconv uppercase naming. 2015-11-18 03:21:34 +01:00