127 Commits

Author SHA1 Message Date
Jehan
886e03a523 Release: version 0.0.5. v0.0.5 2015-12-04 22:45:26 +01:00
Jehan
fe7bf3e994 test: update UTF-16 and UTF-32 tests after label changing. 2015-12-04 19:46:51 +01:00
Jehan
e5234d6b61 Stating endianness of UTF-16 and UTF-32 was an error when BOM present.
According to RFC 2781, section 3.3: "Systems labelling UTF-16BE/LE text
MUST NOT prepend a BOM to the text."
Since uchardet cannot (and should not, obviously, it's not its role)
modify input text, when a BOM is present, we should always label the
encoding as "UTF-16" only.
Also it broke unit tests in using programs since a conversion from UTF-8
to UTF-16LE/BE would create a text without BOM, and a conversion from
UTF-16LE/BE to UTF-8 creates a UTF-8 text with a BOM, which changed
existing behaviours.
Same goes for UTF-32.
See also Unicode 5.0.0 standard, section 3.10 (tables 3.8 and 3.9 in
particular).
2015-12-04 19:19:39 +01:00
Jehan
2856e68aac README: reorganize support list by alphabetic order.
(Except for "International" and "Others")
2015-12-04 03:33:22 +01:00
Jehan
5691dc59a1 LangModels: rename Cyrillic models to Russian models.
Our language models are per-lang, not per script.
2015-12-04 03:27:29 +01:00
Jehan
569509f844 BuildLangModel: forgot to add logs for Thai models generation. 2015-12-04 03:26:52 +01:00
Jehan
dc03ea002f README: supports are per-language rather than per script system.
In particular separate "Cyrillic" into "Russian" and "Bulgarian"
(currently our only 2 supported languages using Cyrillic script).
2015-12-04 03:22:05 +01:00
Jehan
fb3c47a073 LangModels: add ISO-8859-11 and regenerate TIS-620 Thai models.
ISO-8859-11 is basically exactly identical to TIS-620, with the added
non-breaking space character.
Basically our detection will always return TIS-620 except for
exceptional cases when a text has a non-breaking space.
2015-12-04 03:14:52 +01:00
Jehan
ffcd85f709 script: forgot to commit ISO-8859-9 and Turkish files. 2015-12-04 02:40:54 +01:00
Jehan
5ee1c3ee39 LangModels: adding Turkish models for ISO-8859-3 and ISO-8859-9. 2015-12-04 02:35:09 +01:00
Jehan
22b9ed2d4f BuildLangModel: add concept of custom_case_mapping…
… for langs for which Python lower() algorithm fails.
In particular Turkish dotted/dotless 'i' does not follow same rules
as common western languages.
Lowercase for 'I' is indeed not 'i' but 'ı'.
Uppercase for 'i' is indeed not 'I' but 'İ'.
2015-12-04 02:29:40 +01:00
Jehan
f0e122b506 LangModels: add Esperanto ISO-8859-3 language model. 2015-12-04 01:35:56 +01:00
Jehan
a167bd5e42 BuildLangModel: lowercase only when resulting char has a composed form.
I had the case with the Turkish dotted 'İ' that lowercasing it with
Python algorithm returned me a decomposed character that it was not able
to recompose. Therefore ord() raised a TypeError because the string
length was 2.
2015-12-04 01:30:21 +01:00
Jehan
b56a3c7b84 README: add German support. 2015-12-04 00:07:03 +01:00
Jehan
55b4f23971 Single Byte charsets: high ctrl character ratio lowers confidence.
Control characters are not an error per-se. Nevertheless they are clearly not
frequent in single-byte charset texts. It is only normal for them to lower
confidence in a charset. In particular a higher ctrl-per-letter ratio means
a lower confidence.
This fixes for instance our Windows-1252 German test (otherwise detected as
ISO-8859-1).
2015-12-04 00:04:43 +01:00
Jehan
aa587a64bd LangModels: adding German models for ISO-8859-1 and Windows-1252. 2015-12-03 23:58:41 +01:00
Jehan
90728e4068 README: update with Windows-1252 support information. 2015-12-03 21:25:53 +01:00
Jehan
0270b1e856 Adding French Windows-1252 support. 2015-12-03 21:22:30 +01:00
Jehan
5d3fb3dc2f test: add a Windows-1252 French test.
Text from https://fr.wikipedia.org/wiki/Œuf_(cuisine)
2015-12-03 21:20:15 +01:00
Jehan
15afc5c593 test: add a Hungarian Windows-1250 test but skip it for now.
Text from: https://hu.wikipedia.org/wiki/Magyar_nyelv
2015-12-03 21:18:55 +01:00
Jehan
ea34e8b1bd Update doc comment.
We do not return empty string on ASCII anymore. It means only detection
failure, now. ASCII will get a proper "ASCII" return.
2015-12-03 20:36:09 +01:00
Jehan
60f641bf37 Update README to mark independence with original Mozilla code. 2015-12-03 20:32:57 +01:00
Jehan
e4260f4a39 Release: version 0.0.4. v0.0.4 2015-12-03 19:48:58 +01:00
Jehan
ba56d91808 Update uchardet URL in various places. 2015-12-03 19:48:29 +01:00
Jehan
d1bc09e4d7 Update authors.
I think I deserved being listed in the authors by now. ;-)
2015-12-03 19:44:13 +01:00
Jehan
c4fa728e7a Merge branch 'master' of https://github.com/lovasoa/uchardet into lovasoa-master
Let's shortcut Single Byte charset detection on invalid codepoints.
Merging and fixing the contributor's commit conflicts after code
redesign: in particular we added an illegal character concept (they were
mixed with control characters in current charmaps. Yet ctrl characters
are NOT to be considered invalid) and constants instead of hardcoded
numbers ('ILL' rather than 255).
2015-12-03 19:26:19 +01:00
Jehan
d686fcc1cd LangModels: add illegal codepoints information on single byte charmaps. 2015-12-03 19:04:07 +01:00
Jehan
683255278d Re-enable Hungarian language models.
Now that we have at least one model for ISO-8859-1, the risk of
detecting all ISO-8859-1 texts as ISO-8859-2 is lessened.
2015-12-02 22:24:36 +01:00
Jehan
f4f9fc3f28 test: reenable Windows-1251 test for Russian.
Commit 4f1c3ff actually fixed it!
2015-12-02 21:53:27 +01:00
Jehan
9dd6b34e93 test: add French UTF-8 test.
Text from:
https://fr.wikipedia.org/wiki/UTF-8
2015-11-30 20:03:33 +01:00
Jehan
4f1c3ff85e nsSBCharSetProber: multiply confidence by ratio of positive seqs per chars.
If all sequences in a text are positive sequences, the ratio of positive
sequences cannot make the difference between 2 very close charsets.
A ratio of positive sequences per letters on the other hand will
change a tie between 2 encoding. If while adding a letter, the number
of positive sequences does not increase, the confidence will decrease
(corresponding to the fact it was likely not a letter).
On the other hand, if the number of positive sequences increase, so
will the confidence.
For instance this fixes wrong detections of ISO-8859-1 and ISO-8859-15.
When letters only available in ISO-8859-15 appear in a text, we expect
confidence to tilt towards the close yet slightly different ISO-8859-15.
2015-11-30 19:52:07 +01:00
Jehan
9cb5764b73 LangModels: update the French language models.
Fully built with the script.
2015-11-30 19:20:55 +01:00
Jehan
dc5caa46bc BuildLangModel: fix hardcoded file names. 2015-11-30 19:18:25 +01:00
Jehan
3e5d37a6b5 BuildLangModel: process pages level per level.
I.e. horizontally or "breadth first" rather than vertical tree traversal.
This allows to make sure all the start pages in particular are searched,
when using max_page option.
2015-11-30 19:12:04 +01:00
Jehan
04f9309932 tests: update ISO-8859-15 French test file.
Previous technical text about charsets themselves were not relevant
to identify a language. In particular the special characters different
between ISO-8859-1 and ISO-8859-15 were used by themselves, out of a
char sequence context. Therefore without language understanding, they
could have as well been representing the ISO-8859-15 letters or the
ISO-8859-1 symbols at the corresponding codepoints.
Replacing with text from this Wikipedia page:
https://fr.wikipedia.org/wiki/Œuf_(cuisine)
This uses some of these same characters (in particular 'œ') but in
contextual character sequences, making it relevant for our algorithm.
2015-11-30 00:19:15 +01:00
Jehan
d9d347099e BuildLangModel: fix some minor comment from a previous spec. 2015-11-30 00:09:23 +01:00
Jehan
192f8de165 BuildLangModel: build models with computed frequent characters count. 2015-11-30 00:04:44 +01:00
Jehan
429448199f French language model: fix a start page.
Because of a bug in the Wikipedia querying Python library.
2015-11-29 23:55:03 +01:00
Jehan
dbb4c1d2ff nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE...
... with per-language model "frequent character" count.
2015-11-29 23:51:55 +01:00
Jehan
b64831ff89 BuildLangModel: allow a list of start pages...
... and add a page with a word with œ in French to make sure
we have such words in our stats.
2015-11-29 15:51:23 +01:00
Jehan
dce79a6631 BuildLangModel: the SequenceModel naming must include the language name. 2015-11-29 15:49:56 +01:00
Jehan
c59465adfc BuildLangModel: save lang model directly in the right directory. 2015-11-29 13:26:10 +01:00
Jehan
72fbd33dec Add a .gitignore. 2015-11-29 02:27:42 +01:00
Jehan
290fbd2e2e BuildLangModel: add the licensing header to generated files. 2015-11-29 02:26:33 +01:00
Jehan
7f290975ba BuildLangModel: map different cases of the same character together.
With the new case_mapping lang property, we can consider upper and lower
case versions of the same character as one character.
This makes sense in some language, and would allow to enter some rarer
characters (but still in the main alphabet) inside the frequent
character list. For instance 'œ' and 'Œ' in French.
2015-11-29 02:14:48 +01:00
Jehan
00a78faa1d BuildLangModel: the max_depth should be a script option...
... rather than a language property.
2015-11-29 01:59:28 +01:00
Jehan
274386f424 BuildLangModel: add a --max-page option to limit data size.
This is mostly useful for debugging while we don't want to wait forever
to test the script.
2015-11-29 01:42:36 +01:00
Jehan
0314f98ece BuildLangModel.py: some in-progress script to build language models. 2015-11-29 01:30:04 +01:00
Jehan
a8e9de307b Add UTF-16 test files without BOM...
... and disable the tests for now for these since uchardet is not able
to detect UTF-16 without a BOM as for now.
2015-11-28 19:50:18 +01:00
Jehan
92efc0b0b0 Update README: Unicode is "International". 2015-11-28 19:44:13 +01:00