Jehan
dbb4c1d2ff
nsSBCharSetProber: replace the fixed 64 SAMPLE_SIZE...
...
... with per-language model "frequent character" count.
2015-11-29 23:51:55 +01:00
Jehan
b64831ff89
BuildLangModel: allow a list of start pages...
...
... and add a page with a word with œ in French to make sure
we have such words in our stats.
2015-11-29 15:51:23 +01:00
Jehan
dce79a6631
BuildLangModel: the SequenceModel naming must include the language name.
2015-11-29 15:49:56 +01:00
Jehan
c59465adfc
BuildLangModel: save lang model directly in the right directory.
2015-11-29 13:26:10 +01:00
Jehan
72fbd33dec
Add a .gitignore.
2015-11-29 02:27:42 +01:00
Jehan
290fbd2e2e
BuildLangModel: add the licensing header to generated files.
2015-11-29 02:26:33 +01:00
Jehan
7f290975ba
BuildLangModel: map different cases of the same character together.
...
With the new case_mapping lang property, we can consider upper and lower
case versions of the same character as one character.
This makes sense in some language, and would allow to enter some rarer
characters (but still in the main alphabet) inside the frequent
character list. For instance 'œ' and 'Œ' in French.
2015-11-29 02:14:48 +01:00
Jehan
00a78faa1d
BuildLangModel: the max_depth should be a script option...
...
... rather than a language property.
2015-11-29 01:59:28 +01:00
Jehan
274386f424
BuildLangModel: add a --max-page option to limit data size.
...
This is mostly useful for debugging while we don't want to wait forever
to test the script.
2015-11-29 01:42:36 +01:00
Jehan
0314f98ece
BuildLangModel.py: some in-progress script to build language models.
2015-11-29 01:30:04 +01:00
Jehan
a8e9de307b
Add UTF-16 test files without BOM...
...
... and disable the tests for now for these since uchardet is not able
to detect UTF-16 without a BOM as for now.
2015-11-28 19:50:18 +01:00
Jehan
92efc0b0b0
Update README: Unicode is "International".
2015-11-28 19:44:13 +01:00
Jehan
573b303fe3
Add an ASCII test file for English...
...
... with escape characters because even with ESC, a file is ASCII
unless proven otherwise.
2015-11-28 17:49:13 +01:00
Jehan
0289c2a232
Differentiate ASCII and detection failure.
...
The lib used to return "" for both properly detected ASCII and
detection failure. And the tool would return "ascii/unknown".
Make a proper distinction between the 2 cases.
2015-11-28 17:04:52 +01:00
Jehan
4dbc6e7ab3
Update README with French support.
2015-11-28 02:20:57 +01:00
Jehan
50588ba375
Add a ISO-8859-15 test file for French.
2015-11-28 02:18:57 +01:00
Jehan
005fd98086
Add initial support for French with ISO-8859-1 and ISO-8859-15.
...
Mostly generated with a script from Wikipedia data (only the typical
positive ratio is slightly modified).
This is a first test before adding my generating script to the main tree.
2015-11-28 02:14:39 +01:00
Jehan
2106173546
Move all Single-Byte language models to a subdirectory.
2015-11-27 23:11:23 +01:00
Jehan
b67370230b
Update README and manual...
...
... to indicate several files can be specified on command line.
2015-11-27 18:27:11 +01:00
Jehan
984d8f7b09
Add language information in model names when they were missing.
...
Models are language specific (there could be several models for the same
charset but different languages). Let's have a clear naming scheme.
2015-11-27 18:21:13 +01:00
Jehan
c61e65aeb3
s/MACCYRILLIC/MAC-CYRILLIC/
...
Write encoding names in README same as what uchardet returns.
2015-11-27 18:19:02 +01:00
Jehan
942ac05ff5
Add some Russian test files.
...
Texts from:
IBM855: https://ru.wikipedia.org/wiki/CP855
IBM866: https://ru.wikipedia.org/wiki/Альтернативная_кодировка
MAC-CYRILLIC: https://ru.wikipedia.org/wiki/MacCyrillic
2015-11-27 18:17:20 +01:00
Jehan
42b91898da
Create 3-letter constants for special charmap characters.
...
Control characters, carriage, symbols and numbers.
Also add a constant for illegal characters (not used for now).
This will allow easier processing and charmap reading.
2015-11-27 17:41:54 +01:00
Jehan
7fa0fefef8
Add UTF-16 and UTF-32 test files in French, with BOM.
...
Unfortunately uchardet currently seems unable to detect UTF-16/32
text without a BOM.
2015-11-26 02:45:00 +01:00
Ophir LOJKINE
5ef60164fc
Stop detection early on control characters
2015-11-24 22:07:41 +03:00
Jehan
e8dd55995a
Add "LE/BE" suffix to "UTF-16" result for Little/Big Endian info...
...
... and add UTF-32 BOM detection.
2015-11-24 18:50:23 +01:00
Jehan
9a74d08b3c
Fix minor space issues.
2015-11-24 00:15:44 +01:00
Jehan
d082704fec
Add Mageia command and specify Mint compatibility.
2015-11-23 17:46:01 +01:00
Jehan
ff5fd5eff9
Release: version 0.0.3.
v0.0.3
2015-11-19 15:18:11 +01:00
Jehan
5dcff7b241
Hide away tests known to fail.
...
Some charsets are simply not supported (ex: fr:iso-8859-1), some are
temporarily deactivated (ex: hu:iso-8859-2) and some are wrongly
detected as closely related charsets.
These were broken (or not efficient) from the start, and there is no
need to pollute the `make test` output with these, which may make us
miss when actual regressions will occur. So let's hide these away for
now until we can improve the situation.
2015-11-18 20:02:58 +01:00
Jehan
4b38e68aa2
CMake tests: separate the lang and charset with colon...
...
... rather than an hyphen. It makes it easier to read.
2015-11-18 19:42:35 +01:00
Jehan
35153b1e50
Fixes boolean operation precedence warnings...
...
... and some minor space issues.
Some explicit parentheses were needed to make precedence obvious.
Warning was:
"warning: suggest parentheses around ‘&&’ within ‘||’ [-Wparentheses]"
2015-11-18 19:38:12 +01:00
Jehan
0d70a36910
Adding some more test files for Russian and Chinese.
...
Taken from:
https://zh.wikipedia.org/wiki/EUC
https://ru.wikipedia.org/wiki/КОИ-8
And rename a file s/utf8.txt/utf-8.txt/ to fix a build test.
2015-11-18 19:27:38 +01:00
Jehan
eb727d3aca
Add automatic testing against every test file.
2015-11-18 18:18:27 +01:00
Jehan
f303a41735
Add Thai test file for UTF-8.
...
Text from Thai Wikipedia:
https://th.wikipedia.org/wiki/ยูนิโคด
2015-11-18 03:26:34 +01:00
Jehan
9d9257072a
s/windows-1255/WINDOWS-1255/ to follow iconv uppercase naming.
2015-11-18 03:21:34 +01:00
Jehan
e7c8114233
Add Hebrew test files.
...
Texts from Hebrew Wikipedia:
https://he.wikipedia.org/wiki/עברית
https://he.wikipedia.org/wiki/ISO_8859
https://he.wikipedia.org/wiki/UTF-8
uchardet fails to detect the ISO-8859-8 files and detects it as
Windows-1255, which is probably acceptable since it is apparently
an "almost compatible superset". It may be worth trying to make
more complete test files in the future to demonstrate the differences.
2015-11-18 03:16:18 +01:00
Jehan
601e59bd83
Add Greek test files.
...
Taken from Greek Wikipedia:
https://el.wikipedia.org/wiki/UTF-8
https://el.wikipedia.org/wiki/ISO_8859-7
https://el.wikipedia.org/wiki/ISO_8859-7#Windows-1253
Windows-1253 test fails and returns "ISO-8859-7". They are actually
fairly close for main letters, except for Ά, which make them difficult
to differentiate.
2015-11-18 02:57:09 +01:00
Jehan
c8532f63a8
Adding UTF-8 file for Korean.
...
Text taken from Korean Wikipedia:
https://ko.wikipedia.org/wiki/UTF-8
2015-11-18 02:36:33 +01:00
Jehan
1a58fa6d99
Update AUTHORS.
2015-11-17 21:51:59 +01:00
Jehan
4db0d55692
URL of related project python-chardet has changed.
2015-11-17 21:40:44 +01:00
Jehan
a76c0786b3
Adding test files for main Japanese encoding...
...
... taken from the following Japanese Wikipedia pages:
https://ja.wikipedia.org/wiki/Extended_Unix_Code
https://ja.wikipedia.org/wiki/ISO/IEC_2022
https://ja.wikipedia.org/wiki/UTF-8
2015-11-17 21:24:47 +01:00
Jehan
0efcdfa546
Reorganize test files in language subdirectories.
...
I realize that the language information a text has been written in is
very important since it would completely change the character
distribution. Our test files should take this into account, and we
should create several test files in different languages for encoding
used in various languages.
2015-11-17 21:12:39 +01:00
Jehan
192b0e7d51
Add test files for ISO-8859-[12].
...
Taken from French page about ISO-8859-1:
https://fr.wikipedia.org/wiki/ISO_8859-1
... and Hungarian Wikipedia page about ISO-8859-2:
https://hu.wikipedia.org/wiki/ISO/IEC_8859-2
We don't have support for ISO-8859-1, and both these files are detected
as "WINDOWS-1252" (which is acceptable for iso-8859-1.txt since
Windows-1252 is a superset of ISO-8859-1). ISO-8859-2 support is
disabled because the ISO-8859-1 file would be detected as ISO-8859-2,
which would in turn be a clear error.
2015-11-17 19:39:58 +01:00
Jehan
3f3f4b8011
Add a ISO-8859-5 test file.
...
Text taken from Russian Wikipedia page about ISO-8859-5:
https://ru.wikipedia.org/wiki/ISO_8859-5
2015-11-17 19:11:59 +01:00
Jehan
bafccfcea8
Add a Windows-1251 test files.
...
Texts taken from Bulgarian Wikipedia page about Windows-1251:
https://bg.wikipedia.org/wiki/Windows-1251
... and Russian Wikipedia page about Windows-1251:
https://ru.wikipedia.org/wiki/Windows-1251
The Bulgarian file detection is right, but the Russian detection
returns "MAC-CYRILLIC", which is an error and should be fixed.
2015-11-17 19:09:37 +01:00
Jehan
41f3b757f1
Some more encoding names changed to be iconv-compatible.
...
I forgot to fix some names.
In particular "x-mac-cyrillic" is not valid in iconv, and has been
changed to "MAC-CYRILLIC".
2015-11-17 18:51:45 +01:00
Jehan
8216f7b395
Add an ISO-2022-KR test file.
...
Text taken from Korean Wikipedia page about the ISO-2022-KR:
https://ko.wikipedia.org/wiki/ISO/IEC_2022
2015-11-17 18:23:46 +01:00
Jehan
ad4dfc4be4
Add a BUILD_STATIC CMake option to optionally build a static library.
...
It is still ON by default, which means both shared and static libs will
be built and installed (current behavior), but it makes it possible to
disable the build of a static lib.
Closes https://github.com/BYVoid/uchardet/issues/1 .
2015-11-17 18:14:51 +01:00
Jehan
9172b763d1
Add TIS-620 in README (Thai language) and a test file.
...
Test text based on Thai Wikipedia page about the TIS-620 encoding:
https://th.wikipedia.org/wiki/TIS-620
2015-11-17 17:39:45 +01:00