250 Commits

Author SHA1 Message Date
Jehan
e0b9269849 Fix various other occurrences of bug tracker URL in code/build. 2020-04-22 12:29:41 +02:00
Jehan
60bf53c81e README: update to Gitlab links.
Freedesktop moved its infrastructure to Gitlab a while ago.
2020-04-22 00:33:48 +02:00
Jehan
0cfb75724a README: some small updates. 2020-04-22 00:17:23 +02:00
Jehan
bdfd6116a9 Add a mention about fd.o code of conduct. 2018-09-26 15:12:25 +02:00
Ilya Tumaykin
f136d434f0 build: turn TARGET_ARCHITECTURE into option
Default value is autodetected if not specified by user.
2018-01-21 15:58:13 +01:00
Jehan
95872ef41c Adding some information about building for Windows. 2017-12-26 03:37:42 +01:00
Jehan
df67ae4fe0 CMake: get rid of some commented code.
It says that's for Win32 platform and uses the install prefix as library
prefix. But that's not at all the same kind of prefixes!
CMAKE_INSTALL_PREFIX expected value is the path to install the lib (what
is called the "installation prefix"), whereas CMAKE_*_LIBRARY_PREFIX are
the prefix on the file name (usually "lib" on UNIX-like systems).
Anyway I don't see a need to change this value. It will be called
"libuchardet.dll" on Win32. I don't see the problem.
Also this code was already commented out, and compilation and usage for
Win32 works just fine without it. :-)
2017-12-24 19:47:05 +01:00
Jehan
cd617d181d CMake: do not check/set SSE and float-store options on non-x86 targets.
Not sure if that's right. I guess we might also find non-x86 machines
where floating point computation won't follow IEEE standard as well. But
let's do this for now to prevent from useless performance hit.
2017-11-07 00:37:54 +01:00
Jehan
939482ab2b CMake: slightly improve the configuration option messages.
Also add full stops, similarly to CMake defaut options.
2017-11-06 02:11:20 +01:00
Jehan
77bf71ea36 CMake: rename s/ENABLE_SSE2/CHECK_SSE2/.
"ENABLE_SSE2" may be misleading since having it ON does not necessarily
mean that SSE2 flags will be actually set. It only means that the
support will be checked (then set only when supported).
Also adding the warning about possible performance decrease.
2017-11-06 02:07:40 +01:00
Jehan
5996bbd995 Bug 101033 - Testsuite fails on i386.
Floating point accuracy may be different depending on the architecture.
In particular some architectures may store floating values with
different precision, resulting in unreliable results across various
machines. It would seem in particular true on older x86 machines without
SSE support, which were reported cases.
The proposed solution is to test for SSE support and explicitly add the
proper flags (even though they are set by default anyway on modern x86).
When this is not available (on older machines or simply when not on x86
processors), I replace sse2 flags with -ffloat-store, which forces IEEE
floating point definition.
The reason why not to always force -ffloat-store is because it seems to
decrease performance on some machines. SSE is prefered if available.

I also add a ENABLE_SSE2 option on the CMake file to allow builders to
use -ffloat-store even though SSE2 may be available on the build
machine. This would allow to build portable binaries which can also be
installed on older machines.
2017-11-06 01:56:45 +01:00
Jehan
056a5a6e51 README: add some applications having uchardet as dependency.
There are likely more (and I know some are planning support) but these
are the ones I know of and with support already in.
2017-09-21 00:06:03 +02:00
Jehan
1898847eb6 src: cast value to its proper type.
Thanks to Marino Faggiana for reporting it.
See: https://github.com/BYVoid/uchardet/issues/37
2017-08-27 13:01:30 +02:00
Jehan
170ef349cf src: fix some doc comments. s/a instance/an instance/.
Unless mistaken, we should use "an" with next word starting with
vowel.
2017-08-19 10:46:25 +02:00
Jehan
c049332c41 src: s/detctor/detector/. 2017-08-18 12:03:54 +02:00
Jehan
d9d014742a README: Gentoo also has a uchardet package.
And it is up-to-date with upstream URL at Freedesktop! Good!
2017-05-28 21:13:59 +02:00
Jehan
53f7ad0e0b Bug 101032 - assignments to nsSMState in nsCodingStateMachine result...
... in unspecified behavior.
When compiling with UBSan (-fsanitize=undefined), execution complains:
> runtime error: load of value 5, which is not a valid value for type 'nsSMState'
Since the machine states depend on every different charset's state
machine, it is not possible to simply extend the enum with more generic
values. Instead let's just make the state as an unsigned int value and
define the 3 generic states as constants.
2017-05-28 20:01:06 +02:00
Jehan
50bc02c0ff Request C++11 standard project-wise and make it a strong requirement.
It is unneeded to do it by target, using the globale property
CMAKE_CXX_STANDARD instead. Also with CMAKE_CXX_STANDARD_REQUIRED, I
make this a strong requirement. The documentation indeed states that the
CXX_STANDARD "is treated as optional and may “decay” to a previous
standard if the requested is not available".
This means that uchardet will likely not be buildable with a compiler
with no C++11 support. But I assume this is not a common situation, and
probably we should not care about outdated compilers. I remain open to
suggestions and disagreement on the topic obviously.
2017-05-28 15:43:44 +02:00
Jehan
1bf198cb0f Make C++11 the standard used for uchardet.
As discussed in bug 101032, it seems like the most common usage
nowadays. Let's make a specific choice to avoid different behavior on
different builds later on.
2017-05-28 15:32:06 +02:00
Jehan
98bf4d73fd Bug 101204 - different results with different chunk sizes.
ASCII and ISO-8859-1 should not be detected in
nsUniversalDetector::HandleData() but in nsUniversalDetector::DataEnd()
instead. Otherwise it creates an unwanted shortcut from the first call
to uchardet_handle_data() if the input is broken into several pieces and
if the first chunk happens to be ASCII (or ASCII + NBSP).
2017-05-28 14:14:48 +02:00
Jehan
50743e16f8 src: minor indentation fix. 2017-05-14 21:35:11 +02:00
Jehan
6cf13f108b test: output the test file path which we failed to open.
Also properly free the string in such case.
2017-05-14 20:29:30 +02:00
Jehan
94b10b9b29 Bug 101030 - Buffer overflow related to ISO2022JP detection in...
... en:ascii and ja:iso-2022-jp tests.
I don't know much about this part of the code at this point. Yet I can
clearly deduct that the length of the charLenTable is supposed to be the
classFactor of the SMModel. Therefore 2 classes were missing in
ISO2022JPCharLenTable, hence a buffer overflow happens when trying to
reach these. I am not sure of the values I should add there. For now,
let's set 0 to both, but adding also a comment so that I can review this
code later on, when I will get to read and understand this piece of code
in more depth.
2017-05-14 19:49:01 +02:00
Jehan
64efb1b24c Bug 101031 - Memory leak of nsSBCSGroupProber.
This manual incrementation code is just horrible and so error-prone.
Some day, we should make a cleaner loop to register all these
single-byte charset probers.
2017-05-14 18:24:11 +02:00
Jehan
56b843522b INSTALL: update compilation instructions. 2017-03-25 00:08:57 +01:00
Jehan
d90d01bc9e README: adding a flatpak-builder manifest example.
Thanks to Sébastien Wilmet for the example.
2017-03-24 23:22:40 +01:00
Jehan
119fed7e8d LangModels: add Swedish support.
Encodings: ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-15 and
WINDOWS-1252.
Test text from https://sv.wikipedia.org/wiki/Mölle
2016-09-28 22:42:13 +02:00
Jehan
d62154bd6e LangModels: add Slovene support.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250, IBM852 and
MAC-CENTRALEUROPE.
Test text from https://sl.wikipedia.org/wiki/Naseljivi_planet
2016-09-28 22:13:17 +02:00
Jehan
fbd2efdbe9 LangModels: Romanian support added.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852.
Test texts from https://ro.wikipedia.org/wiki/Danemarca
2016-09-28 19:57:50 +02:00
Jehan
0a04177787 script: forgot to commit the Estonian description. 2016-09-27 00:51:19 +02:00
Jehan
a7525b404d LangModels: added support for Irish Gaelic.
Encodings: ISO-8859-1, ISO-8859-9, ISO-8859-15 and WINDOWS-1252.
Test text from:
https://ga.wikipedia.org/wiki/Gluais_théarmaí_seoltóireachta
2016-09-27 00:49:05 +02:00
Jehan
a3a271dfd5 LangModels: Estonian models created.
Encodings: ISO-8859-4, ISO-8859-13, ISO-8859-13, Windows-1252 and
Windows-1257.
Test text from https://et.wikipedia.org/wiki/Anton_Tšehhov
Windows-1257 and ISO-8859-13 are very close so I added quotation marks
(Jutumärgid) which are on codepoints only present in ISO-8859-13,
making both encoding apart.
2016-09-27 00:14:29 +02:00
Jehan
3c6d31f5c2 LangModels: new Croatian models.
Supports: ISO-8859-2, ISO-8859-13, ISO-8859-16, IBM852, Windows-1250
and MAC-CENTRALEUROPE.
Test text from https://hr.wikipedia.org/wiki/Brekinja
2016-09-26 01:32:49 +02:00
Jehan
d76d33b88b script: character orders in single-byte language models should be maxed.
This happened when building a Croatian model which can be written with
many different encodings. There were also many irrelevant glyphs (i.e.
used in other languages) in these encodings so we ended with orders over
255, which breaks when converting to unsigned char.
Just let's make sure that we don't cross the 250 limit (over is used for
controls, illegal characters, symbols, numbers…). This means we may have
several characters with order 249, but since orders over the frequent
character list don't matter, this is not a problem.
2016-09-26 01:31:20 +02:00
Jehan
05ba8555cd src: fix number of Single-Byte charset probers. 2016-09-25 14:02:39 +02:00
Jehan
4e535503c6 script: language script for Slovak forgotten. 2016-09-21 18:58:12 +02:00
Jehan
f262b1d65b LangModels: add Italian support.
Officially supported: ISO-8859-1, ISO-8859-3, ISO-8859-9, ISO-8859-15
and WINDOWS-1252. Same as Finnish only ISO-8859-1 and UTF-8 test added
since other encoding end up similar as ISO-8859-1 for most common texts
(i.e. glyphs used in Italian are on the same codepoints on these other
encodings).
Test text from https://it.wikipedia.org/wiki/Architettura_longobarda
2016-09-21 18:52:09 +02:00
Jehan
87d0c16e0e README: add Finnish support. 2016-09-21 18:35:26 +02:00
Jehan
6bbe7da1ac LangModels: add Finnish support.
I built models for ISO-8859-1, ISO-8859-4, ISO-8859-9, ISO-8859-13,
ISO-8859-15 and WINDOWS-1252, which all contain Finnish letters.
Nevertheless most texts in these encoding end up the same (same
codepoints for the Finnish glyphs) so I keep only tests for ISO-8859-1
and UTF-8. Models for other encoding may still be useful when processing
texts with some symbols, etc.
2016-09-21 18:27:39 +02:00
Jehan
ac4aa94b73 README: add Polish support…
… and update "Mac-CentralEurope" into "MAC-CENTRALEUROPE" (as in iconv).
2016-09-21 17:38:22 +02:00
Jehan
a59b1c9571 src: update documentation comments on the public API. 2016-09-21 17:36:17 +02:00
Jehan
3401ac70d0 LangModels: add Polish support.
With the following encodings: ISO-8859-2, ISO-8859-13, ISO-8859-16,
Windows-1250, IBM852, MAC-CENTRALEUROPE.
Test text from https://pl.wikipedia.org/wiki/Zofia_Holszańska
2016-09-21 17:30:15 +02:00
Jehan
f314b76c0a README: add Slovak support. 2016-09-21 13:42:31 +02:00
Jehan
5f9ec3aef0 LangModels: add support for Slovak.
Encodings are the same as Czech (Windows-1250, ISO-8859-2 and
Mac-CentralEurope) since the resource I found indicate they used the
same encodings historically.
Also it is to be noted that the test examples' encoding were already
properly detected through Czech's models so the languages are definitely
very close, even statistically. Nevertheless adding the right models
will work better and these get better scores. This will take all its
meaning when uchardet will also be used as a language detector (in some
not-too-far future, hopefully!).
Test text taken from: https://sk.wikipedia.org/wiki/Jupiter
2016-09-21 13:42:20 +02:00
Jehan
5680cba0b8 README: adding Czech and Maltese support information. 2016-09-21 03:45:40 +02:00
Jehan
2c752dbbe5 test: adding test files for Czech.
Text taken from:
https://cs.wikipedia.org/wiki/Ledňáček_říční
2016-09-21 03:44:22 +02:00
Jehan
26e1cebad1 LangModels: add support for Czech.
Encodings: Windows-1250, ISO-8859-2, IBM852 and Mac-CentralEurope.
Other encodings are known to have been used for Czech: Kamenicky,
KOI-8 CS2 and Cork. But these are uncommon enough that I decided not
to support them (especially since I can't find them supported in iconv
either, or at least not under an alias which I could recognize).
This web page, which contents was made under the Public Domain, is a
good reference for encodings which were used historically for Czech and
Slovak: http://luki.sdf-eu.org/txt/cs-encodings-faq.html
2016-09-21 03:33:50 +02:00
Jehan
183092d048 src: fix non-guarded 'if' warning.
Not sure if this is useful to have the 'if (mDetectedCharset)' outside
the if block, but it won't hurt for sure in this specific case, so I
leave the current code logics as is.
The exact warning was:
nsUniversalDetector.cpp: In member function ‘virtual nsresult nsUniversalDetector::HandleData(const char*, PRUint32)’:
nsUniversalDetector.cpp:115:5: warning: this ‘if’ clause does not guard... [-Wmisleading-indentation]
     if (aLen > 2)
     ^~
nsUniversalDetector.cpp:157:7: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the ‘if’
       if (mDetectedCharset)
       ^~
2016-09-21 02:37:31 +02:00
Jehan
26024e5c82 script: work around a KeyError exception in Python Wikipedia lib.
Even the test `if hasattr(page, 'links')` would trigger this exception.
So I try the approach "Easier to Ask Forgiveness than Permission".
Weird stuff but well…
Note: I had this exception when running it on the Maltese data.
2016-09-21 02:19:39 +02:00
Jehan
2700cf3a83 LangModels: support for Maltese / ISO-8859-3.
Test text from https://mt.wikipedia.org/wiki/Franza.
2016-09-21 02:11:31 +02:00