From fbd2efdbe918ec18ec79a3b2e0064b2247393cd0 Mon Sep 17 00:00:00 2001 From: Jehan Date: Wed, 28 Sep 2016 19:54:17 +0200 Subject: [PATCH] LangModels: Romanian support added. Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852. Test texts from https://ro.wikipedia.org/wiki/Danemarca --- README.md | 5 + .../BuildLangModelLogs/LangRomanianModel.log | 153 ++++++++++++ script/langs/ro.py | 65 +++++ src/CMakeLists.txt | 1 + src/LangModels/LangRomanianModel.cpp | 232 ++++++++++++++++++ src/nsSBCSGroupProber.cpp | 5 + src/nsSBCSGroupProber.h | 2 +- src/nsSBCharSetProber.h | 5 + test/ro/ibm852.txt | 9 + test/ro/iso-8859-16.txt | 9 + test/ro/utf-8.txt | 9 + test/ro/windows-1250.txt | 9 + 12 files changed, 503 insertions(+), 1 deletion(-) create mode 100644 script/BuildLangModelLogs/LangRomanianModel.log create mode 100644 script/langs/ro.py create mode 100644 src/LangModels/LangRomanianModel.cpp create mode 100644 test/ro/ibm852.txt create mode 100644 test/ro/iso-8859-16.txt create mode 100644 test/ro/utf-8.txt create mode 100644 test/ro/windows-1250.txt diff --git a/README.md b/README.md index ba4d494..1b54e4a 100644 --- a/README.md +++ b/README.md @@ -115,6 +115,11 @@ Techniques used by universalchardet are described at http://www.mozilla.org/proj * ISO-8859-9 * ISO-8859-15 * WINDOWS-1252 + * Romanian: + * ISO-8859-2 + * ISO-8859-16 + * Windows-1250 + * IBM852 * Russian * ISO-8859-5 * KOI8-R diff --git a/script/BuildLangModelLogs/LangRomanianModel.log b/script/BuildLangModelLogs/LangRomanianModel.log new file mode 100644 index 0000000..5d30cbc --- /dev/null +++ b/script/BuildLangModelLogs/LangRomanianModel.log @@ -0,0 +1,153 @@ += Logs of language model for Romanian (ro) = + +- Generated by BuildLangModel.py +- Started: 2016-09-28 18:53:56.086095 +- Maximum depth: 5 +- Max number of pages: 100 + +== Parsed pages == + +The Loving Kind (revision 10166481) +12 ianuarie (revision 10711676) +13 decembrie (revision 9938353) +2007 (revision 10716321) +2008 (revision 10752084) +2009 (revision 10654003) +21 noiembrie (revision 10447643) +25 ianuarie (revision 10228199) +31 ianuarie (revision 10718063) +4 Music (revision 9701591) +Billboard (revision 10505294) +Biology (revision 10112430) +Bulgaria (revision 10481051) +CD (revision 10477531) +Call The Shots (revision 10101027) +Call the Shots (revision 10101027) +Can't Speak French (revision 9721506) +Casă de discuri (revision 10611348) +Channel 4 (revision 7953101) +Chemistry (revision 10112479) +Cheryl Cole (revision 10475016) +Chitară (revision 10468266) +Croația (revision 10737746) +Dance (revision 10231736) +Descărcare digitală (revision 10100743) +Digital Spy (revision 9044016) +Discografia Girls Aloud (revision 10172788) +Estonia (revision 10749810) +Europa (revision 10752724) +Fascination Records (revision 9655292) +Fiona Phillips (revision 5384082) +Gen muzical (revision 10534645) +Girls A Live (revision 10112444) +Girls Aloud (revision 10112446) +Good Morning Television (revision 10166481) +Heat World (revision 10166481) +I'll Stand By You (cântec de Girls Aloud) (revision 10112432) +ITunes (revision 10744174) +I Think We're Alone Now (revision 10112427) +Irlanda (revision 10573806) +Jump (cântec de Girls Aloud) (revision 10112438) +Lady GaGa (revision 10753010) +Life Got Cold (revision 10112437) +Limba engleză (revision 10756676) +Long Hot Summer (revision 10112429) +Love Machine (revision 10112433) +MSN Search (revision 10653298) +MTV (revision 10170766) +Mixed Up (revision 10112443) +Muzică electronică (revision 10608432) +Muzică pop (revision 10740529) +Nadine Coyle (revision 10316187) +Neil Tennant (revision 10499980) +No Good Advice (revision 10112436) +Out Of Control (revision 10112484) +Out of Control (revision 10112484) +Pet Shop Boys (revision 10612741) +Poker Face (revision 10496402) +PopJustice (revision 10625677) +Regatul Unit (revision 10752338) +Regatul Unit al Marii Britanii și Irlandei de Nord (revision 10752338) +Regatul Unit al Marii Britanii și al Irlandei de Nord (revision 10752338) +Republica Irlanda (revision 10573806) +Romanian Top 100 (revision 10736281) +România (revision 10732435) +Sarah Harding (revision 10633651) +Sarah Hearding (revision 10112425) +See the Day (revision 10112431) +Sexy! No No No... (revision 10112425) +Slant Magazine (revision 7697473) +Slovenia (revision 10521499) +Something Kinda Ooooh (revision 10112426) +Sound of the Underground (album) (revision 10112476) +Sound of the Underground (cântec) (revision 10112434) +Tangled Up (revision 10112482) +The Guardian (revision 9752334) +The Paul O'Grady Show (revision 10101027) +The Promise (revision 10166482) +The Show (revision 10112441) +The Sound of Girls Aloud (revision 10112480) +Tonalitate (revision 9966362) +Turneul Out of Control (revision 10112446) +UK Mix (revision 9721468) +UK Singles Chart (revision 10226705) +Ungaria (revision 10737745) +Uniunea Europeană (revision 10751590) +Untouchable (revision 10112410) +Wake Me Up (revision 10112439) +What Will The Neighbours Say? (revision 10112478) +Whole Lotta History (revision 10475020) +Wideboys (revision 10166481) +Wikimedia Commons (revision 9703907) +Xenomania (revision 10112484) + +== End of Parsed pages == + +- Wikipedia parsing ended at: 2016-09-28 18:58:13.756622 + +60 characters appeared 883554 times. + +First 33 characters: +[ 0] Char e: 11.67014127036944 % +[ 1] Char i: 10.97567324690964 % +[ 2] Char a: 10.080198833348046 % +[ 3] Char r: 7.490657050955572 % +[ 4] Char n: 7.18246988865423 % +[ 5] Char t: 6.516296683620921 % +[ 6] Char l: 5.595130574928075 % +[ 7] Char u: 5.551217016730161 % +[ 8] Char o: 4.922732509840938 % +[ 9] Char c: 4.495707110148333 % +[10] Char s: 3.8308920563994957 % +[11] Char d: 3.590499279048027 % +[12] Char m: 2.971408651876399 % +[13] Char p: 2.902369294915761 % +[14] Char ă: 2.1349006399156134 % +[15] Char g: 1.2248261000459508 % +[16] Char f: 1.1199089133205216 % +[17] Char b: 1.0781457613230203 % +[18] Char ț: 1.0323081554721047 % +[19] Char ș: 0.9732285745975912 % +[20] Char î: 0.97017273420753 % +[21] Char v: 0.9693804792915882 % +[22] Char z: 0.7369102510995367 % +[23] Char h: 0.533413916976212 % +[24] Char â: 0.4986678799484808 % +[25] Char x: 0.22081276300033725 % +[26] Char j: 0.20055367300696958 % +[27] Char k: 0.1901411798260208 % +[28] Char y: 0.15471606715605385 % +[29] Char w: 0.11827234102273318 % +[30] Char á: 0.016297815413658927 % +[31] Char é: 0.013355154297303842 % +[32] Char q: 0.00520624659047438 % + +The first 33 characters have an accumulated ratio of 0.9996661211425673. + +981 sequences found. + +First 512 (typical positive ratio): 0.997762564143313 +Next 512 (512-1024): 1.1317927370596478e-06 +Rest: 3.0357660829594124e-18 + +- Processing end: 2016-09-28 18:58:13.862425 diff --git a/script/langs/ro.py b/script/langs/ro.py new file mode 100644 index 0000000..0e8169e --- /dev/null +++ b/script/langs/ro.py @@ -0,0 +1,65 @@ +#!/bin/python3 +# -*- coding: utf-8 -*- + +# ##### BEGIN LICENSE BLOCK ##### +# Version: MPL 1.1/GPL 2.0/LGPL 2.1 +# +# The contents of this file are subject to the Mozilla Public License Version +# 1.1 (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# http://www.mozilla.org/MPL/ +# +# Software distributed under the License is distributed on an "AS IS" basis, +# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License +# for the specific language governing rights and limitations under the +# License. +# +# The Original Code is Mozilla Universal charset detector code. +# +# The Initial Developer of the Original Code is +# Netscape Communications Corporation. +# Portions created by the Initial Developer are Copyright (C) 2001 +# the Initial Developer. All Rights Reserved. +# +# Contributor(s): +# Jehan +# +# Alternatively, the contents of this file may be used under the terms of +# either the GNU General Public License Version 2 or later (the "GPL"), or +# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), +# in which case the provisions of the GPL or the LGPL are applicable instead +# of those above. If you wish to allow use of your version of this file only +# under the terms of either the GPL or the LGPL, and not to allow others to +# use your version of this file under the terms of the MPL, indicate your +# decision by deleting the provisions above and replace them with the notice +# and other provisions required by the GPL or the LGPL. If you do not delete +# the provisions above, a recipient may use your version of this file under +# the terms of any one of the MPL, the GPL or the LGPL. +# +# ##### END LICENSE BLOCK ##### + +import re + +## Mandatory Properties ## + +name = 'Romanian' +code = 'ro' +use_ascii = True +charsets = ['ISO-8859-2', 'ISO-8859-16', + 'Windows-1250', 'IBM852'] + +## Optional Properties ## + +# Alphabet characters. +# Note: Wikipedia explains that s and t with cedilla (şţ), or even +# bare s and t, were often used in place of s and t with comma (șț) +# because of missing characters in most common encoding at the time. +# It may be worth adding some common_replacement_letters logics in +# the training and models. +# https://en.wikipedia.org/wiki/Romanian_alphabet#ISO_8859 +alphabet = 'ăâîșț' +# The starred page which was rewarded on the main page when I created +# the data. +start_pages = ['The Loving Kind'] +wikipedia_code = code +case_mapping = True diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt index 74b3939..2525ec6 100644 --- a/src/CMakeLists.txt +++ b/src/CMakeLists.txt @@ -27,6 +27,7 @@ set( LangModels/LangMalteseModel.cpp LangModels/LangPolishModel.cpp LangModels/LangPortugueseModel.cpp + LangModels/LangRomanianModel.cpp LangModels/LangRussianModel.cpp LangModels/LangSlovakModel.cpp LangModels/LangSpanishModel.cpp diff --git a/src/LangModels/LangRomanianModel.cpp b/src/LangModels/LangRomanianModel.cpp new file mode 100644 index 0000000..abef794 --- /dev/null +++ b/src/LangModels/LangRomanianModel.cpp @@ -0,0 +1,232 @@ +/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ +/* ***** BEGIN LICENSE BLOCK ***** + * Version: MPL 1.1/GPL 2.0/LGPL 2.1 + * + * The contents of this file are subject to the Mozilla Public License Version + * 1.1 (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * http://www.mozilla.org/MPL/ + * + * Software distributed under the License is distributed on an "AS IS" basis, + * WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License + * for the specific language governing rights and limitations under the + * License. + * + * The Original Code is Mozilla Communicator client code. + * + * The Initial Developer of the Original Code is + * Netscape Communications Corporation. + * Portions created by the Initial Developer are Copyright (C) 1998 + * the Initial Developer. All Rights Reserved. + * + * Contributor(s): + * + * Alternatively, the contents of this file may be used under the terms of + * either the GNU General Public License Version 2 or later (the "GPL"), or + * the GNU Lesser General Public License Version 2.1 or later (the "LGPL"), + * in which case the provisions of the GPL or the LGPL are applicable instead + * of those above. If you wish to allow use of your version of this file only + * under the terms of either the GPL or the LGPL, and not to allow others to + * use your version of this file under the terms of the MPL, indicate your + * decision by deleting the provisions above and replace them with the notice + * and other provisions required by the GPL or the LGPL. If you do not delete + * the provisions above, a recipient may use your version of this file under + * the terms of any one of the MPL, the GPL or the LGPL. + * + * ***** END LICENSE BLOCK ***** */ + +#include "../nsSBCharSetProber.h" + +/********* Language model for: Romanian *********/ + +/** + * Generated by BuildLangModel.py + * On: 2016-09-28 18:58:13.757152 + **/ + +/* Character Mapping Table: + * ILL: illegal character. + * CTR: control character specific to the charset. + * RET: carriage/return. + * SYM: symbol (punctuation) that does not belong to word. + * NUM: 0 - 9. + * + * Other characters are ordered by probabilities + * (0 is the most common character in the language). + * + * Orders are generic to a language. So the codepoint with order X in + * CHARSET1 maps to the same character as the codepoint with the same + * order X in CHARSET2 for the same language. + * As such, it is possible to get missing order. For instance the + * ligature of 'o' and 'e' exists in ISO-8859-15 but not in ISO-8859-1 + * even though they are both used for French. Same for the euro sign. + */ +static const unsigned char Iso_8859_16_CharToOrderMap[] = +{ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */ + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */ + SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */ + 13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */ + SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */ + 13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 8X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 9X */ + SYM, 60, 61, 46,SYM,SYM, 38,SYM, 38,SYM, 19,SYM, 62,SYM, 63, 64, /* AX */ + SYM,SYM, 41, 46, 40,SYM,SYM,SYM, 40, 41, 19,SYM, 65, 66, 67, 68, /* BX */ + 69, 30, 24, 14, 33, 35, 53, 42, 45, 31, 58, 49, 70, 37, 20, 48, /* CX */ + 43, 52, 59, 34, 71, 44, 36, 56, 50, 72, 47, 73, 39, 74, 18, 57, /* DX */ + 75, 30, 24, 14, 33, 35, 53, 42, 45, 31, 58, 49, 76, 37, 20, 48, /* EX */ + 43, 52, 59, 34, 77, 44, 36, 56, 50, 78, 47, 79, 39, 80, 18, 81, /* FX */ +}; +/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */ + +static const unsigned char Iso_8859_2_CharToOrderMap[] = +{ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */ + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */ + SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */ + 13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */ + SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */ + 13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 8X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 9X */ + SYM, 82,SYM, 46,SYM, 83, 56,SYM,SYM, 38, 84, 85, 86,SYM, 40, 87, /* AX */ + SYM, 88,SYM, 46,SYM, 89, 56,SYM,SYM, 38, 90, 91, 92,SYM, 40, 93, /* BX */ + 94, 30, 24, 14, 33, 95, 35, 42, 41, 31, 96, 49, 51, 37, 20, 97, /* CX */ + 43, 52, 98, 34, 99, 44, 36,SYM, 55,100, 47, 50, 39, 54,101, 57, /* DX */ + 102, 30, 24, 14, 33,103, 35, 42, 41, 31,104, 49, 51, 37, 20,105, /* EX */ + 43, 52,106, 34,107, 44, 36,SYM, 55,108, 47, 50, 39, 54,109,SYM, /* FX */ +}; +/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */ + +static const unsigned char Windows_1250_CharToOrderMap[] = +{ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */ + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */ + SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */ + 13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */ + SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */ + 13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */ + SYM,ILL,SYM,ILL,SYM,SYM,SYM,SYM,ILL,SYM, 38,SYM, 56,110, 40,111, /* 8X */ + ILL,SYM,SYM,SYM,SYM,SYM,SYM,SYM,ILL,SYM, 38,SYM, 56,112, 40,113, /* 9X */ + SYM,SYM,SYM, 46,SYM,114,SYM,SYM,SYM,SYM,115,SYM,SYM,SYM,SYM,116, /* AX */ + SYM,SYM,SYM, 46,SYM,SYM,SYM,SYM,SYM,117,118,SYM,119,SYM,120,121, /* BX */ + 122, 30, 24, 14, 33,123, 35, 42, 41, 31,124, 49, 51, 37, 20,125, /* CX */ + 43, 52,126, 34,127, 44, 36,SYM, 55,128, 47, 50, 39, 54,129, 57, /* DX */ + 130, 30, 24, 14, 33,131, 35, 42, 41, 31,132, 49, 51, 37, 20,133, /* EX */ + 43, 52,134, 34,135, 44, 36,SYM, 55,136, 47, 50, 39, 54,137,SYM, /* FX */ +}; +/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */ + +static const unsigned char Ibm852_CharToOrderMap[] = +{ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */ + CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */ + NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */ + SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */ + 13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */ + SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */ + 13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */ + 42, 39, 31, 24, 33,138, 35, 42, 46, 49, 44, 44, 20,139, 33, 35, /* 8X */ + 31,140,141,142, 36,143,144, 56, 56, 36, 39,145,146, 46,SYM, 41, /* 9X */ + 30, 37, 34, 47,147,148, 40, 40,149,150,SYM,151, 41,152,SYM,SYM, /* AX */ + SYM,SYM,SYM,SYM,SYM, 30, 24, 51,153,SYM,SYM,SYM,SYM,154,155,SYM, /* BX */ + SYM,SYM,SYM,SYM,SYM,SYM, 14, 14,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* CX */ + 43, 43,156, 49,157,158, 37, 20, 51,SYM,SYM,SYM,SYM,159,160,SYM, /* DX */ + 34, 57,161, 52, 52,162, 38, 38,163, 47,164, 50, 54, 54,165,SYM, /* EX */ + SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, 50, 55, 55,SYM,SYM, /* FX */ +}; +/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */ + + +/* Model Table: + * Total sequences: 981 + * First 512 sequences: 0.997762564143313 + * Next 512 sequences (512-1024): 0.002237435856687006 + * Rest: 3.0357660829594124e-18 + * Negative sequences: TODO + */ +static const PRUint8 RomanianLangModel[] = +{ + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,2,0,2, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,0,3,3,3,2,3,3,3,2,2,0,0,2, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,0,3,3,3,0,3,3,3,3,3,0,2,2, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,2,2,3,3,2,2,2,2, + 3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,3,3,0,3,3,3,3,2,3,3,3,3,2,2,2, + 3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,2,3,3,0,2,2,3,3,3,3,0,2,2,3,3,2,3,0, + 3,3,3,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,3,2,2,3,0,3,3,3,2,2,2,0, + 3,3,3,3,3,3,3,2,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,3,3,2,2,0,2,0, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,0,3,3,3,0,3,2,3,3,3,2,0,2, + 3,3,3,3,3,3,3,3,3,3,3,3,2,2,3,3,2,2,3,2,0,3,2,3,3,0,3,3,2,2,0,2,2, + 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,0,2,2,3,3,3,0,2,3,3,3,2,2,2, + 3,3,3,3,3,2,3,3,3,2,3,3,3,2,3,3,2,3,0,0,0,3,2,3,3,0,2,2,3,3,3,2,0, + 3,3,3,2,3,3,3,3,3,3,3,2,3,3,3,2,2,3,3,2,2,2,2,3,3,2,0,0,3,2,2,2,0, + 3,3,3,3,3,3,3,3,3,3,3,2,2,3,3,0,0,2,3,0,2,0,2,3,3,0,2,2,3,0,2,2,0, + 2,3,0,3,3,3,3,3,0,3,3,3,3,3,0,3,0,3,3,3,0,3,3,0,0,0,2,2,0,0,0,0,0, + 3,3,3,3,3,2,3,3,3,0,2,3,3,2,3,3,2,3,0,0,2,3,2,3,3,0,2,0,3,2,2,2,0, + 3,3,3,3,0,3,3,3,3,2,2,2,3,2,3,2,3,0,0,0,0,0,0,2,3,0,0,0,2,0,2,2,0, + 3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,2,2,3,3,2,0,2,2,2,3,0,2,2,3,2,2,2,0, + 3,3,3,0,0,0,0,3,2,2,2,0,0,0,3,0,0,0,0,0,2,2,0,0,2,0,0,2,0,0,0,0,0, + 3,3,3,0,3,3,3,3,3,3,0,2,2,0,3,0,0,0,0,0,0,2,0,0,2,0,0,2,0,0,0,0,0, + 0,3,0,2,3,0,3,0,0,0,0,0,3,0,0,0,0,0,2,3,0,0,2,2,0,0,0,2,0,0,0,0,0, + 3,3,3,3,3,2,3,3,3,2,2,3,2,0,3,2,2,2,0,0,0,0,0,0,3,0,2,2,2,0,2,0,0, + 3,3,3,2,2,2,2,3,3,0,2,3,2,2,3,2,0,3,0,0,0,3,3,2,3,0,0,2,2,0,2,2,0, + 3,3,3,3,3,3,3,3,3,2,3,2,2,2,3,0,2,3,0,0,0,2,2,0,2,0,2,2,3,2,2,2,0, + 0,3,0,3,3,3,3,3,0,2,2,2,3,0,0,0,0,0,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0, + 3,3,3,2,0,3,0,3,3,3,2,0,0,3,3,0,3,0,0,0,0,3,0,2,2,3,0,0,3,0,0,0,0, + 3,3,3,2,2,2,3,3,3,0,2,2,2,0,2,0,0,2,0,0,0,2,0,0,2,0,0,2,0,0,2,0,0, + 3,3,3,3,2,3,3,3,3,2,3,2,3,2,2,2,2,2,2,2,2,2,0,3,0,0,0,2,3,2,2,2,0, + 3,2,3,3,3,2,3,2,3,3,3,3,3,2,0,2,0,2,0,0,0,2,2,2,0,0,2,2,0,2,2,0,0, + 3,3,3,2,3,2,2,2,3,2,3,2,2,2,0,0,2,2,0,0,0,0,0,3,0,0,0,0,2,3,0,0,0, + 2,3,0,3,3,2,2,0,0,2,2,2,2,0,0,2,0,0,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0, + 0,3,2,2,2,2,2,0,0,2,2,2,2,2,0,2,0,2,0,0,0,2,2,0,0,0,2,2,0,0,0,0,0, + 0,0,2,0,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2, +}; + + +const SequenceModel Iso_8859_16RomanianModel = +{ + Iso_8859_16_CharToOrderMap, + RomanianLangModel, + 33, + (float)0.997762564143313, + PR_TRUE, + "ISO-8859-16" +}; + +const SequenceModel Iso_8859_2RomanianModel = +{ + Iso_8859_2_CharToOrderMap, + RomanianLangModel, + 33, + (float)0.997762564143313, + PR_TRUE, + "ISO-8859-2" +}; + +const SequenceModel Windows_1250RomanianModel = +{ + Windows_1250_CharToOrderMap, + RomanianLangModel, + 33, + (float)0.997762564143313, + PR_TRUE, + "WINDOWS-1250" +}; + +const SequenceModel Ibm852RomanianModel = +{ + Ibm852_CharToOrderMap, + RomanianLangModel, + 33, + (float)0.997762564143313, + PR_TRUE, + "IBM852" +}; \ No newline at end of file diff --git a/src/nsSBCSGroupProber.cpp b/src/nsSBCSGroupProber.cpp index 037153b..96c93e0 100644 --- a/src/nsSBCSGroupProber.cpp +++ b/src/nsSBCSGroupProber.cpp @@ -174,6 +174,11 @@ nsSBCSGroupProber::nsSBCSGroupProber() mProbers[83] = new nsSingleByteCharSetProber(&Iso_8859_15IrishModel); mProbers[84] = new nsSingleByteCharSetProber(&Windows_1252IrishModel); + mProbers[85] = new nsSingleByteCharSetProber(&Windows_1250RomanianModel); + mProbers[86] = new nsSingleByteCharSetProber(&Iso_8859_2RomanianModel); + mProbers[87] = new nsSingleByteCharSetProber(&Iso_8859_16RomanianModel); + mProbers[88] = new nsSingleByteCharSetProber(&Ibm852RomanianModel); + Reset(); } diff --git a/src/nsSBCSGroupProber.h b/src/nsSBCSGroupProber.h index 405e43c..7f7425c 100644 --- a/src/nsSBCSGroupProber.h +++ b/src/nsSBCSGroupProber.h @@ -40,7 +40,7 @@ #define nsSBCSGroupProber_h__ -#define NUM_OF_SBCS_PROBERS 85 +#define NUM_OF_SBCS_PROBERS 89 class nsCharSetProber; class nsSBCSGroupProber: public nsCharSetProber { diff --git a/src/nsSBCharSetProber.h b/src/nsSBCharSetProber.h index dc9ddd7..e6dd2ae 100644 --- a/src/nsSBCharSetProber.h +++ b/src/nsSBCharSetProber.h @@ -235,5 +235,10 @@ extern const SequenceModel Iso_8859_9IrishModel; extern const SequenceModel Iso_8859_1IrishModel; extern const SequenceModel Windows_1252IrishModel; +extern const SequenceModel Windows_1250RomanianModel; +extern const SequenceModel Iso_8859_2RomanianModel; +extern const SequenceModel Iso_8859_16RomanianModel; +extern const SequenceModel Ibm852RomanianModel; + #endif /* nsSingleByteCharSetProber_h__ */ diff --git a/test/ro/ibm852.txt b/test/ro/ibm852.txt new file mode 100644 index 0000000..634dda2 --- /dev/null +++ b/test/ro/ibm852.txt @@ -0,0 +1,9 @@ +Danemarca (n danez Sunet Danmark), oficial Regatul Danemarcei (n +danez Sunet Kongeriget Danmark), este un stat suveran din +Europa de Nord, avnd si dou tri constituente de peste mri, care fac parte +integrant din regat: Insulele Feroe n Atlanticul de Nord si Groenlanda n +America de Nord. Danemarca propriu-zis[a] este cea mai de sud dintre trile +nordice, aflat la sud-vest de Suedia si la sud de Norvegia, nvecinndu-se la +sud cu Germania. Tara const dintr-o peninsul mare, Iutlanda, si mai multe +insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster si +Bornholm, precum si sute de insulite denumite n general ,,Arhipelagul Danez". diff --git a/test/ro/iso-8859-16.txt b/test/ro/iso-8859-16.txt new file mode 100644 index 0000000..29ae299 --- /dev/null +++ b/test/ro/iso-8859-16.txt @@ -0,0 +1,9 @@ +Danemarca (n danez Sunet Danmark), oficial Regatul Danemarcei (n +danez Sunet Kongeriget Danmark), este un stat suveran din +Europa de Nord, avnd i dou ri constituente de peste mri, care fac parte +integrant din regat: Insulele Feroe n Atlanticul de Nord i Groenlanda n +America de Nord. Danemarca propriu-zis[a] este cea mai de sud dintre rile +nordice, aflat la sud-vest de Suedia i la sud de Norvegia, nvecinndu-se la +sud cu Germania. ara const dintr-o peninsul mare, Iutlanda, i mai multe +insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster i +Bornholm, precum i sute de insulie denumite n general Arhipelagul Danez. diff --git a/test/ro/utf-8.txt b/test/ro/utf-8.txt new file mode 100644 index 0000000..dea759e --- /dev/null +++ b/test/ro/utf-8.txt @@ -0,0 +1,9 @@ +Danemarca (în daneză Sunet Danmark), oficial Regatul Danemarcei (în +daneză Sunet Kongeriget Danmark), este un stat suveran din +Europa de Nord, având și două țări constituente de peste mări, care fac parte +integrantă din regat: Insulele Feroe în Atlanticul de Nord și Groenlanda în +America de Nord. Danemarca propriu-zisă[a] este cea mai de sud dintre țările +nordice, aflată la sud-vest de Suedia și la sud de Norvegia, învecinându-se la +sud cu Germania. Țara constă dintr-o peninsulă mare, Iutlanda, și mai multe +insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster și +Bornholm, precum și sute de insulițe denumite în general „Arhipelagul Danez”. diff --git a/test/ro/windows-1250.txt b/test/ro/windows-1250.txt new file mode 100644 index 0000000..f43cb89 --- /dev/null +++ b/test/ro/windows-1250.txt @@ -0,0 +1,9 @@ +Danemarca (n danez Sunet Danmark), oficial Regatul Danemarcei (n +danez Sunet Kongeriget Danmark), este un stat suveran din +Europa de Nord, avnd si dou tri constituente de peste mri, care fac parte +integrant din regat: Insulele Feroe n Atlanticul de Nord si Groenlanda n +America de Nord. Danemarca propriu-zis[a] este cea mai de sud dintre trile +nordice, aflat la sud-vest de Suedia si la sud de Norvegia, nvecinndu-se la +sud cu Germania. Tara const dintr-o peninsul mare, Iutlanda, si mai multe +insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster si +Bornholm, precum si sute de insulite denumite n general Arhipelagul Danez.