LangModels: Romanian support added.

Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852.
Test texts from https://ro.wikipedia.org/wiki/Danemarca
This commit is contained in:
Jehan 2016-09-28 19:54:17 +02:00
parent 0a04177787
commit fbd2efdbe9
12 changed files with 503 additions and 1 deletions

View File

@ -115,6 +115,11 @@ Techniques used by universalchardet are described at http://www.mozilla.org/proj
* ISO-8859-9
* ISO-8859-15
* WINDOWS-1252
* Romanian:
* ISO-8859-2
* ISO-8859-16
* Windows-1250
* IBM852
* Russian
* ISO-8859-5
* KOI8-R

View File

@ -0,0 +1,153 @@
= Logs of language model for Romanian (ro) =
- Generated by BuildLangModel.py
- Started: 2016-09-28 18:53:56.086095
- Maximum depth: 5
- Max number of pages: 100
== Parsed pages ==
The Loving Kind (revision 10166481)
12 ianuarie (revision 10711676)
13 decembrie (revision 9938353)
2007 (revision 10716321)
2008 (revision 10752084)
2009 (revision 10654003)
21 noiembrie (revision 10447643)
25 ianuarie (revision 10228199)
31 ianuarie (revision 10718063)
4 Music (revision 9701591)
Billboard (revision 10505294)
Biology (revision 10112430)
Bulgaria (revision 10481051)
CD (revision 10477531)
Call The Shots (revision 10101027)
Call the Shots (revision 10101027)
Can't Speak French (revision 9721506)
Casă de discuri (revision 10611348)
Channel 4 (revision 7953101)
Chemistry (revision 10112479)
Cheryl Cole (revision 10475016)
Chitară (revision 10468266)
Croația (revision 10737746)
Dance (revision 10231736)
Descărcare digitală (revision 10100743)
Digital Spy (revision 9044016)
Discografia Girls Aloud (revision 10172788)
Estonia (revision 10749810)
Europa (revision 10752724)
Fascination Records (revision 9655292)
Fiona Phillips (revision 5384082)
Gen muzical (revision 10534645)
Girls A Live (revision 10112444)
Girls Aloud (revision 10112446)
Good Morning Television (revision 10166481)
Heat World (revision 10166481)
I'll Stand By You (cântec de Girls Aloud) (revision 10112432)
ITunes (revision 10744174)
I Think We're Alone Now (revision 10112427)
Irlanda (revision 10573806)
Jump (cântec de Girls Aloud) (revision 10112438)
Lady GaGa (revision 10753010)
Life Got Cold (revision 10112437)
Limba engleză (revision 10756676)
Long Hot Summer (revision 10112429)
Love Machine (revision 10112433)
MSN Search (revision 10653298)
MTV (revision 10170766)
Mixed Up (revision 10112443)
Muzică electronică (revision 10608432)
Muzică pop (revision 10740529)
Nadine Coyle (revision 10316187)
Neil Tennant (revision 10499980)
No Good Advice (revision 10112436)
Out Of Control (revision 10112484)
Out of Control (revision 10112484)
Pet Shop Boys (revision 10612741)
Poker Face (revision 10496402)
PopJustice (revision 10625677)
Regatul Unit (revision 10752338)
Regatul Unit al Marii Britanii și Irlandei de Nord (revision 10752338)
Regatul Unit al Marii Britanii și al Irlandei de Nord (revision 10752338)
Republica Irlanda (revision 10573806)
Romanian Top 100 (revision 10736281)
România (revision 10732435)
Sarah Harding (revision 10633651)
Sarah Hearding (revision 10112425)
See the Day (revision 10112431)
Sexy! No No No... (revision 10112425)
Slant Magazine (revision 7697473)
Slovenia (revision 10521499)
Something Kinda Ooooh (revision 10112426)
Sound of the Underground (album) (revision 10112476)
Sound of the Underground (cântec) (revision 10112434)
Tangled Up (revision 10112482)
The Guardian (revision 9752334)
The Paul O'Grady Show (revision 10101027)
The Promise (revision 10166482)
The Show (revision 10112441)
The Sound of Girls Aloud (revision 10112480)
Tonalitate (revision 9966362)
Turneul Out of Control (revision 10112446)
UK Mix (revision 9721468)
UK Singles Chart (revision 10226705)
Ungaria (revision 10737745)
Uniunea Europeană (revision 10751590)
Untouchable (revision 10112410)
Wake Me Up (revision 10112439)
What Will The Neighbours Say? (revision 10112478)
Whole Lotta History (revision 10475020)
Wideboys (revision 10166481)
Wikimedia Commons (revision 9703907)
Xenomania (revision 10112484)
== End of Parsed pages ==
- Wikipedia parsing ended at: 2016-09-28 18:58:13.756622
60 characters appeared 883554 times.
First 33 characters:
[ 0] Char e: 11.67014127036944 %
[ 1] Char i: 10.97567324690964 %
[ 2] Char a: 10.080198833348046 %
[ 3] Char r: 7.490657050955572 %
[ 4] Char n: 7.18246988865423 %
[ 5] Char t: 6.516296683620921 %
[ 6] Char l: 5.595130574928075 %
[ 7] Char u: 5.551217016730161 %
[ 8] Char o: 4.922732509840938 %
[ 9] Char c: 4.495707110148333 %
[10] Char s: 3.8308920563994957 %
[11] Char d: 3.590499279048027 %
[12] Char m: 2.971408651876399 %
[13] Char p: 2.902369294915761 %
[14] Char ă: 2.1349006399156134 %
[15] Char g: 1.2248261000459508 %
[16] Char f: 1.1199089133205216 %
[17] Char b: 1.0781457613230203 %
[18] Char ț: 1.0323081554721047 %
[19] Char ș: 0.9732285745975912 %
[20] Char î: 0.97017273420753 %
[21] Char v: 0.9693804792915882 %
[22] Char z: 0.7369102510995367 %
[23] Char h: 0.533413916976212 %
[24] Char â: 0.4986678799484808 %
[25] Char x: 0.22081276300033725 %
[26] Char j: 0.20055367300696958 %
[27] Char k: 0.1901411798260208 %
[28] Char y: 0.15471606715605385 %
[29] Char w: 0.11827234102273318 %
[30] Char á: 0.016297815413658927 %
[31] Char é: 0.013355154297303842 %
[32] Char q: 0.00520624659047438 %
The first 33 characters have an accumulated ratio of 0.9996661211425673.
981 sequences found.
First 512 (typical positive ratio): 0.997762564143313
Next 512 (512-1024): 1.1317927370596478e-06
Rest: 3.0357660829594124e-18
- Processing end: 2016-09-28 18:58:13.862425

65
script/langs/ro.py Normal file
View File

@ -0,0 +1,65 @@
#!/bin/python3
# -*- coding: utf-8 -*-
# ##### BEGIN LICENSE BLOCK #####
# Version: MPL 1.1/GPL 2.0/LGPL 2.1
#
# The contents of this file are subject to the Mozilla Public License Version
# 1.1 (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
# http://www.mozilla.org/MPL/
#
# Software distributed under the License is distributed on an "AS IS" basis,
# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
# for the specific language governing rights and limitations under the
# License.
#
# The Original Code is Mozilla Universal charset detector code.
#
# The Initial Developer of the Original Code is
# Netscape Communications Corporation.
# Portions created by the Initial Developer are Copyright (C) 2001
# the Initial Developer. All Rights Reserved.
#
# Contributor(s):
# Jehan <jehan@girinstud.io>
#
# Alternatively, the contents of this file may be used under the terms of
# either the GNU General Public License Version 2 or later (the "GPL"), or
# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
# in which case the provisions of the GPL or the LGPL are applicable instead
# of those above. If you wish to allow use of your version of this file only
# under the terms of either the GPL or the LGPL, and not to allow others to
# use your version of this file under the terms of the MPL, indicate your
# decision by deleting the provisions above and replace them with the notice
# and other provisions required by the GPL or the LGPL. If you do not delete
# the provisions above, a recipient may use your version of this file under
# the terms of any one of the MPL, the GPL or the LGPL.
#
# ##### END LICENSE BLOCK #####
import re
## Mandatory Properties ##
name = 'Romanian'
code = 'ro'
use_ascii = True
charsets = ['ISO-8859-2', 'ISO-8859-16',
'Windows-1250', 'IBM852']
## Optional Properties ##
# Alphabet characters.
# Note: Wikipedia explains that s and t with cedilla (şţ), or even
# bare s and t, were often used in place of s and t with comma (șț)
# because of missing characters in most common encoding at the time.
# It may be worth adding some common_replacement_letters logics in
# the training and models.
# https://en.wikipedia.org/wiki/Romanian_alphabet#ISO_8859
alphabet = 'ăâîșț'
# The starred page which was rewarded on the main page when I created
# the data.
start_pages = ['The Loving Kind']
wikipedia_code = code
case_mapping = True

View File

@ -27,6 +27,7 @@ set(
LangModels/LangMalteseModel.cpp
LangModels/LangPolishModel.cpp
LangModels/LangPortugueseModel.cpp
LangModels/LangRomanianModel.cpp
LangModels/LangRussianModel.cpp
LangModels/LangSlovakModel.cpp
LangModels/LangSpanishModel.cpp

View File

@ -0,0 +1,232 @@
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
/* ***** BEGIN LICENSE BLOCK *****
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
*
* The contents of this file are subject to the Mozilla Public License Version
* 1.1 (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
* http://www.mozilla.org/MPL/
*
* Software distributed under the License is distributed on an "AS IS" basis,
* WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
* for the specific language governing rights and limitations under the
* License.
*
* The Original Code is Mozilla Communicator client code.
*
* The Initial Developer of the Original Code is
* Netscape Communications Corporation.
* Portions created by the Initial Developer are Copyright (C) 1998
* the Initial Developer. All Rights Reserved.
*
* Contributor(s):
*
* Alternatively, the contents of this file may be used under the terms of
* either the GNU General Public License Version 2 or later (the "GPL"), or
* the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
* in which case the provisions of the GPL or the LGPL are applicable instead
* of those above. If you wish to allow use of your version of this file only
* under the terms of either the GPL or the LGPL, and not to allow others to
* use your version of this file under the terms of the MPL, indicate your
* decision by deleting the provisions above and replace them with the notice
* and other provisions required by the GPL or the LGPL. If you do not delete
* the provisions above, a recipient may use your version of this file under
* the terms of any one of the MPL, the GPL or the LGPL.
*
* ***** END LICENSE BLOCK ***** */
#include "../nsSBCharSetProber.h"
/********* Language model for: Romanian *********/
/**
* Generated by BuildLangModel.py
* On: 2016-09-28 18:58:13.757152
**/
/* Character Mapping Table:
* ILL: illegal character.
* CTR: control character specific to the charset.
* RET: carriage/return.
* SYM: symbol (punctuation) that does not belong to word.
* NUM: 0 - 9.
*
* Other characters are ordered by probabilities
* (0 is the most common character in the language).
*
* Orders are generic to a language. So the codepoint with order X in
* CHARSET1 maps to the same character as the codepoint with the same
* order X in CHARSET2 for the same language.
* As such, it is possible to get missing order. For instance the
* ligature of 'o' and 'e' exists in ISO-8859-15 but not in ISO-8859-1
* even though they are both used for French. Same for the euro sign.
*/
static const unsigned char Iso_8859_16_CharToOrderMap[] =
{
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */
NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 8X */
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 9X */
SYM, 60, 61, 46,SYM,SYM, 38,SYM, 38,SYM, 19,SYM, 62,SYM, 63, 64, /* AX */
SYM,SYM, 41, 46, 40,SYM,SYM,SYM, 40, 41, 19,SYM, 65, 66, 67, 68, /* BX */
69, 30, 24, 14, 33, 35, 53, 42, 45, 31, 58, 49, 70, 37, 20, 48, /* CX */
43, 52, 59, 34, 71, 44, 36, 56, 50, 72, 47, 73, 39, 74, 18, 57, /* DX */
75, 30, 24, 14, 33, 35, 53, 42, 45, 31, 58, 49, 76, 37, 20, 48, /* EX */
43, 52, 59, 34, 77, 44, 36, 56, 50, 78, 47, 79, 39, 80, 18, 81, /* FX */
};
/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */
static const unsigned char Iso_8859_2_CharToOrderMap[] =
{
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */
NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 8X */
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 9X */
SYM, 82,SYM, 46,SYM, 83, 56,SYM,SYM, 38, 84, 85, 86,SYM, 40, 87, /* AX */
SYM, 88,SYM, 46,SYM, 89, 56,SYM,SYM, 38, 90, 91, 92,SYM, 40, 93, /* BX */
94, 30, 24, 14, 33, 95, 35, 42, 41, 31, 96, 49, 51, 37, 20, 97, /* CX */
43, 52, 98, 34, 99, 44, 36,SYM, 55,100, 47, 50, 39, 54,101, 57, /* DX */
102, 30, 24, 14, 33,103, 35, 42, 41, 31,104, 49, 51, 37, 20,105, /* EX */
43, 52,106, 34,107, 44, 36,SYM, 55,108, 47, 50, 39, 54,109,SYM, /* FX */
};
/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */
static const unsigned char Windows_1250_CharToOrderMap[] =
{
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */
NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */
SYM,ILL,SYM,ILL,SYM,SYM,SYM,SYM,ILL,SYM, 38,SYM, 56,110, 40,111, /* 8X */
ILL,SYM,SYM,SYM,SYM,SYM,SYM,SYM,ILL,SYM, 38,SYM, 56,112, 40,113, /* 9X */
SYM,SYM,SYM, 46,SYM,114,SYM,SYM,SYM,SYM,115,SYM,SYM,SYM,SYM,116, /* AX */
SYM,SYM,SYM, 46,SYM,SYM,SYM,SYM,SYM,117,118,SYM,119,SYM,120,121, /* BX */
122, 30, 24, 14, 33,123, 35, 42, 41, 31,124, 49, 51, 37, 20,125, /* CX */
43, 52,126, 34,127, 44, 36,SYM, 55,128, 47, 50, 39, 54,129, 57, /* DX */
130, 30, 24, 14, 33,131, 35, 42, 41, 31,132, 49, 51, 37, 20,133, /* EX */
43, 52,134, 34,135, 44, 36,SYM, 55,136, 47, 50, 39, 54,137,SYM, /* FX */
};
/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */
static const unsigned char Ibm852_CharToOrderMap[] =
{
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */
NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */
42, 39, 31, 24, 33,138, 35, 42, 46, 49, 44, 44, 20,139, 33, 35, /* 8X */
31,140,141,142, 36,143,144, 56, 56, 36, 39,145,146, 46,SYM, 41, /* 9X */
30, 37, 34, 47,147,148, 40, 40,149,150,SYM,151, 41,152,SYM,SYM, /* AX */
SYM,SYM,SYM,SYM,SYM, 30, 24, 51,153,SYM,SYM,SYM,SYM,154,155,SYM, /* BX */
SYM,SYM,SYM,SYM,SYM,SYM, 14, 14,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* CX */
43, 43,156, 49,157,158, 37, 20, 51,SYM,SYM,SYM,SYM,159,160,SYM, /* DX */
34, 57,161, 52, 52,162, 38, 38,163, 47,164, 50, 54, 54,165,SYM, /* EX */
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, 50, 55, 55,SYM,SYM, /* FX */
};
/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */
/* Model Table:
* Total sequences: 981
* First 512 sequences: 0.997762564143313
* Next 512 sequences (512-1024): 0.002237435856687006
* Rest: 3.0357660829594124e-18
* Negative sequences: TODO
*/
static const PRUint8 RomanianLangModel[] =
{
3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,2,0,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,0,3,3,3,2,3,3,3,2,2,0,0,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,0,3,3,3,0,3,3,3,3,3,0,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,2,2,3,3,2,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,3,3,0,3,3,3,3,2,3,3,3,3,2,2,2,
3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,2,3,3,0,2,2,3,3,3,3,0,2,2,3,3,2,3,0,
3,3,3,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,3,2,2,3,0,3,3,3,2,2,2,0,
3,3,3,3,3,3,3,2,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,3,3,2,2,0,2,0,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,0,3,3,3,0,3,2,3,3,3,2,0,2,
3,3,3,3,3,3,3,3,3,3,3,3,2,2,3,3,2,2,3,2,0,3,2,3,3,0,3,3,2,2,0,2,2,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,0,2,2,3,3,3,0,2,3,3,3,2,2,2,
3,3,3,3,3,2,3,3,3,2,3,3,3,2,3,3,2,3,0,0,0,3,2,3,3,0,2,2,3,3,3,2,0,
3,3,3,2,3,3,3,3,3,3,3,2,3,3,3,2,2,3,3,2,2,2,2,3,3,2,0,0,3,2,2,2,0,
3,3,3,3,3,3,3,3,3,3,3,2,2,3,3,0,0,2,3,0,2,0,2,3,3,0,2,2,3,0,2,2,0,
2,3,0,3,3,3,3,3,0,3,3,3,3,3,0,3,0,3,3,3,0,3,3,0,0,0,2,2,0,0,0,0,0,
3,3,3,3,3,2,3,3,3,0,2,3,3,2,3,3,2,3,0,0,2,3,2,3,3,0,2,0,3,2,2,2,0,
3,3,3,3,0,3,3,3,3,2,2,2,3,2,3,2,3,0,0,0,0,0,0,2,3,0,0,0,2,0,2,2,0,
3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,2,2,3,3,2,0,2,2,2,3,0,2,2,3,2,2,2,0,
3,3,3,0,0,0,0,3,2,2,2,0,0,0,3,0,0,0,0,0,2,2,0,0,2,0,0,2,0,0,0,0,0,
3,3,3,0,3,3,3,3,3,3,0,2,2,0,3,0,0,0,0,0,0,2,0,0,2,0,0,2,0,0,0,0,0,
0,3,0,2,3,0,3,0,0,0,0,0,3,0,0,0,0,0,2,3,0,0,2,2,0,0,0,2,0,0,0,0,0,
3,3,3,3,3,2,3,3,3,2,2,3,2,0,3,2,2,2,0,0,0,0,0,0,3,0,2,2,2,0,2,0,0,
3,3,3,2,2,2,2,3,3,0,2,3,2,2,3,2,0,3,0,0,0,3,3,2,3,0,0,2,2,0,2,2,0,
3,3,3,3,3,3,3,3,3,2,3,2,2,2,3,0,2,3,0,0,0,2,2,0,2,0,2,2,3,2,2,2,0,
0,3,0,3,3,3,3,3,0,2,2,2,3,0,0,0,0,0,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,
3,3,3,2,0,3,0,3,3,3,2,0,0,3,3,0,3,0,0,0,0,3,0,2,2,3,0,0,3,0,0,0,0,
3,3,3,2,2,2,3,3,3,0,2,2,2,0,2,0,0,2,0,0,0,2,0,0,2,0,0,2,0,0,2,0,0,
3,3,3,3,2,3,3,3,3,2,3,2,3,2,2,2,2,2,2,2,2,2,0,3,0,0,0,2,3,2,2,2,0,
3,2,3,3,3,2,3,2,3,3,3,3,3,2,0,2,0,2,0,0,0,2,2,2,0,0,2,2,0,2,2,0,0,
3,3,3,2,3,2,2,2,3,2,3,2,2,2,0,0,2,2,0,0,0,0,0,3,0,0,0,0,2,3,0,0,0,
2,3,0,3,3,2,2,0,0,2,2,2,2,0,0,2,0,0,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,
0,3,2,2,2,2,2,0,0,2,2,2,2,2,0,2,0,2,0,0,0,2,2,0,0,0,2,2,0,0,0,0,0,
0,0,2,0,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,
};
const SequenceModel Iso_8859_16RomanianModel =
{
Iso_8859_16_CharToOrderMap,
RomanianLangModel,
33,
(float)0.997762564143313,
PR_TRUE,
"ISO-8859-16"
};
const SequenceModel Iso_8859_2RomanianModel =
{
Iso_8859_2_CharToOrderMap,
RomanianLangModel,
33,
(float)0.997762564143313,
PR_TRUE,
"ISO-8859-2"
};
const SequenceModel Windows_1250RomanianModel =
{
Windows_1250_CharToOrderMap,
RomanianLangModel,
33,
(float)0.997762564143313,
PR_TRUE,
"WINDOWS-1250"
};
const SequenceModel Ibm852RomanianModel =
{
Ibm852_CharToOrderMap,
RomanianLangModel,
33,
(float)0.997762564143313,
PR_TRUE,
"IBM852"
};

View File

@ -174,6 +174,11 @@ nsSBCSGroupProber::nsSBCSGroupProber()
mProbers[83] = new nsSingleByteCharSetProber(&Iso_8859_15IrishModel);
mProbers[84] = new nsSingleByteCharSetProber(&Windows_1252IrishModel);
mProbers[85] = new nsSingleByteCharSetProber(&Windows_1250RomanianModel);
mProbers[86] = new nsSingleByteCharSetProber(&Iso_8859_2RomanianModel);
mProbers[87] = new nsSingleByteCharSetProber(&Iso_8859_16RomanianModel);
mProbers[88] = new nsSingleByteCharSetProber(&Ibm852RomanianModel);
Reset();
}

View File

@ -40,7 +40,7 @@
#define nsSBCSGroupProber_h__
#define NUM_OF_SBCS_PROBERS 85
#define NUM_OF_SBCS_PROBERS 89
class nsCharSetProber;
class nsSBCSGroupProber: public nsCharSetProber {

View File

@ -235,5 +235,10 @@ extern const SequenceModel Iso_8859_9IrishModel;
extern const SequenceModel Iso_8859_1IrishModel;
extern const SequenceModel Windows_1252IrishModel;
extern const SequenceModel Windows_1250RomanianModel;
extern const SequenceModel Iso_8859_2RomanianModel;
extern const SequenceModel Iso_8859_16RomanianModel;
extern const SequenceModel Ibm852RomanianModel;
#endif /* nsSingleByteCharSetProber_h__ */

9
test/ro/ibm852.txt Normal file
View File

@ -0,0 +1,9 @@
Danemarca (Śn danezÇ Sunet Danmark), oficial Regatul Danemarcei (Śn
danezÇ Sunet Kongeriget Danmark), este un stat suveran din
Europa de Nord, av<61>nd si douÇ tÇri constituente de peste mÇri, care fac parte
integrantÇ din regat: Insulele Feroe Śn Atlanticul de Nord si Groenlanda Śn
America de Nord. Danemarca propriu-zisÇ[a] este cea mai de sud dintre tÇrile
nordice, aflatÇ la sud-vest de Suedia si la sud de Norvegia, Śnvecin<69>ndu-se la
sud cu Germania. Tara constÇ dintr-o peninsulÇ mare, Iutlanda, si mai multe
insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster si
Bornholm, precum si sute de insulite denumite Śn general ,,Arhipelagul Danez".

9
test/ro/iso-8859-16.txt Normal file
View File

@ -0,0 +1,9 @@
Danemarca (în daneză Sunet Danmark), oficial Regatul Danemarcei (în
daneză Sunet Kongeriget Danmark), este un stat suveran din
Europa de Nord, având şi două ţări constituente de peste mări, care fac parte
integrantă din regat: Insulele Feroe în Atlanticul de Nord şi Groenlanda în
America de Nord. Danemarca propriu-zisă[a] este cea mai de sud dintre ţările
nordice, aflată la sud-vest de Suedia şi la sud de Norvegia, învecinându-se la
sud cu Germania. Ţara constă dintr-o peninsulă mare, Iutlanda, şi mai multe
insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster şi
Bornholm, precum şi sute de insuliţe denumite în general ĽArhipelagul Danezľ.

9
test/ro/utf-8.txt Normal file
View File

@ -0,0 +1,9 @@
Danemarca (în daneză Sunet Danmark), oficial Regatul Danemarcei (în
daneză Sunet Kongeriget Danmark), este un stat suveran din
Europa de Nord, având și două țări constituente de peste mări, care fac parte
integrantă din regat: Insulele Feroe în Atlanticul de Nord și Groenlanda în
America de Nord. Danemarca propriu-zisă[a] este cea mai de sud dintre țările
nordice, aflată la sud-vest de Suedia și la sud de Norvegia, învecinându-se la
sud cu Germania. Țara constă dintr-o peninsulă mare, Iutlanda, și mai multe
insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster și
Bornholm, precum și sute de insulițe denumite în general „Arhipelagul Danez”.

9
test/ro/windows-1250.txt Normal file
View File

@ -0,0 +1,9 @@
Danemarca (în daneză Sunet Danmark), oficial Regatul Danemarcei (în
daneză Sunet Kongeriget Danmark), este un stat suveran din
Europa de Nord, având si două tări constituente de peste mări, care fac parte
integrantă din regat: Insulele Feroe în Atlanticul de Nord si Groenlanda în
America de Nord. Danemarca propriu-zisă[a] este cea mai de sud dintre tările
nordice, aflată la sud-vest de Suedia si la sud de Norvegia, învecinându-se la
sud cu Germania. Tara constă dintr-o peninsulă mare, Iutlanda, si mai multe
insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster si
Bornholm, precum si sute de insulite denumite în general „Arhipelagul Danez”.