mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-06 08:46:40 +08:00
LangModels: Romanian support added.
Encodings: ISO-8859-2, ISO-8859-16, Windows-1250 and IBM852. Test texts from https://ro.wikipedia.org/wiki/Danemarca
This commit is contained in:
parent
0a04177787
commit
fbd2efdbe9
@ -115,6 +115,11 @@ Techniques used by universalchardet are described at http://www.mozilla.org/proj
|
||||
* ISO-8859-9
|
||||
* ISO-8859-15
|
||||
* WINDOWS-1252
|
||||
* Romanian:
|
||||
* ISO-8859-2
|
||||
* ISO-8859-16
|
||||
* Windows-1250
|
||||
* IBM852
|
||||
* Russian
|
||||
* ISO-8859-5
|
||||
* KOI8-R
|
||||
|
||||
153
script/BuildLangModelLogs/LangRomanianModel.log
Normal file
153
script/BuildLangModelLogs/LangRomanianModel.log
Normal file
@ -0,0 +1,153 @@
|
||||
= Logs of language model for Romanian (ro) =
|
||||
|
||||
- Generated by BuildLangModel.py
|
||||
- Started: 2016-09-28 18:53:56.086095
|
||||
- Maximum depth: 5
|
||||
- Max number of pages: 100
|
||||
|
||||
== Parsed pages ==
|
||||
|
||||
The Loving Kind (revision 10166481)
|
||||
12 ianuarie (revision 10711676)
|
||||
13 decembrie (revision 9938353)
|
||||
2007 (revision 10716321)
|
||||
2008 (revision 10752084)
|
||||
2009 (revision 10654003)
|
||||
21 noiembrie (revision 10447643)
|
||||
25 ianuarie (revision 10228199)
|
||||
31 ianuarie (revision 10718063)
|
||||
4 Music (revision 9701591)
|
||||
Billboard (revision 10505294)
|
||||
Biology (revision 10112430)
|
||||
Bulgaria (revision 10481051)
|
||||
CD (revision 10477531)
|
||||
Call The Shots (revision 10101027)
|
||||
Call the Shots (revision 10101027)
|
||||
Can't Speak French (revision 9721506)
|
||||
Casă de discuri (revision 10611348)
|
||||
Channel 4 (revision 7953101)
|
||||
Chemistry (revision 10112479)
|
||||
Cheryl Cole (revision 10475016)
|
||||
Chitară (revision 10468266)
|
||||
Croația (revision 10737746)
|
||||
Dance (revision 10231736)
|
||||
Descărcare digitală (revision 10100743)
|
||||
Digital Spy (revision 9044016)
|
||||
Discografia Girls Aloud (revision 10172788)
|
||||
Estonia (revision 10749810)
|
||||
Europa (revision 10752724)
|
||||
Fascination Records (revision 9655292)
|
||||
Fiona Phillips (revision 5384082)
|
||||
Gen muzical (revision 10534645)
|
||||
Girls A Live (revision 10112444)
|
||||
Girls Aloud (revision 10112446)
|
||||
Good Morning Television (revision 10166481)
|
||||
Heat World (revision 10166481)
|
||||
I'll Stand By You (cântec de Girls Aloud) (revision 10112432)
|
||||
ITunes (revision 10744174)
|
||||
I Think We're Alone Now (revision 10112427)
|
||||
Irlanda (revision 10573806)
|
||||
Jump (cântec de Girls Aloud) (revision 10112438)
|
||||
Lady GaGa (revision 10753010)
|
||||
Life Got Cold (revision 10112437)
|
||||
Limba engleză (revision 10756676)
|
||||
Long Hot Summer (revision 10112429)
|
||||
Love Machine (revision 10112433)
|
||||
MSN Search (revision 10653298)
|
||||
MTV (revision 10170766)
|
||||
Mixed Up (revision 10112443)
|
||||
Muzică electronică (revision 10608432)
|
||||
Muzică pop (revision 10740529)
|
||||
Nadine Coyle (revision 10316187)
|
||||
Neil Tennant (revision 10499980)
|
||||
No Good Advice (revision 10112436)
|
||||
Out Of Control (revision 10112484)
|
||||
Out of Control (revision 10112484)
|
||||
Pet Shop Boys (revision 10612741)
|
||||
Poker Face (revision 10496402)
|
||||
PopJustice (revision 10625677)
|
||||
Regatul Unit (revision 10752338)
|
||||
Regatul Unit al Marii Britanii și Irlandei de Nord (revision 10752338)
|
||||
Regatul Unit al Marii Britanii și al Irlandei de Nord (revision 10752338)
|
||||
Republica Irlanda (revision 10573806)
|
||||
Romanian Top 100 (revision 10736281)
|
||||
România (revision 10732435)
|
||||
Sarah Harding (revision 10633651)
|
||||
Sarah Hearding (revision 10112425)
|
||||
See the Day (revision 10112431)
|
||||
Sexy! No No No... (revision 10112425)
|
||||
Slant Magazine (revision 7697473)
|
||||
Slovenia (revision 10521499)
|
||||
Something Kinda Ooooh (revision 10112426)
|
||||
Sound of the Underground (album) (revision 10112476)
|
||||
Sound of the Underground (cântec) (revision 10112434)
|
||||
Tangled Up (revision 10112482)
|
||||
The Guardian (revision 9752334)
|
||||
The Paul O'Grady Show (revision 10101027)
|
||||
The Promise (revision 10166482)
|
||||
The Show (revision 10112441)
|
||||
The Sound of Girls Aloud (revision 10112480)
|
||||
Tonalitate (revision 9966362)
|
||||
Turneul Out of Control (revision 10112446)
|
||||
UK Mix (revision 9721468)
|
||||
UK Singles Chart (revision 10226705)
|
||||
Ungaria (revision 10737745)
|
||||
Uniunea Europeană (revision 10751590)
|
||||
Untouchable (revision 10112410)
|
||||
Wake Me Up (revision 10112439)
|
||||
What Will The Neighbours Say? (revision 10112478)
|
||||
Whole Lotta History (revision 10475020)
|
||||
Wideboys (revision 10166481)
|
||||
Wikimedia Commons (revision 9703907)
|
||||
Xenomania (revision 10112484)
|
||||
|
||||
== End of Parsed pages ==
|
||||
|
||||
- Wikipedia parsing ended at: 2016-09-28 18:58:13.756622
|
||||
|
||||
60 characters appeared 883554 times.
|
||||
|
||||
First 33 characters:
|
||||
[ 0] Char e: 11.67014127036944 %
|
||||
[ 1] Char i: 10.97567324690964 %
|
||||
[ 2] Char a: 10.080198833348046 %
|
||||
[ 3] Char r: 7.490657050955572 %
|
||||
[ 4] Char n: 7.18246988865423 %
|
||||
[ 5] Char t: 6.516296683620921 %
|
||||
[ 6] Char l: 5.595130574928075 %
|
||||
[ 7] Char u: 5.551217016730161 %
|
||||
[ 8] Char o: 4.922732509840938 %
|
||||
[ 9] Char c: 4.495707110148333 %
|
||||
[10] Char s: 3.8308920563994957 %
|
||||
[11] Char d: 3.590499279048027 %
|
||||
[12] Char m: 2.971408651876399 %
|
||||
[13] Char p: 2.902369294915761 %
|
||||
[14] Char ă: 2.1349006399156134 %
|
||||
[15] Char g: 1.2248261000459508 %
|
||||
[16] Char f: 1.1199089133205216 %
|
||||
[17] Char b: 1.0781457613230203 %
|
||||
[18] Char ț: 1.0323081554721047 %
|
||||
[19] Char ș: 0.9732285745975912 %
|
||||
[20] Char î: 0.97017273420753 %
|
||||
[21] Char v: 0.9693804792915882 %
|
||||
[22] Char z: 0.7369102510995367 %
|
||||
[23] Char h: 0.533413916976212 %
|
||||
[24] Char â: 0.4986678799484808 %
|
||||
[25] Char x: 0.22081276300033725 %
|
||||
[26] Char j: 0.20055367300696958 %
|
||||
[27] Char k: 0.1901411798260208 %
|
||||
[28] Char y: 0.15471606715605385 %
|
||||
[29] Char w: 0.11827234102273318 %
|
||||
[30] Char á: 0.016297815413658927 %
|
||||
[31] Char é: 0.013355154297303842 %
|
||||
[32] Char q: 0.00520624659047438 %
|
||||
|
||||
The first 33 characters have an accumulated ratio of 0.9996661211425673.
|
||||
|
||||
981 sequences found.
|
||||
|
||||
First 512 (typical positive ratio): 0.997762564143313
|
||||
Next 512 (512-1024): 1.1317927370596478e-06
|
||||
Rest: 3.0357660829594124e-18
|
||||
|
||||
- Processing end: 2016-09-28 18:58:13.862425
|
||||
65
script/langs/ro.py
Normal file
65
script/langs/ro.py
Normal file
@ -0,0 +1,65 @@
|
||||
#!/bin/python3
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
# ##### BEGIN LICENSE BLOCK #####
|
||||
# Version: MPL 1.1/GPL 2.0/LGPL 2.1
|
||||
#
|
||||
# The contents of this file are subject to the Mozilla Public License Version
|
||||
# 1.1 (the "License"); you may not use this file except in compliance with
|
||||
# the License. You may obtain a copy of the License at
|
||||
# http://www.mozilla.org/MPL/
|
||||
#
|
||||
# Software distributed under the License is distributed on an "AS IS" basis,
|
||||
# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
|
||||
# for the specific language governing rights and limitations under the
|
||||
# License.
|
||||
#
|
||||
# The Original Code is Mozilla Universal charset detector code.
|
||||
#
|
||||
# The Initial Developer of the Original Code is
|
||||
# Netscape Communications Corporation.
|
||||
# Portions created by the Initial Developer are Copyright (C) 2001
|
||||
# the Initial Developer. All Rights Reserved.
|
||||
#
|
||||
# Contributor(s):
|
||||
# Jehan <jehan@girinstud.io>
|
||||
#
|
||||
# Alternatively, the contents of this file may be used under the terms of
|
||||
# either the GNU General Public License Version 2 or later (the "GPL"), or
|
||||
# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
|
||||
# in which case the provisions of the GPL or the LGPL are applicable instead
|
||||
# of those above. If you wish to allow use of your version of this file only
|
||||
# under the terms of either the GPL or the LGPL, and not to allow others to
|
||||
# use your version of this file under the terms of the MPL, indicate your
|
||||
# decision by deleting the provisions above and replace them with the notice
|
||||
# and other provisions required by the GPL or the LGPL. If you do not delete
|
||||
# the provisions above, a recipient may use your version of this file under
|
||||
# the terms of any one of the MPL, the GPL or the LGPL.
|
||||
#
|
||||
# ##### END LICENSE BLOCK #####
|
||||
|
||||
import re
|
||||
|
||||
## Mandatory Properties ##
|
||||
|
||||
name = 'Romanian'
|
||||
code = 'ro'
|
||||
use_ascii = True
|
||||
charsets = ['ISO-8859-2', 'ISO-8859-16',
|
||||
'Windows-1250', 'IBM852']
|
||||
|
||||
## Optional Properties ##
|
||||
|
||||
# Alphabet characters.
|
||||
# Note: Wikipedia explains that s and t with cedilla (şţ), or even
|
||||
# bare s and t, were often used in place of s and t with comma (șț)
|
||||
# because of missing characters in most common encoding at the time.
|
||||
# It may be worth adding some common_replacement_letters logics in
|
||||
# the training and models.
|
||||
# https://en.wikipedia.org/wiki/Romanian_alphabet#ISO_8859
|
||||
alphabet = 'ăâîșț'
|
||||
# The starred page which was rewarded on the main page when I created
|
||||
# the data.
|
||||
start_pages = ['The Loving Kind']
|
||||
wikipedia_code = code
|
||||
case_mapping = True
|
||||
@ -27,6 +27,7 @@ set(
|
||||
LangModels/LangMalteseModel.cpp
|
||||
LangModels/LangPolishModel.cpp
|
||||
LangModels/LangPortugueseModel.cpp
|
||||
LangModels/LangRomanianModel.cpp
|
||||
LangModels/LangRussianModel.cpp
|
||||
LangModels/LangSlovakModel.cpp
|
||||
LangModels/LangSpanishModel.cpp
|
||||
|
||||
232
src/LangModels/LangRomanianModel.cpp
Normal file
232
src/LangModels/LangRomanianModel.cpp
Normal file
@ -0,0 +1,232 @@
|
||||
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
|
||||
/* ***** BEGIN LICENSE BLOCK *****
|
||||
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
|
||||
*
|
||||
* The contents of this file are subject to the Mozilla Public License Version
|
||||
* 1.1 (the "License"); you may not use this file except in compliance with
|
||||
* the License. You may obtain a copy of the License at
|
||||
* http://www.mozilla.org/MPL/
|
||||
*
|
||||
* Software distributed under the License is distributed on an "AS IS" basis,
|
||||
* WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
|
||||
* for the specific language governing rights and limitations under the
|
||||
* License.
|
||||
*
|
||||
* The Original Code is Mozilla Communicator client code.
|
||||
*
|
||||
* The Initial Developer of the Original Code is
|
||||
* Netscape Communications Corporation.
|
||||
* Portions created by the Initial Developer are Copyright (C) 1998
|
||||
* the Initial Developer. All Rights Reserved.
|
||||
*
|
||||
* Contributor(s):
|
||||
*
|
||||
* Alternatively, the contents of this file may be used under the terms of
|
||||
* either the GNU General Public License Version 2 or later (the "GPL"), or
|
||||
* the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
|
||||
* in which case the provisions of the GPL or the LGPL are applicable instead
|
||||
* of those above. If you wish to allow use of your version of this file only
|
||||
* under the terms of either the GPL or the LGPL, and not to allow others to
|
||||
* use your version of this file under the terms of the MPL, indicate your
|
||||
* decision by deleting the provisions above and replace them with the notice
|
||||
* and other provisions required by the GPL or the LGPL. If you do not delete
|
||||
* the provisions above, a recipient may use your version of this file under
|
||||
* the terms of any one of the MPL, the GPL or the LGPL.
|
||||
*
|
||||
* ***** END LICENSE BLOCK ***** */
|
||||
|
||||
#include "../nsSBCharSetProber.h"
|
||||
|
||||
/********* Language model for: Romanian *********/
|
||||
|
||||
/**
|
||||
* Generated by BuildLangModel.py
|
||||
* On: 2016-09-28 18:58:13.757152
|
||||
**/
|
||||
|
||||
/* Character Mapping Table:
|
||||
* ILL: illegal character.
|
||||
* CTR: control character specific to the charset.
|
||||
* RET: carriage/return.
|
||||
* SYM: symbol (punctuation) that does not belong to word.
|
||||
* NUM: 0 - 9.
|
||||
*
|
||||
* Other characters are ordered by probabilities
|
||||
* (0 is the most common character in the language).
|
||||
*
|
||||
* Orders are generic to a language. So the codepoint with order X in
|
||||
* CHARSET1 maps to the same character as the codepoint with the same
|
||||
* order X in CHARSET2 for the same language.
|
||||
* As such, it is possible to get missing order. For instance the
|
||||
* ligature of 'o' and 'e' exists in ISO-8859-15 but not in ISO-8859-1
|
||||
* even though they are both used for French. Same for the euro sign.
|
||||
*/
|
||||
static const unsigned char Iso_8859_16_CharToOrderMap[] =
|
||||
{
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */
|
||||
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */
|
||||
NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */
|
||||
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */
|
||||
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */
|
||||
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */
|
||||
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 8X */
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 9X */
|
||||
SYM, 60, 61, 46,SYM,SYM, 38,SYM, 38,SYM, 19,SYM, 62,SYM, 63, 64, /* AX */
|
||||
SYM,SYM, 41, 46, 40,SYM,SYM,SYM, 40, 41, 19,SYM, 65, 66, 67, 68, /* BX */
|
||||
69, 30, 24, 14, 33, 35, 53, 42, 45, 31, 58, 49, 70, 37, 20, 48, /* CX */
|
||||
43, 52, 59, 34, 71, 44, 36, 56, 50, 72, 47, 73, 39, 74, 18, 57, /* DX */
|
||||
75, 30, 24, 14, 33, 35, 53, 42, 45, 31, 58, 49, 76, 37, 20, 48, /* EX */
|
||||
43, 52, 59, 34, 77, 44, 36, 56, 50, 78, 47, 79, 39, 80, 18, 81, /* FX */
|
||||
};
|
||||
/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */
|
||||
|
||||
static const unsigned char Iso_8859_2_CharToOrderMap[] =
|
||||
{
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */
|
||||
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */
|
||||
NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */
|
||||
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */
|
||||
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */
|
||||
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */
|
||||
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 8X */
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 9X */
|
||||
SYM, 82,SYM, 46,SYM, 83, 56,SYM,SYM, 38, 84, 85, 86,SYM, 40, 87, /* AX */
|
||||
SYM, 88,SYM, 46,SYM, 89, 56,SYM,SYM, 38, 90, 91, 92,SYM, 40, 93, /* BX */
|
||||
94, 30, 24, 14, 33, 95, 35, 42, 41, 31, 96, 49, 51, 37, 20, 97, /* CX */
|
||||
43, 52, 98, 34, 99, 44, 36,SYM, 55,100, 47, 50, 39, 54,101, 57, /* DX */
|
||||
102, 30, 24, 14, 33,103, 35, 42, 41, 31,104, 49, 51, 37, 20,105, /* EX */
|
||||
43, 52,106, 34,107, 44, 36,SYM, 55,108, 47, 50, 39, 54,109,SYM, /* FX */
|
||||
};
|
||||
/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */
|
||||
|
||||
static const unsigned char Windows_1250_CharToOrderMap[] =
|
||||
{
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */
|
||||
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */
|
||||
NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */
|
||||
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */
|
||||
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */
|
||||
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */
|
||||
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */
|
||||
SYM,ILL,SYM,ILL,SYM,SYM,SYM,SYM,ILL,SYM, 38,SYM, 56,110, 40,111, /* 8X */
|
||||
ILL,SYM,SYM,SYM,SYM,SYM,SYM,SYM,ILL,SYM, 38,SYM, 56,112, 40,113, /* 9X */
|
||||
SYM,SYM,SYM, 46,SYM,114,SYM,SYM,SYM,SYM,115,SYM,SYM,SYM,SYM,116, /* AX */
|
||||
SYM,SYM,SYM, 46,SYM,SYM,SYM,SYM,SYM,117,118,SYM,119,SYM,120,121, /* BX */
|
||||
122, 30, 24, 14, 33,123, 35, 42, 41, 31,124, 49, 51, 37, 20,125, /* CX */
|
||||
43, 52,126, 34,127, 44, 36,SYM, 55,128, 47, 50, 39, 54,129, 57, /* DX */
|
||||
130, 30, 24, 14, 33,131, 35, 42, 41, 31,132, 49, 51, 37, 20,133, /* EX */
|
||||
43, 52,134, 34,135, 44, 36,SYM, 55,136, 47, 50, 39, 54,137,SYM, /* FX */
|
||||
};
|
||||
/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */
|
||||
|
||||
static const unsigned char Ibm852_CharToOrderMap[] =
|
||||
{
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */
|
||||
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */
|
||||
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */
|
||||
NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */
|
||||
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 4X */
|
||||
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,SYM, /* 5X */
|
||||
SYM, 2, 17, 9, 11, 0, 16, 15, 23, 1, 26, 27, 6, 12, 4, 8, /* 6X */
|
||||
13, 32, 3, 10, 5, 7, 21, 29, 25, 28, 22,SYM,SYM,SYM,SYM,CTR, /* 7X */
|
||||
42, 39, 31, 24, 33,138, 35, 42, 46, 49, 44, 44, 20,139, 33, 35, /* 8X */
|
||||
31,140,141,142, 36,143,144, 56, 56, 36, 39,145,146, 46,SYM, 41, /* 9X */
|
||||
30, 37, 34, 47,147,148, 40, 40,149,150,SYM,151, 41,152,SYM,SYM, /* AX */
|
||||
SYM,SYM,SYM,SYM,SYM, 30, 24, 51,153,SYM,SYM,SYM,SYM,154,155,SYM, /* BX */
|
||||
SYM,SYM,SYM,SYM,SYM,SYM, 14, 14,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* CX */
|
||||
43, 43,156, 49,157,158, 37, 20, 51,SYM,SYM,SYM,SYM,159,160,SYM, /* DX */
|
||||
34, 57,161, 52, 52,162, 38, 38,163, 47,164, 50, 54, 54,165,SYM, /* EX */
|
||||
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, 50, 55, 55,SYM,SYM, /* FX */
|
||||
};
|
||||
/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */
|
||||
|
||||
|
||||
/* Model Table:
|
||||
* Total sequences: 981
|
||||
* First 512 sequences: 0.997762564143313
|
||||
* Next 512 sequences (512-1024): 0.002237435856687006
|
||||
* Rest: 3.0357660829594124e-18
|
||||
* Negative sequences: TODO
|
||||
*/
|
||||
static const PRUint8 RomanianLangModel[] =
|
||||
{
|
||||
3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,2,0,2,
|
||||
3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,0,3,3,3,2,3,3,3,2,2,0,0,2,
|
||||
3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,0,3,3,3,0,3,3,3,3,3,0,2,2,
|
||||
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,2,2,3,3,2,2,2,2,
|
||||
3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,3,3,0,3,3,3,3,2,3,3,3,3,2,2,2,
|
||||
3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,2,3,3,0,2,2,3,3,3,3,0,2,2,3,3,2,3,0,
|
||||
3,3,3,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,3,2,2,3,0,3,3,3,2,2,2,0,
|
||||
3,3,3,3,3,3,3,2,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,3,3,2,2,0,2,0,
|
||||
3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,3,3,3,3,0,3,3,3,0,3,2,3,3,3,2,0,2,
|
||||
3,3,3,3,3,3,3,3,3,3,3,3,2,2,3,3,2,2,3,2,0,3,2,3,3,0,3,3,2,2,0,2,2,
|
||||
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,0,2,2,3,3,3,0,2,3,3,3,2,2,2,
|
||||
3,3,3,3,3,2,3,3,3,2,3,3,3,2,3,3,2,3,0,0,0,3,2,3,3,0,2,2,3,3,3,2,0,
|
||||
3,3,3,2,3,3,3,3,3,3,3,2,3,3,3,2,2,3,3,2,2,2,2,3,3,2,0,0,3,2,2,2,0,
|
||||
3,3,3,3,3,3,3,3,3,3,3,2,2,3,3,0,0,2,3,0,2,0,2,3,3,0,2,2,3,0,2,2,0,
|
||||
2,3,0,3,3,3,3,3,0,3,3,3,3,3,0,3,0,3,3,3,0,3,3,0,0,0,2,2,0,0,0,0,0,
|
||||
3,3,3,3,3,2,3,3,3,0,2,3,3,2,3,3,2,3,0,0,2,3,2,3,3,0,2,0,3,2,2,2,0,
|
||||
3,3,3,3,0,3,3,3,3,2,2,2,3,2,3,2,3,0,0,0,0,0,0,2,3,0,0,0,2,0,2,2,0,
|
||||
3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,2,2,3,3,2,0,2,2,2,3,0,2,2,3,2,2,2,0,
|
||||
3,3,3,0,0,0,0,3,2,2,2,0,0,0,3,0,0,0,0,0,2,2,0,0,2,0,0,2,0,0,0,0,0,
|
||||
3,3,3,0,3,3,3,3,3,3,0,2,2,0,3,0,0,0,0,0,0,2,0,0,2,0,0,2,0,0,0,0,0,
|
||||
0,3,0,2,3,0,3,0,0,0,0,0,3,0,0,0,0,0,2,3,0,0,2,2,0,0,0,2,0,0,0,0,0,
|
||||
3,3,3,3,3,2,3,3,3,2,2,3,2,0,3,2,2,2,0,0,0,0,0,0,3,0,2,2,2,0,2,0,0,
|
||||
3,3,3,2,2,2,2,3,3,0,2,3,2,2,3,2,0,3,0,0,0,3,3,2,3,0,0,2,2,0,2,2,0,
|
||||
3,3,3,3,3,3,3,3,3,2,3,2,2,2,3,0,2,3,0,0,0,2,2,0,2,0,2,2,3,2,2,2,0,
|
||||
0,3,0,3,3,3,3,3,0,2,2,2,3,0,0,0,0,0,2,3,0,0,0,0,0,0,0,0,0,0,0,0,0,
|
||||
3,3,3,2,0,3,0,3,3,3,2,0,0,3,3,0,3,0,0,0,0,3,0,2,2,3,0,0,3,0,0,0,0,
|
||||
3,3,3,2,2,2,3,3,3,0,2,2,2,0,2,0,0,2,0,0,0,2,0,0,2,0,0,2,0,0,2,0,0,
|
||||
3,3,3,3,2,3,3,3,3,2,3,2,3,2,2,2,2,2,2,2,2,2,0,3,0,0,0,2,3,2,2,2,0,
|
||||
3,2,3,3,3,2,3,2,3,3,3,3,3,2,0,2,0,2,0,0,0,2,2,2,0,0,2,2,0,2,2,0,0,
|
||||
3,3,3,2,3,2,2,2,3,2,3,2,2,2,0,0,2,2,0,0,0,0,0,3,0,0,0,0,2,3,0,0,0,
|
||||
2,3,0,3,3,2,2,0,0,2,2,2,2,0,0,2,0,0,0,0,0,0,2,0,0,0,0,2,0,0,0,0,0,
|
||||
0,3,2,2,2,2,2,0,0,2,2,2,2,2,0,2,0,2,0,0,0,2,2,0,0,0,2,2,0,0,0,0,0,
|
||||
0,0,2,0,0,2,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,
|
||||
};
|
||||
|
||||
|
||||
const SequenceModel Iso_8859_16RomanianModel =
|
||||
{
|
||||
Iso_8859_16_CharToOrderMap,
|
||||
RomanianLangModel,
|
||||
33,
|
||||
(float)0.997762564143313,
|
||||
PR_TRUE,
|
||||
"ISO-8859-16"
|
||||
};
|
||||
|
||||
const SequenceModel Iso_8859_2RomanianModel =
|
||||
{
|
||||
Iso_8859_2_CharToOrderMap,
|
||||
RomanianLangModel,
|
||||
33,
|
||||
(float)0.997762564143313,
|
||||
PR_TRUE,
|
||||
"ISO-8859-2"
|
||||
};
|
||||
|
||||
const SequenceModel Windows_1250RomanianModel =
|
||||
{
|
||||
Windows_1250_CharToOrderMap,
|
||||
RomanianLangModel,
|
||||
33,
|
||||
(float)0.997762564143313,
|
||||
PR_TRUE,
|
||||
"WINDOWS-1250"
|
||||
};
|
||||
|
||||
const SequenceModel Ibm852RomanianModel =
|
||||
{
|
||||
Ibm852_CharToOrderMap,
|
||||
RomanianLangModel,
|
||||
33,
|
||||
(float)0.997762564143313,
|
||||
PR_TRUE,
|
||||
"IBM852"
|
||||
};
|
||||
@ -174,6 +174,11 @@ nsSBCSGroupProber::nsSBCSGroupProber()
|
||||
mProbers[83] = new nsSingleByteCharSetProber(&Iso_8859_15IrishModel);
|
||||
mProbers[84] = new nsSingleByteCharSetProber(&Windows_1252IrishModel);
|
||||
|
||||
mProbers[85] = new nsSingleByteCharSetProber(&Windows_1250RomanianModel);
|
||||
mProbers[86] = new nsSingleByteCharSetProber(&Iso_8859_2RomanianModel);
|
||||
mProbers[87] = new nsSingleByteCharSetProber(&Iso_8859_16RomanianModel);
|
||||
mProbers[88] = new nsSingleByteCharSetProber(&Ibm852RomanianModel);
|
||||
|
||||
Reset();
|
||||
}
|
||||
|
||||
|
||||
@ -40,7 +40,7 @@
|
||||
#define nsSBCSGroupProber_h__
|
||||
|
||||
|
||||
#define NUM_OF_SBCS_PROBERS 85
|
||||
#define NUM_OF_SBCS_PROBERS 89
|
||||
|
||||
class nsCharSetProber;
|
||||
class nsSBCSGroupProber: public nsCharSetProber {
|
||||
|
||||
@ -235,5 +235,10 @@ extern const SequenceModel Iso_8859_9IrishModel;
|
||||
extern const SequenceModel Iso_8859_1IrishModel;
|
||||
extern const SequenceModel Windows_1252IrishModel;
|
||||
|
||||
extern const SequenceModel Windows_1250RomanianModel;
|
||||
extern const SequenceModel Iso_8859_2RomanianModel;
|
||||
extern const SequenceModel Iso_8859_16RomanianModel;
|
||||
extern const SequenceModel Ibm852RomanianModel;
|
||||
|
||||
#endif /* nsSingleByteCharSetProber_h__ */
|
||||
|
||||
|
||||
9
test/ro/ibm852.txt
Normal file
9
test/ro/ibm852.txt
Normal file
@ -0,0 +1,9 @@
|
||||
Danemarca (Śn danezÇ Sunet Danmark), oficial Regatul Danemarcei (Śn
|
||||
danezÇ Sunet Kongeriget Danmark), este un stat suveran din
|
||||
Europa de Nord, av<61>nd si douÇ tÇri constituente de peste mÇri, care fac parte
|
||||
integrantÇ din regat: Insulele Feroe Śn Atlanticul de Nord si Groenlanda Śn
|
||||
America de Nord. Danemarca propriu-zisÇ[a] este cea mai de sud dintre tÇrile
|
||||
nordice, aflatÇ la sud-vest de Suedia si la sud de Norvegia, Śnvecin<69>ndu-se la
|
||||
sud cu Germania. Tara constÇ dintr-o peninsulÇ mare, Iutlanda, si mai multe
|
||||
insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster si
|
||||
Bornholm, precum si sute de insulite denumite Śn general ,,Arhipelagul Danez".
|
||||
9
test/ro/iso-8859-16.txt
Normal file
9
test/ro/iso-8859-16.txt
Normal file
@ -0,0 +1,9 @@
|
||||
Danemarca (în daneză Sunet Danmark), oficial Regatul Danemarcei (în
|
||||
daneză Sunet Kongeriget Danmark), este un stat suveran din
|
||||
Europa de Nord, având şi două ţări constituente de peste mări, care fac parte
|
||||
integrantă din regat: Insulele Feroe în Atlanticul de Nord şi Groenlanda în
|
||||
America de Nord. Danemarca propriu-zisă[a] este cea mai de sud dintre ţările
|
||||
nordice, aflată la sud-vest de Suedia şi la sud de Norvegia, învecinându-se la
|
||||
sud cu Germania. Ţara constă dintr-o peninsulă mare, Iutlanda, şi mai multe
|
||||
insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster şi
|
||||
Bornholm, precum şi sute de insuliţe denumite în general ĽArhipelagul Danezľ.
|
||||
9
test/ro/utf-8.txt
Normal file
9
test/ro/utf-8.txt
Normal file
@ -0,0 +1,9 @@
|
||||
Danemarca (în daneză Sunet Danmark), oficial Regatul Danemarcei (în
|
||||
daneză Sunet Kongeriget Danmark), este un stat suveran din
|
||||
Europa de Nord, având și două țări constituente de peste mări, care fac parte
|
||||
integrantă din regat: Insulele Feroe în Atlanticul de Nord și Groenlanda în
|
||||
America de Nord. Danemarca propriu-zisă[a] este cea mai de sud dintre țările
|
||||
nordice, aflată la sud-vest de Suedia și la sud de Norvegia, învecinându-se la
|
||||
sud cu Germania. Țara constă dintr-o peninsulă mare, Iutlanda, și mai multe
|
||||
insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster și
|
||||
Bornholm, precum și sute de insulițe denumite în general „Arhipelagul Danez”.
|
||||
9
test/ro/windows-1250.txt
Normal file
9
test/ro/windows-1250.txt
Normal file
@ -0,0 +1,9 @@
|
||||
Danemarca (în daneză Sunet Danmark), oficial Regatul Danemarcei (în
|
||||
daneză Sunet Kongeriget Danmark), este un stat suveran din
|
||||
Europa de Nord, având si două tări constituente de peste mări, care fac parte
|
||||
integrantă din regat: Insulele Feroe în Atlanticul de Nord si Groenlanda în
|
||||
America de Nord. Danemarca propriu-zisă[a] este cea mai de sud dintre tările
|
||||
nordice, aflată la sud-vest de Suedia si la sud de Norvegia, învecinându-se la
|
||||
sud cu Germania. Tara constă dintr-o peninsulă mare, Iutlanda, si mai multe
|
||||
insule, dintre care cele mai mari sunt Zealand, Funen, Lolland, Falster si
|
||||
Bornholm, precum si sute de insulite denumite în general „Arhipelagul Danez”.
|
||||
Loading…
x
Reference in New Issue
Block a user