mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-06 16:56:40 +08:00
LangModels: adding German models for ISO-8859-1 and Windows-1252.
This commit is contained in:
parent
90728e4068
commit
aa587a64bd
159
script/BuildLangModelLogs/LangGermanModel.log
Normal file
159
script/BuildLangModelLogs/LangGermanModel.log
Normal file
@ -0,0 +1,159 @@
|
|||||||
|
= Logs of language model for German (de) =
|
||||||
|
|
||||||
|
- Generated by BuildLangModel.py
|
||||||
|
- Started: 2015-12-03 22:42:29.154759
|
||||||
|
- Maximum depth: 3
|
||||||
|
- Max number of pages: 100
|
||||||
|
|
||||||
|
== Parsed pages ==
|
||||||
|
|
||||||
|
Wikipedia:Hauptseite (revision 140459035)
|
||||||
|
1740 (revision 145584733)
|
||||||
|
1890 (revision 148575121)
|
||||||
|
1925 (revision 148682812)
|
||||||
|
1965 (revision 148411693)
|
||||||
|
3. Dezember (revision 148684818)
|
||||||
|
Bundeswehreinsatz in Syrien (revision 148714599)
|
||||||
|
Clara Klabunde (revision 148697193)
|
||||||
|
Day Tripper (revision 145956669)
|
||||||
|
Dezember 2015 (revision 148713161)
|
||||||
|
Edwar al-Charrat (revision 148656295)
|
||||||
|
Enzyklika (revision 148704406)
|
||||||
|
Enzyklopädie (revision 148364925)
|
||||||
|
Facebook Inc. (revision 148280344)
|
||||||
|
Franz Neubauer (CSU) (revision 148710968)
|
||||||
|
Freie Inhalte (revision 148123311)
|
||||||
|
Gabriele Ferzetti (revision 148715582)
|
||||||
|
Georg von Waldburg zu Zeil und Trauchburg (revision 148710609)
|
||||||
|
Jim Loscutoff (revision 148690370)
|
||||||
|
Katarina Witt (revision 148713884)
|
||||||
|
Klavierkonzert (Gershwin) (revision 143900338)
|
||||||
|
Ludolf Camphausen (revision 145088962)
|
||||||
|
Mark Zuckerberg (revision 148714452)
|
||||||
|
Montenegro (revision 148692773)
|
||||||
|
NATO (revision 148697872)
|
||||||
|
NATO-Osterweiterung (revision 148697354)
|
||||||
|
Nekrolog 2015 (revision 148711617)
|
||||||
|
Peter-Ulrich-Haus (revision 148654149)
|
||||||
|
Philanthropie (revision 145561255)
|
||||||
|
Präsidentschaftswahl in Burkina Faso 2015 (revision 148677453)
|
||||||
|
Québec (Stadt) (revision 148716893)
|
||||||
|
Rivka Zohar (revision 148708850)
|
||||||
|
Roch Marc Kaboré (revision 148673951)
|
||||||
|
Rubber Soul (revision 148665720)
|
||||||
|
Salve Regina (Latry) (revision 148713279)
|
||||||
|
Schießerei in San Bernardino (revision 148711974)
|
||||||
|
Single (Musik) (revision 146450210)
|
||||||
|
The Giving Pledge (revision 148711856)
|
||||||
|
Ubi primum (Benedikt XIV.) (revision 136691297)
|
||||||
|
VTech (revision 148704025)
|
||||||
|
Walter Damrosch (revision 148716127)
|
||||||
|
We Can Work It Out (revision 148706519)
|
||||||
|
1. August (revision 148089156)
|
||||||
|
1. Januar (revision 148659041)
|
||||||
|
1. Juni (revision 148375663)
|
||||||
|
1. November (revision 147888516)
|
||||||
|
10. August (revision 148079904)
|
||||||
|
10. November (revision 148658709)
|
||||||
|
10. September (revision 148201788)
|
||||||
|
11. August (revision 148315737)
|
||||||
|
11. Oktober (revision 148087353)
|
||||||
|
12. Januar (revision 147377586)
|
||||||
|
12. September (revision 148359994)
|
||||||
|
13. Dezember (revision 148614781)
|
||||||
|
13. September (revision 148320520)
|
||||||
|
14. August (revision 148513270)
|
||||||
|
14. Dezember (revision 147968142)
|
||||||
|
15. April (revision 146544147)
|
||||||
|
15. August (revision 147827975)
|
||||||
|
16. April (revision 148712866)
|
||||||
|
16. Dezember (revision 148392316)
|
||||||
|
16. Februar (revision 148221712)
|
||||||
|
16. Jahrhundert (revision 147390194)
|
||||||
|
16. Juli (revision 147928181)
|
||||||
|
1652 (revision 142931287)
|
||||||
|
1654 (revision 145531451)
|
||||||
|
1656 (revision 144194148)
|
||||||
|
1657 (revision 147492859)
|
||||||
|
1662 (revision 147548355)
|
||||||
|
1665 (revision 147757128)
|
||||||
|
1666 (revision 147843417)
|
||||||
|
1667 (revision 148566099)
|
||||||
|
1668 (revision 145304760)
|
||||||
|
1670 (revision 147643990)
|
||||||
|
1672 (revision 145296252)
|
||||||
|
1673 (revision 147879655)
|
||||||
|
1674 (revision 146784434)
|
||||||
|
1679 (revision 146069377)
|
||||||
|
1685 (revision 148596629)
|
||||||
|
1688 (revision 140370621)
|
||||||
|
1692 (revision 146892539)
|
||||||
|
1693 (revision 147464373)
|
||||||
|
17. August (revision 148288443)
|
||||||
|
17. Februar (revision 145814425)
|
||||||
|
17. Jahrhundert (revision 147869798)
|
||||||
|
17. Oktober (revision 148327370)
|
||||||
|
1700er (revision 127393249)
|
||||||
|
1707 (revision 148288721)
|
||||||
|
1710er (revision 134739897)
|
||||||
|
1720er (revision 127302296)
|
||||||
|
1730 (revision 148694277)
|
||||||
|
1730er (revision 127393280)
|
||||||
|
1731 (revision 147730204)
|
||||||
|
1735 (revision 145436596)
|
||||||
|
1736 (revision 145680122)
|
||||||
|
1737 (revision 146645905)
|
||||||
|
1738 (revision 145094942)
|
||||||
|
1739 (revision 147843445)
|
||||||
|
1740er (revision 127393296)
|
||||||
|
1741 (revision 146530178)
|
||||||
|
1742 (revision 147010984)
|
||||||
|
|
||||||
|
== End of Parsed pages ==
|
||||||
|
|
||||||
|
- Wikipedia parsing ended at: 2015-12-03 22:50:46.517106
|
||||||
|
|
||||||
|
59 characters appeared 1746165 times.
|
||||||
|
|
||||||
|
First 31 characters:
|
||||||
|
[ 0] Char e: 14.27997926885489 %
|
||||||
|
[ 1] Char r: 8.696257226550754 %
|
||||||
|
[ 2] Char n: 8.464091308667852 %
|
||||||
|
[ 3] Char i: 8.258784250056554 %
|
||||||
|
[ 4] Char s: 6.690833913175444 %
|
||||||
|
[ 5] Char a: 6.370703799469123 %
|
||||||
|
[ 6] Char t: 5.925728668253001 %
|
||||||
|
[ 7] Char h: 4.540979804314025 %
|
||||||
|
[ 8] Char d: 4.367284878576767 %
|
||||||
|
[ 9] Char l: 4.083634708060234 %
|
||||||
|
[10] Char u: 3.899917819908199 %
|
||||||
|
[11] Char o: 3.6450163644329145 %
|
||||||
|
[12] Char c: 3.392405643223865 %
|
||||||
|
[13] Char m: 2.578565026787274 %
|
||||||
|
[14] Char g: 2.543631329227192 %
|
||||||
|
[15] Char b: 1.9455206123132693 %
|
||||||
|
[16] Char k: 1.7604292836014925 %
|
||||||
|
[17] Char f: 1.6422273954637734 %
|
||||||
|
[18] Char p: 1.519329502080273 %
|
||||||
|
[19] Char w: 1.0273370500496803 %
|
||||||
|
[20] Char z: 1.0037997554641171 %
|
||||||
|
[21] Char v: 0.9010603236234834 %
|
||||||
|
[22] Char ä: 0.4926224039538073 %
|
||||||
|
[23] Char j: 0.4661644231787947 %
|
||||||
|
[24] Char ü: 0.4094687500894818 %
|
||||||
|
[25] Char y: 0.34229296773214446 %
|
||||||
|
[26] Char ö: 0.3044958523392692 %
|
||||||
|
[27] Char ß: 0.14477440562604335 %
|
||||||
|
[28] Char x: 0.09918879372796958 %
|
||||||
|
[29] Char é: 0.07633871942227682 %
|
||||||
|
[30] Char q: 0.06099079983850323 %
|
||||||
|
|
||||||
|
The first 31 characters have an accumulated ratio of 0.9993385504806246.
|
||||||
|
|
||||||
|
1188 sequences found.
|
||||||
|
|
||||||
|
First 512 (typical positive ratio): 0.9934041448127945
|
||||||
|
Next 512 (512-1024): 1.1453671331174316e-06
|
||||||
|
Rest: 0.0001130256702826099
|
||||||
|
|
||||||
|
- Processing end: 2015-12-03 22:50:46.681265
|
||||||
78
script/langs/de.py
Normal file
78
script/langs/de.py
Normal file
@ -0,0 +1,78 @@
|
|||||||
|
#!/bin/python3
|
||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
|
||||||
|
# ##### BEGIN LICENSE BLOCK #####
|
||||||
|
# Version: MPL 1.1/GPL 2.0/LGPL 2.1
|
||||||
|
#
|
||||||
|
# The contents of this file are subject to the Mozilla Public License Version
|
||||||
|
# 1.1 (the "License"); you may not use this file except in compliance with
|
||||||
|
# the License. You may obtain a copy of the License at
|
||||||
|
# http://www.mozilla.org/MPL/
|
||||||
|
#
|
||||||
|
# Software distributed under the License is distributed on an "AS IS" basis,
|
||||||
|
# WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
|
||||||
|
# for the specific language governing rights and limitations under the
|
||||||
|
# License.
|
||||||
|
#
|
||||||
|
# The Original Code is Mozilla Universal charset detector code.
|
||||||
|
#
|
||||||
|
# The Initial Developer of the Original Code is
|
||||||
|
# Netscape Communications Corporation.
|
||||||
|
# Portions created by the Initial Developer are Copyright (C) 2001
|
||||||
|
# the Initial Developer. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Contributor(s):
|
||||||
|
# Jehan <jehan@girinstud.io>
|
||||||
|
#
|
||||||
|
# Alternatively, the contents of this file may be used under the terms of
|
||||||
|
# either the GNU General Public License Version 2 or later (the "GPL"), or
|
||||||
|
# the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
|
||||||
|
# in which case the provisions of the GPL or the LGPL are applicable instead
|
||||||
|
# of those above. If you wish to allow use of your version of this file only
|
||||||
|
# under the terms of either the GPL or the LGPL, and not to allow others to
|
||||||
|
# use your version of this file under the terms of the MPL, indicate your
|
||||||
|
# decision by deleting the provisions above and replace them with the notice
|
||||||
|
# and other provisions required by the GPL or the LGPL. If you do not delete
|
||||||
|
# the provisions above, a recipient may use your version of this file under
|
||||||
|
# the terms of any one of the MPL, the GPL or the LGPL.
|
||||||
|
#
|
||||||
|
# ##### END LICENSE BLOCK #####
|
||||||
|
|
||||||
|
import re
|
||||||
|
|
||||||
|
## Mandatory Properties ##
|
||||||
|
|
||||||
|
# The human name for the language, in English.
|
||||||
|
name = 'German'
|
||||||
|
# Use 2-letter ISO 639-1 if possible, 3-letter ISO code otherwise,
|
||||||
|
# or use another catalog as a last resort.
|
||||||
|
code = 'de'
|
||||||
|
# ASCII characters are also used in French.
|
||||||
|
use_ascii = True
|
||||||
|
# The charsets we want to support and create data for.
|
||||||
|
charsets = ['ISO-8859-1', 'WINDOWS-1252']
|
||||||
|
|
||||||
|
## Optional Properties ##
|
||||||
|
|
||||||
|
# Alphabet characters.
|
||||||
|
# If use_ascii=True, there is no need to add any ASCII characters.
|
||||||
|
# If case_mapping=True, there is no need to add several cases of a same
|
||||||
|
# character (provided Python algorithms know the right cases).
|
||||||
|
alphabet = ['ä', 'ö', 'ü', 'ß']
|
||||||
|
# The start page. Though optional, it is advised to choose one yourself.
|
||||||
|
start_pages = ['Wikipedia:Hauptseite']
|
||||||
|
# give possibility to select another code for the Wikipedia URL.
|
||||||
|
wikipedia_code = code
|
||||||
|
# 'a' and 'A' will be considered the same character, and so on.
|
||||||
|
# This uses Python algorithm to determine upper/lower-case of a given
|
||||||
|
# character.
|
||||||
|
case_mapping = True
|
||||||
|
|
||||||
|
# A function to clean content returned by the `wikipedia` python lib,
|
||||||
|
# in case some unwanted data has been overlooked.
|
||||||
|
def clean_wikipedia_content(content):
|
||||||
|
# Get rid of title syntax: "=== Articles connexes ==="
|
||||||
|
cleaned = re.sub(r'(=+) *([^=]+) *\1',
|
||||||
|
r'\2',
|
||||||
|
content)
|
||||||
|
return cleaned
|
||||||
@ -11,6 +11,7 @@ set(
|
|||||||
LangModels/LangBulgarianModel.cpp
|
LangModels/LangBulgarianModel.cpp
|
||||||
LangModels/LangCyrillicModel.cpp
|
LangModels/LangCyrillicModel.cpp
|
||||||
LangModels/LangFrenchModel.cpp
|
LangModels/LangFrenchModel.cpp
|
||||||
|
LangModels/LangGermanModel.cpp
|
||||||
LangModels/LangGreekModel.cpp
|
LangModels/LangGreekModel.cpp
|
||||||
LangModels/LangHungarianModel.cpp
|
LangModels/LangHungarianModel.cpp
|
||||||
LangModels/LangHebrewModel.cpp
|
LangModels/LangHebrewModel.cpp
|
||||||
|
|||||||
168
src/LangModels/LangGermanModel.cpp
Normal file
168
src/LangModels/LangGermanModel.cpp
Normal file
@ -0,0 +1,168 @@
|
|||||||
|
/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
|
||||||
|
/* ***** BEGIN LICENSE BLOCK *****
|
||||||
|
* Version: MPL 1.1/GPL 2.0/LGPL 2.1
|
||||||
|
*
|
||||||
|
* The contents of this file are subject to the Mozilla Public License Version
|
||||||
|
* 1.1 (the "License"); you may not use this file except in compliance with
|
||||||
|
* the License. You may obtain a copy of the License at
|
||||||
|
* http://www.mozilla.org/MPL/
|
||||||
|
*
|
||||||
|
* Software distributed under the License is distributed on an "AS IS" basis,
|
||||||
|
* WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
|
||||||
|
* for the specific language governing rights and limitations under the
|
||||||
|
* License.
|
||||||
|
*
|
||||||
|
* The Original Code is Mozilla Communicator client code.
|
||||||
|
*
|
||||||
|
* The Initial Developer of the Original Code is
|
||||||
|
* Netscape Communications Corporation.
|
||||||
|
* Portions created by the Initial Developer are Copyright (C) 1998
|
||||||
|
* the Initial Developer. All Rights Reserved.
|
||||||
|
*
|
||||||
|
* Contributor(s):
|
||||||
|
*
|
||||||
|
* Alternatively, the contents of this file may be used under the terms of
|
||||||
|
* either the GNU General Public License Version 2 or later (the "GPL"), or
|
||||||
|
* the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
|
||||||
|
* in which case the provisions of the GPL or the LGPL are applicable instead
|
||||||
|
* of those above. If you wish to allow use of your version of this file only
|
||||||
|
* under the terms of either the GPL or the LGPL, and not to allow others to
|
||||||
|
* use your version of this file under the terms of the MPL, indicate your
|
||||||
|
* decision by deleting the provisions above and replace them with the notice
|
||||||
|
* and other provisions required by the GPL or the LGPL. If you do not delete
|
||||||
|
* the provisions above, a recipient may use your version of this file under
|
||||||
|
* the terms of any one of the MPL, the GPL or the LGPL.
|
||||||
|
*
|
||||||
|
* ***** END LICENSE BLOCK ***** */
|
||||||
|
|
||||||
|
#include "../nsSBCharSetProber.h"
|
||||||
|
|
||||||
|
/********* Language model for: German *********/
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Generated by BuildLangModel.py
|
||||||
|
* On: 2015-12-03 22:50:46.518374
|
||||||
|
**/
|
||||||
|
|
||||||
|
/* Character Mapping Table:
|
||||||
|
* ILL: illegal character.
|
||||||
|
* CTR: control character specific to the charset.
|
||||||
|
* RET: carriage/return.
|
||||||
|
* SYM: symbol (punctuation) that does not belong to word.
|
||||||
|
* NUM: 0 - 9.
|
||||||
|
*
|
||||||
|
* Other characters are ordered by probabilities
|
||||||
|
* (0 is the most common character in the language).
|
||||||
|
*
|
||||||
|
* Orders are generic to a language. So the codepoint with order X in
|
||||||
|
* CHARSET1 maps to the same character as the codepoint with the same
|
||||||
|
* order X in CHARSET2 for the same language.
|
||||||
|
* As such, it is possible to get missing order. For instance the
|
||||||
|
* ligature of 'o' and 'e' exists in ISO-8859-15 but not in ISO-8859-1
|
||||||
|
* even though they are both used for French. Same for the euro sign.
|
||||||
|
*/
|
||||||
|
static const unsigned char Windows_1252_CharToOrderMap[] =
|
||||||
|
{
|
||||||
|
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */
|
||||||
|
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */
|
||||||
|
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */
|
||||||
|
NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */
|
||||||
|
SYM, 5, 15, 12, 8, 0, 17, 14, 7, 3, 23, 16, 9, 13, 2, 11, /* 4X */
|
||||||
|
18, 30, 1, 4, 6, 10, 21, 19, 28, 25, 20,SYM,SYM,SYM,SYM,SYM, /* 5X */
|
||||||
|
SYM, 5, 15, 12, 8, 0, 17, 14, 7, 3, 23, 16, 9, 13, 2, 11, /* 6X */
|
||||||
|
18, 30, 1, 4, 6, 10, 21, 19, 28, 25, 20,SYM,SYM,SYM,SYM,CTR, /* 7X */
|
||||||
|
SYM,ILL,SYM, 59,SYM,SYM,SYM,SYM,SYM,SYM, 36,SYM, 54,ILL, 42,ILL, /* 8X */
|
||||||
|
ILL,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, 36,SYM, 54,ILL, 42, 56, /* 9X */
|
||||||
|
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* AX */
|
||||||
|
SYM,SYM,SYM,SYM,SYM, 60,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* BX */
|
||||||
|
41, 31, 37, 44, 22, 49, 50, 35, 32, 29, 48, 43, 57, 33, 47, 52, /* CX */
|
||||||
|
53, 39, 51, 34, 40, 55, 26,SYM, 38, 58, 46, 61, 24, 45, 62, 27, /* DX */
|
||||||
|
41, 31, 37, 44, 22, 49, 50, 35, 32, 29, 48, 43, 57, 33, 47, 52, /* EX */
|
||||||
|
53, 39, 51, 34, 40, 55, 26,SYM, 38, 58, 46, 63, 24, 45, 64, 56, /* FX */
|
||||||
|
};
|
||||||
|
/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */
|
||||||
|
|
||||||
|
static const unsigned char Iso_8859_1_CharToOrderMap[] =
|
||||||
|
{
|
||||||
|
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,RET,CTR,CTR,RET,CTR,CTR, /* 0X */
|
||||||
|
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 1X */
|
||||||
|
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* 2X */
|
||||||
|
NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,NUM,SYM,SYM,SYM,SYM,SYM,SYM, /* 3X */
|
||||||
|
SYM, 5, 15, 12, 8, 0, 17, 14, 7, 3, 23, 16, 9, 13, 2, 11, /* 4X */
|
||||||
|
18, 30, 1, 4, 6, 10, 21, 19, 28, 25, 20,SYM,SYM,SYM,SYM,SYM, /* 5X */
|
||||||
|
SYM, 5, 15, 12, 8, 0, 17, 14, 7, 3, 23, 16, 9, 13, 2, 11, /* 6X */
|
||||||
|
18, 30, 1, 4, 6, 10, 21, 19, 28, 25, 20,SYM,SYM,SYM,SYM,CTR, /* 7X */
|
||||||
|
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 8X */
|
||||||
|
CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR,CTR, /* 9X */
|
||||||
|
SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* AX */
|
||||||
|
SYM,SYM,SYM,SYM,SYM, 65,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM,SYM, /* BX */
|
||||||
|
41, 31, 37, 44, 22, 49, 50, 35, 32, 29, 48, 43, 57, 33, 47, 52, /* CX */
|
||||||
|
53, 39, 51, 34, 40, 55, 26,SYM, 38, 58, 46, 66, 24, 45, 67, 27, /* DX */
|
||||||
|
41, 31, 37, 44, 22, 49, 50, 35, 32, 29, 48, 43, 57, 33, 47, 52, /* EX */
|
||||||
|
53, 39, 51, 34, 40, 55, 26,SYM, 38, 58, 46, 68, 24, 45, 69, 56, /* FX */
|
||||||
|
};
|
||||||
|
/*X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 XA XB XC XD XE XF */
|
||||||
|
|
||||||
|
|
||||||
|
/* Model Table:
|
||||||
|
* Total sequences: 1188
|
||||||
|
* First 512 sequences: 0.9934041448127945
|
||||||
|
* Next 512 sequences (512-1024): 0.006482829516922903
|
||||||
|
* Rest: 0.0001130256702826099
|
||||||
|
* Negative sequences: TODO
|
||||||
|
*/
|
||||||
|
static const PRUint8 GermanLangModel[] =
|
||||||
|
{
|
||||||
|
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,2,3,2,3,3,0,2,
|
||||||
|
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,2,2,3,2,
|
||||||
|
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,1,2,2,2,
|
||||||
|
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,0,2,2,3,3,2,3,
|
||||||
|
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,2,0,0,3,2,
|
||||||
|
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,0,3,0,3,3,2,2,
|
||||||
|
3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,3,3,3,3,3,3,3,2,3,3,3,0,0,2,2,
|
||||||
|
3,3,3,3,3,3,3,3,2,3,3,3,2,3,3,3,3,3,2,3,2,2,3,2,3,3,3,0,0,2,2,
|
||||||
|
3,3,3,3,3,3,3,3,3,3,3,3,2,3,3,3,2,3,2,3,2,2,3,2,3,3,2,0,0,2,2,
|
||||||
|
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,2,2,2,
|
||||||
|
3,3,3,3,3,3,3,3,3,3,2,2,3,3,3,3,3,3,3,3,3,3,2,2,2,2,0,3,3,3,2,
|
||||||
|
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,0,3,0,3,0,3,3,1,2,
|
||||||
|
3,3,2,3,2,3,3,3,2,3,3,3,3,2,2,2,3,2,2,2,2,2,2,2,1,3,2,0,1,2,3,
|
||||||
|
3,3,2,3,3,3,3,2,3,3,3,3,3,3,2,3,2,3,3,2,2,2,3,2,3,3,3,0,0,2,2,
|
||||||
|
3,3,3,3,3,3,3,3,2,3,3,3,2,3,3,2,3,2,2,2,3,2,3,2,3,3,2,0,2,2,1,
|
||||||
|
3,3,3,3,3,3,3,3,2,3,3,3,2,2,3,3,2,2,2,2,2,2,3,2,3,3,3,0,0,2,0,
|
||||||
|
3,3,3,3,3,3,3,3,2,3,3,3,1,3,2,2,3,3,3,2,2,2,3,2,3,3,3,0,1,2,1,
|
||||||
|
3,3,3,3,3,3,3,2,2,3,3,3,2,3,3,2,3,3,2,2,2,2,3,2,3,2,3,0,0,2,0,
|
||||||
|
3,3,2,3,3,3,3,3,3,3,3,3,2,2,2,2,2,3,3,2,2,2,3,2,2,2,2,0,0,2,0,
|
||||||
|
3,3,3,3,3,3,2,2,2,2,3,3,1,2,2,2,2,2,2,2,2,2,3,3,3,2,3,0,0,0,0,
|
||||||
|
3,2,2,3,3,3,3,2,2,3,3,3,2,3,2,3,2,2,2,3,3,2,2,2,3,3,3,0,0,2,2,
|
||||||
|
3,2,2,3,2,3,2,0,2,2,2,3,1,2,2,2,2,2,2,2,2,2,2,1,0,2,3,0,0,2,1,
|
||||||
|
2,3,3,3,3,2,3,3,3,3,3,2,3,3,3,2,2,3,2,0,2,2,0,0,0,0,0,2,0,0,2,
|
||||||
|
3,2,2,3,2,3,2,2,2,2,3,3,2,2,2,1,2,1,2,0,2,0,3,2,3,2,2,0,0,2,0,
|
||||||
|
2,3,3,0,3,1,3,3,3,3,0,0,3,2,3,3,2,2,2,1,1,0,0,0,0,0,0,2,0,0,0,
|
||||||
|
3,3,3,2,3,3,2,2,2,3,2,3,3,3,2,2,3,2,3,2,2,2,0,2,2,2,1,0,0,1,0,
|
||||||
|
2,3,3,2,3,0,3,3,2,3,0,1,3,3,3,2,2,3,2,2,2,2,0,0,0,0,1,3,1,0,0,
|
||||||
|
3,2,2,3,2,2,3,2,1,2,2,2,0,2,2,3,2,2,2,2,2,2,0,0,0,0,0,0,0,0,0,
|
||||||
|
3,1,2,3,1,3,3,2,1,2,2,2,2,0,0,2,2,2,3,2,0,2,0,0,0,2,0,0,2,2,0,
|
||||||
|
2,3,2,0,2,2,2,2,2,2,2,2,2,2,2,3,2,2,2,1,2,2,0,2,0,0,0,0,0,0,2,
|
||||||
|
0,1,0,2,0,2,0,0,0,0,3,2,0,0,0,0,0,1,0,2,0,0,0,0,0,0,0,0,0,0,0,
|
||||||
|
};
|
||||||
|
|
||||||
|
|
||||||
|
const SequenceModel Windows_1252GermanModel =
|
||||||
|
{
|
||||||
|
Windows_1252_CharToOrderMap,
|
||||||
|
GermanLangModel,
|
||||||
|
31,
|
||||||
|
(float)0.9934041448127945,
|
||||||
|
PR_TRUE,
|
||||||
|
"WINDOWS-1252"
|
||||||
|
};
|
||||||
|
|
||||||
|
const SequenceModel Iso_8859_1GermanModel =
|
||||||
|
{
|
||||||
|
Iso_8859_1_CharToOrderMap,
|
||||||
|
GermanLangModel,
|
||||||
|
31,
|
||||||
|
(float)0.9934041448127945,
|
||||||
|
PR_TRUE,
|
||||||
|
"ISO-8859-1"
|
||||||
|
};
|
||||||
@ -85,6 +85,9 @@ nsSBCSGroupProber::nsSBCSGroupProber()
|
|||||||
mProbers[17] = new nsSingleByteCharSetProber(&Latin2HungarianModel);
|
mProbers[17] = new nsSingleByteCharSetProber(&Latin2HungarianModel);
|
||||||
mProbers[18] = new nsSingleByteCharSetProber(&Win1250HungarianModel);
|
mProbers[18] = new nsSingleByteCharSetProber(&Win1250HungarianModel);
|
||||||
|
|
||||||
|
mProbers[19] = new nsSingleByteCharSetProber(&Iso_8859_1GermanModel);
|
||||||
|
mProbers[20] = new nsSingleByteCharSetProber(&Windows_1252GermanModel);
|
||||||
|
|
||||||
Reset();
|
Reset();
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@ -40,7 +40,7 @@
|
|||||||
#define nsSBCSGroupProber_h__
|
#define nsSBCSGroupProber_h__
|
||||||
|
|
||||||
|
|
||||||
#define NUM_OF_SBCS_PROBERS 19
|
#define NUM_OF_SBCS_PROBERS 21
|
||||||
|
|
||||||
class nsCharSetProber;
|
class nsCharSetProber;
|
||||||
class nsSBCSGroupProber: public nsCharSetProber {
|
class nsSBCSGroupProber: public nsCharSetProber {
|
||||||
|
|||||||
@ -139,6 +139,8 @@ extern const SequenceModel TIS620ThaiModel;
|
|||||||
extern const SequenceModel Iso_8859_15FrenchModel;
|
extern const SequenceModel Iso_8859_15FrenchModel;
|
||||||
extern const SequenceModel Iso_8859_1FrenchModel;
|
extern const SequenceModel Iso_8859_1FrenchModel;
|
||||||
extern const SequenceModel Windows_1252FrenchModel;
|
extern const SequenceModel Windows_1252FrenchModel;
|
||||||
|
extern const SequenceModel Iso_8859_1GermanModel;
|
||||||
|
extern const SequenceModel Windows_1252GermanModel;
|
||||||
|
|
||||||
#endif /* nsSingleByteCharSetProber_h__ */
|
#endif /* nsSingleByteCharSetProber_h__ */
|
||||||
|
|
||||||
|
|||||||
11
test/de/iso-8859-1.txt
Normal file
11
test/de/iso-8859-1.txt
Normal file
@ -0,0 +1,11 @@
|
|||||||
|
ISO 8859-1, genauer ISO/IEC 8859-1, auch bekannt als Latin-1, ist ein von der
|
||||||
|
ISO zuletzt 1998 aktualisierter Standard für die Informationstechnik zur
|
||||||
|
Zeichenkodierung mit acht Bit und der erste Teil der Normenfamilie ISO/IEC 8859.
|
||||||
|
|
||||||
|
Die mit sieben Bit kodierbaren Zeichen entsprechen US-ASCII mit führendem
|
||||||
|
Nullbit. Zusätzlich zu den 95 darstellbaren ASCII-Zeichen (2016-7E16) kodiert
|
||||||
|
ISO 8859-1 96 weitere (A016-FF16), also insgesamt 191 von theoretisch möglichen
|
||||||
|
256 (= 28). Den Positionen 0016-1F16 und 7F16-9F16 sind in ISO/IEC 8859 und
|
||||||
|
damit ISO/IEC 8859-1 keine Zeichen zugewiesen. Die von der IANA definierte
|
||||||
|
Bezeichnung ISO-8859-1 (mit Bindestrich) steht für die Kombination der Zeichen
|
||||||
|
dieser Norm mit nicht darstellbaren Steuerzeichen gemäß ISO/IEC 6429.
|
||||||
11
test/de/windows-1252.txt
Normal file
11
test/de/windows-1252.txt
Normal file
@ -0,0 +1,11 @@
|
|||||||
|
ISO 8859-1, genauer ISO/IEC 8859-1, auch bekannt als Latin-1, ist ein von der
|
||||||
|
ISO zuletzt 1998 aktualisierter Standard für die Informationstechnik zur
|
||||||
|
Zeichenkodierung mit acht Bit und der erste Teil der Normenfamilie ISO/IEC 8859.
|
||||||
|
|
||||||
|
Die mit sieben Bit kodierbaren Zeichen entsprechen US-ASCII mit führendem
|
||||||
|
Nullbit. Zusätzlich zu den 95 darstellbaren ASCII-Zeichen (2016–7E16) kodiert
|
||||||
|
ISO 8859-1 96 weitere (A016–FF16), also insgesamt 191 von theoretisch möglichen
|
||||||
|
256 (= 28). Den Positionen 0016–1F16 und 7F16–9F16 sind in ISO/IEC 8859 und
|
||||||
|
damit ISO/IEC 8859-1 keine Zeichen zugewiesen. Die von der IANA definierte
|
||||||
|
Bezeichnung ISO-8859-1 (mit Bindestrich) steht für die Kombination der Zeichen
|
||||||
|
dieser Norm mit nicht darstellbaren Steuerzeichen gemäß ISO/IEC 6429.
|
||||||
Loading…
x
Reference in New Issue
Block a user