uchardet/script/BuildLangModelLogs/LangFrenchModel.log
Jehan 9518f4d7a2 Rebuild a bunch of language models.
Adding generic language model (see coming commit), which uses the same
data as specific single-byte encoding statistics model, except that it
applies it to unicode code points.
For this to work, instead of the CharToOrderMap which was mapping
directly from encoded byte (always 256 values) to order, now we add an
array of frequent characters, ordered by generic unicode code points to
the order of frequency (which can be used on the same sequence mapping
array).

This of course means that each prober where we will want to use these
generic models will have to implement their own byte to code point
decoder, as this is per-encoding logics anyway. This will come in a
subsequent commit.
2021-03-16 12:35:18 +01:00

160 lines
5.6 KiB
Plaintext

= Logs of language model for French (fr) =
- Generated by BuildLangModel.py
- Started: 2021-03-16 01:17:58.545030
- Maximum depth: 4
- Max number of pages: 100
== Parsed pages ==
Wikipédia:Accueil_principal (revision 164303621)
Bœuf (animal) (revision 178255345)
10 mars (revision 180841287)
12 mars (revision 180798998)
13 mars (revision 180904703)
1493 (revision 163870551)
14 mars (revision 180901488)
15 mars (revision 180904428)
1891 (revision 180890066)
1917 (revision 178369116)
1939 (revision 178458019)
2011 (revision 176114496)
45e parallèle nord (revision 180910832)
6 mars (revision 180750121)
7 mars (revision 180750121)
Absolutisme (revision 179767600)
Alassane Ouattara (revision 180842696)
Ambassadeur (revision 180674153)
Amiral de France (revision 177268292)
Amirautés de Bretagne (revision 175194082)
Aurora Cornu (revision 180901231)
Bata (Guinée équatoriale) (revision 180763894)
Bob Walkup (revision 180908319)
Bourgogne-Franche-Comté (revision 180662628)
Centre de données (revision 180741567)
Championnats du monde de ski acrobatique 2021 (revision 180882257)
Christophe Colomb (revision 180494940)
Claude Debussy (revision 179962158)
Couronne solaire (revision 180875717)
Crise présidentielle depuis 2019 au Venezuela (revision 180336636)
Critique musical (revision 174352172)
Côte d'Ivoire (revision 180838790)
Daniel Vachez (revision 180915214)
Degré Celsius (revision 179948881)
Deuxième République (Tchécoslovaquie) (revision 180896689)
Deuxième guerre civile libyenne (revision 180269091)
Empire romain (revision 180843240)
Empire russe (revision 179593986)
Excommunication (revision 178073962)
Explosions de Bata (revision 180862772)
Fatima Aziz (revision 180862495)
Fort du Lomont (revision 180886100)
Frankie de la Cruz (revision 180903250)
GINK (revision 179590111)
Giovanni Gastel (revision 180881061)
Goodwill Zwelithini kaBhekuzulu (revision 180806403)
Gouvernement de l'Église catholique (revision 176961659)
Guerre civile syrienne (revision 180897321)
Guerre civile yéménite (revision 180691885)
Guerre du Tigré (revision 180793174)
Guinée équatoriale (revision 180759310)
Hamed Bakayoko (revision 180904779)
Helena Fuchsová (revision 180909783)
Henri-Charles de Beaumanoir de Lavardin (revision 180903071)
Henry Darrow (revision 180905848)
Heure en France (revision 180854115)
Incendie du centre de données d'OVHcloud à Strasbourg (revision 180901025)
Innocent XI (revision 180108629)
Ivo Trumbić (revision 180827381)
Jean-Claude Fasquelle (revision 180871354)
Jean-Jacques Viton (revision 180889491)
Jean Frydman (revision 180909934)
Le Mans (revision 180520548)
Lieutenant général (revision 180899945)
Liste des ambassadeurs de France près le Saint-Siège (revision 180150184)
Manifestation des agriculteurs indiens de 2020-2021 (revision 180901643)
Manifestations de 2020-2021 en Arménie (revision 180901656)
Manifestations de 2020-2021 en Biélorussie (revision 180901634)
Manifestations de 2021 au Sénégal (revision 180900196)
Manifestations de 2021 en Birmanie (revision 180901671)
Manifestations de 2021 en Russie (revision 180897927)
Manifestations de Deraa (revision 180914771)
Mars 1891 (revision 155220626)
Mars 2021 (revision 180914744)
Marvin Hagler (revision 180908678)
Militaire (revision 178062901)
Murray Walker (revision 180862148)
OVHcloud (revision 180900746)
Obren Joksimović (revision 180901629)
Palais Farnèse (revision 180885444)
Pandémie de Covid-19 (revision 180845115)
Pays-Bas (revision 180853920)
Photosphère (revision 179722426)
Premier ministre ivoirien (revision 180838804)
Province de Bretagne (revision 176523092)
Président de la république de Côte d'Ivoire (revision 180747416)
Pôle Nord (revision 178839482)
Querelle des Franchises (revision 180092394)
Raoul Casadei (revision 180910155)
Rassemblement des houphouëtistes pour la démocratie et la paix (revision 180912125)
Roi des Français (revision 180882393)
Ronald DeFeo Jr. (revision 180915749)
Royaume de France (revision 180809662)
Révolte du Papier timbré (revision 180903105)
== End of Parsed pages ==
- Wikipedia parsing ended at: 2021-03-16 01:24:27.092152
57 characters appeared 1900431 times.
First 38 characters:
[ 0] Char e: 14.210092342210793 %
[ 1] Char a: 8.0327567799094 %
[ 2] Char s: 7.818647454182762 %
[ 3] Char i: 7.531554684174274 %
[ 4] Char n: 7.491616375443256 %
[ 5] Char r: 7.05650455080979 %
[ 6] Char t: 6.771779664718161 %
[ 7] Char l: 5.854461435327039 %
[ 8] Char o: 5.412772155368966 %
[ 9] Char u: 5.014546700195903 %
[10] Char d: 4.239248886173716 %
[11] Char c: 3.238896860764742 %
[12] Char m: 2.8875028875028876 %
[13] Char p: 2.787104609428072 %
[14] Char é: 2.546790701688196 %
[15] Char v: 1.3356443880361877 %
[16] Char g: 1.1728392138414918 %
[17] Char f: 1.1096956427252553 %
[18] Char b: 1.084859171419536 %
[19] Char h: 0.9054261901642312 %
[20] Char q: 0.7540920980556516 %
[21] Char y: 0.42858698895145364 %
[22] Char x: 0.4087493836924361 %
[23] Char à: 0.39127966235027745 %
[24] Char è: 0.3704422838819194 %
[25] Char j: 0.35176231076003284 %
[26] Char k: 0.17332910271406854 %
[27] Char z: 0.11539487621492178 %
[28] Char ê: 0.10397641377140239 %
[29] Char ç: 0.09292628882606103 %
[30] Char ô: 0.07540394784130547 %
[31] Char w: 0.06340666932922058 %
[32] Char î: 0.031729644485908724 %
[33] Char û: 0.029309140926453 %
[34] Char â: 0.02504694987610705 %
[35] Char ï: 0.019942844544211285 %
[36] Char ù: 0.016259469562430837 %
[37] Char œ: 0.010839646374953892 %
The first 38 characters have an accumulated ratio of 0.9996521841624343.
1049 sequences found.
First 512 (typical positive ratio): 0.997006678170155
Next 512 (512-1024): 0.00010839646374953892
Rest: 1.646491655585584e-05
- Processing end: 2021-03-16 01:24:27.266283