mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-07 01:06:40 +08:00
Adding generic language model (see coming commit), which uses the same data as specific single-byte encoding statistics model, except that it applies it to unicode code points. For this to work, instead of the CharToOrderMap which was mapping directly from encoded byte (always 256 values) to order, now we add an array of frequent characters, ordered by generic unicode code points to the order of frequency (which can be used on the same sequence mapping array). This of course means that each prober where we will want to use these generic models will have to implement their own byte to code point decoder, as this is per-encoding logics anyway. This will come in a subsequent commit.
163 lines
5.5 KiB
Plaintext
163 lines
5.5 KiB
Plaintext
= Logs of language model for Italian (it) =
|
|
|
|
- Generated by BuildLangModel.py
|
|
- Started: 2021-03-16 01:25:53.681909
|
|
- Maximum depth: 4
|
|
- Max number of pages: 100
|
|
|
|
== Parsed pages ==
|
|
|
|
Pieve Ligure (revision 118508492)
|
|
010 (prefisso) (revision 94383168)
|
|
AMT (Genova) (revision 118888771)
|
|
Abbazia di San Colombano (revision 119100076)
|
|
Abbazia di San Fruttuoso (revision 119098176)
|
|
Acacia dealbata (revision 118537500)
|
|
Affresco (revision 119234348)
|
|
Agenzia nazionale per le nuove tecnologie, l'energia e lo sviluppo economico sostenibile (revision 119261985)
|
|
Agricoltura (revision 119211593)
|
|
Altitudine (revision 118983270)
|
|
Antica Roma (revision 118468482)
|
|
Anton Maria Maragliano (revision 116868790)
|
|
Appennino Ligure (revision 117194376)
|
|
Arcidiocesi di Genova (revision 119158953)
|
|
Area (revision 118021697)
|
|
Area naturale marina protetta Portofino (revision 117836953)
|
|
Arenzano (revision 118507675)
|
|
Austria (revision 119220244)
|
|
Avegno (revision 118656626)
|
|
Bargagli (revision 118656627)
|
|
Batteria di Punta Chiappa (revision 118356835)
|
|
Battesimo (revision 118993799)
|
|
Bogliasco (revision 118656629)
|
|
Bogliasco Pieve (revision 118656629)
|
|
Borzonasca (revision 118854360)
|
|
Busalla (revision 118656635)
|
|
Calcio (sport) (revision 118995232)
|
|
Calcio a 5 (revision 118431165)
|
|
Camogli (revision 118850151)
|
|
Campo Ligure (revision 119083085)
|
|
Campomorone (revision 119226877)
|
|
Cantiere navale (revision 115540115)
|
|
Carabinieri (revision 119285803)
|
|
Carasco (revision 118801735)
|
|
Caravella (revision 118751709)
|
|
Casarza Ligure (revision 118656643)
|
|
Casella (Italia) (revision 118797269)
|
|
Castello della Dragonara (revision 108868054)
|
|
Castiglione Chiavarese (revision 118656646)
|
|
Centrismo (revision 117397211)
|
|
Centro-destra (revision 117992364)
|
|
Centrolabrus melanocercus (revision 116914326)
|
|
Ceranesi (revision 118656648)
|
|
Cesare Lanza (revision 115376996)
|
|
Chiavari (revision 119146951)
|
|
Chiesa di San Michele Arcangelo (Pieve Ligure) (revision 119097578)
|
|
Chiesa di Santa Croce (Pieve Ligure) (revision 119097599)
|
|
Chilometro quadrato (revision 116585233)
|
|
Cicagna (revision 118656655)
|
|
Circondario di Genova (revision 113691033)
|
|
Città dell'olio (revision 118165836)
|
|
Città metropolitana di Genova (revision 119014943)
|
|
Città metropolitane d'Italia (revision 119240923)
|
|
Classificazione climatica dei comuni italiani (revision 118213893)
|
|
Classificazione sismica dell'Italia (revision 118461862)
|
|
Claudio Burlando (revision 119123207)
|
|
Codice catastale (revision 116588085)
|
|
Codice postale (revision 105346722)
|
|
Cogoleto (revision 118508042)
|
|
Cogorno (revision 118962627)
|
|
Compagnia di Gesù (revision 119271066)
|
|
Comune (Italia) (revision 118913656)
|
|
Comune medievale (revision 113420512)
|
|
Comuni d'Italia (revision 119120484)
|
|
Comuni della Liguria (revision 113527316)
|
|
Comunità montana Fontanabuona (revision 105560751)
|
|
Concilio di Trento (revision 118571991)
|
|
Congresso di Vienna (revision 118881415)
|
|
Coordinate geografiche (revision 118353691)
|
|
Corallo (revision 117035534)
|
|
Coreglia Ligure (revision 118656657)
|
|
Corona (copricapo) (revision 117780990)
|
|
Cristo degli abissi (revision 117435230)
|
|
Cristoforo Colombo (revision 119014639)
|
|
Croce (revision 117653124)
|
|
Crocefieschi (revision 118656658)
|
|
Crêuza (revision 119275449)
|
|
Davagna (revision 118656659)
|
|
Decreto del presidente della Repubblica (revision 119120849)
|
|
Democrazia Cristiana (revision 119162011)
|
|
Densità di popolazione (revision 119143170)
|
|
Dipartimento di Genova (revision 118450361)
|
|
Ebano (revision 116535223)
|
|
Erba sintetica (revision 114157150)
|
|
Etnico (onomastica) (revision 117289144)
|
|
Fascia (Italia) (revision 118955929)
|
|
Favale di Malvaro (revision 118656662)
|
|
Federico Barbarossa (revision 118793984)
|
|
Fermata ferroviaria (revision 119085486)
|
|
Ferrovia Genova-Pisa (revision 119025272)
|
|
Flora (revision 110652725)
|
|
Floricoltura (revision 113487805)
|
|
Fontanigorda (revision 118803588)
|
|
Francesco Bossi (vescovo) (revision 117422608)
|
|
Frazione (geografia) (revision 119001222)
|
|
Fuso orario (revision 119022172)
|
|
Galleria (ingegneria) (revision 115407813)
|
|
Gas (revision 117414169)
|
|
Genova (revision 119208791)
|
|
Germania nazista (revision 119177156)
|
|
Giacomo il Maggiore (revision 118986303)
|
|
|
|
== End of Parsed pages ==
|
|
|
|
- Wikipedia parsing ended at: 2021-03-16 01:31:12.602302
|
|
|
|
54 characters appeared 1487235 times.
|
|
|
|
First 34 characters:
|
|
[ 0] Char i: 11.700840822062418 %
|
|
[ 1] Char e: 11.23655642854021 %
|
|
[ 2] Char a: 11.108197426768466 %
|
|
[ 3] Char o: 9.061513479712351 %
|
|
[ 4] Char n: 7.150383093458666 %
|
|
[ 5] Char l: 7.047440384337378 %
|
|
[ 6] Char t: 6.5587482812064 %
|
|
[ 7] Char r: 6.521363469794619 %
|
|
[ 8] Char s: 4.669067094305877 %
|
|
[ 9] Char c: 4.495120139049982 %
|
|
[10] Char d: 3.939861555167811 %
|
|
[11] Char u: 2.7531627483215497 %
|
|
[12] Char p: 2.6924460492121285 %
|
|
[13] Char m: 2.5125820734450173 %
|
|
[14] Char g: 1.9460273594959776 %
|
|
[15] Char v: 1.64123356429885 %
|
|
[16] Char f: 1.1068862688142762 %
|
|
[17] Char b: 1.0097933413347588 %
|
|
[18] Char z: 0.9880079476343685 %
|
|
[19] Char h: 0.7280624783574889 %
|
|
[20] Char q: 0.27574660359660713 %
|
|
[21] Char à: 0.2058854182425777 %
|
|
[22] Char è: 0.14859790147488458 %
|
|
[23] Char ò: 0.10186688721015845 %
|
|
[24] Char ù: 0.07302141221797497 %
|
|
[25] Char x: 0.06501998675394272 %
|
|
[26] Char k: 0.05291699025372587 %
|
|
[27] Char y: 0.04471384818135668 %
|
|
[28] Char w: 0.04115018810073727 %
|
|
[29] Char ì: 0.041015710361845974 %
|
|
[30] Char é: 0.024474948478216286 %
|
|
[31] Char j: 0.019028600053118707 %
|
|
[32] Char ö: 0.006791125814010562 %
|
|
[33] Char ó: 0.004505004252858493 %
|
|
|
|
The first 34 characters have an accumulated ratio of 0.9997202863031062.
|
|
|
|
921 sequences found.
|
|
|
|
First 512 (typical positive ratio): 0.9992462827093448
|
|
Next 512 (512-1024): 0.0007302141221797497
|
|
Rest: -2.0166160408230382e-17
|
|
|
|
- Processing end: 2021-03-16 01:31:12.679004
|