mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-06 16:56:40 +08:00
Adding generic language model (see coming commit), which uses the same data as specific single-byte encoding statistics model, except that it applies it to unicode code points. For this to work, instead of the CharToOrderMap which was mapping directly from encoded byte (always 256 values) to order, now we add an array of frequent characters, ordered by generic unicode code points to the order of frequency (which can be used on the same sequence mapping array). This of course means that each prober where we will want to use these generic models will have to implement their own byte to code point decoder, as this is per-encoding logics anyway. This will come in a subsequent commit.
157 lines
5.1 KiB
Plaintext
157 lines
5.1 KiB
Plaintext
= Logs of language model for Danish (da) =
|
||
|
||
- Generated by BuildLangModel.py
|
||
- Started: 2021-03-16 01:32:17.684746
|
||
- Maximum depth: 4
|
||
- Max number of pages: 100
|
||
|
||
== Parsed pages ==
|
||
|
||
Forside (revision 10000691)
|
||
1. symfoni (Beethoven) (revision 10648993)
|
||
15. marts (revision 8172123)
|
||
1917 (revision 10645384)
|
||
1930 (revision 10645389)
|
||
1940 (revision 10648721)
|
||
1951 (revision 10640371)
|
||
1972 (revision 10641861)
|
||
2. marts (revision 9423344)
|
||
2003 (revision 10654209)
|
||
44 f.Kr. (revision 7242128)
|
||
7. marts (revision 9423388)
|
||
9. marts (revision 10601197)
|
||
Abdikation (revision 10197388)
|
||
Afsnit af Badehotellet (revision 10654331)
|
||
Agnes Slott-Møller (revision 10648962)
|
||
Australian Open-mesterskabet i damesingle 2021 (revision 10630904)
|
||
Australian Open-mesterskabet i herresingle 2021 (revision 10630887)
|
||
Australian Open 2021 (revision 10630544)
|
||
Casper & Mandrilaftalen (revision 10444147)
|
||
Coronaviruspandemien (revision 10652415)
|
||
Cykling under sommer-OL 2012 – Linjeløb (kvinder) (revision 10651872)
|
||
Dansk (sprog) (revision 10633727)
|
||
Den danske Treårsekspedition til Østgrønland 1931-34 (revision 10654093)
|
||
Dnepr (revision 10635465)
|
||
Donald Trump (revision 10653185)
|
||
Døde i 2021 (revision 10653976)
|
||
Encyklopædi (revision 10590147)
|
||
Eurovision Song Contest 2014 (revision 10592331)
|
||
Folkerepublikken Kina (revision 10634829)
|
||
Folketinget (revision 10643927)
|
||
Fram-ekspeditionen 1910-1912 (revision 10630146)
|
||
Frankrig (revision 10648749)
|
||
Frankrigs præsidenter (revision 10477099)
|
||
Geologi (revision 10631000)
|
||
Geoteknik (revision 10603548)
|
||
Greater London (revision 10380043)
|
||
Hortus Botanicus Amsterdam (revision 8854568)
|
||
Hu Jintao (revision 10610855)
|
||
IC4 (revision 10577458)
|
||
Idus martius (revision 10652897)
|
||
Inger Støjberg (revision 10643259)
|
||
Italiens premierministre (revision 10625575)
|
||
John Polkinghorne (revision 10654447)
|
||
Julius Cæsar (revision 10653812)
|
||
Korruption (revision 10401686)
|
||
Lars Göran Petrov (revision 10650013)
|
||
London Underground (revision 10635531)
|
||
Marge Simpson (revision 10640942)
|
||
Mario Draghi (revision 10652699)
|
||
Matilde af Skotland (revision 10648200)
|
||
Metrosystemer i verden (revision 10510595)
|
||
Middelaldercentret (revision 10574228)
|
||
Naomi Osaka (revision 10478959)
|
||
Nederlandene (revision 10642742)
|
||
Nicolas Sarkozy (revision 10639376)
|
||
Nikolaj 2. af Rusland (revision 10639924)
|
||
Novak Djokovic (revision 10479710)
|
||
Outlaw Gentlemen & Shady Ladies (revision 10492201)
|
||
Paris-Nice 2021 (revision 10653019)
|
||
Rigsretssagen mod Donald Trump 2021 (revision 10653875)
|
||
Rigsretssagen mod Inger Støjberg (revision 10643260)
|
||
Rusland (revision 10631140)
|
||
Sanja Ilić (revision 10645645)
|
||
Senat (revision 10429780)
|
||
Senatet (USA) (revision 10624834)
|
||
Shu-bi-dua (revision 10630614)
|
||
Svend Johansen (skuespiller) (revision 10643631)
|
||
Tennis (revision 10651841)
|
||
Tommy Troelsen (revision 10648382)
|
||
Træsko (revision 10626215)
|
||
USA's præsidenter (revision 10639768)
|
||
Undergrundsbane (revision 10541653)
|
||
Vilhelm Erobreren (revision 10631208)
|
||
Wikimedia (revision 10260889)
|
||
Wikipedia (revision 10627445)
|
||
Zar (revision 10557166)
|
||
1800 (revision 10645359)
|
||
2. april (revision 9568657)
|
||
Burgtheater (revision 9296862)
|
||
C-dur (revision 10513719)
|
||
Cello (revision 10641506)
|
||
Coda (revision 9298442)
|
||
Dominant (revision 9513277)
|
||
Dynamik (musik) (revision 9504157)
|
||
F-dur (revision 8135200)
|
||
Fagot (revision 10578018)
|
||
Fløjte (revision 10329382)
|
||
Harmonik (revision 10577145)
|
||
International Music Score Library Project (revision 10115839)
|
||
Italienske og franske musikudtryk (revision 10352094)
|
||
Johann Georg Albrechtsberger (revision 10289540)
|
||
Joseph Haydn (revision 10289602)
|
||
Klarinet (revision 10490230)
|
||
Klassicisme (musik) (revision 10436811)
|
||
Kontrabas (revision 10147393)
|
||
Kontrapunkt (musikteori) (revision 10184029)
|
||
Leipzig (revision 10611798)
|
||
Ludwig van Beethoven (revision 10642134)
|
||
|
||
== End of Parsed pages ==
|
||
|
||
- Wikipedia parsing ended at: 2021-03-16 01:36:49.098009
|
||
|
||
57 characters appeared 1058523 times.
|
||
|
||
First 30 characters:
|
||
[ 0] Char e: 15.118707859914238 %
|
||
[ 1] Char r: 8.552388564065213 %
|
||
[ 2] Char n: 7.6833474567864855 %
|
||
[ 3] Char t: 7.125305732610439 %
|
||
[ 4] Char a: 6.351302711419591 %
|
||
[ 5] Char i: 6.265806222443915 %
|
||
[ 6] Char s: 6.152629654716997 %
|
||
[ 7] Char d: 5.90341447469729 %
|
||
[ 8] Char o: 5.144999211164992 %
|
||
[ 9] Char l: 5.1253491893893655 %
|
||
[10] Char g: 3.907992551885977 %
|
||
[11] Char m: 3.3046990948708723 %
|
||
[12] Char k: 3.0474538578755492 %
|
||
[13] Char f: 2.586434116216653 %
|
||
[14] Char v: 2.2680659749481116 %
|
||
[15] Char u: 1.9654745338551927 %
|
||
[16] Char b: 1.7524418458550264 %
|
||
[17] Char p: 1.6338804163915193 %
|
||
[18] Char h: 1.5844719481768466 %
|
||
[19] Char ø: 0.7598323324103491 %
|
||
[20] Char æ: 0.7542585281566863 %
|
||
[21] Char å: 0.728278932059105 %
|
||
[22] Char y: 0.6751860847615027 %
|
||
[23] Char c: 0.6527963964883143 %
|
||
[24] Char j: 0.5847770903419198 %
|
||
[25] Char w: 0.17241004682940286 %
|
||
[26] Char z: 0.0783166733268904 %
|
||
[27] Char x: 0.05602145631223884 %
|
||
[28] Char é: 0.019177665482941794 %
|
||
[29] Char q: 0.016626941502452003 %
|
||
|
||
The first 30 characters have an accumulated ratio of 0.9997184756495605.
|
||
|
||
936 sequences found.
|
||
|
||
First 512 (typical positive ratio): 0.9962304038307248
|
||
Next 512 (512-1024): 0.007598323324103491
|
||
Rest: -5.2909066017292616e-17
|
||
|
||
- Processing end: 2021-03-16 01:36:49.182013
|