mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-07 01:06:40 +08:00
Adding generic language model (see coming commit), which uses the same data as specific single-byte encoding statistics model, except that it applies it to unicode code points. For this to work, instead of the CharToOrderMap which was mapping directly from encoded byte (always 256 values) to order, now we add an array of frequent characters, ordered by generic unicode code points to the order of frequency (which can be used on the same sequence mapping array). This of course means that each prober where we will want to use these generic models will have to implement their own byte to code point decoder, as this is per-encoding logics anyway. This will come in a subsequent commit.
151 lines
4.6 KiB
Plaintext
151 lines
4.6 KiB
Plaintext
= Logs of language model for German (de) =
|
||
|
||
- Generated by BuildLangModel.py
|
||
- Started: 2021-03-16 01:05:29.301622
|
||
- Maximum depth: 4
|
||
- Max number of pages: 100
|
||
|
||
== Parsed pages ==
|
||
|
||
Wikipedia:Hauptseite (revision 201839754)
|
||
1021 (revision 209824844)
|
||
1521 (revision 209838003)
|
||
16. März (revision 209315535)
|
||
1861 (revision 209842356)
|
||
1946 (revision 209524711)
|
||
1951 (revision 209835290)
|
||
Beyoncé (revision 209832932)
|
||
Bolivien (revision 209448707)
|
||
Bund der Schweizerinnen gegen das Frauenstimmrecht (revision 209693790)
|
||
Bundesgrenzschutz (revision 208691250)
|
||
Clara Weaver Parrish (revision 209287165)
|
||
Dornmühle (Fränkisch-Crumbach) (revision 209842366)
|
||
Edmund Weiskopf (revision 209843848)
|
||
Enrico Letta (revision 209811620)
|
||
Enzyklopädie (revision 209393223)
|
||
Ferdinand Magellan (revision 209566955)
|
||
Freie Inhalte (revision 207460431)
|
||
Geschichte der Bundesrepublik Deutschland (bis 1990) (revision 209662112)
|
||
Giovanni Gastel (revision 209840651)
|
||
Henry Darrow (revision 209836134)
|
||
Heribert von Köln (revision 208577962)
|
||
Homonhon (revision 207392862)
|
||
Internationales Olympisches Komitee (revision 209815926)
|
||
Jeanine Áñez (revision 209843969)
|
||
Jeanne d’Arc Mujawamariya (revision 209842628)
|
||
Kommunalwahlen in Hessen 2021 (revision 209834340)
|
||
Landtagswahl in Baden-Württemberg 2021 (revision 209842530)
|
||
Mark Lubotsky (revision 209830272)
|
||
Marvelous Marvin Hagler (revision 209843820)
|
||
Max Blokzijl (revision 209843982)
|
||
Molly Pitcher (revision 209843994)
|
||
Murray Walker (revision 209841073)
|
||
März 2021 (revision 209804897)
|
||
Nekrolog 2021 (revision 207237920)
|
||
Oscarverleihung 2021 (revision 209715006)
|
||
Thomas Bach (revision 209739384)
|
||
1. Dezember (revision 209839074)
|
||
1. Januar (revision 209777781)
|
||
1. November (revision 209796293)
|
||
10. Februar (revision 209675106)
|
||
10. Mai (revision 208810425)
|
||
10. März (revision 209821650)
|
||
11. Juli (revision 209510718)
|
||
11. März (revision 209819434)
|
||
11. November (revision 209630921)
|
||
12. Dezember (revision 209724301)
|
||
12. Mai (revision 208883973)
|
||
12. März (revision 209795040)
|
||
12. September (revision 209262794)
|
||
13. Dezember (revision 209710424)
|
||
13. Januar (revision 209629276)
|
||
13. März (revision 209795132)
|
||
13. Oktober (revision 209183744)
|
||
14. Februar (revision 209414444)
|
||
14. September (revision 209562392)
|
||
16. April (revision 209621904)
|
||
19. August (revision 208018991)
|
||
1920 (revision 209819215)
|
||
1921 (revision 209733600)
|
||
1923 (revision 209799201)
|
||
1924 (revision 209534204)
|
||
1925 (revision 209632533)
|
||
1926 (revision 209684778)
|
||
1927 (revision 209374750)
|
||
1929 (revision 209747684)
|
||
1930 (revision 209715589)
|
||
1931 (revision 209767120)
|
||
1933 (revision 209704894)
|
||
1934 (revision 209767120)
|
||
1936 (revision 209834629)
|
||
1939 (revision 209524711)
|
||
1940 (revision 209524711)
|
||
1941 (revision 209524711)
|
||
1942 (revision 209524711)
|
||
1944 (revision 209505481)
|
||
1945 (revision 209524711)
|
||
1947 (revision 209505481)
|
||
1948 (revision 209767120)
|
||
1950 (revision 209655464)
|
||
1952 (revision 209572541)
|
||
1954 (revision 209187815)
|
||
1955 (revision 209259419)
|
||
1957 (revision 209842142)
|
||
1965 (revision 209593366)
|
||
1980er (revision 209258403)
|
||
1990er (revision 209258403)
|
||
2. März (revision 209835819)
|
||
2. September (revision 209803579)
|
||
20. April (revision 209655478)
|
||
20. Jahrhundert (revision 207914301)
|
||
20. Januar (revision 209517100)
|
||
|
||
== End of Parsed pages ==
|
||
|
||
- Wikipedia parsing ended at: 2021-03-16 01:10:34.749053
|
||
|
||
59 characters appeared 3848604 times.
|
||
|
||
First 31 characters:
|
||
[ 0] Char e: 13.62925362027374 %
|
||
[ 1] Char r: 9.404189155340482 %
|
||
[ 2] Char i: 8.18457809636949 %
|
||
[ 3] Char n: 7.829540269666611 %
|
||
[ 4] Char s: 6.804155480792516 %
|
||
[ 5] Char a: 6.737923673103287 %
|
||
[ 6] Char t: 5.6408765360115 %
|
||
[ 7] Char h: 4.424695292111114 %
|
||
[ 8] Char u: 4.194118178955279 %
|
||
[ 9] Char l: 4.1823216937881895 %
|
||
[10] Char d: 4.112010484840737 %
|
||
[11] Char o: 3.6970808116397533 %
|
||
[12] Char c: 3.4451453046351355 %
|
||
[13] Char m: 2.8236732072200725 %
|
||
[14] Char g: 2.3015618130626065 %
|
||
[15] Char b: 2.0475736137051253 %
|
||
[16] Char k: 1.9373258459431004 %
|
||
[17] Char p: 1.6796479970399656 %
|
||
[18] Char f: 1.6060368902594293 %
|
||
[19] Char z: 1.0385064298639195 %
|
||
[20] Char w: 0.9370410673584499 %
|
||
[21] Char v: 0.7894031186373033 %
|
||
[22] Char j: 0.6687879553209424 %
|
||
[23] Char ä: 0.5280616036360197 %
|
||
[24] Char y: 0.35885739348605367 %
|
||
[25] Char ü: 0.33731711550473886 %
|
||
[26] Char ö: 0.27194276158316105 %
|
||
[27] Char ß: 0.13979094757475696 %
|
||
[28] Char x: 0.09044838076351841 %
|
||
[29] Char é: 0.04185933392991329 %
|
||
[30] Char q: 0.02814007364748361 %
|
||
|
||
The first 31 characters have an accumulated ratio of 0.9991186414606439.
|
||
|
||
1337 sequences found.
|
||
|
||
First 512 (typical positive ratio): 0.9936565191798025
|
||
Next 512 (512-1024): 0.0033731711550473885
|
||
Rest: 0.00017862552962171364
|
||
|
||
- Processing end: 2021-03-16 01:10:34.853392
|