mirror of
https://gitlab.freedesktop.org/uchardet/uchardet.git
synced 2025-12-06 16:56:40 +08:00
This allows to handle cases where some characters are actually alternative/variants of another. For instance, a same word can be written with both variants, while both are considered correct and equivalent. Browsing a bit Slovenian Wikipedia, it looks like they only use them for titles there. I use this the first time on characters with diacritics in Slovene. Indeed these are so rarely used that they would hardly show in the stats and worse, any sequence using these in tested text would likely show as negative sequences hence drop the confidence in Slovenian. As a consequence, various Slovene text would show up as Slovak as it's close enough and contains the same character with diacritics in a common way.
53 lines
1.5 KiB
Plaintext
53 lines
1.5 KiB
Plaintext
= Logs of language model for Slovene (sl) =
|
|
|
|
- Generated by BuildLangModel.py
|
|
- Started: 2021-03-21 14:46:51.759879
|
|
- Maximum depth: 4
|
|
- Max number of pages: 1
|
|
|
|
== Parsed pages ==
|
|
|
|
Ljubljana (revision 5468628)
|
|
1689 (revision 4230028)
|
|
|
|
== End of Parsed pages ==
|
|
|
|
- Wikipedia parsing ended at: 2021-03-21 14:47:12.578759
|
|
|
|
34 characters appeared 32235 times.
|
|
|
|
Most Frequent characters:
|
|
[ 0] Char e: 10.097719869706841 %
|
|
[ 1] Char a: 9.846440204746393 %
|
|
[ 2] Char i: 8.760663874670389 %
|
|
[ 3] Char o: 8.515588645881806 %
|
|
[ 4] Char n: 7.299519156196681 %
|
|
[ 5] Char l: 5.546765937645416 %
|
|
[ 6] Char j: 5.264464091825656 %
|
|
[ 7] Char r: 5.053513261982317 %
|
|
[ 8] Char s: 5.000775554521483 %
|
|
[ 9] Char t: 4.814642469365596 %
|
|
[10] Char v: 4.374127501163332 %
|
|
[11] Char k: 3.4993020009306655 %
|
|
[12] Char m: 2.9253916550333487 %
|
|
[13] Char d: 2.888165038002172 %
|
|
[14] Char p: 2.869551729486583 %
|
|
[15] Char u: 2.574841011323096 %
|
|
[16] Char b: 2.233597021870638 %
|
|
[17] Char z: 1.8458197611292075 %
|
|
[18] Char g: 1.48596246316116 %
|
|
[19] Char č: 1.181945090739879 %
|
|
[20] Char š: 1.0671630215604158 %
|
|
[21] Char h: 1.0361408407011012 %
|
|
[22] Char c: 0.9492787342950209 %
|
|
[23] Char ž: 0.5739103458973166 %
|
|
[24] Char f: 0.210950829843338 %
|
|
[25] Char x: 0.018613308515588647 %
|
|
[26] Char w: 0.018613308515588647 %
|
|
[27] Char y: 0.015511090429657206 %
|
|
[28] Char ü: 0.009306654257794323 %
|
|
[29] Char ö: 0.006204436171862882 %
|
|
[30] Char q: 0.006204436171862882 %
|
|
[31] Char ř: 0.003102218085931441 %
|
|
[32] Char á: 0.003102218085931441 %
|
|
[33] Char ý: 0.003102218085931441 % |