uchardet/script/BuildLangModelLogs/LangVietnameseModel.log
Jehan 178c6119b8 LangModels: add Windows-1258 support for Vietnamese.
I was planning on adding VISCII support as well, but Python encode()
method does not have any support for it apparently, so I cannot generate
the proper statistics data with the current version of the string.
2016-02-13 02:32:57 +01:00

100 lines
2.9 KiB
Plaintext

= Logs of language model for Vietnamese (vi) =
- Generated by BuildLangModel.py
- Started: 2016-02-13 02:13:44.503931
- Maximum depth: 3
- Max number of pages: 40
== Parsed pages ==
Chữ_Quốc_ngữ (revision 22887853)
1651 (revision 21455247)
1773 (revision 21354755)
1815 (revision 21361292)
1838 (revision 21361314)
1865 (revision 21361338)
1869 (revision 21361342)
1888 (revision 21389506)
1902 (revision 21354811)
1918 (revision 21354828)
1919 (revision 21354829)
1938 (revision 21354849)
1945 (revision 21354857)
22 tháng 2 (revision 21376086)
26 tháng 11 (revision 22579845)
28 tháng 12 (revision 22475308)
A (revision 22549334)
ASCII (revision 22528409)
Alexandre de Rhodes (revision 22859954)
Antonio Barbosa (revision 22145269)
B (revision 22836557)
BBC (revision 22863903)
Biên khảo (revision 22531516)
Bán nguyên âm (revision 22655600)
Bình luận (revision 22117664)
Bảng chữ cái Bồ Đào Nha (revision 22887853)
Bảng chữ cái Hy Lạp (revision 21362081)
Bảng chữ cái Latinh (revision 22442448)
Bắc Kỳ (revision 22393289)
Bồ Đào Nha (revision 22620858)
C (revision 21341881)
Cao Xuân Dục (revision 22620201)
Chính tả (revision 22187359)
Chính tả tiếng Việt (revision 20897580)
Chữ Hán (revision 22889609)
Chữ Nôm (revision 22781506)
Chữ cái (revision 22169220)
Công giáo (revision 22173119)
D (revision 21447691)
== End of Parsed pages ==
- Wikipedia parsing ended at: 2016-02-13 02:16:03.731928
49 characters appeared 190798 times.
First 33 characters:
[ 0] Char n: 13.15212947724819 %
[ 1] Char h: 10.371702009455026 %
[ 2] Char t: 8.20134382959989 %
[ 3] Char c: 7.433516074591977 %
[ 4] Char i: 7.238545477415906 %
[ 5] Char g: 6.529418547364228 %
[ 6] Char a: 4.203922472981897 %
[ 7] Char u: 3.328127129215191 %
[ 8] Char m: 3.0540152412499086 %
[ 9] Char o: 3.037767691485236 %
[10] Char đ: 2.5948909317707733 %
[11] Char r: 2.4643864191448546 %
[12] Char à: 2.3878657008983324 %
[13] Char v: 2.269939936477322 %
[14] Char l: 2.2327278063711358 %
[15] Char á: 2.0482394993658217 %
[16] Char p: 1.9214037882996675 %
[17] Char b: 1.7998092223188922 %
[18] Char ư: 1.6813593433893437 %
[19] Char s: 1.6069350831769726 %
[20] Char y: 1.4952986928584158 %
[21] Char e: 1.4544177611924654 %
[22] Char d: 1.3139550729043281 %
[23] Char k: 1.2489648738456378 %
[24] Char â: 1.1278944223734 %
[25] Char ê: 0.977997672931582 %
[26] Char ô: 0.8260044654556128 %
[27] Char ó: 0.7091269300516777 %
[28] Char q: 0.60011111227581 %
[29] Char ơ: 0.4192916068302603 %
[30] Char í: 0.4166710342875712 %
[31] Char ă: 0.37998301868992335 %
[32] Char x: 0.34329500309227556 %
The first 33 characters have an accumulated ratio of 0.9887105734860954.
852 sequences found.
First 512 (typical positive ratio): 0.990048941203513
Next 512 (512-1024): 1.0482290170756506e-05
Rest: -1.5612511283791264e-17
- Processing end: 2016-02-13 02:16:03.877897