uchardet

coffee/uchardet

Fork 0

mirror of https://gitlab.freedesktop.org/uchardet/uchardet.git synced 2025-12-13 15:10:06 +08:00

Commit Graph

Author	SHA1	Message	Date
Jehan	310e750abd	src: new nsCJKDetector specifically Chinese/Japanese/Korean recognition. I was pondering improving the logics of the LanguageModel contents, in order to better handle language with a huge number of characters (far too much to keep a full frequent list while keeping reasonable memory consumption and speed). But then I realize that this happens for languages which have anyway their own set of characters. For instance, modern Korean is near full hangul. Of course, we can find some Chinese characters here and there, but nothing which should really break confidence if we base it on the hangul ratio. Of course if some day we want to go further and detect older Korean, we will have to improve the logics a bit with some statistics, though I wonder if limiting ourselves to character frequency is not enough here (sequence frequency is maybe a bit overboard). To be tested. In any case, this new class gives much more relevant confidence on Korean texts, compared to the statistics data we previously generated. For Japanese, it is a mix of kana and Chinese characters. A modern full text cannot exist without a lot of kanas (probably only old text or very short texts, such as titles, could have only Chinese characters). We would still want to add a bit of statistics to differentiate correctly a Japanese text with a lot of Chinese characters in it and a Chinese text which quotes a bit of Japanese phrases. It will have to be improved, but for now it works fairly ok. A last case where we would want to play with statistics might be if we want to differentiate between regional variants. For instance, Simplified Chinese, Taiwan or Hong Kong Chinese… More to experiment later on. It's already a first good step for UTF-8 support with language!	2021-03-20 22:43:36 +01:00

Author

SHA1

Message

Date

Jehan

310e750abd

src: new nsCJKDetector specifically Chinese/Japanese/Korean recognition.

I was pondering improving the logics of the LanguageModel contents, in
order to better handle language with a huge number of characters (far
too much to keep a full frequent list while keeping reasonable memory
consumption and speed).
But then I realize that this happens for languages which have anyway
their own set of characters.

For instance, modern Korean is near full hangul. Of course, we can find
some Chinese characters here and there, but nothing which should really
break confidence if we base it on the hangul ratio. Of course if some
day we want to go further and detect older Korean, we will have to
improve the logics a bit with some statistics, though I wonder if
limiting ourselves to character frequency is not enough here (sequence
frequency is maybe a bit overboard). To be tested.
In any case, this new class gives much more relevant confidence on
Korean texts, compared to the statistics data we previously generated.

For Japanese, it is a mix of kana and Chinese characters. A modern full
text cannot exist without a lot of kanas (probably only old text or very
short texts, such as titles, could have only Chinese characters). We
would still want to add a bit of statistics to differentiate correctly a
Japanese text with a lot of Chinese characters in it and a Chinese
text which quotes a bit of Japanese phrases. It will have to be
improved, but for now it works fairly ok.

A last case where we would want to play with statistics might be if we
want to differentiate between regional variants. For instance,
Simplified Chinese, Taiwan or Hong Kong Chinese… More to experiment
later on. It's already a first good step for UTF-8 support with
language!

2021-03-20 22:43:36 +01:00

1 Commits