Unicode Script Ranges
The following table describes the unicode script ranges that Knowledge Discovery identifies.
Script | Begin | End |
---|---|---|
Arabic | U+0600 | U+06FF |
BasicLatin | U+0000 | U+007F |
Bengali | U+0981 | U+09FB |
Burmese | U+1000 | U+109F |
CJKComp | U+3300 | U+33FF |
CJKComp | U+2F00 | U+2FDF |
CJKComp | U+FE30 | U+FE4F |
CJKCompIdeo | U+F900 | U+FAFF |
CJKCompIdeo | U+2F800 | U+2FA1F |
CJKRadicalsSup | U+2E80 | U+2EFF |
CJKRadicalsSup | U+3000 | U+303F |
CJKRadicalsSup | U+31C0 | U+31EF |
CJKUnifIdeo | U+4E00 | U+9FFF |
CJKUnifIdeo | U+20000 | U+2A6D6 |
CJKUnifIdeo | U+2A700 | U+2B73F |
CJKUnifIdeo | U+2B740 | U+2B81F |
CJKUnifIdeo | U+3200 | U+32FF |
CJKUnifIdeo | U+2FF0 | U+2FFF |
CJKUnifIdeoExtA | U+3400 | U+4DBF |
Cyrillic | U+0400 | U+04FF |
Cyrillic | U+0500 | U+052F |
Cyrillic | U+2DE0 | U+2DFF |
Cyrillic | U+A640 | U+A69F |
Devanagari | U+0901 | U+097F |
Ethiopic | U+1200 | U+1399 |
Georgian | U+10A0 | U+10FF |
GreekAndCoptic | U+0370 | U+03FF |
GreekAndCoptic | U+1F00 | U+1FFF |
Gujarati | U+0A81 | U+0AF1 |
Hangul | U+AC00 | U+D7A3 |
Hangul | U+1100 | U+11FF |
Hangul | U+3130 | U+318F |
Hangul | U+A960 | U+A97F |
Hangul | U+D7B0 | U+D7FF |
Hebrew | U+0590 | U+05FF |
Hiragana | U+3040 | U+309F |
Kannada | U+0C82 | U+0CF2 |
Katakana | U+30A0 | U+30FF |
Katakana | U+31F0 | U+31FF |
Lao | U+0E81 | U+0EDF |
Latin1Sup | U+0080 | U+00FF |
LatinExtA | U+0100 | U+017F |
LatinExtB | U+0180 | U+024F |
Malayalam | U+0D02 | U+0D7F |
Mongolian | U+1800 | U+18AA |
OrientalMisc | U+3105 | U+312C |
OrientalMisc | U+31A0 | U+31BF |
OrientalMisc | U+3190 | U+319F |
OrientalMisc | U+4DC0 | U+4DFF |
Oriya | U+0B01 | U+0B77 |
Sinhala | U+0D82 | U+0DF4 |
Tamil | U+0B82 | U+0BFA |
Telugu | U+0C01 | U+0C7F |
Thai | U+0E01 | U+0E5B |
Tibetan | U+0F00 | U+0FDA |
Vietnamese | U+1EA0 | U+1EF9 |
Chinese, Japanese, and Korean Scripts
When processing text, Knowledge Discovery identifies the script range that a character belongs to. In some cases, the script range can determine how that part of the text is processed. For example, when a language has NGramSentenceBrokenScriptOnly
set to True
in the configuration, the Content component only produces NGrams from words that consist entirely of characters that belong to one of the following Chinese, Japanese, and Korean script ranges:
-
CJKUnifIdeo
-
CJKUnifIdeoExtA
-
CJKCompIdeo
-
CJKComp
-
CJKRadicalsSup
-
Hiragana
-
Katakana
-
Hangul
-
OrientalMisc