Unicode Script Ranges

The following table describes the unicode script ranges that Knowledge Discovery identifies.

Script	Begin	End
Arabic	U+0600	U+06FF
BasicLatin	U+0000	U+007F
Bengali	U+0981	U+09FB
Burmese	U+1000	U+109F
CJKComp	U+3300	U+33FF
CJKComp	U+2F00	U+2FDF
CJKComp	U+FE30	U+FE4F
CJKCompIdeo	U+F900	U+FAFF
CJKCompIdeo	U+2F800	U+2FA1F
CJKRadicalsSup	U+2E80	U+2EFF
CJKRadicalsSup	U+3000	U+303F
CJKRadicalsSup	U+31C0	U+31EF
CJKUnifIdeo	U+4E00	U+9FFF
CJKUnifIdeo	U+20000	U+2A6D6
CJKUnifIdeo	U+2A700	U+2B73F
CJKUnifIdeo	U+2B740	U+2B81F
CJKUnifIdeo	U+3200	U+32FF
CJKUnifIdeo	U+2FF0	U+2FFF
CJKUnifIdeoExtA	U+3400	U+4DBF
Cyrillic	U+0400	U+04FF
Cyrillic	U+0500	U+052F
Cyrillic	U+2DE0	U+2DFF
Cyrillic	U+A640	U+A69F
Devanagari	U+0901	U+097F
Ethiopic	U+1200	U+1399
Georgian	U+10A0	U+10FF
GreekAndCoptic	U+0370	U+03FF
GreekAndCoptic	U+1F00	U+1FFF
Gujarati	U+0A81	U+0AF1
Hangul	U+AC00	U+D7A3
Hangul	U+1100	U+11FF
Hangul	U+3130	U+318F
Hangul	U+A960	U+A97F
Hangul	U+D7B0	U+D7FF
Hebrew	U+0590	U+05FF
Hiragana	U+3040	U+309F
Kannada	U+0C82	U+0CF2
Katakana	U+30A0	U+30FF
Katakana	U+31F0	U+31FF
Lao	U+0E81	U+0EDF
Latin1Sup	U+0080	U+00FF
LatinExtA	U+0100	U+017F
LatinExtB	U+0180	U+024F
Malayalam	U+0D02	U+0D7F
Mongolian	U+1800	U+18AA
OrientalMisc	U+3105	U+312C
OrientalMisc	U+31A0	U+31BF
OrientalMisc	U+3190	U+319F
OrientalMisc	U+4DC0	U+4DFF
Oriya	U+0B01	U+0B77
Sinhala	U+0D82	U+0DF4
Tamil	U+0B82	U+0BFA
Telugu	U+0C01	U+0C7F
Thai	U+0E01	U+0E5B
Tibetan	U+0F00	U+0FDA
Vietnamese	U+1EA0	U+1EF9

Chinese, Japanese, and Korean Scripts

When processing text, Knowledge Discovery identifies the script range that a character belongs to. In some cases, the script range can determine how that part of the text is processed. For example, when a language has NGramSentenceBrokenScriptOnly set to True in the configuration, the Content component only produces NGrams from words that consist entirely of characters that belong to one of the following Chinese, Japanese, and Korean script ranges:

CJKUnifIdeo
CJKUnifIdeoExtA
CJKCompIdeo
CJKComp
CJKRadicalsSup
Hiragana
Katakana
Hangul
OrientalMisc