Unicode

From TORI
Revision as of 13:12, 7 June 2021 by T (talk | contribs) (→‎Keywords)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Unicode is way of unified encoding of characters, used in the Human and algoritmic languages.

Encoding

There are several ways of encoding of the unicode characters. In Internet, the most popular is the Utf8 encoding. This system is used in TORI.

The first 127 characters of Unicode correspond to usual Ascii. (At least in the Utf8 encoding). The ascii characters are recommended in all doubtful cases, as they are well represented in all encoding systems; the graphical representation is almost the same for all software. The English abc is abc everywhere, even in Africa.

The full table of the unicode characters covers the tables of Katakana, Hiragana and Kanji, used in Japanese. For this reason, the unicode characters are important for servers located in Japan.

The Utf8 encoding of a character (or that of a line of characters) can be revealed with the PHP program du.t.

Example:

php du.t ±⼟⼠土士

File du.t should be loaded in order to execute the command above. The output is

±⼟⼠土士
The array has 14 bytes; here is its splitting:
c2 b1 e2 bc 9f e2 bc a0 e5 9c 9f e5 a3 ab
array(5) {
  [0]=>
  string(2) "±"
  [1]=>
  string(3) "⼟"
  [2]=>
  string(3) "⼠"
  [3]=>
  string(3) "土"
  [4]=>
  string(3) "士"
}

Unicode character number 00177 id est, X00B1
Picture: ± ; uses 2 bytes. These bytes are:
XC2 XB1 in the hexadecimal representation and
194 177 in the decimal representation

Unicode character number 12063 id est, X2F1F
Picture:  ; uses 3 bytes. These bytes are:
XE2 XBC X9F in the hexadecimal representation and
226 188 159 in the decimal representation

Unicode character number 12064 id est, X2F20
Picture:  ; uses 3 bytes. These bytes are:
XE2 XBC XA0 in the hexadecimal representation and
226 188 160 in the decimal representation

Unicode character number 22303 id est, X571F
Picture:  ; uses 3 bytes. These bytes are:
XE5 X9C X9F in the hexadecimal representation and
229 156 159 in the decimal representation

Unicode character number 22763 id est, X58EB
Picture:  ; uses 3 bytes. These bytes are:
XE5 XA3 XAB in the hexadecimal representation and
229 163 171 in the decimal representation


In cases of a confusion, the unicode characters should be specified with their ascii representations, for example, in the hexadecimal form. In such a way,
X2F1F should be written instead of ±;
X2F25 should be written instead of and so on.
Even Japanese native speakers, looking at characters and , cannot guess, which of them is X2F25 and which is X5973.

Synonyms and confusions

Many unicode symbols have established pictures. Often, the same or similar picture correspond to different unicode characters. For example, the unicode characters X2F25, X5973, XF981 have puctures , , , that look very similar; they have similar sense and may be considered as synonyms.

Also, characters
± X00B1
X2f1F
X2f20
X571F
X58EB
(used in the example with du.t above) may look similar, although some of them have different meanings. This may cause confusions. [1].

Tables

Several tables of uincode characters are available. At TORI, the following tables are loaded:
KanjiLiberal
KanjiRadical
SomeU
Utf8table

References

  1. https://util.unicode.org/UnicodeJsps/character.jsp?a=58EB Unicode Utilities: Character Properties. 58EB CJK UNIFIED IDEOGRAPH-58EB Han Script id: restricted confuse: ⼠ , 土 , ⼟

https://en.wikipedia.org/wiki/List_of_Unicode_characters

https://unicode-table.com/en

https://unicode-table.com/en/blocks/cjk-unified-ideographs-extension-a/ CJK Unified Ideographs Extension A Range: 3400—4DBF Quantity of characters: 6592

https://unicode-table.com/en/blocks/cjk-unified-ideographs/ CJK Unified Ideographs Range: 4E00—9FFF Quantity of characters: 20992

Keywords

du.t, Japanese, Kanji, KanjiLiberal, KanjiRadical, SomeU, Unicode, Utf8, Utf8table