大
大 is the Unicode character number 22823.
Html input:
大 (& # x 5 9 2 7 ;)
大 (& # 2 2 8 2 3 ;)
Wikipedia
at Wikipedia, character 大 appears in line number 1312 of table of recommended Japanese kanji [1] and denotes "large". Ways of the pronunciation are suggested:
ダイ、タイ、おお、おお-きい、おお-いに
dai, tai, oo, oo-kii, oo-ini
Encoding
In the Utf8, symbol 大 is encoded with 3 bytes: 229 164 167.
Confusion
Picture of character 大 may look similar to that of character ⼤;
⼤ (& # x 2 F 2 4 ;)
⼤ (& # 1 2 0 6 8 ;)
that may have similar pronunciation and meaning, "big".
Such a similarity may lead to confusions.
For databases,
大 (大 (& # x 5 9 2 7 ;)) and
⼤ (⼤ (& # x 2 F 2 4 ;))
are two different Unicode characters.
This confusion can be illustrated with the code below:
<?php function mb_str_split($str) { // split multibyte string in characters // Split at all positions, not after the start: ^ // and not before the end: $ $pattern = '/(?<!^)(?!$)/u'; return preg_split($pattern,$str); } function uniord($a) { $M=strlen($a); $p=ord($a[0]); if($M==1) return $p; $p-=194; $p*=64; $p+=ord($a[1]); if($M==2) return $p; $p-=2050; $p*=64; $p+=ord($a[2]); return $p; } $a='⼤ 大'; # two different unicode characters separated with spacebar $N=strlen($a); echo "The array has $N bytes; here is its splitting:\n"; for($n=0;$n<$N;$n++) { printf("%02x ",ord($a[$n]) ); } echo "\n"; $b = mb_str_split($a); var_dump($b); $M=count($b); #mb_internal_encoding("UTF-8"); for($m=0;$m<$M;$m++) { printf("\n"); $c=$b[$m]; $u=uniord($c); printf("Unicode character number %05d id est, x%04x\n",$u,$u); $d=strlen($c); echo "Picture: $c uses $d bytes. These bytes are:\n"; for($n=0;$n<$d;$n++) printf("x%2x ",ord($c[$n])); printf("in the hexadecimal representation and\n"); for($n=0;$n<$d;$n++) printf("%3d ",ord($c[$n])); printf("in the decimal representation\n"); } ?>
Functions mb_str_split.t and uniord.t are used in the code above.
The output:
The array has 7 bytes; here is its splitting: e2 bc a4 20 e5 a4 a7 array(3) { [0]=> string(3) "⼤" [1]=> string(1) " " [2]=> string(3) "大" } Unicode character number 12068 id est, x2f24 Picture: ⼤ uses 3 bytes. These bytes are: xe2 xbc xa4 in the hexadecimal representation and 226 188 164 in the decimal representation Unicode character number 00032 id est, x0020 Picture: uses 1 bytes. These bytes are: x20 in the hexadecimal representation and 32 in the decimal representation Unicode character number 22823 id est, x5927 Picture: 大 uses 3 bytes. These bytes are: xe5 xa4 xa7 in the hexadecimal representation and 229 164 167 in the decimal representation
Refrences
https://en.wiktionary.org/wiki/%E5%A4%A7 https://en.wiktionary.org/wiki/大
Keywords
Japanese, Kanji, K1312, mb_str_split.t, PHP, SomeU, Unicode, uniord.t, Utf8, UtfH, Utf8table