From TORI
Revision as of 18:08, 21 May 2021 by T (talk | contribs)
Jump to navigation Jump to search
Gulliver2.jpg
岩石之间的个子

is the Unicode character number 22823.

Html input:
(& # x 5 9 2 7 ;)
(& # 2 2 8 2 3 ;)

Wikipedia

at Wikipedia, character appears in line number 1312 of table of recommended Japanese kanji [1] and denotes "large". Ways of the pronunciation are suggested:

ダイ、タイ、おお、おお-きい、おお-いに

dai, tai, oo, oo-kii, oo-ini

Encoding

In the Utf8, symbol is encoded with 3 bytes: 229 164 167.

Confusion

Picture of character may look similar to that of character ;
(& # x 2 F 2 4 ;)
(& # 1 2 0 6 8 ;)
that may have similar pronunciation and meaning, "big".

Such a similarity may lead to confusions. For databases,
( (& # x 5 9 2 7 ;)) and
( (& # x 2 F 2 4 ;))
are two different Unicode characters.

This confusion can be illustrated with the code below:

<?php
function mb_str_split($str) {
   // split multibyte string in characters
   // Split at all positions, not after the start: ^
   // and not before the end: $
   $pattern = '/(?<!^)(?!$)/u';
   return preg_split($pattern,$str);
}


function uniord($a) 
 {
   $M=strlen($a);
   $p=ord($a[0]);                    if($M==1) return $p;
   $p-=194;  $p*=64; $p+=ord($a[1]); if($M==2) return $p;
   $p-=2050; $p*=64; $p+=ord($a[2]);           return $p;
 }


$a='⼤ 大'; # two different unicode characters separated with spacebar

$N=strlen($a);
echo "The array has $N bytes; here is its splitting:\n";

for($n=0;$n<$N;$n++)
{
printf("%02x ",ord($a[$n]) );
}
echo "\n";

$b = mb_str_split($a);

var_dump($b);
$M=count($b);

#mb_internal_encoding("UTF-8");

for($m=0;$m<$M;$m++)
{
printf("\n");
$c=$b[$m];
$u=uniord($c);
printf("Unicode character number %05d id est, x%04x\n",$u,$u);
$d=strlen($c);
echo "Picture: $c uses $d bytes. These bytes are:\n";
for($n=0;$n<$d;$n++) printf("x%2x ",ord($c[$n]));
printf("in the hexadecimal representation and\n");
for($n=0;$n<$d;$n++) printf("%3d ",ord($c[$n]));
printf("in the decimal representation\n");
}
?>

Functions mb_str_split.t and uniord.t are used in the code above.

The output:

The array has 7 bytes; here is its splitting:
e2 bc a4 20 e5 a4 a7 
array(3) {
  [0]=>
  string(3) "⼤"
  [1]=>
  string(1) " "
  [2]=>
  string(3) "大"
}

Unicode character number 12068 id est, x2f24
Picture: ⼤ uses 3 bytes. These bytes are:
xe2 xbc xa4 in the hexadecimal representation and
226 188 164 in the decimal representation

Unicode character number 00032 id est, x0020
Picture:   uses 1 bytes. These bytes are:
x20 in the hexadecimal representation and
 32 in the decimal representation

Unicode character number 22823 id est, x5927
Picture: 大 uses 3 bytes. These bytes are:
xe5 xa4 xa7 in the hexadecimal representation and
229 164 167 in the decimal representation

Refrences

https://en.wiktionary.org/wiki/%E5%A4%A7 https://en.wiktionary.org/wiki/大

Keywords

Japanese, Kanji, K1312, mb_str_split.t, PHP, SomeU, Unicode, uniord.t, Utf8, UtfH, Utf8table

((& # x 2 f 2 4 ;)) , ((& # x 5 9 2 7 ;)) ,