From TORI
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Gulliver2.jpg
岩石之间的个子

is the Unicode character number 22823.

Html input:
(& # x 5 9 2 7 ;)
(& # 2 2 8 2 3 ;)

Wikipedia

at Wikipedia, character appears in line number 1312 of table of recommended Japanese kanji [1] and denotes "large". Ways of the pronunciation are suggested:

ダイ、タイ、おお、おお-きい、おお-いに

dai, tai, oo, oo-kii, oo-ini

Encoding

In the Utf8, symbol is encoded with 3 bytes: 229 164 167.

Confusion

Picture of character may look similar to that of character ;
(& # x 2 F 2 4 ;)
(& # 1 2 0 6 8 ;)
that may have similar pronunciation and meaning, "big".

Such a similarity may lead to confusions. For databases,
( (& # x 5 9 2 7 ;)) and
( (& # x 2 F 2 4 ;))
are two different Unicode characters.

This confusion can be illustrated with the code below:

<?php
function mb_str_split($str) {
   // split multibyte string in characters
   // Split at all positions, not after the start: ^
   // and not before the end: $
   $pattern = '/(?<!^)(?!$)/u';
   return preg_split($pattern,$str);
}


function uniord($a) 
 {
   $M=strlen($a);
   $p=ord($a[0]);                    if($M==1) return $p;
   $p-=194;  $p*=64; $p+=ord($a[1]); if($M==2) return $p;
   $p-=2050; $p*=64; $p+=ord($a[2]);           return $p;
 }


$a='⼤ 大'; # two different unicode characters separated with spacebar

$N=strlen($a);
echo "The array has $N bytes; here is its splitting:\n";

for($n=0;$n<$N;$n++)
{
printf("%02x ",ord($a[$n]) );
}
echo "\n";

$b = mb_str_split($a);

var_dump($b);
$M=count($b);

#mb_internal_encoding("UTF-8");

for($m=0;$m<$M;$m++)
{
printf("\n");
$c=$b[$m];
$u=uniord($c);
printf("Unicode character number %05d id est, x%04x\n",$u,$u);
$d=strlen($c);
echo "Picture: $c uses $d bytes. These bytes are:\n";
for($n=0;$n<$d;$n++) printf("x%2x ",ord($c[$n]));
printf("in the hexadecimal representation and\n");
for($n=0;$n<$d;$n++) printf("%3d ",ord($c[$n]));
printf("in the decimal representation\n");
}
?>

Functions mb_str_split.t and uniord.t are used in the code above.

The output:

The array has 7 bytes; here is its splitting:
e2 bc a4 20 e5 a4 a7 
array(3) {
  [0]=>
  string(3) "⼤"
  [1]=>
  string(1) " "
  [2]=>
  string(3) "大"
}

Unicode character number 12068 id est, x2f24
Picture: ⼤ uses 3 bytes. These bytes are:
xe2 xbc xa4 in the hexadecimal representation and
226 188 164 in the decimal representation

Unicode character number 00032 id est, x0020
Picture:   uses 1 bytes. These bytes are:
x20 in the hexadecimal representation and
 32 in the decimal representation

Unicode character number 22823 id est, x5927
Picture: 大 uses 3 bytes. These bytes are:
xe5 xa4 xa7 in the hexadecimal representation and
229 164 167 in the decimal representation

Refrences

https://en.wiktionary.org/wiki/%E5%A4%A7 https://en.wiktionary.org/wiki/大

Keywords

Japanese, Kanji, K1312, mb_str_split.t, PHP, SomeU, Unicode, uniord.t, Utf8, UtfH, Utf8table

((& # x 2 f 2 4 ;)) , ((& # x 5 9 2 7 ;)) ,