Uni.t

From TORI
Jump to navigation Jump to search

uni.t is set of 3 routines:
unichr.t
uniord.t
mb_str_split.t

Up to year 2021, the most of software confuse some unicode characters. For the recognition and the testing, all the 3 routines above are necessary. In order to simplify the handling, the routines are combined in the code below.

Code

<?php
function unichr($dec) {
  if ($dec < 128) {
    $utf = chr($dec);
  } else if ($dec < 2048) {
    $utf = chr(192 + (($dec - ($dec % 64)) / 64));
    $utf .= chr(128 + ($dec % 64));
  } else {
    $utf = chr(224 + (($dec - ($dec % 4096)) / 4096));
    $utf .= chr(128 + ((($dec % 4096) - ($dec % 64)) / 64));
    $utf .= chr(128 + ($dec % 64));
  }
  return $utf;
}

function mb_str_split($str) {
   // split multibyte string in characters
   // at all positions except the start: ^
   // and the end: $
   $pattern = '/(?<!^)(?!$)/u';
   return preg_split($pattern,$str);
}

function uniord($a)
{
  $M=strlen($a);
  $p=ord($a[0]); if($M==1) return $p;
  $p-=194; $p*=64; $p+=ord($a[1]); if($M==2) return $p;
  $p-=2050; $p*=64; $p+=ord($a[2]); return $p;
}
?>

References


Keywords

PHP, du.t, mb_str_split.t, unichr.t, uniord.t

Japanese, Kanji, KanjiLiberal, KanjiRadical, Unicode, Utf8