Difference between revisions of "Du.t"

Latest revision as of 11:38, 1 June 2021

Du.t is PHP routine that reveals encoding of the Utf8 characters. It is equivalent of "dump.t", but all functions necessary are defined inside and have no need to be included from the external files.

Code

<?php
 function unichr($dec) {
  if ($dec < 128) {
    $utf = chr($dec);
  } else if ($dec < 2048) {
    $utf = chr(192 + (($dec - ($dec % 64)) / 64));
    $utf .= chr(128 + ($dec % 64));
  } else {
    $utf = chr(224 + (($dec - ($dec % 4096)) / 4096));
    $utf .= chr(128 + ((($dec % 4096) - ($dec % 64)) / 64));
    $utf .= chr(128 + ($dec % 64));
  }
  return $utf;
} 
// include "unichr.t";


 function uniord($a) 
 {
   $M=strlen($a);
   $p=ord($a[0]);                    if($M==1) return $p;
   $p-=194;  $p*=64; $p+=ord($a[1]); if($M==2) return $p;
   $p-=2050; $p*=64; $p+=ord($a[2]);           return $p;

#   if($M==1) return ord($a[0]);
#   if($M==2) return 64*(ord($a[0])-194)+ord($a[1]);
#   if($M==3) return 64*( 64*(ord($a[0])-194)+ord($a[1]))-131200+ord($a[2]);
 }
/*
Recovery of number of the Utf8 character encoded with 1,2 or 3 bytes.
Input: string, that consists of single utf8 character.
output: number of this character in the utf8 encoding table,
see [[Utf8table]] 
*/
//include "uniord.t";

function mb_str_split($str) {
  // split multibyte string in characters
  // Split at all positions, not after the start: ^
  // and not before the end: $
  $pattern = '/(?<!^)(?!$)/u';
  return preg_split($pattern,$str);
}
//include "mb_str_split.t";

//dump.t analyses the content of a sttring.
//The string is interpreted as sequense of Utf8 characters
// files unichr.t, uniord.t, mb_str_split.t
// should be loaded in the working directory.
// Usage:
// php dump.t "any абракадабра and だからも in any language(s)"

$a=$argv[1];
echo "$a\n";
$N=strlen($a);
echo "The array has $N bytes; here is its splitting:\n";

for($n=0;$n<$N;$n++){printf("%02x ",ord($a[$n]) );}
echo "\n";
$b = mb_str_split($a);
var_dump($b);
$M=count($b);
for($m=0;$m<$M;$m++)
{
printf("\n");
$c=$b[$m];
$u=uniord($c);
printf("Unicode character number %05d id est, [[X%04X]]\n",$u,$u);
$d=strlen($c);
echo "Picture: $c ; uses $d bytes. These bytes are:\n";
for($n=0;$n<$d;$n++) printf("x%2X ",ord($c[$n]));
printf("in the hexadecimal representation and\n");
for($n=0;$n<$d;$n++) printf("%3d ",ord($c[$n]));
printf("in the decimal representation\n");
}
?>

References

Keywords

dump.t, KanjiLiberal, KanjiRadical, mb_str_split.t, PHP, Utf8, UtfH, unichr.t, Unicode, uniord.t

@@ Line 90: / Line 90: @@
 [[KanjiLiberal]],
 [[KanjiRadical]],
-[[mb_split.t]],
+[[mb_str_split.t]],
 [[PHP]],
 [[Utf8]],
@@ Line 99: / Line 99: @@
 [[Category:dump.t]]
-[[Category:mb_split.t]]
+[[Category:mb_str_split.t]]
 [[Category:PHP]]
 [[Category:Utf8]]

Difference between revisions of "Du.t"

Latest revision as of 11:38, 1 June 2021

Code

References

Keywords

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

.

Tools