Du.t

From TORI
Revision as of 13:00, 22 February 2026 by T (talk | contribs) (add example and ref)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Du.t is PHP routine that reveals encoding of the Utf8 character(s).
It is equivalent of "dump.t", but all functions necessary are defined inside and have no need to be included from the external files.

Du.t happen to be useful analyzing the texts in Japanese.
One example is suggested in the next section.

The unicode number of a single doubtful character can be revealed also with the Unicode Utilities [1].

Nichi

Here is an example of the confusion.

Nichi is set of the following Unicode characters:

X2F47 [2], KanjiRadical

X2F48 [3], KanjiRadical

X65E5 [4], KanjiLiberal, CJK

X66F0 [5], KanjiLiberal, CJK


Even a native Japanese speaker looking at characters , , , us unlikely to guess:
Which of them is X2F47?
Which of them is X2F48?
Which of them is X65E5?
Which of them is X66F0?

To year 2026, many Japanese glyphs have not yet assigned a unique Unicode number.

So, the confuse takes place.

Du.t helps to identify confusing characters from an input string.

Just load the code below and type:

php Du.t "any_string ⽇⽈日曰⿂魚"

The information about each character of the string is supposed to be revealed in the output.

Code

<?php
 function unichr($dec) {
  if ($dec < 128) {
    $utf = chr($dec);
  } else if ($dec < 2048) {
    $utf = chr(192 + (($dec - ($dec % 64)) / 64));
    $utf .= chr(128 + ($dec % 64));
  } else {
    $utf = chr(224 + (($dec - ($dec % 4096)) / 4096));
    $utf .= chr(128 + ((($dec % 4096) - ($dec % 64)) / 64));
    $utf .= chr(128 + ($dec % 64));
  }
  return $utf;
} 
// include "unichr.t";


 function uniord($a) 
 {
   $M=strlen($a);
   $p=ord($a[0]);                    if($M==1) return $p;
   $p-=194;  $p*=64; $p+=ord($a[1]); if($M==2) return $p;
   $p-=2050; $p*=64; $p+=ord($a[2]);           return $p;

#   if($M==1) return ord($a[0]);
#   if($M==2) return 64*(ord($a[0])-194)+ord($a[1]);
#   if($M==3) return 64*( 64*(ord($a[0])-194)+ord($a[1]))-131200+ord($a[2]);
 }
/*
Recovery of number of the Utf8 character encoded with 1,2 or 3 bytes.
Input: string, that consists of single utf8 character.
output: number of this character in the utf8 encoding table,
see [[Utf8table]] 
*/
//include "uniord.t";

function mb_str_split($str) {
  // split multibyte string in characters
  // Split at all positions, not after the start: ^
  // and not before the end: $
  $pattern = '/(?<!^)(?!$)/u';
  return preg_split($pattern,$str);
}
//include "mb_str_split.t";

//dump.t analyses the content of a sttring.
//The string is interpreted as sequense of Utf8 characters
// files unichr.t, uniord.t, mb_str_split.t
// should be loaded in the working directory.
// Usage:
// php dump.t "any абракадабра and だからも in any language(s)"

$a=$argv[1];
echo "$a\n";
$N=strlen($a);
echo "The array has $N bytes; here is its splitting:\n";

for($n=0;$n<$N;$n++){printf("%02x ",ord($a[$n]) );}
echo "\n";
$b = mb_str_split($a);
var_dump($b);
$M=count($b);
for($m=0;$m<$M;$m++)
{
printf("\n");
$c=$b[$m];
$u=uniord($c);
printf("Unicode character number %05d id est, [[X%04X]]\n",$u,$u);
$d=strlen($c);
echo "Picture: $c ; uses $d bytes. These bytes are:\n";
for($n=0;$n<$d;$n++) printf("x%2X ",ord($c[$n]));
printf("in the hexadecimal representation and\n");
for($n=0;$n<$d;$n++) printf("%3d ",ord($c[$n]));
printf("in the decimal representation\n");
}
?> 

Sci-fi

In the sci-fi utopia Tartaria, the special default font Uniglif is mentioned.

In that font, each glyph is assigned a unique unicode number.

The reality is not so advanced.

To year 2026, no any analogy of the Uniglif is available.

To year 2026, many Japanese Kanjis have not yet assigned a unique, bijective computer representation. This cause confuses.

In order to avoid the confusion, the special technical language Tarja is suggested as a modification of Japanese.

In Tarja, only those Kanjis are allowed that already have a unique Unicode number.

Words with ambiguous glyphs are replaced with their Hiragana or Romaji transliterations, or with words borrowed from other languages, written in ascii.

The text in Tarja has no need to be analyzed with routine Du.t

Warning

1. To year 2026, some PHP interpreters already have routine mb_str_split defined as a built-in function. In such a case, in the code above, the routine mb_str_split be suppressed with /* .. */.

2. The description above may require some correction(s) by a native Japanese speaker.

References

  1. https://util.unicode.org/UnicodeJsps/character.jsp Unicode Utilities: Character Properties
  2. https://util.unicode.org/UnicodeJsps/character.jsp?a=2F47 2F47 KANGXI RADICAL SUN Han Script id: allowed confuse:
  3. https://util.unicode.org/UnicodeJsps/character.jsp?a=2F48 2F48 KANGXI RADICAL SAY Han Script id: allowed confuse:
  4. https://util.unicode.org/UnicodeJsps/character.jsp?a=65E5 65E5 CJK UNIFIED IDEOGRAPH-65E5 Han Script id: restricted confuse:
  5. https://util.unicode.org/UnicodeJsps/character.jsp?a=66F0 66F0 CJK UNIFIED IDEOGRAPH-66F0 Han Script id: restricted confuse:

Keywords

«Du.t», «dump.t», «KanjiLiberal», «KanjiRadical», «mb_str_split.t», «Nichi», «PHP», «Utf8», «UtfH», «unichr.t», «Unicode», «uniord.t»,