Uniord.t

From TORI
Jump to: navigation, search

uniord.t is routine that measures the number of the utf8 character in the Utf8table.

Code

<?php 
 function uniord($a) 
 {
   $M=strlen($a);
   $p=ord($a[0]);                    if($M==1) return $p;
   $p-=194;  $p*=64; $p+=ord($a[1]); if($M==2) return $p;
   $p-=2050; $p*=64; $p+=ord($a[2]);           return $p;

#   if($M==1) return ord($a[0]);
#   if($M==2) return 64*(ord($a[0])-194)+ord($a[1]);
#   if($M==3) return 64*( 64*(ord($a[0])-194)+ord($a[1]))-131200+ord($a[2]);
 }
/*
Recovery of number of the Utf8 character encoded with 1,2 or 3 bytes.
Input: string, that consists of single utf8 character.
output: number of this character in the utf8 encoding table,
see [[Utf8table]] 
*/
?>

Example

<?php 
 function uniord($a) 
 {
   $M=strlen($a);
   $p=ord($a[0]);                    if($M==1) return $p;
   $p-=194;  $p*=64; $p+=ord($a[1]); if($M==2) return $p;
   $p-=2050; $p*=64; $p+=ord($a[2]);           return $p;

#   if($M==1) return ord($a[0]);
#   if($M==2) return 64*(ord($a[0])-194)+ord($a[1]);
#   if($M==3) return 64*( 64*(ord($a[0])-194)+ord($a[1]))-131200+ord($a[2]);
 }
/*
Recovery of number of the Utf8 character encoded with 1,2 or 3 bytes.
Input: string, that consists of single utf8 character.
output: number od this character in the utf8 encoding table. 
*/

echo uniord('<')," ", uniord('く'), " ", uniord('〈'),"\n";
echo uniord('〈')," ", uniord('ㄍ'), " ", uniord('巛'),"\n";
echo uniord('⽊')," ", uniord('林'), " ", uniord('森'),"\n";
echo uniord('女')," ", uniord('奻')," ", uniord('姦'),"\n";
echo uniord('ロ')," ", uniord('日')," ", uniord('目'),"\n";
?>

Output:

60 12367 12296
12296 12557 24027
12106 26519 26862
22899 22907 23014
12525 26085 30446

Check by table Utf8table.

<?php include "unichr.t";
echo unichr(98), unichr(12450);
?>

output:

bア

Analogies

https://www.php.net/manual/en/function.ord.php#42778 User Contributed Notes 4 notes 8 years ago (2013) As ord() doesn't work with utf-8, and if you do not have access to mb_* functions, the following function will work well:

<?php
function ordutf8($string, &$offset) {
    $code = ord(substr($string, $offset,1));
    if ($code >= 128) {        //otherwise 0xxxxxxx
        if ($code < 224) $bytesnumber = 2;                //110xxxxx
        else if ($code < 240) $bytesnumber = 3;        //1110xxxx
        else if ($code < 248) $bytesnumber = 4;    //11110xxx
        $codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0);
        for ($i = 2; $i <= $bytesnumber; $i++) {
            $offset ++;
            $code2 = ord(substr($string, $offset, 1)) - 128;        //10xxxxxx
            $codetemp = $codetemp*64 + $code2;
        }
        $code = $codetemp;
    }
    $offset += 1;
    if ($offset >= strlen($string)) $offset = -1;
    return $code;
}
?>
$offset is a reference, as it is not easy to split a utf-8 char-by-char. Useful to iterate on a string:
<?php
$text = "abcàê߀abc";
$offset = 0;
while ($offset >= 0) {
    echo $offset.": ".ordutf8($text, $offset)."\n";
}
/* returns:
0: 97
1: 98
2: 99
3: 224
5: 234
7: 223
9: 8364
12: 97
13: 98
14: 99
*/
?>

Feel free to adapt my code to fit your needs.

References


https://www.php.net/manual/en/intlchar.ord.php IntlChar::ord — Return Unicode code point value of characte

Keywords

Japanese, Kanji, mb_str_split.t, PHP, SomeUtf8, SomeUtfH, uniord.t, Unicode, unichr.t, Utf8, UtfH, Utf8table