Uniord.t
Jump to navigation
Jump to search
uniord.t is routine that measures the number of the utf8 character in the Utf8table.
Code
<?php function uniord($a) { $M=strlen($a); $p=ord($a[0]); if($M==1) return $p; $p-=194; $p*=64; $p+=ord($a[1]); if($M==2) return $p; $p-=2050; $p*=64; $p+=ord($a[2]); return $p; # if($M==1) return ord($a[0]); # if($M==2) return 64*(ord($a[0])-194)+ord($a[1]); # if($M==3) return 64*( 64*(ord($a[0])-194)+ord($a[1]))-131200+ord($a[2]); } /* Recovery of number of the Utf8 character encoded with 1,2 or 3 bytes. Input: string, that consists of single utf8 character. output: number of this character in the utf8 encoding table, see [[Utf8table]] */ ?>
Example
<?php function uniord($a) { $M=strlen($a); $p=ord($a[0]); if($M==1) return $p; $p-=194; $p*=64; $p+=ord($a[1]); if($M==2) return $p; $p-=2050; $p*=64; $p+=ord($a[2]); return $p; # if($M==1) return ord($a[0]); # if($M==2) return 64*(ord($a[0])-194)+ord($a[1]); # if($M==3) return 64*( 64*(ord($a[0])-194)+ord($a[1]))-131200+ord($a[2]); } /* Recovery of number of the Utf8 character encoded with 1,2 or 3 bytes. Input: string, that consists of single utf8 character. output: number od this character in the utf8 encoding table. */ echo uniord('<')," ", uniord('く'), " ", uniord('〈'),"\n"; echo uniord('〈')," ", uniord('ㄍ'), " ", uniord('巛'),"\n"; echo uniord('⽊')," ", uniord('林'), " ", uniord('森'),"\n"; echo uniord('女')," ", uniord('奻')," ", uniord('姦'),"\n"; echo uniord('ロ')," ", uniord('日')," ", uniord('目'),"\n"; ?>
Output:
60 12367 12296 12296 12557 24027 12106 26519 26862 22899 22907 23014 12525 26085 30446
Check by table Utf8table.
<?php include "unichr.t"; echo unichr(98), unichr(12450); ?>
output:
bア
Analogies
https://www.php.net/manual/en/function.ord.php#42778 User Contributed Notes 4 notes 8 years ago (2013) As ord() doesn't work with utf-8, and if you do not have access to mb_* functions, the following function will work well:
<?php function ordutf8($string, &$offset) { $code = ord(substr($string, $offset,1)); if ($code >= 128) { //otherwise 0xxxxxxx if ($code < 224) $bytesnumber = 2; //110xxxxx else if ($code < 240) $bytesnumber = 3; //1110xxxx else if ($code < 248) $bytesnumber = 4; //11110xxx $codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0); for ($i = 2; $i <= $bytesnumber; $i++) { $offset ++; $code2 = ord(substr($string, $offset, 1)) - 128; //10xxxxxx $codetemp = $codetemp*64 + $code2; } $code = $codetemp; } $offset += 1; if ($offset >= strlen($string)) $offset = -1; return $code; } ?> $offset is a reference, as it is not easy to split a utf-8 char-by-char. Useful to iterate on a string: <?php $text = "abcàê߀abc"; $offset = 0; while ($offset >= 0) { echo $offset.": ".ordutf8($text, $offset)."\n"; } /* returns: 0: 97 1: 98 2: 99 3: 224 5: 234 7: 223 9: 8364 12: 97 13: 98 14: 99 */ ?>
Feel free to adapt my code to fit your needs.
References
https://www.php.net/manual/en/intlchar.ord.php
IntlChar::ord — Return Unicode code point value of characte
Keywords
Japanese, Kanji, mb_str_split.t, PHP, SomeUtf8, SomeUtfH, uniord.t, Unicode, unichr.t, Utf8, UtfH, Utf8table