Mb str split.t

From TORI
Revision as of 14:13, 21 May 2021 by T (talk | contribs) (Created page with "mb_str_split.t is the PHP function that splits the input string to the Utf8 characters. The output is array of strings, each of them counts one, two of three bytes...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

mb_str_split.t is the PHP function that splits the input string to the Utf8 characters. The output is array of strings, each of them counts one, two of three bytes.

Code

<?php
function mb_str_split($str) {
  // split multibyte string in characters
  // Split at all positions, not after the start: ^
  // and not before the end: $
  $pattern = '/(?<!^)(?!$)/u';
  return preg_split($pattern,$str);
}

Example of calling

<?php
include "mb_str_split.t";

$a="私の名前はマー\n Tori です。";
$N=strlen($a);
echo $N, "\n";

for($n=0;$n<$N;$n++)
{
printf("%02x ",ord($a[$n]) );
}

$b = mb_str_split($a);
var_dump($b);
$M=count($b);

for($m=0;$m<$M;$m++)
{
$c=$b[$m];
$d=strlen($c);
echo "$c  $d ";
for($n=0;$n<$d;$n++) printf("%2x ",ord($c[$n]));
echo "\n";
}
?>

Output

36
e7 a7 81 e3 81 ae e5 90 8d e5 89 8d e3 81 af e3 83 9e e3 83 bc 0a 54 6f 72 69 20 e3 81 a7 e3 81 99 e3 80 82 
array(16) {
  [0]=>
  string(3) "私"
  [1]=>
  string(3) "の"
  [2]=>
  string(3) "名"
  [3]=>
  string(3) "前"
  [4]=>
  string(3) "は"
  [5]=>
  string(3) "マ"
  [6]=>
  string(3) "ー"
  [7]=>
  string(1) "
"
  [8]=>
  string(1) "T"
  [9]=>
  string(1) "o"
  [10]=>
  string(1) "r"
  [11]=>
  string(1) "i"
  [12]=>
  string(1) " "
  [13]=>
  string(3) "で"
  [14]=>
  string(3) "す"
  [15]=>
  string(3) "。"
}
私  3 e7 a7 81 231
の  3 e3 81 ae 227
名  3 e5 90 8d 229
前  3 e5 89 8d 229
は  3 e3 81 af 227
マ  3 e3 83 9e 227
ー  3 e3 83 bc 227

  1  a 10
T  1 54 84
o  1 6f 111
r  1 72 114
i  1 69 105
   1 20 32
で  3 e3 81 a7 227
す  3 e3 81 99 227
。  3 e3 80 82 227

Analogies

The extended version of PHP includes function mb_str_split().

The appropriate setting of PHP requires certain skills.

Many lamers cannot reconfigurate the PHP software without to breakdown their servers.

For them, it is easier to write own version of the function, than to find the setting file that should be modified and edit it in some appropriate way.
So, several self-made versions of mb_str_split appear in the internet.

Only one, very short, is presented above.
The required function preg_split seems to be supported in the default PHP setting.

References


https://en.wikipedia.org/wiki/List_of_jōyō_kanji
https://en.wikipedia.org/wiki/List_of_j%C5%8Dy%C5%8D_kanji

https://www.php.net/manual/de/function.mb-str-split.php User Contributed Notes Polyfill PHP < 7.4 based on package "symfony/polyfill-mbstring": (2020). Much longer (and perhaps more universal) code is suggested that seems to do the same.

Keywords

Japanese, Kanji, PHP, SomeH, SomeUtf8, Utf8, Utf8table, UtfH