PHP's processing of Chinese strings has been plagued by new programmers who are new to PHP development. The following briefly analyzes how PHP processes the length of a Chinese string:
PHP built-in functions such as strlen () and mb_strlen () calculate the number of bytes occupied by the string to calculate the length of the string. An English character occupies 1 byte. Example:
$ EnStr = 'hello, China! ';
Echo strlen ($ enStr); // output: 12
Chinese is not the case. For Chinese websites, two types of codes are generally used: gbk/gb2312 or UTF-8. UTF-8 is compatible with more characters, so it is favored by many webmasters. Gbk and UTF-8 are different in Chinese encoding, which leads to differences in the number of bytes occupied by gbk and UTF-8 encoding.
Each Chinese Character occupies 2 bytes in gbk encoding, for example:
$ ZhStr = 'hello, China! ';
Echo strlen ($ zhStr); // output: 12
Each Chinese Character occupies 3 bytes in UTF-8 encoding, for example:
$ ZhStr = 'hello, China! ';
Echo strlen ($ zhStr); // output: 18
So how can we calculate the length of this set of Chinese strings? Some people may say that the length of a Chinese string obtained in gbk is divided by 2. Is it okay to divide it by 3 in UTF-8 encoding? However, you need to consider that the string is not honest, and 99% of the cases will appear in a mix of Chinese and English.
This is a piece of code in WordPress. The main idea is to break down the string into individual units using regular expressions, and then calculate the number of units, that is, the length of the string. The Code is as follows (only UTF-8 encoded strings can be processed):Copy codeThe Code is as follows: $ zhStr = 'Hello, China! ';
$ Str = 'hello, China! ';
// Calculate the length of a Chinese String
Function utf8_strlen ($ string = null ){
// Splits the string into units.
Preg_match_all ("/./us", $ string, $ match );
// Returns the number of units.
Return count ($ match [0]);
}
Echo utf8_strlen ($ zhStr); // output: 6
Echo utf8_strlen ($ str); // output: 9
Utf8_strlen-get the length of the UTF-8 encoded stringCopy codeThe Code is as follows :/*
* UTF8 encoding Program
* Obtain the length of a string. A Chinese Character represents three lengths.
* Itlearner comments
*/
Function utf8_strlen ($ str ){
$ Count = 0;
For ($ I = 0; $ I <strlen ($ str); $ I ++ ){
$ Value = ord ($ str [$ I]);
If ($ value & gt; 127 ){
$ Count ++;
If ($ value >=192 & $ value <= 223) $ I ++;
Elseif ($ value >=224 & $ value <= 239) $ I = $ I + 2;
Elseif ($ value >=240 & $ value <= 247) $ I = $ I + 3;
Else die ('not a UTF-8 compatible string ');
}
$ Count ++;
}
Return $ count;
}