Solve the Problem of garbled characters in gbk Chinese, and split gbk Chinese garbled characters
Recently, I encountered a magic word "tao )".
The specific process is as follows:
1 $ list = explode ('|', 'abc scheme | bc'); 2 var_dump ($ list );
Obtain the result of this split.
Unlike imagination, the result is as follows:
array(3) { [0]=> string(4) "abc? [1]=> string(0) "" [2]=> string(2) "bc"}
Garbled characters appear, and an empty element appears inexplicably.
The reason is that the gbk encoding of the word "bytes" is 8f7c, And the | ASCII is 7c. In this way, explode uses the second ASCII of bytes as | cut.
Since it is a dual-byte problem, we can solve it with mbstring.
Unfortunately, php does not have a function such as mb_explode. Find a function such as mb_split.
array mb_split ( string $pattern , string $string [, int $limit = -1 ] )
The encoding is not declared. The code is declared through mb_regex_encoding.
Write the following code:
1 mb_regex_encoding ('gbk'); 2 $ list = mb_split ('\ |', 'abc scheme | bc'); 3 var_dump ($ list );
The result is an error in php. mb_regex_encoding does not know gbk and encoding.
You can use it to understand:
1 mb_regex_encoding ('gb2312'); 2 $ list = mb_split ('\ |', 'abc scheme | bc'); 3 var_dump ($ list );
Result:
array(3) { [0]=> string(4) "abc? [1]=> string(0) "" [2]=> string(2) "bc"}
It is found that this method is useless. ,
Why? The word "bytes" is not in the GB2312 album !!!!! However, this function does not support the limit set (GBK, GB18030 !!!!!
Since this is not easy to use, the omnipotent regular expression may be OK. The following code is obtained:
1 var_dump (preg_match_all ('/([^ \ |]) */', 'abc tables | bc', $ matches); 2 var_dump ($ matches );
Result:
int(2)array(2) { [0]=> array(2) { [0]=> string(4) "abc? [1]=> string(2) "bc" } [1]=> array(2) { [0]=> string(1) "? [1]=> string(1) "c" }}
Okay, I think more.
Now let's look at how to use regular expressions to describe this scenario.
For more information, refer to the blog of laruence: how to split GBK Chinese into garbled characters. Unfortunately, I still cannot find a suitable regular expression if the regular expression capability is low. (If you want to come up with this regular expression, please let me know ).
There's no way. I thought about it, so I had to use substr:
1 function mb_explode($delimiter, $string, $encoding = null){ 2 $list = array(); 3 is_null($encoding) && $encoding = mb_internal_encoding(); 4 $len = mb_strlen($delimiter, $encoding); 5 while(false !== ($idx = mb_strpos($string, $delimiter, 0, $encoding))){ 6 $list[] = mb_substr($string, 0, $idx, $encoding); 7 $string = mb_substr($string, $idx + $len, null, $encoding); 8 } 9 $list[] = $string;10 return $list; 11 }
Test code:
1 $ a = 'abc scheme | bc'; 2 3 var_dump (mb_explode ('|', $ a, 'gbk'); 4 var_dump (mb_explode ('bc ', $ a, 'gbk'); 5 var_dump (mb_explode ('hour', $ a, 'gbk '));
Result:
Array (2) {[0] => string (5) "abc regular" [1] => string (2) "bc"} array (3) {[0] => string (1) "a" [1] => string (3) "inline |" [2] => string (0) ""} array (2) {[0] => string (3) "abc" [1] => string (3) "| bc "}
In this way, you can get the correct result.