Solve the problem of GBK Chinese characters garbled in segmentation
Recently encountered a magical word "Mingtao (TAO)".
The specific process is this:
1 $list Explode (' | ', ' ABC Mingtao |BC '); 2 Var_dump ($list);
Get the result of this partition.
Unlike imagination, the result is this:
Array (3) { [0]=> string(4) "ABC? [1]=> "" [2]=> "BC"}
There was garbled, and inexplicably appeared an empty element.
The reason, originally the word "Mingtao" GBK encoding is 8f7c, and | The ASCII is 7c, so explode will mingtao the second ASCII as | cut.
Since it is a double-byte problem, we solved it with mbstring.
Unfortunately, PHP did not mb_explode this function, looked for, found a mb_split.
Array string $pattern string $string $limit =-1])
There is no place to declare the code. In a closer look, he was encoded by mb_regex_encoding.
Then write the following code:
1 mb_regex_encoding (' GBK '); 2 $list = mb_split (' \| ', ' abc Mingtao |BC '); 3 Var_dump ($list);
Results PHP error, mb_regex_encoding do not know GBK, embarrassed.
Then use it to recognize:
1 mb_regex_encoding (' gb2312 '); 2 $list = mb_split (' \| ', ' abc Mingtao |BC '); 3 Var_dump ($list);
Results:
Array (3) { [0]=> string(4) "ABC? [1]=> "" [2]=> "BC"}
Found that this method is of little use. 、
As for the reason? The word "Mingtao" is not actually in GB2312 's code SET!!!!! But the code set with this word (GBK, GB18030) is not supported by this function!!!!!
Since this is not a good use, perhaps the universal regular expression is OK. Then get the following code:
1 Var_dump (preg_match_all$matches)); 2 Var_dump ($matches);
Results:
Int (2)array(2) { [0]=> array(2) { [0]=> String(4) "ABC? " [1]=> "BC" } [1]=> Array (2) { [0]=> "? [1]=> string(1) "C" }}
Well, I think more.
Now look at how to describe the scene in a regular way.
For reference, bird elder brother Big God's blog: Segmentation GBK Chinese encountered garbled solution. Unfortunately, the regular ability to be relatively low, I still can't think of a suitable regular expression (if there are big gods who come up with this regular expression, hope can tell me).
No way, reasoning, had to use substr:
1 functionMb_explode ($delimiter,$string,$encoding=NULL){2 $list=Array();3 Is_null($encoding) &&$encoding=mb_internal_encoding ();4 $len= Mb_strlen ($delimiter,$encoding);5 while(false!== ($idx= Mb_strpos ($string,$delimiter, 0,$encoding))){6 $list[] = Mb_substr ($string, 0,$idx,$encoding);7 $string= Mb_substr ($string,$idx+$len,NULL,$encoding);8 } 9 $list[] =$string;Ten return $list; One}
Test code:
1 $a = ' abc Mingtao |BC '; 2 3 Var_dump $a, ' GBK '); 4 Var_dump $a, ' GBK '); 5 Var_dump $a, ' GBK ');
Results:
array (2 0]=> string (5) "ABC Mingtao" [ 1]=> string (2) "BC" array (3 0]=> string (1)" A " [ 1]=> string (3)" Mingtao | " [ 2]=> string (0) "} array (2 0]=> string (3) "ABC" [ 1]=> string (3) "|BC"
This will give you the right results.