Many friends may encounter garbled characters when using php to separate Chinese strings. Especially in UTF-8 encoding, this article provides a better solution, which is also being used in this article. Many friends may encounter garbled characters when using php to separate Chinese strings. Especially in UTF-8 encoding, this article provides a better solution, which is also being used in this article.
Script ec (2); script
The str_split function in php does not support Chinese segmentation. We can use the mb_xx function.
/**
* Convert a string to an array
* @ Param string $ str
* @ Param number $ split_length
* @ Return multitype: string
*/
Function mb_str_split ($ str, $ split_length = 1, $ charset = "UTF-8 "){
If (func_num_args () = 1 ){
Return preg_split ('/(? }
If ($ split_length <1) return false;
$ Len = mb_strlen ($ str, $ charset );
$ Arr = array ();
For ($ I = 0; $ I <$ len; $ I + = $ split_length ){
$ S = mb_substr ($ str, $ I, $ split_length, $ charset );
$ Arr [] = $ s;
}
Return $ arr;
}
Method 2:
Function mbStrSplit ($ string, $ len = 1 ){
$ Start = 0;
$ Strlen = mb_strlen ($ string );
While ($ strlen ){
$ Array [] = mb_substr ($ string, $ start, $ len, "utf8 ");
$ String = mb_substr ($ string, $ len, $ strlen, "utf8 ");
$ Strlen = mb_strlen ($ string );
}
Return $ array;
}
Php "str_split" function segmentation Chinese character string garbled Problem
Q:
// Test Chinese segmentation
$ Str2 = "gently I walked away, just as I gently came. ";
Echo "original string:$ Str2.
";
Echo "1. Split string with a specified length of 5:
";
$ Arr3 = str_split ($ str2, 5 );
Echo "-- \ $ arr3 [0] value:". "$ arr3 [0]"."
";
Echo "-- \ $ arr3 [1] value:". "$ arr3 [1]";
The result is:
Original string: Gently I walked away, just as I gently came ..
1. Split the string with the specified length of 5:
-- $ Arr3 [0] value: qingshi br/> -- $ arr3 [1] value :? Geographic response body>
Garbled !! Ask for an explanation!
A:
A solution.
The test showed that there may be a problem in the processing of Chinese (Multi-byte) separators by preg_split.
The reason may be that a multi-byte character cannot be normally separated during regular expression matching (speculation ).
However, my experiment works very well with a half-width (English) separator.
Therefore
I replace the text before processing the text to be separated, replace the Chinese periods and commas with the English half-width, and then use preg_split. I found that the work is good for the time being.
The following is my test code.
$ Test = < The reporter learned from relevant sources that all preparations for the launch of chang'e 2 were ready. After review by the expert group yesterday, the satellite, rocket, launch site, measurement and control systems were normal and the launch conditions were met. Starting today, the filling hand of the Xichang Satellite Launch Center will fuel the rocket.
According to an aerospace expert, because the earth and the moon are both rotating, the optimal intersection between the Moon and the Earth appears only three times a year, the three were the best nodes for launching moon exploration satellites. After observation, the three times of appearance this year were October 1, October 2, and October 3, respectively, the best launch window is 7 o'clock on the first day, 8 o'clock on the second day, and 10 o'clock on the third day. Among them, 7 o'clock is the best in the future.
According to the media, the launch window on April 9 was scheduled at 06:59:57. The expert told our reporter that three seconds earlier than three seconds earlier, instead of three seconds earlier, they reserved the three seconds later, as the buffer time for field commanders to send countdown passwords. (Reporter wanqiang)
EOF;
// $ Input = $ _ POST [$ content]; // .......... obtain the string to be split
$ Test = str_replace (",", ',', $ test );
$ Test = str_replace (". ", '.', $ Test );
$ Mode = "/[, |.] /s ";//...................... use commas (,) and periods (,) to separate strings.
$ Output = preg_split ($ mode, $ test,-1 );
Print_r ($ output );
?>
================
Try it without defense. My code is a string of gb2312.
It can also work normally when the string is UTF8.