Php Solution supporting Chinese character string segmentation-PHP source code

Source: Internet
Author: User
Many friends may encounter garbled characters when using php to separate Chinese strings. Especially in UTF-8 encoding, this article provides a better solution, which is also being used in this article. Many friends may encounter garbled characters when using php to separate Chinese strings. Especially in UTF-8 encoding, this article provides a better solution, which is also being used in this article.

Script ec (2); script

The str_split function in php does not support Chinese segmentation. We can use the mb_xx function.

/**
* Convert a string to an array
* @ Param string $ str
* @ Param number $ split_length
* @ Return multitype: string
*/
Function mb_str_split ($ str, $ split_length = 1, $ charset = "UTF-8 "){
If (func_num_args () = 1 ){
Return preg_split ('/(? }
If ($ split_length <1) return false;
$ Len = mb_strlen ($ str, $ charset );
$ Arr = array ();
For ($ I = 0; $ I <$ len; $ I + = $ split_length ){
$ S = mb_substr ($ str, $ I, $ split_length, $ charset );
$ Arr [] = $ s;
}
Return $ arr;
}

Method 2:

Function mbStrSplit ($ string, $ len = 1 ){
$ Start = 0;
$ Strlen = mb_strlen ($ string );
While ($ strlen ){
$ Array [] = mb_substr ($ string, $ start, $ len, "utf8 ");
$ String = mb_substr ($ string, $ len, $ strlen, "utf8 ");
$ Strlen = mb_strlen ($ string );
}
Return $ array;
}





Php "str_split" function segmentation Chinese character string garbled Problem

Q:

// Test Chinese segmentation
$ Str2 = "gently I walked away, just as I gently came. ";
Echo "original string:$ Str2.
";
Echo "1. Split string with a specified length of 5:
";
$ Arr3 = str_split ($ str2, 5 );
Echo "-- \ $ arr3 [0] value:". "$ arr3 [0]"."
";
Echo "-- \ $ arr3 [1] value:". "$ arr3 [1]";

The result is:
Original string: Gently I walked away, just as I gently came ..
1. Split the string with the specified length of 5:
-- $ Arr3 [0] value: qingshi br/> -- $ arr3 [1] value :? Geographic response body>
Garbled !! Ask for an explanation!

A:

A solution.
The test showed that there may be a problem in the processing of Chinese (Multi-byte) separators by preg_split.
The reason may be that a multi-byte character cannot be normally separated during regular expression matching (speculation ).
However, my experiment works very well with a half-width (English) separator.
Therefore
I replace the text before processing the text to be separated, replace the Chinese periods and commas with the English half-width, and then use preg_split. I found that the work is good for the time being.
The following is my test code.
$ Test = < The reporter learned from relevant sources that all preparations for the launch of chang'e 2 were ready. After review by the expert group yesterday, the satellite, rocket, launch site, measurement and control systems were normal and the launch conditions were met. Starting today, the filling hand of the Xichang Satellite Launch Center will fuel the rocket.

According to an aerospace expert, because the earth and the moon are both rotating, the optimal intersection between the Moon and the Earth appears only three times a year, the three were the best nodes for launching moon exploration satellites. After observation, the three times of appearance this year were October 1, October 2, and October 3, respectively, the best launch window is 7 o'clock on the first day, 8 o'clock on the second day, and 10 o'clock on the third day. Among them, 7 o'clock is the best in the future.

According to the media, the launch window on April 9 was scheduled at 06:59:57. The expert told our reporter that three seconds earlier than three seconds earlier, instead of three seconds earlier, they reserved the three seconds later, as the buffer time for field commanders to send countdown passwords. (Reporter wanqiang)

EOF;

// $ Input = $ _ POST [$ content]; // .......... obtain the string to be split
$ Test = str_replace (",", ',', $ test );
$ Test = str_replace (". ", '.', $ Test );
$ Mode = "/[, |.] /s ";//...................... use commas (,) and periods (,) to separate strings.

$ Output = preg_split ($ mode, $ test,-1 );

Print_r ($ output );
?>
================
Try it without defense. My code is a string of gb2312.
It can also work normally when the string is UTF8.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.