Question about chinese character string truncation in php?

Source: Internet
Author: User
Php Chinese string truncation question ??? Because the use of the substr () function to intercept Chinese strings will cause problems, I found a function online, as follows: PHPcode Chinese string intercept function functioncut_str ($ string, $ start, $ length) {if (strlen ($ string) & gt; $ length) {$ strnull; php Chinese string truncation question ???
Because the use of the substr () function to intercept Chinese strings may cause problems, I found a function online, as shown below:
PHP code
  // Function cut_str ($ string, $ start, $ length) {if (strlen ($ string)> $ length) {$ str = null; $ len = $ start + $ length; for ($ I = $ start; $ I <$ len; $ I ++) {if (ord (substr ($ string, $ I, 1)> 0xa0) {$ str. = substr ($ string, $ I, 2); $ I ++;} else {$ str. = substr ($ string, $ I, 1) ;}} return $ str. '... ';} else {return $ string ;}}

However, problems still occur after I use the filter and layer style. for example, I cut out "use the filter and layer style to make realistic stone words ",
PHP code
  $ Str = "use filters and layer styles to create realistic stone words"; cut_str ($ str );


However, the effect is as follows: "How to use filters and layer styles to create realistic stones ?...", Except for the characters with the same question mark, I am depressed. I checked it online. the man generally occupies 3 bytes in UTF-8 encoding, but in this function, "$ str. = substr ($ string, $ I, 2); "returns 2. what does this mean ?? I never figured it out .... If I change 2 to 3, the sentence will become "profit? Why? Why ?? Mirror? And? Figure? Mountains ?? ? Why ?? ? Why ?? Force? Why? ? Shi? Dam? The word... ", alas, it was really defeated. Which of the following heroes helped me .........

------ Solution --------------------
Why not use the mb_substr () function
------ Solution --------------------
You have to confirm your encoding. Specifies the encoding when intercepting an object.
------ Solution --------------------
It is calculated in bytes. Gbk encoding. A Chinese character is equal to 2 bytes.
------ Solution --------------------
Of course it's mb_substr. I don't know much about encoding...

Utf8 Chinese encoding 2-3 characters is very common, but the single-byte non-ASCII characters must be 1-7th characters, which does not conflict with the single-byte ASCII, and the GBK code is similar.

Use mb_substr, which automatically identifies multi-byte characters based on the utf8 encoding range.
------ Solution --------------------
This function is only applicable to gbk encoding.

Discussion

Haha, I studied the manual and just got it done. you just said, just confirm the encoding, but I want to know why that function is not working. why ?? That seems to be the answer to the php interview. Can the predecessors give us some advice, especially the 2 character in UTF-8, which is a string of 3 to 4 characters and commonly used to contain 3 characters ..... Trouble

------ Solution --------------------
PHP code
/***************************** SubCNchar () trash Chinese characters ** [$ str] [string to be truncated] * [$ start] [starting position of the trash] * [$ length] [length to be truncated] * [$ charset] [string encoding] ***************************/function subCNchar ($ str, $ start = 0, $ length, $ charset = "UTF-8") {if (strlen ($ str) <= $ length) return $ str; $ re ['utf-8'] = "/[\ x01-\ x7f] | [\ xc2-\ xdf] [\ x80-\ xbf] | [\ xe0 -\ xef] [\ x80-\ xbf] {2} | [\ xf0-\ xff] [\ x80-\ xbf] {3 }/"; $ re ['gb2312'] = "/[\ x01-\ x7f] | [\ xb0-\ xf7] [\ xa0-\ xfe]/"; $ re ['gbk'] = "/[\ x01-\ x7f] | [\ x81-\ xfe] [\ x40-\ xfe]/"; $ re ['big5'] = "/[\ x01-\ x7f] | [\ x81-\ xfe] ([\ x40-\ x7e] | \ xa1-\ xfe]) /"; preg_match_all ($ re [$ charset], $ str, $ match); $ slice = join (" ", array_slice ($ match [0], $ start, $ length); return $ slice ;}
------ Solution --------------------
Why can't I add .....
Echo mb_strlen ($ str, 'utf-8')> 10? Mb_substr ($ str, 'utf-8'). '...': $ str;
------ Solution --------------------
Add "..." to the 12th floor,

If you have to change this function, the UTF-8 encoding is quite regular, except for the ascii code,
The first byte starts with 11. the number of consecutive 1 represents the total number of bytes, and the subsequent bytes start with 10.
The Chinese characters are basically in the three-byte zone.
Knowing this rule, it is easy to write a function?
U + 007F 0 xxxxxxx
U + 07FF 110 xxxxx 10 xxxxxx
U + FFFF 1110 xxxx 10 xxxxxx 10 xxxxxx
U + 1 FFFFF 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U + 3 FFFFFF 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U + 7 FFFFFFF 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx



Discussion

Can this function be changed to UTF-8 ?? Mb_substr () does not seem to be able to add "..." to the end of a character that has not been completed. this affects the effect and solves the problem.

------ Solution --------------------
You can use the mb_strimwidth function

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.