The problem of Chinese character substitution and pattern matching in PHP

Source: Internet
Author: User

These two days are doing a keyword highlighting program, write a good program in the local test also run well, but a page on a pile of a bunch of garbled, don't say add bright, is simply not look! I find the wrong, find to find, found no problem in English, encountered Chinese characters are prone to problems, sometimes encountered Chinese characters will be problematic.

To sum up:

When using pattern matching, such as: Preg_match_all ($pat,......) With Preg_replace ($pat,......) ......

The situation that is prone to problems is as follows:
Preg_match_all ("/(Chinese character) +/ism", "I am a Chinese character, see you put me how!" ", $m _a);
This pattern is very simple to match the "Chinese character". This pattern contains Chinese characters that can be successfully matched, but don't be too happy, the results are not sure, why not be sure you look down slowly.

The problems that must occur are as follows:
Preg_match_all ("/[Chinese character]+/ism", "I am a Chinese character, see you put me how!" ", $m _a);
Would like to match the emergence of "Han", "word" or "Chinese characters." This must be a problem, matching the results of a large group of garbled, may also be a dead cycle. Why is this happening? is because the internal use of PHP is not Unicode, does not support multibyte text, so a "Chinese character" is used as a 4bytes of ASCII to do pattern matching, not wrong to blame it!

Later, I tried to write a new pattern match, and I found one that seemed to be (why?) To look back, the method solves:
Preg_match_all ("/han | word) +/ism", "I am a Chinese character, see you put me how!" ", $m _a);

This can be written to match the "Han", "word" or "Chinese", $m the results in _a

Array
(
[0] => Array
(
[0] => Chinese characters
)

[1] => Array
(
[0] => Word
)

)

How about all the matching strings appear! But happy too early, later in the actual use or will often out of the question! Then to find the problem, finally found the root of the problem! PHP does not support multibyte text, so in the case of pattern matching and character manipulation is the internal code after conversion (I do not know if this is right), for example:

Eregi_replace ("Sex", "no", "responsible"); This operation is to replace the string "responsible" in the "sex" word "no", the final result is what? Because there is no "sex" in the word "responsible", the result should be no replacement operation returned to the "sense of responsibility", but the result is "with a wave of the sense of a feeling"!

Didn't think of it! Why? Look at the ASCII code. 2 ASCII codes one Chinese character "responsible" ASCII encoding is: 211,208 (there), 212,240 (responsible), 200,206 (ren), 184,208 (sense)

and the "Sex" code is: 208,212 (Sex), exactly with the 2nd byte and the responsibility of the 1th byte combination is consistent! So PHP to find the same pattern to match, split into half the Chinese characters and the replacement of the string after the combination, so there is a mistake!

At the time I thought the most commonly used str_replace (), should be no problem, but in fact str_replace () do the same operation will be wrong! Now I think it's so lucky to have replaced Chinese characters before! May be at that time of the Chinese character replacement are relatively long string bar, it is not easy to appear above the situation. Even if there is no problem, you should know that it is not safe!

The problem is some, work to continue to do, overcome the difficulties are::: Now the ego.

It's good to think of a set of PHP extensions, multibyte String functions, to add a number of functions that support multibyte text operations, such as Ereg_replace () Mb_ereg_replace () and so on. For specific function instructions, please inquire about the relevant articles.

Conclusion: It is best to use multibyte String functions for the safe operation of Chinese characters.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.