Chinese character replacement and pattern matching in php

Source: Internet
Author: User
Php Chinese character replacement and pattern matching problem Original Post address: www. anbbs. comanbbsindex. php? F_id3page1 is working on a keyword highlighted display program over the past two days. the written program also runs well in local testing, but as soon as it reaches the page, a bunch of garbled code appears, let alone highlight it. you just don't have to watch it! I am looking for an error. I am looking for the php Chinese character replacement and pattern matching problem.

Original Post address: http://www.anbbs.com/anbbs/index.php? F_id = 3 & page = 1
In the past two days, we are working on a program for keyword highlighted Display. The program we have written has been well tested locally, but as soon as we get up, there will be a bunch of garbled code, not to mention highlighted code, I just don't know!

I am looking for an error. I can find it and find that there is no problem with English. Chinese characters are prone to problems. sometimes there are problems with Chinese characters.

Summary:

For example, preg_match_all ($ pat ,......) And preg_replace ($ pat ,......)......

The problems are as follows:
Preg_match_all ("/(Chinese characters) +/ism", "I am a Chinese character, see what you think of me! ", $ M_a );
This mode is easy to match with Chinese characters ". In this mode, Chinese characters can be matched successfully, but you should not be too happy with the results. The results are uncertain. why are you not sure about it.

The following problems must occur:
Preg_match_all ("/[Chinese characters] +/ism", "I am a Chinese character. what do you think of me! ", $ M_a );
I wanted to match "Chinese", "word", or "Chinese character ". This is a problem. if a large group of garbled characters are matched, an endless loop may occur. Why is this happening? Because PHP uses non-UNICODE characters internally and does not support multi-byte text, a "Chinese character" is regarded as 4 bytes ASCII for pattern matching. it is strange that there is no error!

Later, I tried to re-write the pattern match and found a pattern (why? Later) the solution can be:
Preg_match_all ("/(Chinese | word) +/ism", "I am a Chinese character. what do you think of me! ", $ M_a );

In this way, we can match the results in "Chinese", "word", or "Chinese character", $ m_a.

Array
(
[0] => Array
(
[0] => Chinese characters
)

[1] => Array
(
[0] => Word
)

)

How can I see a fully-matched string! However, I was so happy that I still had problems in actual use! Find the problem again and finally find the root of the problem! PHP does not support multi-byte text, so during pattern matching and character operations, it is performed after the internal code is converted (I don't know if this is correct). For example:

Eregi_replace ("sex", "no", "sense of responsibility"); this operation is to replace the character string "sense of responsibility" with "no ", what is the final result? Because "sense of responsibility" does not mean "nature", the result should be "sense of responsibility" if no replacement operation is performed, but the result is "a sense of responsibility "!

I did not expect it! Why? Take a look at the ASCII code, you will understand, two ASCII codes, one Chinese character "have a sense of responsibility" ASCII code in sequence: 211,208 (have), 212,240 (responsibility ), 200,206 (ren), 184,208 (sense)

The encoding of "sex" is 208,212 (sex), which is exactly the same as the combination of some 2nd bytes and the 1st bytes of responsibility! So PHP will find the same pattern for matching, split half of the Chinese characters and then combine them with the replaced strings, so there is an error!

At that time, I thought the most commonly used str_replace () should not be a problem, but in fact, str_replace () will also encounter errors when performing the same operation! Now I think it's so lucky to replace Chinese characters before! It may be that the replacement of Chinese characters at that time was a long string of Chinese characters, and it was not easy to see the above situation. Even if there is no problem, you must know that it is not safe!

There are some problems. we need to continue our work and overcome the following difficulties: The current self.

I think of a group of PHP extension modules, Multibyte String Functions, and added many Functions that support multi-byte text operations, such as ereg_replace () corresponding to mb_ereg_replace. For specific function descriptions, see related articles.

Conclusion: for Chinese characters, it is best to use Multibyte String Functions.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.