Matching UNICODE character codes using regular expressions in PHP

Source: Internet
Author: User
Tags php regular expression

The problem with the netizen ainfa is:

The PHP code is as follows:
Copy codeThe Code is as follows:
$ Words = "0123456789 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSRUVWXYZ! @ # $ % ^ & * () _ +-= [] \,./{} | <>? '\ "Hello, we ";
$ OtherStr = preg_replace ("/[chr (128)-chr (256)] +/is", "", $ words );
Echo 'otherstr: ', $ otherStr;

Why is the printed result:
OtherStr :! # $ % & {} | '"Hello, we

What does the regular expression/[chr (128)-chr (256)] +/is mean?
If/[chr (128)-chr (256)] +/is refers to characters with ascii codes ranging from 128 to 256, why are the characters such as a-zA-Z replaced? Their ascii code is less than 127.
The most depressing reason is that the ascii code is in the range 0-12 7. "#", "$", "%", "&", "!"," {","} "," | "," '"," Is not replaced ????
What's even more amazing is that if you change the regular expression to "/[chr (128)-chr (256)] +/s", the output result will be: otherStr: defg ijklmnopq stuvwxyz! # $ % & {} | '"Hello, we
Just remove the 'I' symbol from the regular expression, and the result is missing. I cannot understand it completely.
Do you have any opinions ????
Appendix ascii code table
(I will not paste the image of this ASCII code table)

In the reply, a netizen said that chr (128) was not resolved and a new solution was provided. First, the netizen answered the correct answer. First, he did not comment on whether he "knows and knows why". The netizen did not give the cause of the error.

CFC4N:

The PHP Regular Expression preg_match function uses the PCRE regular engine. in this Code, the regular expression processed by the PCRE engine is [/[chr (128)-chr (256)] +/is]. What is next?
In the PHP regular expression, the pattern modifier after the boundary character is called. It will tell the engine how to parse and process regular expressions. The I modifier is case insensitive. S indicates the "Point wildcard mode", which is used to make the metacharacters [.] In the regular expression match the line break. This modifier only applies to the dot. In this Netizen's question, modifier s does not work.

Find the cause:
Let's analyze the regular expression [[chr (128)-chr (256)] + written by this user. How does the PCRE engine of the regular expression explain this regular expression? First, we need to know that in a regular expression, [[] represents a character group. Except for the connector [-], the character group is not a metacharacter. That is to say, they are all common characters. Of course, if a hyphen appears at the first or does not indicate the range between two characters, it is a normal character. Here, chr (128) only identifies the ASCII code as 128 (specifically, the ASCII code is only 0-128, to other, and should not be called ASCII code .), But in the regular expression, he still represents the eight characters [c, h, r, (, 1, 2, 8,)] (don't, just distinguish between readable. What is the range of the connection characters in this regular expression? Obviously, the range of the connected characters here is [)-c], and the ")" ASCII code is 0 × 29, that is, 41 in decimal format; the ASCII code of "c" is 0 × 63, that is, 99 in decimal format. Then, the range of the connected character is ASCII 41 (chr (41 )) to ASCII 99 (chr (99. That is to say, the regular expression range of this netizen is [[hr)-c (], that is, chr (41) to chr (99) add the two letters "(" in front of "hr".
In the first test, there was a modifier I, which means that the characters between chr (41) and chr (99) are case-insensitive, this includes that all of their cases are matched. Will be replaced with null. In the second test, the modifier I was removed and the case-insensitive match was performed. Because the range is only to c, but suddenly, in addition to the lowercase letters "h" and "r", the test results will contain "defgijklmnopqstuvwxyz ". Therefore, these differences occur in his results.

The expressions of netizens are equivalent to those shown in

Solution:
The cause of the error is found out. How can this problem be solved?
Let's take a look at the demand of this netizen. His demand is to convert the chr (128) of unicode (ASCII is only 0-128 bits, after 255, it should be called UNICODE Code) to chr) character match. In a regular expression, there are two Representation Methods for hexadecimal character matching: [\ u] and [\ x {}], the former can only represent the four hexadecimal values after [\ u], while the latter [\ x {}] can represent any number of hexadecimal digits (written in braces ).
So how do I write this regular expression ????

The goal of netizens is chr (128) to chr (255 ), [\ u0080-\ u00FF] or [[\ x {0080}-\ x {00FF}].
The purpose is to match the characters in the red box.



Note: The u modifier is required when the regular expression matches unicode characters in PHP.
The PHP code after regular expression changes is as follows:
Copy codeThe Code is as follows:
$ Words = "0123456789 abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSRUVWXYZ! @ # $ % ^ & * () _ +-= [] \,./{} | <>? '\ "Hello, we ";
$ OtherStr = preg_replace ("// [\ x {0080}-\ x {00FF}] +/iu", "", $ words );
Echo 'otherstr: ', $ otherStr;

The running result is still the output string. Why? Which strings are not in the range of chr (128) to chr (255.
(When testing, note that the file is encoded as a UTF-8)
The above is your humble opinion. Thank you for your criticism and correction.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.