PHP-PCRE regular expression escape sequence (backslash)

Source: Internet
Author: User
Tags character classes printable characters expression engine
PHP extension text processing -- PCRE regular expression syntax 3 -- escape sequence (backslash) backlash has multiple usage. First, if it is followed by a non-alphanumeric character, it indicates that the special meaning represented by this character is canceled. This method uses the backslash as an escape character, which is available both inside and outside the character class.

For example, if you want to match a "*" character, you need to write it as "\ *" in the mode "\*". This applies when a character is not escaped and has a special meaning. However, it is safe to add a backslash to the front of a non-alphanumeric character when it needs to match the original text. If you want to match a backslash, use "\" in the mode "\\".

Note:

The backslash has special meanings in single quotes and double quotation marks. therefore, to match a backslash, the backslash must be written as "\\\\" in the mode "\\\\". Note: "// \/". First, it is used as a string, and the backslash will be escaped. the escape result is //. this is the pattern obtained by the regular expression engine, the regular expression engine also regards "\" as an escape mark, which will escape the separator/to get an error. Therefore, four backlash lines are required to match a backslash.

If a mode is compiled using the PCRE_EXTENDED option, the blank characters (except in the character class) in the mode and the unescaped # to all characters at the end of the line will be ignored. To use a blank character or # In this case, escape it.

The second use of the backslash provides a method to control the visible encoding of non-printable characters. Except that the binary 0 ends a mode, it does not strictly limit the appearance of non-printable characters (itself). However, when a mode is edited and prepared in a text editor, it is easier to use the following escape sequence than to use binary characters.

\

Bell character (hexadecimal 07)

\ Cx

"Control-x", x is any character

\ E

Escape (hexadecimal 1B)

\ F

Page feed (hexadecimal 0C)

\ N

Line feed (hexadecimal 0A)

\ P {xx}

A character that meets the xx attribute

\ P {xx}

A character that does not conform to the xx attribute

\ R

Enter (hexadecimal 0D)

\ T

Horizontal Tab (hexadecimal 09)

\ Xhh

Hh hexadecimal characters

\ Ddd

Ddd octal characters or backward references

The exact effect of \ cx is as follows: if x is a lowercase letter, it is converted to uppercase. Then, the first digit of the character (hexadecimal 40, the first digit of the right is 6th) is reversed. For example, \ cz becomes the hexadecimal 1A, \ c {becomes the hexadecimal 3B, \ c; becomes the hexadecimal 7B.

After "\ x", read two hexadecimal numbers (uppercase or lowercase letters ). In UTF-8 mode, '\ x {...}" Allowed. the content in curly braces is a hexadecimal valid number. It interprets the given hexadecimal number as a UTF-8 character code. The original hexadecimal escape sequence, \ xhh, matches a dual-byte UTF-8 character if its value is greater than 127

After "\ 0", read two octal numbers. In all cases, if the number is less than two, use it directly. Sequence "\ 0 \ x \ 07" specifies two binary 0 followed by a BEL character. Make sure that the first two digits after 0 are valid octal numbers.

Processing a number with a backslash followed by a number other than 0 is complicated. Outside the character class, PCRE reads it and reads the followed number in decimal format. If the value is less than 10, or the left parenthesis (child group) that can be represented by the number is captured before, the entire sequence of numbers is considered as backward reference. The following describes how back-to-back references work in the following sections. next we will discuss the parentheses subgroups.

In a character class, or a child group with a decimal number greater than 9 and less than so many characters is captured, PCRE re-reads the third octal number after the backslash, in addition, the single-byte value is generated from the lowest eight bits. Any subsequent numbers represent themselves. For example:

\ 040

Another use of space

\ 40

It is also considered as a space when less than 40 sub-groups are provided.

\ 7

Always backward reference

\ 11

It may be backward reference or tab

\ 011

Always a tab

/0113

A tab is followed by a 3 (because at most three octal digits can be read at a time.

/113

Octal 113 characters

/377

An octal value of 377 is 10 to 255. Therefore, it represents a full 1 character.

\ 81

A back reference or a binary 0 followed by two numbers 8 and 1 (because 8 is not an octal valid number)

Note that the 100 or greater value of the octal value must not have a 0-pilot, because a maximum of three octal bits can be read at a time.

The single-byte values defined by all sequences can be used inside or outside the character class. In addition, in the character class, the sequence "\ B" is interpreted as a return character. It has different meanings outside the Character class (as described below)

The third method of backslash is to describe a specific character class:

\ D

Any decimal number

\ D

Any non-decimal number

\ H

Any horizontal white space character (since PHP 5.2.4)

\ H

Any non-horizontal white space character (since PHP 5.2.4)

\ S

Any blank character

\ S

Any non-blank characters

\ V

Any vertical blank character (since PHP 5.2.4)

\ V

Any non-vertical white space character (since PHP 5.2.4)

\ W

Any word character

\ W

Any non-word character

Each of the above escape sequences represents two non-intersecting parts of the complete character set. any character will certainly match one of them, and it will not match the other.

A word character refers to any letter, number, or underline. That is to say, any character that can constitute a perl word. The definition of letters and numbers is controlled through the PCRE users table and can be changed by specifying the region settings. For example, in the French (fr) region settings, code with more than 128 characters is used for accent letters, which can be used for \ w matching.

These character classes can appear either inside or outside the character classes. Each time they match a character that represents a character type. If the current match point is at the end of the target string, all the characters in it will fail to match because no character matches them.

The fourth method of backlash is some simple assertions. An assertion specifies a condition that must be matched at a specific position and does not consume any characters from the target string. Next we will discuss more complex assertions using sub-groups. Backlash assertions include:

\ B

Word boundary

\ B

Non-word boundary

\

Start position of the target (independent from the multiline mode)

\ Z

The end position or line break at the end of the target (independent from the multiline mode)

\ Z

Target end position (independent from the multiline mode)

\ G

First matching position in Target

These assertions cannot appear in character classes (but note that "\ B" has different meanings in character classes, indicating backspace characters)

A word boundary indicates that the current character and the previous character do not match \ w or \ W at the same time in the target string (one matching \ w, one matching \ W ), or the current character matches \ w when the string starts or ends.

The assertion of \ A, \ Z, \ z is different from the traditional ^ and $ (see below), because they always match the start and end of the target string, it is not limited by the pattern modifier. They are not affected by the PCRE_MULTILINE and PCRE_DOLLAR_ENDONLY options. The difference between \ Z and \ z is that when the character string ends with a line break, \ Z will regard it as a string end match, while \ z only matches the end of the string.

The \ G assertion is successful only when the current matching position matches the start point in the preg_match () call with the $ offset parameter specified. When the value of $ offset is not 0, it is different from \. Note: The difference with \ A is that when preg_match_all () is used, each matching \ G only indicates whether it is the starting position of the matching result, the assertion of \ A is whether the starting position of the matching result is at the starting position of the target string.

From PHP 4.3.3, \ Q and \ E can be used to ignore the metacharacters of regular expressions in the mode. For example, \ w + \ Q. $. \ E $ matches one or more word characters, followed by a dot, a $, a dot, and finally anchor to the end of the string.

It starts with PHP 5.2.4. \ K can be used to reset matching. For example, foot \ Kbar matches "footbar ". The matching result is bar ". However, the use of \ K will not interfere with the content in the sub-group. for example, if (foot) \ Kbar matches "footbar", the result in the first sub-group will still be "foo ". Note: \ K has the same effect on the sub-group and sub-group.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.