Php-pcre regular expression escape sequence (backslash)

Source: Internet
Author: User
Tags alphanumeric characters expression engine
Backslashes have several uses. First, if it is followed by a non-alphanumeric character, the special meaning that the character represents is canceled. This use of the backslash as an escape character is available both inside and outside the character class.

For example, if you want to match a "*" character, you need to write "\*" in the pattern. This applies if a character has special meanings without escaping. However, for non-alphanumeric characters, it is safe to declare that it represents itself by adding a backslash in front of it when it needs to match the original. If you want to match a backslash, use "\ \" in the pattern.

Note:

Backslashes have special meanings in both single-quote strings and double-quote strings, so to match a backslash, the pattern must be written as "\\\\". "/\\/", first it as a string, the backslash will be escaped, then the result of escaping is/\/, this is the regular expression engine to get the pattern, and the regular expression engine also think \ is an escape token, it will be the delimiter/escaped, resulting in an error, so it takes 4 backslashes to Can match a backslash.

If a pattern is compiled with the pcre_extended option, whitespace characters in the pattern (except in the character class) and all characters that are not escaped to the end of the line are ignored. To use whitespace characters or # in this case, you need to escape them.

The second use of backslashes provides a means of controlling the visible encoding of nonprinting characters. In addition to the binary 0 will end a pattern, does not strictly restrict the appearance of nonprinting characters (itself), but when a pattern is edited in a text editor preparation, it is easier to use the following escape sequence than using binary characters.

\a

Bell character (Hex 07)

\cx

"Control-x", X is any character

\e

Escaped (hex 1B)

\f

Page break (hex 0C)

\ n

Line break (hex 0A)

\P{XX}

A character that conforms to the XX attribute

\P{XX}

A character that does not conform to the XX attribute

\ r

Enter (Hex 0D)

\ t

Horizontal tab (Hex 09)

\xhh

HH hexadecimal-encoded character

\ddd

DDD octal encoded character, or back reference

The exact effect of the \CX is as follows: If x is a lowercase letter, it is converted to uppercase. Next, the 6th digit of the character (Hex 40, the first bit of the right number is the No. 0 digit) is reversed. For example, \cz becomes the hexadecimal 1a,\c{becomes hex 3B, \c; becomes hex 7 b.

After "\x", read two hexadecimal digits (letters can be uppercase or lowercase). In UTF-8 mode, "\x{...}" Allowed, the content inside the curly braces is a hexadecimal valid number. It interprets the hexadecimal digits given as the UTF-8 character code. The original hexadecimal escape sequence, \xhh, matches a double-byte UTF-8 character if its value is greater than 127

After "\", two octal digits are read. In all cases, if the number is less than 2, it is used directly. The sequence "\0\x\07" specifies two binary 0 followed by a BEL character. Make sure that the two digits after the initial 0 are valid octal numbers.

Dealing with a backslash followed by a number that is not 0 is more complicated. Outside of the character class, PCRE reads it and reads the number immediately after it in decimal. If the value is less than 10, or if the number is previously captured to represent an opening parenthesis (a subgroup), the entire number sequence is considered a back reference. How the back reference works is described later, and then the bracket subgroup is discussed.

Inside a character class, or if the decimal number is greater than 9 and not so many subgroups are captured, PCRE re-reads the third 8 binary number after the backslash, and generates a single-byte value from the lowest 8 bits. Any subsequent numbers represent themselves. For example:

\040

Another way to use spaces

\40

It is also considered a space when less than 40 subgroups are provided.

\7

Always a back reference

\11

May be a back reference or a tab

\011

Always a tab

\0113

A tab is followed by a 3 (because a maximum of 3 8 binary bits are read at a time

\113

Characters represented by octal 113

\377

8 Binary 377 is 10 binary 255, so it represents a full 1 character

\81

A back reference or a binary 0 followed by two digits 8 and 1 (because 8 is not a valid number of 8)

Note that the value of 100 or greater of the octal value must not have a predecessor of 0 boot, since up to 3 8 binary bits are read at a time.

All sequence-defined single-byte values can be used inside or outside the character class. In addition, in the character class, the sequence "\b" is interpreted as a backspace character. It has a different meaning outside the character class (described below)

A third use of backslashes is to describe a particular character class:

\d

arbitrary decimal digits

\d

Any non-decimal number

\h

Any horizontal whitespace character (since PHP 5.2.4)

\h

Any non-horizontal whitespace character (since PHP 5.2.4)

\s

Any whitespace character

\s

Any non-whitespace character

\v

Any vertical whitespace character (since PHP 5.2.4)

\v

Any non-vertical whitespace character (since PHP 5.2.4)

\w

Any word character

\w

Any non-word character

Each of the above pairs of escape sequences represents two disjoint parts of the full character set, and any character must match one of them and must not match the other.

A word character refers to any letter, number, or underscore. That is, any character that can compose a Perl word. The definition of letters and numbers is controlled by the Pcre character set, which can be changed by specifying the locale settings. For example, in the French (FR) Locale, some more than 128 character codes are used for accented letters, and they can be useful for \w matching.

These character class sequences can appear inside or outside a character class. Each time they match one character in the character type that is represented. If the current match point is at the end of the target string, all of the characters in them will fail because no characters are matched.

The fourth use of backslashes is some simple assertions. An assertion specifies a condition that must be matched at a particular location, and they do not consume any characters from the target string. Next we'll discuss more complex assertions that use subgroups. Backslash assertions include:

\b

Word boundaries

\b

Non-word boundary

\a

The starting position of the target (independent of multiline mode)

\z

The end position or line break at the end of the target (independent of multiline mode)

\z

End position of the target (independent of multiline mode)

\g

First match position in target

These assertions cannot appear in the character class (but note that "\b" has a different meaning in the character class, representing the BACKSPACE (BACKSPACE) character)

A word boundary indicates that the current character and the previous character do not match either \w or \w (a histograms \w, a match \w) in the target string, or the current character matches \w as the beginning or end character of the string.

\a, \z, \z assertions differ from the traditional ^ and $ (see below), because they always match the start and end of the target string, not the pattern modifier's limit. They are not affected by the pcre_multiline,pcre_dollar_endonly option. The difference between \z and \z is that when the string ends with a newline character \z it as a string end match, and \z matches only the end of the string.

\g asserts that in the Preg_match () call with the $offset parameter specified, only the current match position is successful at the point at which the start is matched. When the value of $offset is not 0, it is different from \a. Another point that differs from \a is that when you use Preg_match_all (), each match \g is just the beginning of the match, and \a asserts whether the start of the match result is at the beginning of the target string.

Starting with PHP 4.3.3, \q and \e can be used to omit regular expression metacharacters from the pattern. For example: \w+\q.$.\e$ matches one or more word characters, followed by a dot, a $, a dot, and the last anchor to the end of the string.

Starting from PHP 5.2.4. The \k can be used to reset the match. For example, Foot\kbar matches "Footbar". But the resulting match is "bar". However, the use of \k does not interfere with the content within the subgroup, such as (foot) \kbar match "Footbar", and the result in the first subgroup will still be "foo". \k: The effect is the same when placed outside subgroups and subgroups.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.