Schema syntax-syntax for explaining PERL compatible regular expressions

Source: Internet
Author: User
Tags character classes class definition compact lowercase modifier perl regular expression
Differences from Perl PCRE function PCRE Patterns PHP manual Schema Syntax

(No version information available, might only being in SVN)

Schema syntax-syntax explanation for Perl compatible regular expressions

The PCRE library is a set of functions that implement regular expression pattern matching with the same syntax and semantics as Perl 5, but there are a few differences (see below). The implementation of the current PCRE is in accordance with Perl 5.005. the difference from Perl

The difference mentioned here is in Perl 5.005来. By default, whitespace characters are any characters recognized by the C language library function isspace (), although they may be compiled with other character type tables. usually isspace () matches spaces, page breaks, line breaks, carriage returns, horizontal tabs, and Vertical tabs. Perl 5 no longer includes vertical tabs in white space characters. In fact, the escape sequence, which has long existed in the Perl document, has never been \v, but the character is treated as a blank character until at least 5.002. \s does not match this character in 5.004 and 5.005. PCRE does not allow the use of duplicate quantifiers in forward assertions. Perl allows this, but it may not be what you think it means. For example, (?! A) {3} is not an assertion that the following three characters are not "a", but rather that the next character is not "a" three times. A child pattern that is caught in an exclusion mode assertion, although counted, does not set its entry in the offset vector. Perl sets its numeric variable from this pattern before the match fails, but only if the row-touch assertion contains only one branch. Although a binary 0 character is supported in the destination string, it cannot appear in the pattern string because it is passed as a normal C string and terminated in binary zero. The escape sequence "\x00" can be used in the pattern to represent binary zeros. The following Perl escape sequences are not supported: \l,\u,\l,\u. These are actually implemented by Perl string processing, not part of the pattern matching engine. Perl's \g assertion is not supported because it has nothing to do with a single pattern match. Obviously, PCRE does not support (? { Code}) structure. When part of the pattern repeats, the setting of the Perl 5.005_02 capture string is somewhat odd. For example, using the Mode/^ (A (b)?) +$/to match "ABA" will be set to "B", but with mode/^ (AA (BB)?) +$/to Match "Aabbaa" will make $ no value. However, if the pattern is changed to/^ (AA (b))? +$/, then the $ (and $) value is available. In both cases, the $ $ is assigned in Perl 5.004 and is also in PCRETRUE。 If Perl changes later, PCRE may also follow suit. Another unresolved paradox is the Perl 5.005_02 pattern/^ (a)? (? (1) a|b) +$/can match up the string "a", but PCRE will not. However, using/^ (a) in Perl and PCRE to match "a" makes no value.

PCRE provides some extensions to the Perl regular expression mechanism: Although the backward assertion must match a fixed-length string, each branch of the backward assertion can match a string of different lengths. Perl 5.005 requires the same length of all branches. If the pcre_dollar_endonly is set and the Pcre_multiline is not set, the $ metacharacters match only the end of the string. If you set a Pcre_extra, a backslash followed by a letter with no special meaning would make an error. If Pcre_ungreedy is set, the greed of the repeated number character is reversed, that is, the default is not greedy, but if followed by a question mark it becomes greedy.

description of regular Expressions

The syntax and semantics of the regular expressions supported by PCRE are described below. Regular expressions are also explained in the Perl documentation and many other books, some of which have many examples. "Mastering Regular Expressions", written by Jeffrey Friedl, is published by O ' Reilly publishing house (ISBN 1-56592-257-3) and contains a great deal of detail. The instructions here are just a reference document.

A regular expression is a set of patterns that match a target string from left to right. Most characters represent themselves in the pattern and match the corresponding characters in the target. As a small example, the model the quick brown Fox matches the exact part of the target string. Meta character

The power of regular expressions is the ability to include selections and loops in the schema. They are encoded in the pattern by using metacharacters, and the metacharacters do not represent themselves, and they are parsed in some special way.

There are two different sets of metacharacters: one that can be identified in the pattern except in square brackets, and one that is identified within square brackets. The metacharacters outside the square brackets have these: \ The general escape character for several purposes ^ asserts the beginning of the target (or the beginning of the line in multiline mode, followed by a newline character) $ asserts the end of the target (or the end of the line in multiline mode, preceded by a newline character). Matches any of the characters except newline characters (by default) [character class definition start] character class definition ends | Start a multiple-selection branch (child mode start) Child mode end? Extension (meaning, is also a 0 or 1 quantity qualifier, and a quantity qualifier minimum * matches 0 or more quantity qualifiers + matches 1 or more quantity qualifiers {minimum/maximum number of qualified start} at least/maximum number limit end

The part of the pattern in parentheses is called a "character class." The metacharacters available in the character class are: \ Universal Escape Character ^ exclude character class, but only if it is valid for first character-indicates character Range] End character class

The following describes the use of each meta character. backslash (\)

Backslash characters have several uses. First, if followed by a non-alphanumeric character, cancels any special meaning that the character might have. This use of backslashes as an escape character applies to both the character class and beyond.

For example, if you want to match a "*" character, use "\*" in the pattern. This applies to whether the next character is interpreted as a metacharacters, so adding a "\" before the Non-alphanumeric character indicates that the character is always safe. In particular, if you want to match a backslash, use "\".

Note: The backslash in a PHP string enclosed in single or double quotes has special meaning. Therefore, you must match \ with the regular expression, and use "\\\\" or ' \\\\ ' in the PHP code.

If the schema is compiled with the pcre_extended option, whitespace characters in the pattern (outside of the character class) and characters from "#" To line breaks are ignored. You can include whitespace characters or "#" characters in a pattern with an escaped backslash.

The second purpose of the backslash provides a way to encode the nonprinting characters in a pattern in a visible way. There is no limit to the appearance of nonprinting characters except binary 0, which represents the end of the pattern. However, when you use a text editor to prepare a pattern, it is usually easier to use the following escape sequences to represent those binary characters:

\a Alarm, that is, BEL character (0x07) \cx "Control-x" where x is any character \e escape (0x1b) \f the page break formfeed (0x0C) \ n newline character newline (0x0A) \ r return character Carri Age return (0x0D) \ t tab tab (0X09) \xhh hexadecimal code for HH character \ddd octal code is DDD character, or backreference

The exact effect of "\CX" is as follows: If "X" is a lowercase letter, it is converted to uppercase. Then the 6th bit (0x40) in the character is reversed. Thus "\cz" becomes 0x1A, but "\c{" becomes 0x3b, while "\c;" Become 0x7b.

Read up to two hexadecimal digits after "\x" (where the letters can be uppercase or lowercase). In UTF-8 mode, "\x{...}" is allowed, and the content in curly braces is a string that represents a hexadecimal number. The original hexadecimal escape sequence, \xhh, matches a double-byte UTF-8 character if its value is greater than 127.

Read up to two octal digits after "". In both cases, if there are fewer than two digits, only the occurrences are used. So the sequence "\0\x\07" represents two binary 0 plus a BEL character. If the number is octal, make sure to provide two more digits after the start of 0.

Processing a backslash followed by a number that is not 0 is more complicated. Outside the character class, PCRE reads the number and its subsequent digits in decimal digits. If the number is less than 10, or if at least the left parenthesis of the number is captured in the previous expression, the sequence is used as a reverse reference. A description of how this works is followed by the child mode in parentheses.

Within the character class, or if the decimal number is greater than 9 and there is not so much of the captured child mode, PCRE again reads the maximum number of three octal digits from the backslash and produces a single byte with the lowest bit of 8 bits. Any subsequent figures are representative of themselves. For example:

\040 another way to represent spaces \40 ditto, if the previously captured child mode is less than 40 \7 is always a reverse reference \11 may be a reverse reference, or a tab tab \011 always represents tab tab \0113 is followed by a character "3" "\113 represents a character with a octal code of 113 (since no more than 99 reverse references) \377 means that all bits are 1 bytes \81 either a reverse reference or a binary 0 followed by two characters" 8 "and" 1 "

Note octal values of 100 or greater cannot begin with 0, because more than three octal digits are not read (after backslashes).

All the sequences that define a single byte can be used in or out of a character class. In addition, in the character class, the sequence "\b" is interpreted as a backslash character (0x08) and has a different meaning outside the character class (see below).

The third use of a backslash is to specify a common character type:

\d any decimal digit \d any non-decimal character \s any whitespace character \s any non-white-space character \w any "word" character \w any "non-word" character

Any escape sequence splits the complete character combination into two separate parts. Any given character matches one and only one escape sequence.

The word "character" refers to any letter or number or underscore, that is, any character that can be Perl "word." The definition of letters and numbers is controlled by the PCRE character chart and may change depending on the match of the specified range. For example, in the French area, some characters larger than 128 are used to denote accented letters that can be matched by \w.

These character type sequences can appear in and out of the character class. Each match a character in the corresponding type. If the current match point is at the end of the target string, all of the above matches fail because there are no characters to match.

The fourth use of a backslash is some simple assertion. An assertion is a condition that must be reached at a particular location in a match and does not consume any character in the target string. The use of more complex assertions in child mode is described below. The assertions of backslashes are:

\b Word demarcation line \b The beginning of the \a target (independent of multiline mode) \z the end of the target or the end of the line-wrapping match either (independent of multiline mode) \z the end of the target (independent of multiline mode) \g the first match in the target

These assertions may not appear in the character class (but note that "\b" has a different meaning, in the character class, which is the backslash character).

A word boundary is a position in the target string whose current and previous characters cannot match either \w or \w (either one of the matching \w and the other \w), or the beginning or end of the string, if the first or last character matches the \w.

\a,\z and \z assertions differ from traditional diacritics and dollar characters (described below) in that they match only the absolute beginning and end of the target string, regardless of the options set. They are not affected by the Pcre_notbol or Pcre_noteol options. The difference between \z and \z is that \z matches the line break that is the last character of the string and the end of the string, and \z only matches the end of the string.

The \g assertion is true only if the current match position is the point at which the match started, as the offset parameter for Preg_match () specifies. This differs from \a when the value of offset is not zero. Available from PHP 4.3.3.

\q and \e from PHP 4.3.3 can be used to ignore regular expression matching characters in the pattern. For example: \w+\q.$.\e$ will match one or more characters that can be made into words, followed by literal. $. and is at the end of the string. Unicode character Properties

From PHP 4.4.0 and 5.1.0, there are three more transfer sequences when you select additional escape sequences to match generic the character types are available UTF-8 mode With: \p{xx} one character with XX attribute \p{xx} \x an extended Unicode sequence without a character of XX property

The attribute name indicated by XX is limited to the Unicode common Type property. Each character has one of these attributes, which is specified by two initials. For Perl compatibility, you can include an up arrow symbol in the middle of the left curly brace and the property name to represent the exclusion attribute. For example, \p{^lu} is the same as \p{lu}.

If only one letter is used in \p or \p, then all attributes beginning with that letter are included. In this case, if it is not excluded, you can omit the curly braces. The two items in the following example have the same effect:


Supported property Codes
C Other-Others
Cc Control-Controls
Cf Format-Formatting
Cn Unassigned-Unsigned
Co Private Use-Proprietary
Cs Surrogate-instead
L Letter-Letters
Ll Lower Case Letter-Lowercase
Lm Modifier letter-modifier Letters
Lo Other Letters-Others
Lt Title Case letter-heading capital letters
Lu Upper Case Letter-uppercase letters
M Mark-Tag
Mc Spacing Mark-Space mark
Me Enclosing mark-Wrapping tag
Mn Non-spacing mark-Non-space mark
N Number-Digital
Nd Decimal number-decimal digit
Nl Letter number-alphanumeric
No Other number-Additional numbers
P Punctuation-Punctuation
Pc Connector punctuation-Connection punctuation character
Pd Dash punctuation-Horizontal punctuation character
Pe Close punctuation-end punctuation character
Pf Final punctuation-end punctuation character
Pi Initial punctuation-Start punctuation character
Po Other punctuation-Other punctuation
Ps Open punctuation-start punctuation character
S Symbol-Symbols
Sc Currency symbol-Currency symbol
Sk Modifier symbol-modifier
Sm Mathematical symbol-Arithmetic symbols
So Other symbol-Additional symbols
Z Separator-Separator
Zl Line Separator-row separator
Zp Paragraph Separator-Paragraph separator
Zs Space Separator-Whitespace separator

Extended properties such as "Greek" or "inmusicalsymbols" are not supported by PCRE.

Specifies that case-insensitive matching does not affect this type of escape sequence. For example, \p{lu} always matches only uppercase letters.

The \x transfer character matches any number of Unicode characters that can form an extended Unicode sequence. \x and (? >\pm\pm*) are the same.

That is, it matches a character that has no "mark" attribute followed by 0 or more characters with "Mark" attributes, and this sequence is considered an atomic group (see below). A typical letter with the "Mark" attribute is a heavy note that affects the preceding character.

It is not fast to match characters with Unicode properties because PCRE has to search for a structure that contains more than 15,000 characters of data. This is why traditional escape sequences such as \d and \w do not use Unicode properties in PCRE. Diacritics (^) and dollar characters ($)

Outside the character class, in the default match mode, the diacritics are assertions that are true only when the current match is the beginning of the target string. In the character class, the tonal meaning is completely different (see below).

The first character of the selected branch should be the first character in a branch if it involves a few selected momentary tones that do not need to be the one in the pattern. If all the selection branches start with a diacritics, this means that if the pattern is limited to matching the beginning of the target, then this is a fastening mode. (There are other structures that can make the pattern compact.) )

A dollar is an assertion that is TRUE only if the current match is the end of the target string, or when the last character is the position preceding the line break (by default). When it comes to a few selected dollar characters that do not need to be the last character of the pattern, it should be the last character in the branch where it appears. The dollar symbol has no special meaning in the character class.

The meaning of the dollar character can be changed so that it matches only the end of the string, as long as the pcre_dollar_endonly option is set when compiling or matching. This does not affect the \z assertion.

If the Pcre_multiline option is set, the meaning of the diacritics and dollar characters is changed. In this case, they match immediately after and before the internal "\ n" character, plus the start and end of the target string. For example, the pattern/^abc$/matches the target string "DEF\NABC" in multi-line mode, but does not match properly. Therefore, because all branches begin with "^" and become compact in Single-line mode, they are not fastened in multiline mode. If Pcre_multiline is set, the PCRE_DOLLAR_ENDONLY option is ignored.

Note that the \a,\z and \z sequences can be used in both cases to match the beginning and end of the target, and if all branches of the pattern are \a, they are always fastened, regardless of whether or not the pcre_multiline is set. period (.)

Outside of a character class, a dot in a pattern can match any character in the target, including nonprinting characters, but does not match line breaks (by default). If Pcre_dotall is set, the dot also matches the newline character. Processing dots is completely independent of processing diacritics and dollar characters, and the only connection is that they all involve line breaks. Dots have no special meaning in the character class.

\c can be used to match a single byte. This makes sense in UTF-8 mode, because a period can match an entire character consisting of more than one byte. square brackets ([])

The opening parenthesis begins with a character class, and the right bracket ends. A separate right bracket is not a special character. If you need a closing bracket within a character class, it should be the first character in the character class (immediately after the diacritics if you have a diacritics), or you can escape with a backslash.

A character class matches a character in a target that must be one in the character set defined by the character class, unless the first character in the character class is a diacritics, in which case the target character must not be in the character set defined by the class. If the diacritics themselves are required in a character class, they must not be the first character or be escaped with a backslash.

For example, the character class [aeiou] matches any lowercase vowel letter, and [^aeiou] matches any character that is not a lowercase vowel letter. Note that the diacritics are only symbols that specify characters that are not in the character class by enumerating. Not assertion: A character in the target string is still consumed and fails if the current position ends at the end of the string.

When a case-insensitive match is set, any letter in a character class also represents its case, so for example, lowercase [aeiou] matches both "a" and "a", and lowercase [^aeiou] does not match "a", but matches when case-sensitive.

Line breaks are not treated specifically in character classes, regardless of the value set by the Pcre_dotall or Pcre_multiline option. Character classes like [^a] can always match line breaks.

The minus (-) character can specify a range of characters in a character class. For example, [d-m] matches any character between D and M, including both. If the minus sign itself is required in a character class, it must be escaped with a backslash or placed in a position that cannot be interpreted as a specified range, typically the first or last character in a character class.

Literal "]" cannot be treated as the end of a character range. A pattern like [w-]46] is interpreted as a character class that includes two characters ("W" and "-") followed by the string "46]", so it matches either "W46]" or "-46]". However, if the "]" is escaped with a backslash, it is interpreted as the end of the range. therefore [w-\]46] is interpreted as a character class that contains a range and two separate characters. The octal or hexadecimal representation of "]" can also be used to represent the end of a range.

Scopes are manipulated in ASCII comparison order. can also be used for characters represented by numbers, such as [\000-\037]. Matches uppercase and lowercase letters in case-insensitive matching if the range contains letters. For example, [w-c] is equivalent to [][\^_ ' wxyzabc] case-insensitive matching. If you use a character chart in the "Fr" area, [\XC8-\XCB] matches the accent E character of the case.

Character types \d,\d,\s,\s,\w and \w can also appear in character classes and add characters they can match into character classes. For example, [\dabcdef] matches any hexadecimal number. A tonal character makes it easy to make a strict set of characters, for example [^\w_] matches any letter or number, but does not match the underscore.

any non-alphanumeric character other than \,-,^ (at the beginning) and end has no special meaning in the character class, but it does not hurt to escape them. vertical bar (|)

A vertical bar character is used to separate multiple-selection modes. For example, the pattern:


matches one of the "Gilbert" or "Sullivan". You can have any number of branches, or you can have empty branches (matching empty strings). The matching process attempts each branch from left to right and uses the first branch to match successfully. If the branch is in sub mode (defined below), then "successful match" means that both the branch in the child mode and the other parts of the main mode are matched.

internal option settings

The Pcre_caseless,pcre_multiline,pcre_dotall,pcre_extra and pcre_extended settings can be changed within the pattern by a sequence of Perl options that are contained between "(?" and ")". The option letters are:

Internal Option Letters
I Representative Pcre_caseless
M Representative Pcre_multiline
S Representative Pcre_dotall
X Representative pcre_extended
U Representative Pcre_ungreedy
X Representative Pcre_extra

For example, (? IM) sets a case-insensitive, multiple-line match. You can also cancel these options by adding a minus sign before the letter. For example, the combination of options (? im-sx), set the Pcre_caseless and Pcre_multiline, and cancel the Pcre_dotall and pcre_extended. If a letter appears before and after the minus sign, the option is set to be canceled.

If an option change occurs at the top level (that is, not in parentheses in the child mode), the remaining mode applied to it is changed. Therefore/ab (? i) only matches "ABC" and "ABC". This behavior was modified from PCRE 4.0, which is bound by PHP 4.3.3. Prior to this release,/ab (? i) was executed in the same form as/abc/i (for example, matching "ABC" and "ABC").

If an option change occurs in child mode, the effect is different. This is a change in the behavior of Perl 5.005. The option changes in child mode affect only the part that follows the child mode, so (a (? i) b) C will only match "ABC" and "ABC" (assuming that pcre_caseless is not used). This means that options can create different settings in different parts of the pattern. Changes in one branch can be passed to the branch in the back of the same child mode, for example (A (?). i) B|c will match "AB", "AB", "C" and "C", although the first branch will be discarded before the option is set when the "C" is matched. This is because the

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.