In-depth parsing of linux Regular Expressions

Last Update:2013-12-30 Source: Internet

Author: User

Tags character classes printable characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

Generally speaking, the syntax of regular expressions is divided into three standards: BRE, ERE, and ARE. Here, BER and ERE belong to the POSIX standard, and ARE extensions defined by various families.

POSIX Regular Expression

Traditionally, POSIX defines two regular expression syntaxes: Basic Regular Expressions (BRE) and extended regular tables.
Dash (ERE ).

The syntax symbols defined by BRE include:

.-Match any character.
[]-Character Set matching: matches one of the character sets defined in square brackets.
[^]-The character set does not match any character that is not defined in square brackets.
^-Match start position.
$-Match end position.
\ (\)-Define a subexpression.
\ N-subexpression refers to a forward expression, where n is a number between 1 and 9. Because this function has exceeded the regular semantics, you must
To backtrack a string, you must use the NFA Algorithm for matching.
*-Any match (zero or multiple matches ).
\ {M, n \}-at least m times, up to n matches; \ {m \} indicates m exact matches; \ {m, \} indicates at least m
Secondary match.

ERE modified some of BRE syntax and added the following syntax symbols:

? -Match at most once (zero or one match ).
+-Match at least once (one or more times ).
|-Or operation. both the left and right operands can be considered as a subexpression.

At the same time, ERE cancels the escape character reference syntax of the subexpression "()" and the number of times matching the "{m, n}" syntax symbol.
When you use these two syntax symbols, you do not need to add escape characters. At the same time, ERE also canceled the non-regular Semantics
The ability of the subexpression to forward.

BRE and ERE share the same POSIX character class definition. They also support character-class comparison operations "[...]"
And character to equivalent "[=]" operation, but rarely used.

Tools such as f/fr/wfr/bwfr use the ERE mode by default and support the following perl-style character classes:

POSIX perl class description
----------------------------------------------------------------------------
[: Alnum:] letters and numbers
[: Alpha:] \
[: Lower:] \ l lowercase letters
[: Upper:] \ u uppercase letters
[: Blank:] blank characters (spaces and tabs)
[: Space:] \ s all space characters (wider than [: blank)
[: Cntrl:] unprintable control characters (undefined, deleted, alert bell ...)
[: Digit:] \ d decimal number
[: Xdigit:] \ x hexadecimal number
[: Graph:] printable non-blank characters
[: Print:] \ p printable characters
[: Punct:] punctuation marks

-In addition, there are the following special character classes:

Description of equivalent POSIX expressions in perl
----------------------------------------------------------------------------
\ O [0-7] octal Digit
\ O [^ 0-7] non-octal Digit
\ W [[: alnum:] _] Word Composition Character
\ W [^ [: alnum:] _] non-word character
\ A [^ [: alpha:] non-letters
\ L [^ [: lower:] non-lowercase letters
\ U [^ [: upper:] non-UPPERCASE letters
\ S [^ [: space:] non-space characters
\ D [^ [: digit:] Not a number
\ X [^ [: xdigit:] non-hexadecimal number
\ P [^ [: print:] non-printable characters

-You can also use the following special character conversion sequence:

\ R-press ENTER
\ N-line feed
\ B-return
\ T-Tab
\ V-vertical Tab
\ "-Double quotation marks
\ '-Single quotes

Advanced Regular Expression

In addition to posix bre and ERE, libutilitis also supports advanced Regular Expression languages compatible with TCL 8.2
Method (ARE ). You can enable the ARE mode by adding the prefix "***:" To the stRegEx parameter.
Cover the bExtended option. Basically, ARE is the superset of ERE. Based on ERE, it performs the following operations:
Item extension:

1. Supports "lazy match" (also called "non-Greedy match" or "shortest match"): In '? ',' * ',' + 'Or' {m, n }'
Append '? The 'symbol enables the shortest match, so that the regular expression clause matches
Match as few characters as possible (matching as many characters as possible by default ). For example, apply "a. * B" to "abab"
The entire string ("abab") will be matched. If ".*? B ", it will only match the first two characters (" AB ").

2. Support forward references matching of subexpressions: In stRegEx, you can use '\ n' to reference the previously defined
Subexpression. For example, "(a. *) \ 1" can match "abcabc.

3. Ring expression: use "(? : Expression) "to create an unknown expression. An unknown expression is not returned.
To a '\ n' match.

4. Forward prediction: To hit matching, the specified conditions must be met before. Forward prediction can be divided into positive prediction and negative prediction.
. The pre-prediction syntax is :"(? = Expression) ", for example:" bai .*(? = Yang) "match" bai yang"
The first four characters ("bai") in, but ensure that the string must contain "yang" after "bai ".
The syntax for negative judgment is :"(?! Expression) ", for example:" bai .*(?! Yang) before "match" bai shan"
Four characters, but the match ensures that "yang" is not displayed after "bai ".

5. The prefix can be switched in mode. After "***:", the prefix can be followed by a pattern like "(? Mode string) "pattern string, Pattern
Strings affect the semantics and behavior of the subsequent expressions. The mode string can be a combination of the following characters:

B-switch to POSIX BRE mode to overwrite the bExtended option.
E-switch to posix ere mode to overwrite the bExtended option.
Q-switch to text literal match mode. All characters in the expression are searched as text to cancel all regular expressions.
Semantics. This mode degrades the regular expression matching to a simple string search. "*** =" The prefix is its quick representation.
Method, meaning: "*** =" equivalent "***:(? Q )".

C-perform case-sensitive matching to overwrite the bNoCase option.
I-perform case-insensitive matching to overwrite the bNoCase option.

N-enable line-sensitive matching: '^' and '$' match the beginning and end of the line; '.' And the negative set ('[^...]') are not
Match the line break. This function is equivalent to a 'PW 'mode string. Overwrite the bNewLine option.
M-is equivalent to 'n '.
P-'^' and '$' only match the beginning and end of the entire string, but do not match the line break. '.' And the negative set do not match the line break.
Overwrite the bNewLine option.
W-'^' and '$' match the beginning and end of a row. '.' matches the linefeed with the negative set. Overwrite the bNewLine option.
S-'^' and '$' match only the beginning and end of the entire string, but do not match the line. '.' And the negative set match the line break. Overwrite
Enter the bNewLine option. This mode is used by default in the ARE status.

X-enable extended mode: In extended mode, blank spaces and comments '#' in the expression are ignored.
For example:
@ Code @
(? X)
\ S + ([[: graph:] +) # first number
\ S + ([[: graph:] +) # second number
@ Code @
Equivalent to "\ s + ([[: graph:] +) \ s + ([[: graph:] + )".
T-Disable the Extended Mode and do not ignore the blank spaces and comments. This mode is used by default in the ARE status.

6. the Perl-style character class conversion sequence is different from the BRE/ERE mode:

Description of equivalent POSIX expressions in perl
----------------------------------------------------------------------------
\ A-bell character
\ A-match the start of the entire string regardless of the Current Mode
\ B-Escape Character ('\ x08 ')
\ B-Escape Character itself ('\\')
\ CX-controller-X (= X & 037)
\ D [[: digit:] 10-digit ('0'-'9 ')
\ D [^ [: digit:] Not a number
\ E-exit character ('\ x1B ')
\ F-page feed ('\ x0c ')
\ M [[[: <:] Start position of a word
\ M [[[: >:]] end position of a word
\ N-linefeed ('\ x0a ')
\ R-carriage return ('\ x0d ')
\ S [[: space:] Blank space Character
\ S [^ [: space:] non-blank characters
\ T-tab ('\ x09 ')
\ UX-16-bit UNICODE character (X, [0000 .. FFFF])
\ UX-32-bit UNICODE character (X, [00000000 .. FFFFFFFF])
\ V-vertical tab ('\ x0b ')
\ W [[: alnum:] _] is a word character.
\ W [^ [: alnum:] _] non-word characters
\ XX-8 characters (X, [00.. FF])
\ Y-word boundary (\ m or \ M)
\ Y-non-word boundary
\ Z-match only the tail of the entire string regardless of the Current Mode
\ 0-NULL, NULL character
\ X-subexpression forward reference (xε [1 .. 9])
\ XX-the 8-character forward referenced by a subexpression or in octal format
\ XXX-the 8-character forward referenced by a subexpression or in octal format

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More