Linux Regular Expression Depth resolution _ regular expressions

Source: Internet
Author: User
Tags character classes class definition control characters numeric posix printable characters

Brief introduction

Generally speaking, the grammar of regular expression is divided into 3 kinds of standard: BRE, ERE and ARE. Where the BER and ERE are POSIX standards, ARE is defined by each family extension.

POSIX Regular Expressions

Traditionally, POSIX defines two regular expression syntaxes: basic regular Expressions (BRE) and extended regular tables
Up-style (ERE).

Among them, the syntax symbols defined by BRE include:

. -matches any one character.
[]-character set matching that matches one of the character sets defined in the square brackets.
[^]-character Set negation matches, matching characters that are not defined in square brackets.
^-Match start position.
$-matches the end position.
\ (\)-Defines a subexpression.
\ n-subexpression forward, n is the number between 1-9. Because this feature has exceeded the regular semantics, you need to
To backtrack in a string, you need to use the NFA algorithm to match.
*-any time match (0 or more times).
\{m,n\}-at least m times, at most n times match; \{m\} represents M-time exact match; \{m,\} represents at least m
Secondary match.

ERE I modified some of the syntax in BRE and added the following syntax notation:

? -Up to one match (0 times or one match).
+-match at least one time (one or more times).
| -or operation, the left and right operands can be thought of as one subexpression.

Also, ERE the subexpression "()" and the number of occurrences of the "{m,n}" syntax symbol are canceled, the escape character reference syntax
When you use both of these syntax symbols, you do not need to add an escape character. At the same time, ERE also cancelled the irregular semantics of the
A subexpression is capable of referencing forward.

BRE and ERE share the same POSIX character class definition. At the same time, they also support character class comparison operations "[..]"
and the character equivalent body "[= =]" operation, but is rarely used.

Tools such as F/FR/WFR/BWFR use ERE mode by default, while supporting the following Perl-style character classes:

POSIX Class Perl class description
----------------------------------------------------------------------------
[: Alnum:] Letters and numbers
[: Alpha:] \a Letter
[: Lower:] \l Small Letter
[: Upper:] \u Capital Letter
[: Blank:] white space characters (spaces and tabs)
[: Space:] \s all spaces (more than [: blank:] covers a wide range)
[: Cntrl:] non-printable control characters (backspace, delete, alarm ...) )
[:d igit:] \d decimal digits
[: xdigit:] \x hexadecimal digits
[: Graph:] printable non-white-space characters
[:p rint:] \p printable characters
[:p UNCT:] Punctuation

-In addition, there are the following special character classes:

Perl class equivalent POSIX-expression description
----------------------------------------------------------------------------
\o [0-7] octal number
\o [^0-7] non-octal number
\w [[: alnum:]_] words make up characters
\w [^[:alnum:]_] non-word constituent characters
\a [^[:alpha:]] Non-letter
\l [^[:lower:]] is not a lowercase letter
\u [^[:upper:]] Non-uppercase
\s [^[:space:]] Non-spaces
\d [^[:d igit:]] Non-numeric
\x [^[:xdigit:]] non-hexadecimal digits
\p [^[:p rint:]] non-printable characters

-You can also use the following special characters to change the code sequence:

\ r-Enter
\ n-Line Wrap
\b-Backspace
\ t-Tab
\v-Vertical tab
\ "-Double quotes
\ '-Single quotation mark

Advanced Regular Expressions

In addition to POSIX BRE and ERE, Libutilitis also supports advanced regular expressions that are compatible with TCL 8.2
Law (ARE). By adding a prefix to the Stregex parameter "* *:" You can turn on the ARE mode, which is a prefix
Cover bextended option. Basically speaking, ARE is ERE the superset. It was on the basis of ERE the following several
Item Extension:

1. Support "Lazy Match" (also known as "non-greedy match" or "shortest Match"): In '? ', ' * ', ' + ' or ' {m,n} '
Append the '? ' symbol to enable the shortest match, so that the regular expression clause in the premise of satisfying the conditions of the horse
With as few characters as possible (the default is to match as many characters as possible). For example: "a.*b" acts on "Abab"
, the entire string ("Abab") is matched, and if "a.*?b" is used, only the first two characters ("AB") will be matched.

2. Support for forward-reference matching of subexpression: In Stregex, you can use ' \ n ' to forward references to the previously defined
Child expression. such as: "(a.*) \1" can match "ABCABC" and so on.

3. Nameless subexpression: Creates an unnamed expression using the method "(?: expression)", and the nameless expression does not return
to a ' \ n ' match.

4. Forward: To hit the match, the specified condition must be met. Forward pre-judgment is divided into positive and negative pre-judgment
Two kinds. The syntax for affirmative pre-judgment is: "(? = Expression)", for example: "bai.* (? =yang)" matches "Bai Yang"
The first four characters ("Bai") in, but ensure that the string must contain "Yang" after "bai.*" when it matches.
The syntax for negative judgments is: "(?!) Expression) ", for example:" bai.* (?!) Yang) "Match the Bai Shan" of the former
Four characters, but the match is guaranteed that the string does not appear "Yang" after "bai.*".

5. Support mode switch prefix, after "* * *:" Can be followed by the form of "(? Pattern string)" style pattern string, pattern
Strings affect the semantics and behavior of subsequent expressions. A pattern string can be a combination of characters:

B-Switch to POSIX BRE mode and overwrite the bextended option.
e-Switch to POSIX ERE mode, overwriting the bextended option.
Q-Switch to literal match mode, the word characters in the expression is searched as text, canceling all regular
Semantic. This pattern degrades the regular match to a simple string lookup. The "***=" prefix is its shortcut representation
Way, meaning namely: "***=" equal to "* * *:(? q)".

C-Performs case sensitive matching, overwriting the bnocase option.
I-performs a match that ignores the case, overwriting the bnocase option.

N-Activate row sensitive match: ' ^ ' and ' $ ' match the beginning and end of line; '. ' and negative set (' [^ ...] ') ) does not
Matches line breaks. This function is equivalent to a ' pw ' pattern string. Overwrite the Bnewline option.
M-Equal to ' n '.
P-' ^ ' and ' $ ' match only the end of the entire string, do not match the line, '. ' and negative sets do not match line breaks.
Overwrite the Bnewline option.
W-' ^ ' and ' $ ' match the beginning and end of line, '. ' and negative sets to match line breaks. Overwrite the Bnewline option.
S-' ^ ' and ' $ ' match only the end of the entire string, do not match the line, '. ' and negative sets match line breaks. Reply
Cover bnewline option. This mode is used by default in the ARE state.

X-Open Extended mode: In extended mode, the contents of the whitespace and annotation character ' # ' in the expression are ignored
For example:
@code @
(? x)
\s+ ([[: graph:]]+) #
\s+ ([[: graph:]]+) # Second number
@code @
Equivalent to "\s+" ([: graph:]]+) \s+ ([[: graph:]]+).
T-turns off extended mode without ignoring the contents of whitespace and annotation characters. This mode is used by default in the ARE state.

6. Perl-style character-changing sequences different from the Bre/ere mode:

Perl class equivalent POSIX expression description
----------------------------------------------------------------------------
\a- The ring character
\a-matches the beginning of the entire string, regardless of the current mode
\b-fallback characters (' \x08 ')
\b-escape character itself (' \ \)
\cx-Control character-X (= & 037)
\d [[:d Igit :]] 10 binary digits (' 0 '-' 9 ')
\d [^[:d igit:]] Non-numeric
\e-exit (' \x1b ')
\f-page break (' \x0c ')
\m [[:;:]] Word start
\m [: The word end position
\ n-line break (' \x0a ')
\ r-Carriage return (' \x0d ')
\s [[: Space:]] blank character
\s [^[:space:]] non-whitespace
\ t-tab (' \x09 '
\ux-16-bit UNICODE character (x∈[0000. FFFF])
\ux-32-bit UNICODE character (x∈[00000000. FFFFFFFF])
\v-Portrait tab (' \x0b ')
\w [[: Alnum:]_] makes up the character of the word
\w [^[:alnum:]_] non-word character
\xx-8 bit character (x∈[00. FF])
\y-word boundary (\m or \m)
\y-Non word boundary
\z-matches only the tail of the entire string
\0-null, null character
\x-subexpression forward (x∈[1.9])
\xx -The 8-character
\xxx-subexpression forward-or 8-binary-represented 8-character

of a subexpression-forward or 8-binary expression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.