. NET regular expressions use advanced techniques in the work characteristics

Source: Internet
Author: User
Tags character set end expression net regular expression requires string expression engine
Advanced | skills | regular

Syntax:??, *?,+?,{n}?,{n,m}?

Meaning: Simply put, behind this? (lazy character) tells the regular engine that the expression in front of it matches the shortest match without a match, such as??,? itself matches 0-1 matches, then?? To take the shortest, matching 0 items do not match down, the same, *? match 0, +? Match 1, {n}? match N, {n,m}? match N. When you use @ "\w*?" When matching "ABCD", there will be five successful matches, each matching result is an empty string, why it is 5 times, this is because the regular engine in matching an expression is a character of a character in contrast, each successful match once, go forward.

Judge an expression

Grammar:

1, a| B, this is the most basic, a or B, in fact, this is not a judgment

2, (?) ( expression) Yes-expression|no-expression, where no-expression is optional, meaning that if the expression is established, then the matching yes-expression is required, Otherwise requires matching no-expression

3, (?) ( Group-name) Yes-expressioin|no-expression, where no-expression is optional, meaning that if the group named Group-name matches successfully, the matching yes-expression is required. Otherwise requires matching no-expression

The expression is well understood, only one thing to note: @ "( A) a| B) "Can not match" AA ", for what?" How to write to match it, we first think about ...

We should write a regex like this: @ "( A) aa| B) "Note that the content in the judgment form is not part of the yes-expression or no-expression expression.

. NET's regular engine work features

The regular engine of. NET works most of the way we "take for granted", but there are a few things to note:

1. NET Framework Regular expression engines match as many characters (greed) as possible. Because of this, don't use a regular like @ <.*> (. *) </.*> to try to find all the innertext in an HTML document. (I also see someone on the Internet to write a regular style before deciding to write "the regular expression of advanced skills", hehe)

2. The. NET Framework Regular expression engine is a regular expression-matching device for backtracking, incorporating traditional non-deterministic finite automaton (NFA) engines such as Perl and Python engines. This distinguishes it from the faster, but more limited, pure expression deterministic finite automaton (DFA) engine. The. NET Framework regular expression engine matches the success as much as possible, so when @ "\w+\." (. *) \.\w+ "in the. * to the www. . csdn.net in Csdn.net are all matched, so that the following \.\w+ do not have characters to match, the engine will backtrack to get a successful match.
 
NET Framework Regular expression engine also includes a complete set of syntax that allows programmers to manipulate the backtracking engine. Including:

"Lazy" qualifier:??、 *?, +?, {n,m}?. These lazy qualifiers indicate that the backtracking engine first searches for a minimum number of duplicates. In contrast, an ordinary "greedy" qualifier attempts to match the maximum number of repetitions first.

Matches from right to left. This is useful when searching from right to left rather than left to right, or it is useful to start a search from the right side of the pattern rather than start the search from the left part of the pattern.

3. NET Framework Regular expression engine in the case of (Expression1|expression2|expression3), expression1 is always the first to try, In turn, expression2 and Expression3.

Publicstaticvoidmain ()
{
Strings= "Thinisaasp.netdeveloper."
Regexreg=newregex (@ "(\w{2}|\w{3}|\w{4})", regexoptions.compiled|regexoptions.ignorecase);
Matchcollectionmc=reg. Matches (s);
foreach (MATCHMINMC)
Console.WriteLine (M.value);
Console.ReadLine ();
}

Output result: ' TH ' ' ' is ' as ' ne ' ' de ' ve ' ' Lo ' pe '

  Schedule

Escape character Description
Generic characters Other characters match themselves except. $ ^ {[(|) * +? \].
\a Matches the Bell (alert) \u0007.
\b In regular expressions, \b represents word boundaries (between \w and \w), but in the [] character class, \b represents backspace. In replacement mode, \b always represents backspace.
\ t Matches the Tab character \u0009.
\ r Matches the return character \u000d.
\v Matches the vertical Tab character \u000b.
\f Matches the \u000c of the page feed character.
\ n Matches the line feed character \u000a.
\e Matches the ESC character \u001b.
\040 Matches the ASCII character to an octal number (up to three digits), or a back reference if the number with no leading zeros is only one digit or corresponds to the capturing group number. For example, a character \040 represents a space.
\x20 Matches the ASCII character using the hexadecimal representation (exactly two bits).
\cc Matches the ASCII control character; for example, \cC ctrl-c.
\u0020 Matches a Unicode character using the hexadecimal representation (exactly four digits).
\ Matches the character followed by a character that is not recognized as an escape symbol. For example, \* is the same as \x2a.
Character class Description
. Matches any character other than \ n. If you have modified with the Singleline option, the period character can match any character.
[Aeiou] Matches any single character contained in the specified character set.
[^ Aeiou] Matches any single character that is not in the specified character set.
[0-9a-fa-f] Using a hyphen (–) allows you to specify a contiguous range of characters.
\p{name}

Matches any characters in the named character class specified by {name}. The supported names are Unicode groups and block scopes. For example, Ll, Nd, Z, Isgreek, isboxdrawing. You can use the GetUnicodeCategory method to find the Unicode category to which a character belongs.

\p{name} Matches text that is not included in the group and block ranges specified in {name}.
\w matches any word character. is equivalent to the Unicode character category [\p{ll}\p{lu}\p{lt}\p{lo}\p{nd}\p{pc}\p{lm}]. If the ECMAScript behavior is specified with the ECMAScript option, \w is equivalent to [a-za-z_0-9].
\w Matches any non word character. is equivalent to the Unicode character category [^\p{ll}\p{lu}\p{lt}\p{lo}\p{nd}\p{pc}\p{lm}]. If the ECMAScript behavior is specified with the ECMAScript option, \w is equivalent to [^a-za-z_0-9].
\s Matches any white-space character. is equivalent to the Unicode character category [\f\n\r\t\v\x85\p{z}]. If the ECMAScript behavior is specified with the ECMAScript option, \s is equivalent to [\f\n\r\t\v].
\s Matches any non-white-space character. is equivalent to the Unicode character category [^\f\n\r\t\v\x85\p{z}]. If the ECMAScript behavior is specified with the ECMAScript option, then \s is equivalent to [^ \f\n\r\t\v].
\d matches any decimal digit. For ECMAScript behavior of the Unicode category, which is equivalent to \p{nd}, ECMAScript behavior for non-Unicode categories is equivalent to [0-9].
\d Matches any non-numeric number. For ECMAScript behavior of the Unicode category, which is equivalent to \p{nd}, ECMAScript behavior for non-Unicode categories is equivalent to [^0-9].
Assertion Description
>^ Specifies that the match must appear at the beginning of the string or at the beginning of the line.
$ Specifies that the match must occur at the end of the string, before \ n at the end of the string, or at the end of the line.
\a Specifies that the match must appear at the beginning of the string (ignoring the Multiline option).
\z Specifies that the match must appear at the end of the string or before \ n at the end of the string (ignoring the Multiline option).
\z Specifies that the match must appear at the end of the string (ignoring the Multiline option).
\g Specifies that the match must appear where the last match ended. When used with Match.nextmatch (), this assertion ensures that all matches are contiguous.
\b Specifies that the match must appear on the boundary between the \w (alphanumeric) and \w (non-alphanumeric) characters. The match must appear on the word boundary, which appears on the first or last character in a word separated by any non-alphanumeric character.
\b The specified match must not appear on the \b boundary.
Qualifier Description
* Specify 0 or more matches, such as \w* or (ABC) *. is equivalent to {0,}.
+ Specify one or more matches, such as \w+ or (ABC) +. is equivalent to {1,}.
? Specify 0 or one match; e.g. \w? or (ABC)? is equivalent to {0,1}.
N Specifies exactly n matches, for example (pizza) {2}.
{N,} Specifies at least n matches, for example (ABC) {2,}.
{N, m} Specify at least n but not more than M matches.
*? Specifies that a duplicate first match (equivalent to lazy *) be used as little as possible.
+? Specifies to use as few repetitions as possible but at least once (equivalent to Lazy +).
?? Specifies to use 0 repetitions (if possible) or one repetition (lazy?).
{n}? is equivalent to {n} (lazy {n}).
{N,}? Specifies to use as few repetitions as possible but at least n times (lazy {n,}).
{n, m}? Specify between N and M times, with as few repetitions as possible (lazy {n,m}).



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.