Regular Expression Learning reference

Source: Internet
Author: User
Tags character classes

1 overview

The regular expression (Regular expression) is a matching pattern that describes the characteristics of a string of text.

Just as the words "tall" and "strong" in natural language are abstracted out to describe the characteristics of things, the regular expression is the high abstraction of the characters used to describe the character of the string.

Regular Expressions (hereinafter referred to as regular, Regex) usually do not exist independently, and various programming languages and tools provide regular support for the host language, and can be clipped or extended according to the characteristics of their own language.

Regular getting Started is easy, limited grammar rules are easy to grasp, but the current regular prevalence is not high, mainly because of the many regular schools, the various host languages provided by the documents are too much attention to their own details, and these details are usually beginners do not need to pay attention to.

Of course, if you want to know more about regular expressions, these details have to be noticed, and this is something, let's start with the regular basis and enter the world of regular expressions.

2 Regular Expression Basics 2.1 Basic Concepts 2.1.1 String composition

For the string "A5", it is composed of two characters "a", "5" and three positions, which is important to understand the matching principle of regular expressions.

2.1.2 occupies character and 0 width

In the regular expression matching process, if the sub-expression matches the character content, but not the position, and is saved to the final matching result, then it is assumed that the sub-expression is a possessive character, if the sub-expression matches only the position, or the matching content is not saved to the final matching results, Then think of this subexpression as 0 width.

The possessive character or 0 width is the same as whether the matched content is saved to the final matching result.

Possessive characters are mutually exclusive, and the 0 width is non-exclusive. That is, a character that can only be matched by one sub-expression at a time, while a position may be matched by multiple 0-width sub-expressions simultaneously.

2.1.3 Regular Expression composition

Regular expressions are composed of two characters. One is the special meaning of "meta-character" in regular expression, and the other is ordinary "literal character".

Metacharacters can be a character, such as "^", or a sequence of characters, such as "\w".

2.2 metacharacters (Meta Character) 2.2.1 [...] character group (Character Classes)

A character group can match any one of the characters contained in []. Although it can be any one, but only one.

Character groups are supported by a hyphen "-" to represent a range. When the "-" is formed before and after the range, the code of the preceding character is required to be less than the code bit of the following character.

[^ ...] Exclusion type character Group. An excluded character group represents any character that is not listed, and can only be one. Excluded character groups are also supported by a hyphen "-" to represent a range.

An expression

Description

[ABC]

Denotes "a" or "B" or "C"

[0-9]

Represents any number in the 0~9 equivalent to [0123456789]

[\u4e00-\u9fa5]

denotes any one character

[^a1<]

Represents any character except "a", "1", "<"

[^a-z]

Represents any character except a lowercase letter

Example:

When "[0-9][0-9]" matches "Windows 2003", the match is successful and the matching result is "20".

"[^INW]" When matching "Windows 2003", the match succeeds, the match result is "D".

2.2.2 Common character range abbreviations

For some commonly used character ranges, such as numbers, it is very common to use character groups such as [0-9] to be cumbersome, so some metacharacters are defined to represent the common range of characters.

An expression

Description

\d

Any number, equivalent to [0-9], which is any one of the 0~9

\w

Any one letter or number or underscore, equivalent to [a-za-z0-9_]

\s

Any white space character, equivalent to [\r\n\f\t\v]

\d

Any non-numeric character, \d, equivalent to [^0-9]

\w

\w counter, equivalent to [^a-za-z0-9_]

\s

Any non-whitespace character, \s counter, equivalent to [^ \r\n\f\t\v]

Example:

When "\w\s\d" matches "Windows 2003", the match is successful and the match is "s 2".

2.2.3. Decimal point

The decimal point can match any character except "\ n". If you want to match all characters, including "\ n", it is generally used [\s\s], or "." Plus (? s) match pattern to achieve.

An expression

Description

.

Match any character except the line break \ n

2.2.4 Other meta-characters

An expression

Description

^

Matches the starting position of the string, does not match any characters

$

Matches the position of the end of the string, does not match any characters

\b

Match word boundaries, do not match any characters

Example:

When "^a" matches "CBA", the match fails because the expression requires the character "a" after the start position, and "CBA" is obviously not satisfied.

"\d$" matches "123", the match succeeds, the match result is "3", the expression requires matching the number at the end, if the end is not a number, such as "123ABC", then the match failed.

2.2.5 Escape character

Some invisible characters, or metacharacters that have special meanings in the regular, need to be escaped with "\" If you want to match the character itself.

An expression

Description

\ r , \ n

Carriage return and line break

\\

Match "\" itself

\^ , \$ , \.

Match "^", "$" and ".", respectively.

The following characters usually need to be escaped when they match themselves. In practice, depending on the situation, the characters that need to be escaped may be more than the characters listed below

.  $  ^  {  [  (  |  )  *  +  ? \

2.2.6 quantifier (quantifier)

A quantifier represents the number of times a subexpression can be matched. Quantifiers can be used to modify a character, a group of characters, or a sub-expression enclosed in (). Some commonly used quantifiers are defined as independent meta-characters.

An expression

Description

Example

{m}

Expression matches M-times

"\d{3}" is equivalent to "\d\d\d"

"(ABC) {2}" equals "Abcabc"

{M,n}

Expression matches at least m times, up to N times

"\d{2,3}" can match 2 to 3 digits such as "12" or "321"

{m,}

Expression matches at least M-times

"[A-z]{8,}" means at least 8 characters or more

?

Expression matches 0 or 1 times, equivalent to {0,1}

"Ab?" Can match "a" or "AB"

*

Expression matches 0 or more times, equivalent to {0,}

"[^>]*" in "<[^>]*>" means 0 or more characters that are not ">"

+

Expression matches 1 or more times, at least 1 times, equivalent to {1,}

"\d\s+\d" means that there are at least one or more white space characters in the middle of a two number

Note: In a regular expression that is not dynamically generated, quantifiers such as "{1}" do not appear, such as "\w{1}" equivalent to "\w" on the result, but will reduce the matching efficiency and readability, is the superfluous practice.

2.2.7 Branch structure (alternation)

When a substring of a string has multiple possibilities, the branch structure is used to match, "|" Represents a "or" relationship between multiple sub-expressions, "|" is in the () scope, if the "|" The left and right sides do not have () to limit the scope, then its scope is "|" Left and right sides overall.

An expression

Description

|

A relationship between multiple sub-expressions taking "or"

Example:

When "^aa|b$" matches "CCCB", it can be matched successfully, and the result of the match is "B", because the expression matches "^aa" or "b$", and "b$" matches "CCCB" when it matches the success.

"^ (aa|b) $" is a match failure when the zone is matched with "CCCB", because the expression indicates that only "AA" or "B" is between the start and end positions, and "CCCB" is clearly not satisfied.

3 Regular Expression Advanced 3.1 capturing groups (capture group)

A capturing group is the content of a regular expression that is matched by a neutron expression that is stored in memory in a numerically numbered or manually named group for later reference.

An expression

Description

(Expression)

Normal capturing group, which saves the content of the subexpression expression match to a number-numbered group

(?<name> Expression)

Name the capturing group and save the subexpression expression match to a group named by name

The normal capturing group (in the case of no ambiguity, the capturing group) is numbered numerically, and the number sequence is numbered from left to right, starting with 1. Typically, a group numbered 0 represents the entire expression that matches the content.

Named capturing groups can reference captured content by capturing the group name instead of ordinal, providing a more convenient way to refer to a capturing group without having to focus on the sequence number of the captured groups, or worrying about partial changes to the expression that would result in the wrong capturing group being referenced.

3.2 Non-capturing group

In some expressions, you have to use (), but you do not need to save () the content of a neutron expression match, you can use a non-capturing group to counteract the side effects of using ().

An expression

Description

(?: Expression)

Matches the subexpression expression and saves the match to the result of the resulting extents of the entire expression, but the content of the expression match is not saved separately into a group

3.3 Reverse Reference

Captures the contents of a group match, can be referenced in an external program of a regular expression, or can be referenced in an expression, in which case the reference is reversed.

A reverse reference is usually used to look up a repeating substring, or to qualify a substring as a pair.

An expression

Description

\1 , \2

A reverse reference to a capturing group with a sequence number of 1 and 2

\k<name>

A reverse reference to a capturing group named name

Example:

"(a|b) \1" When matching "Abaa", the match is successful, the match to the result is "AA". "(a|b)" When trying to match, although can match "a", but also can match "B", but in the reverse reference, the corresponding () matches the content is already fixed.

3.4 Surround (look Around)

Look around only the matching of sub-expressions, the matching content does not count toward the final matching results, is 0 width.

Look around in accordance with the direction of the order and reverse two, according to whether the match has positive and negative two, combined together there are four kinds of surround. Surround look is equivalent to adding an additional condition to your location.

An expression

Description

(? <=expression)

Positive look around in reverse order, indicating that the left side of the position matches expression

(? <! Expression)

Reverse negative look, indicating that the left side of the position does not match expression

(? =expression)

The order is sure to look around, indicating that the right side of the position matches expression

(?! Expression)

Sequential negative surround, indicating that the right side of the position does not match expression

Example:

"(? <=windows) \d+" When matching Windows 2003, the match was successful and the match result was "2003". We know that "\d+" means matching more than one number, while "(? <=windows)" Equals an additional condition, indicating that the left side of the location must be "Windows", and that the matching content does not count towards the matching result. The same regular match fails when matching Office 2003, because the left side of any string of numbers here is not "Windows".

"(?! 1) \d+"Match" 123", the match is successful and the matching result is" 23 ". "\d+" matches more than one number, but additional conditions "(?! 1) "1" is not on the right side of the requirement, so the location where the match succeeds is the position in front of "2".

3.5 Ignore precedence and match priority

Or a greedy and non-greedy pattern called regular expression matching.

The sub-expression of the standard quantifier modification, in the case of matching can not match, always first try to match, said this way is matching first, or greedy mode. Some of the quantifiers described earlier, "{m}", "{m,n}", "{m,}", "?", "*" and "+" are all matching priorities.

Some NFA regular engines support ignoring the priority quantifier, that is, after the standard quantifier with a "?", at this time, in the case of matching can not match, will always first ignore the match, only in the sub-expression that is modified by ignoring the precedence quantifier, must be matched in order to make the entire expression match successfully, the match, This approach is called ignoring precedence, or non-greedy mode. Ignore precedence quantifiers include "{m}?", "{m,n}?", "{m,}?", "??", "*?" and "+?".

Example:

SOURCE String:<div>aaa</div><div>bbb</div>

Regular expression 1:<div>.*</div> Match result:<div>aaa</div><div>bbb</div>

Regular expression 2:<div>.*?</div> Match result:<div>aaa</div>

Regular Expression Learning reference

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.