Regular Expression Learning Reference recommendation Beginner look _ Regular expression

Source: Internet
Author: User
Tags character classes character set numeric

1 overview

A regular expression (Regular Expression) is a matching pattern that describes the characteristics of a string of text.

Just as the words "tall" and "strong" in natural language are abstracted out to describe things, regular expressions are a highly abstract character that describes the character of a string.

Regular Expressions (hereinafter referred to as regular, Regex) usually do not exist independently, various programming languages and tools as host language to provide regular support, and according to the characteristics of their own language, to make a certain clipping or expansion.

It's easy to get started, the limited grammar rules are easy to grasp, but the current popularity is not high, mainly because of the number of regular schools, the various host language documents are too much attention to the details of their own, and these details are often beginners do not need attention.

Of course, if you want to understand the regular expression in depth, these details must be paid attention to, this is something, let us start from the regular basis, into the regular expression of the world.

2 Regular Expression Basics

2.1 Basic Concepts

2.1.1 String composition


For the string "A5", it is made up of two characters "a", "5" and three positions, which is important for understanding the matching principle of regular expressions.

2.1.2 Possessive character and 0 width

During regular expression matching, if the subexpression matches the character content, instead of the position and being saved to the final match, it is assumed that the subexpression is possessive, and if the subexpression matches only the position, or the matching content is not saved to the final match result, Then consider this subexpression to be 0-width.

The possessive character or 0 width is for whether the matching content is saved to the final matching result.

Possessive characters are mutually exclusive and 0 widths are mutually exclusive. is a character that can only be matched by one subexpression at a time, while a position can be matched by multiple 0-width subexpression.

2.1.3 Regular expression of the composition

Regular expressions are composed of two characters. One is the "meta character" in the regular expression and the other is the ordinary "text character".

Metacharacters can be a character, such as "^", or it can be a sequence of characters, such as "\w".

2.2-dollar character (Meta Character)

2.2.1 [...] character group (Character Classes)

Character groups can match any one of the characters contained in []. Although it can be any one, but only one.

Character groups support the hyphen "-" to represent a range. When a range is formed before and after "-", the code bit of the preceding character is required to be less than the code bit of the following character.

[^ ...] A group of excluded characters. An excluded character group represents any unlisted character and can only be one. Excluded character groups also support the hyphen "-" to represent a range.

An expression

Description

[ABC]

Denotes "a" or "B" or "C"

[0-9]

Represents any number in a 0~9, equivalent to [0123456789]

[\u4e00-\u9fa5]

Denotes any one of the characters

[^a1<]

Represents any of the other characters except "a", "1", "<"

[^a-z]

Represents any one character except the lowercase letter

Example:

"[0-9][0-9]" matches "Windows 2003" with a successful match with the result "20".

"[^INW]" matches "Windows 2003" with a successful match with the result "D".

2.2.2 Common character range abbreviation

For some commonly used character ranges, such as numbers, and because of their very common use, even if using a character set such as [0-9] is still troublesome, some metacharacters are defined to represent a common range of characters.

An expression

Description

\d

Any number, equivalent to [0-9], that is, any one of the 0~9.

\w

Any letter or number or underscore, equivalent to [a-za-z0-9_]

\s

Any white space character, equivalent to [\r\n\f\t\v]

\d

Any number of non-numeric characters, \d, equivalent to [^0-9]

\w

\w reverse, equivalent to [^a-za-z0-9_]

\s

Any non-white-space character, \s, equivalent to [^ \r\n\f\t\v]

Example:

"\w\s\d" matches "Windows 2003" with a successful match with the result "s 2".

2.2.3. Decimal point

The decimal point can match any character other than "\ n". If you want to match all characters including "\ n", use [\s\s], or "." Add (? s) matching pattern to implement.

An expression

Description

.

matches any character except \ n of a newline character

2.2.4 Other meta characters

An expression

Description

^

Matches the start of a string and does not match any characters

$

Matches the position of the end of the string and does not match any characters

\b

Matches a word boundary and does not match any characters

Example:

When "^a" matches "CBA", the match fails because the expression requires the character "a" behind the start position, and the "CBA" is obviously not satisfied.

When "\d$" matches "123", the match is successful and the result is "3", which requires matching the number at the end, if the end is not a number, such as "123ABC", then the match fails.

2.2.5 Escape character

Some invisible characters, or metacharacters with special meaning in the regular, need to be escaped with "\" If you want to match the characters themselves.

An expression

Description

\ r , \ n

Carriage return and line wrapping

\\

Match "\" itself

\^ , \$ , \.

Match "^", "$" and "." respectively.

The following characters usually need to be escaped when they match themselves. In practical applications, depending on the circumstances, the characters that need to be escaped may be more than the characters listed below

.  $  ^  {  [  (  |  )  *  +  ? \

2.2.6 Classifier (quantifier)

Quantifiers represent the number of times a subexpression can match. Quantifiers can be used to modify a character, a group of characters, or a subexpression that is surrounded by (). Some commonly used quantifiers are defined as independent meta characters.

An expression

Description

Example

{m}

Expression matching M-times

"\d{3}" is equivalent to "\d\d\d"

"(ABC) {2}" is equivalent to "ABCABC"

{M,n}

Expression matches at least m times, up to N times

"\d{2,3}" can match 2 to 3 digits such as "12" or "321"

{m,}

Expression matches at least m times

"[A-z]{8,}" means at least 8 digits above the letter

?

Expression matches 0 or 1 times, equivalent to {0,1}

"Ab?" Can match "a" or "AB"

*

Expression matches 0 or more times, equal to {0,}

"[^>]*" in "<[^>]*>" means 0 or more characters that are not ">"

+

Expression matches 1 or more times, at least 1 times, equivalent to {1,}

"\d\s+\d" means the middle of two digits, with at least one more white space character

Note: In a regular expression that is not dynamically generated, a quantifier such as "{1}", such as "\w{1}", is equivalent to "\w" on the result, but reduces the efficiency and readability of the match, which is the superfluous practice.

2.2.7 Branch structure (alternation)

When a substring of a string has multiple possibilities, a branching structure is used to match, "|" Represents a relationship between "or" multiple subexpression, "|" is bounded by () if in "|" The left and right sides do not have () to limit the scope, then its scope is "|" The left and right sides of the whole.

An expression

Description

|

Relationship between multiple subexpression takes "or"

Example:

"^aa|b$" when matching "CCCB", can match the success, the result of the match is "B", because this expression is to match "^aa" or "b$", and "b$" Match "CCCB" can match successfully.

"^ (aa|b) $" in the Zone with "CCCB" when the match fails, because this expression means that between the "start" and "end" position can only be "AA" or "B", and "CCCB" is obviously not satisfied.

3 Regular Expression Advanced

3.1 Capturing group (Capture Group)

A capturing group is the content of a regular expression that matches a neutron expression, which is saved to a numbered or manually named group in memory for later reference.

An expression

Description

(Expression)

A common capturing group that saves the contents of a subexpression expression to a number-numbered group

(?<name> Expression)

Name the capturing group, and save the expression of the subexpression to the group named by name

The common capture group, which, in the case of ambiguity, is numbered numerically, and the number sequence is numbered "(from left to right), starting at 1. In general, a group numbered 0 represents the contents of an entire expression.

Named capture groups can reference the captured content by capturing the group name instead of an ordinal, providing a more convenient way of referencing, without paying attention to the sequence number of the capturing group, or worrying about partial changes in the expression that can lead to the capturing group that references the error.

3.2 Non-capture group

In some expressions, you have to use (), but you do not need to save () a neutron expression matching the content, you can use a non-capture group to offset the use of () side effects.

An expression

Description

(?: Expression)

Matches the subexpression expression and saves the matching content to the result of the final entire expression, but expression matches are not saved separately in a group

3.3 Reverse References

Capturing group matches can be referenced in the external program of a regular expression, or in an expression, in the form of a reverse reference.

A reverse reference is usually used to find a repeating substring, or to restrict the occurrence of a substring.

An expression

Description

\1 , \2

A reverse reference to a capturing group with ordinal 1 and 2

\k<name>

A reverse reference to a capturing group named name

Example:

"(a|b) \1" matches "Abaa", the match is successful and the result is "AA". "(a|b)" When trying to match, although it can match "a" or "B", the matching content in the corresponding () is already fixed when the reverse reference is made.

3.4 Look Around (look Around)

Look around for only the subexpression matching, the matching content does not count to the final match result, is 0 width.

Look around according to the direction of the two kinds of order and reverse, according to whether the match is affirmative and negative two, combined with four kinds of looking around. Looking around is equivalent to adding an additional condition to your location.

An expression

Description

(? <=expression)

The reverse is definitely looking around, indicating that the left side of the position can match expression

(? <! Expression)

Reverse negation a glance, indicating that the left side of the position cannot match expression

(? =expression)

The order is sure to look around, indicating that the right side of the position can match expression

(?! Expression)

Sequential negation look around, indicating that the right side of the position does not match expression

Example:

"(? <=windows) \d+" matches "Windows 2003", the match is successful and the result is "2003". We know that "\d+" matches more than one number, and "(? <=windows)" is equivalent to an additional condition, indicating that the left side of the location must be "Windows" and that the matching content does not count to match the result. The same regular matches the "Office 2003" when the match fails because the left side of any string of numeric substrings is not "Windows."

"(?!) 1) \d+"whenmatching" 123, the match is successful and the result is "23". "\d+" matches more than one number, but additional conditions "(?!) 1) "The location on the right cannot be" 1 ", so the position to match successfully is the position before" 2 ".

3.5 Ignore priority and match priority

or the greedy and non greedy patterns that are called regular expression matches.

A subexpression that is decorated with a standard quantifier, which, in the case of mismatched mismatches, tries to match first, calling it a matching priority, or greedy pattern. Some quantifiers previously introduced, "{m}", "{m,n}", "{m,}", "?", "*" and "+" are all match-first.

Some NFA regular engines support ignoring priority quantifiers, which means adding a "?" after the standard quantifier., at this point, when matching can be mismatched, the match will always be ignored, only when the subexpression modified by ignoring the precedence quantifier must match to make the entire expression match successfully. Call this way to ignore priority, or not greedy mode. Ignore priority quantifiers include "{m}?", "{m,n}?", "{m,}?", "??", "*?" and "+?".

Example:

SOURCE String:<div>aaa</div><div>bbb</div>

Regular expression 1:<div>.*</div> Match Results:<div>aaa</div><div>bbb</div>

Regular expression 2:<div>.*?</div> Match Results:<div>aaa</div>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.