Regular Expression learning notes

Last Update:2018-12-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular Expression learning notes
The regular expression describes a string matching pattern, which can be used to check whether a string contains
There are seed strings, replace matched substrings, or retrieve substrings that meet certain conditions from a string.
*. Txt in dir *. txt or ls *. txt is not a regular expression *
The meaning is different.
For ease of understanding and memory, start with some concepts. All special characters or character combinations have a summary table at the end and the last one.
Examples are provided to understand the concept.
Regular Expression
It is a text mode consisting of common characters (such as characters A to Z) and special characters (called metacharacters. Regular Expression
As a template, match a character pattern with the searched string.
You can construct a regular expression by placing various components in expression mode between a pair of delimiters,
That is,/expression/
Common characters
It consists of all the print and non-print characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letter characters
, All numbers, all punctuation marks, and some symbols.
Non-printable characters
Character meaning
\ CX matches the control characters specified by X. For example, \ cm matches a control-M or carriage return character. The value of X must be A-Z
Or one of A-Z. Otherwise, C is treated as an original 'C' character.
\ F matches a break. It is equivalent to \ x0c and \ Cl.
\ N matches a linefeed. It is equivalent to \ x0a and \ CJ.
\ R matches a carriage return. It is equivalent to \ x0d and \ cm.
\ S matches any blank characters, including spaces, tabs, and page breaks. It is equivalent to [\ f \ n \ r \ t \ v].
\ S matches any non-blank characters. It is equivalent to [^ \ f \ n \ r \ t \ v].
\ T matches a tab. It is equivalent to \ x09 and \ CI.
\ V matches a vertical tab. It is equivalent to \ x0b and \ ck.
Special characters
A special character is a character with special meanings. For example, * in "*. txt" mentioned above, it is simply a representation of any
The meaning of the string. If you want to find a file with * in the file name, you need to escape *, that is, add a \ before it \. Ls
\ *. Txt. Regular expressions have the following special characters.
Special characters
$ Matches the end position of the input string. If the multiline attribute of the Regexp object is set, $ also matches '\ n'
Or '\ R '. To match the $ character, use \ $.
() Mark the start and end positions of a subexpression. Subexpressions can be obtained for future use. To match these characters, please make
Use \ (and \).
* Matches the previous subexpression zero or multiple times. To match * characters, use \*.
+ Match the previous subexpression once or multiple times. To match + characters, use \ +.
. Match any single character except linefeed \ n. To match., use \.
[Mark the start of a bracket expression. To match [, use \[.
? Match the previous subexpression zero or once, or specify a non-Greedy qualifier. To match? Character, use \?.
\ Mark the next character as a special character, or a literal character, or backward reference, or an octal escape character. For example, 'n'
With the character 'n '. '\ N' matches the line break. The sequence '\' matches "\", while '\ (' matches "(".
^ Matches the start position of the input string. Unless used in the square brackets expression, this character set is not accepted. Youqi
Use the ^ character.
{Mark the start of the qualifier expression. To match {, use \{.
| Specify an option between the two items. To match |, use \ |.
The method for constructing a regular expression is the same as that for creating a mathematical expression. That is to say, a small table can be converted using a variety of metacharacters and operators.
To create a larger expression. The regular expression component can be a single character, Character Set combination, and character range
Select between characters or any combination of all these components.
Qualifier
A qualifier is used to specify how many times a given component of a regular expression must appear to match. There are * or + or? Or {n}
Or {n,} or {n, m.
*, +, And? The delimiters are greedy because they will match as many words as possible, and only add one? You can
Achieve non-greedy or minimum matching .
Regular expressions have the following delimiters:
Character Description
* Matches the previous subexpression zero or multiple times. For example, Zo * can match "Z" and "Zoo ". * Is equivalent to {0 ,}.
+ Match the previous subexpression once or multiple times. For example, 'Zo + 'can match "zo" and "Zoo", but cannot match "Z ". +
It is equivalent to {1 ,}.
? Match the previous subexpression zero or once. For example, "Do (ES )? "Can match" do "in" do "or" does ".?
It is equivalent to {0, 1 }.
{N} n is a non-negative integer. Match n times. For example, 'O {2} 'cannot match 'O' in "Bob", but can match
With two o in "food.
{N,} n is a non-negative integer. Match at least N times. For example, 'O {2,} 'cannot match 'O' in "Bob", but can match
All O in "foooood. 'O {1,} 'is equivalent to 'o + '. 'O {0,} 'is equivalent to 'o *'.
Both {n, m} m and n are non-negative integers, where n <= m. Match at least N times and at most m times. For example, "O {1, 3 }"
Match the first three o in "fooooood. 'O {0, 1} 'is equivalent to 'o? '. Please note that there must be no space between commas and two numbers.
Grid.
Operator
Used to describe the boundary of a string or word. ^ and $ respectively indicate the start and end of a string, and \ B describes the boundary before or after a word,
\ B indicates non-word boundary. The delimiters cannot be used.
Select
Enclose all selection items with parentheses, and separate adjacent selection items with |. However, parentheses have a side effect:
The related matching will be cached. Is it available now? : Put the first option to eliminate this side effect.
Where? : One non-capturing element, and two non-capturing elements? = And ?!, The two have more meanings. The former is forward
In any position that begins to match the Regular Expression Pattern in parentheses to match the search string. The latter is a negative pre-query.
Which does not match the location of the Regular Expression Pattern to match the search string.
Backward reference
Adding parentheses on both sides of a regular expression or partial expression will cause the matching to be stored in a temporary buffer,
Each captured sub-match is stored in the content from left to right in the regular expression mode. Store sub-matched Buffer
The serial number starts from 1 and ranges from consecutive serial numbers to a maximum of 99 subexpressions. Each buffer can be accessed using '\ n', where n
A one-or two-digit decimal number that identifies a specific buffer.
Can I use non-captured metacharacters '? :','? = ', Or '?! 'To ignore the save of the matching.
Operation priority of various operators
Operations with the same priority are performed from left to right. Operations with different priorities are first high and then low. Priority of operators from high to low
As follows:
Operator description
\ Escape Character
(),(? :),(? =), [] Parentheses and square brackets
*, + ,?, {N}, {n ,}, {n, m} qualifier
^, $, \ Anymetacharacter location and Sequence
| "Or" Operation
All symbolic interpretations
Character Description
\ Mark the next character as a special character, or a literal character, or a backward reference, or an octal Escape Character
. For example, 'n' matches the character "N ". '\ N' matches a line break. The sequence '\' matches "\" and "\ (" matches "(".
^ Matches the start position of the input string. If the multiline attribute of the Regexp object is set, ^ matches '\ n' or
Location after '\ R.
$ Matches the end position of the input string. If the multiline attribute of the Regexp object is set, $ also matches '\ n' or
'\ R.
* Matches the previous subexpression zero or multiple times. For example, Zo * can match "Z" and "Zoo ". * Is equivalent to {0 ,}.
+ Match the previous subexpression once or multiple times. For example, 'Zo + 'can match "zo" and "Zoo", but cannot match "Z ". +
It is equivalent to {1 ,}.
? Match the previous subexpression zero or once. For example, "Do (ES )? "Can match" do "in" do "or" does ".?
It is equivalent to {0, 1 }.
{N} n is a non-negative integer. Match n times. For example, 'O {2} 'cannot match 'O' in "Bob", but can match
With two o in "food.
{N,} n is a non-negative integer. Match at least N times. For example, 'O {2,} 'cannot match 'O' in "Bob", but can match
All O in "foooood. 'O {1,} 'is equivalent to 'o + '. 'O {0,} 'is equivalent to 'o *'.
Both {n, m} m and n are non-negative integers, where n <= m. Match at least N times and at most m times. For example, "O {1, 3 }"
Match the first three o in "fooooood. 'O {0, 1} 'is equivalent to 'o? '. Please note that there must be no space between commas and two numbers.
Grid.
? When this character is followed by any other delimiter (*, + ,?, The matching mode after {n}, {n ,},{ n, m}) is not greedy.
. The non-Greedy mode matches as few searched strings as possible, while the default greedy mode matches as many searched strings as possible.
String. For example, for strings "oooo", 'O ++? 'Will match a single "O", and 'O +' will match all 'O '.
. Match any single character except "\ n. To match any character including '\ n', use
Mode.
(Pattern) matches pattern and obtains this match. The obtained match can be obtained from the generated matches set.
Use the submatches set in VBScript and $0… In JScript... $9 attribute. To match parentheses, please
Use '\ (' or '\)'.
(? : Pattern) matches pattern but does not get the matching result. That is to say, this is a non-get match and is not stored
. This is useful when you use the "or" character (|) to combine each part of a pattern. For example, 'industr
(? : Y | ies) is a simpler expression than 'industry | industries.
(? = Pattern) Forward pre-query: matches the search string at the beginning of any string that matches pattern. This is a non-retrieved
Match, that is, this match does not need to be obtained for future use. For example, 'windows (? = 95 | 98 | nt | 2000) 'can match
"Windows" in "Windows 2000", but cannot match "Windows" in "Windows 3.1 ". Pre-query does not consume words
In other words, after a match occurs, the next match is started immediately after the last match, instead of from the package
Start with precheck characters.
(?! Pattern) negative pre-query: matches the search string at the beginning of any string that does not match pattern. This is an unobtained
Obtain the matching, that is, the matching does not need to be obtained for future use. For example, 'windows (?! 95 | 98 | nt | 2000) 'can match
"Windows" in "Windows 3.1", but cannot match "Windows" in "Windows 2000 ". Pre-query does not consume characters
In other words, after a match occurs, the next matching search starts immediately after the last match, instead
Start after precheck characters
X | y matches X or Y. For example, 'z | food' can match "Z" or "food ". '(Z | f) Ood' matches "zood" or
"Food ".
[Xyz] Character Set combination. Match any character in it. For example, '[ABC]' can match 'A' in "plain '.
[^ XYZ] combination of negative character sets. Match any character not included. For example, '[^ ABC]' can match 'p' in "plain '.
[A-Z] character range. Matches any character in the specified range. For example, '[A-Z]' can match
Any lowercase letter.
[^ A-Z] negative character range. Matches any character that is not within the specified range. For example, '[^ A-Z]' can match any
Any character in the range of 'A' to 'Z.
\ B matches a word boundary, that is, the position between a word and a space. For example, 'er \ B 'can match
'Er ', but cannot match 'er' in "verb '.
\ B matches non-word boundaries. 'Er \ B 'can match 'er' in "verb", but cannot match 'er 'in "never '.
\ CX matches the control characters specified by X. For example, \ cm matches a control-M or carriage return character. The value of X must be A-Z
Or one of A-Z. Otherwise, C is treated as an original 'C' character.
\ D matches a numeric character. It is equivalent to [0-9].
\ D matches a non-numeric character. It is equivalent to [^ 0-9].
\ F matches a break. It is equivalent to \ x0c and \ Cl.
\ N matches a linefeed. It is equivalent to \ x0a and \ CJ.
\ R matches a carriage return. It is equivalent to \ x0d and \ cm.
\ S matches any blank characters, including spaces, tabs, and page breaks. It is equivalent to [\ f \ n \ r \ t \ v].
\ S matches any non-blank characters. It is equivalent to [^ \ f \ n \ r \ t \ v].
\ T matches a tab. It is equivalent to \ x09 and \ CI.
\ V matches a vertical tab. It is equivalent to \ x0b and \ ck.
\ W matches any word characters that contain underscores. It is equivalent to '[A-Za-z0-9 _]'.
\ W matches any non-word characters. It is equivalent to '[^ A-Za-z0-9 _]'.
\ XN matches n, where N is the hexadecimal escape value. The hexadecimal escape value must be determined by the length of two numbers. Example
For example, '\ x41' matches "". '\ X041' is equivalent to '\ x04' & "1 ". The regular expression can use ASCII encoding ..
\ Num matches num, where num is a positive integer. References to the obtained matching. For example, '(.) \ 1' matches two connections.
.
\ N identifies an octal escape value or a backward reference. If there are at least N obtained subexpressions before \ n, n is a forward
. Otherwise, if n is an octal digit (0-7), n is an octal escape value.
\ Nm identifies an octal escape value or a backward reference. If at least one child expression is obtained before \ nm, then nm
Is backward reference. If at least N records are obtained before \ nm, n is a backward reference followed by text M. If the preceding
None of the conditions are met. If n and m are Octal numbers (0-7), \ nm matches the octal escape value nm.
\ NML if n is an octal digit (0-3) and M and l are octal digits (0-7), the octal escape value is matched.
NML.
\ UN matches n, where n is a Unicode character represented by four hexadecimal numbers. For example, the matching version of \ u00a9
Weight symbol (?).
Some examples
Regular Expression description
/\ B ([A-Z] +) \ 1 \ B/GI position where a word appears continuously
/(\ W +): \/([^/:] +) (: \ D *)? ([^ #] *)/Resolve a URL as a protocol, domain, port, and relative path
/^ (? : Chapter | section) [1-9] [0-9] {0, 1} $/locate the position of the chapter
/[-A-Z]/A to Z a total of 26 letters plus a-number.
/TER \ B/can match chapter, but cannot be terminal
/\ Bapt/can match chapter, but not aptitude
/Windows (? = 95 | 98 | NT)/matches Windows95, Windows98, or WindowsNT. After a match is found
After windows, start the next retrieval match.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Regular Expression learning notes

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support