Getting started with regular expressions

Last Update:2014-03-24 Source: Internet

Author: User

Tags character classes expression engine

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, let's take a look at some concepts:
Word: the word in a regular expression means not less than a continuous \ w.

Common metacharacters:

\ B	Match the start or end of a word, that is, the word division. Although the English words are generally separated by spaces, punctuation marks, or line breaks, \ B does not match any of these word delimiters. It only matches one position. More accurately, it is said that \ B matches such a position: its first character and the last character are not all (one is, one is not or does not exist) \ w.
.	Match any character except line breaks.
*	It indicates the number. It specifies that * the content of the front edge can be repeatedly used for any number of times to match the entire expression. Therefore,. * When connected, it means that any number of characters do not contain line breaks.
\ N	The line break is '\ n' and the ASCII code is 10 (hexadecimal 0x0A) characters.
\ D	Match a digit (0, or 1, or 2, or ......). To avoid duplication, we can also write the expression \ d {2 }. {2} ({8}) after \ d means that the previous \ d must be repeated twice (eight times ).
\ S	Matches any blank space characters, including spaces, tabs, line breaks, and Chinese fullwidth spaces.
\ W	Matches letters, numbers, underscores, and Chinese characters.
^	Start of matching string.
$	End of matching string.
()	Group, which will be described later.

If you want to find the metacharacters themselves, such as searching for. Or *, a problem occurs: you cannot specify them because they are interpreted as other meanings. In this case, you must use \ to cancel the special meanings of these characters. Therefore, you should use \. And \*. Of course, to find the \ itself, you also need to use \\.

Repeated times match:

*	Repeated zero or more times
?	Zero or one repetition
{N}	Repeated n times
{N ,}	Repeat n or more times
{N, m}	Repeat n to m times
+	Repeat once or more times

Ex: Windows \ d + matches one or more numbers after Windows
Ex: ^ \ w + match the first word of the string (or the first word of the entire string, depending on the meaning of the match)

Character class match:

[Aeiou] matches any English vowel, [.?!] Match punctuation marks (. Or? Or !).
[0-9] represents the same meaning as \ d: a number, [a-z0-9A-Z] is equivalent to \ w (only in English ).

Ex :\(? 0 \ d {2} [)-]? \ D {8 }. "(" And ")" are also metacharacters, so escape is required here.
This expression can match phone numbers in several formats, such as (010) 88886666, 022-22334455, or 02912345678. First, it is an escape character \ (, which can appear 0 times or 1 time (?), Then there is a 0 followed by two numbers (\ d {2}), followed by one of),-, or space. It appears once or does not appear (?), The last eight digits are (\ d {8 }).

Branch condition:
In fact, if you think carefully, the previous expression can also match the incorrect format of 010) 12345678 or (022-87654321), and the branch condition can solve this problem.
The branching Condition refers to several rules. If any rule is satisfied, it should be regarded as a match. The specific method is to use | to separate different rules.

Ex: 0 \ d {2}-\ d {8} | 0 \ d {3}-\ d {7} This expression can match two phone numbers separated by a font size: one is a three-digit area code, an eight-digit Local Code (for example, 010-12345678), a four-digit area code, and a seven-digit local code (0376-2233445 ).
Ex: \ (0 \ d {2} \) [-]? \ D {8} | 0 \ d {2} [-]? The expression \ d {8} matches the phone number of the three-digit area code. The area code can be enclosed in parentheses or not. The area code can be separated by a hyphen or space, or there is no interval. Note: | the two segments before and after the sign are a branch expression.
Ex: \ d {5}-\ d {4} | \ d {5} This expression is used to match the zip code of the United States. The U.S. Postal Code uses five digits or nine digits separated by a hyphen. This example is given because it indicates a problem: when using a branch condition, pay attention to the order of each condition. If you change it to \ d {5} | \ d {5}-\ d {4, then, it will only match the 5-digit ZIP code (and the first 5-digit of the 9-digit ZIP code ). The reason is that, when matching a branch condition, each condition will be tested from left to right. If a branch is satisfied, other conditions will not be managed.

GROUP:
I have already mentioned how to repeat a single character (simply add a qualifier after the character); but what if I want to repeat multiple characters? In a regular expression, parentheses can be used to indicate the subexpression (also called grouping). You can specify the number of repetitions of the subexpression, or perform other operations on the subexpression.

(\ D {1, 3} \.) {3} \ d {1, 3} is a simple IP address matching expression. To understand this expression, analyze it in the following order:
\ D {1, 3} matches 1 to 3 digits (\ d {1, 3 }\.) {3} matches three digits with an English ending (this group is used as a whole), repeats three times, and finally adds one to three digits (\ d {1, 3 }). Unfortunately, each number in the IP address cannot exceed 255, but the above expression will also match 256.300.888.999, which is an impossible IP address. If arithmetic comparison can be used, this problem may be solved simply. However, regular expressions do not provide any mathematical functions, therefore, you can only use the lengthy grouping to select character classes to describe a correct IP Address:
^ (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) \.) {3} (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?)
The key to understanding this expression is to understand 2 [0-4] \ d | 25 [0-5] | [01]? \ D ?, Simply put, the IP value is divided into three intervals for processing 199-249 | 250-255 | 1.

Negative:
Sometimes you need to find characters that do not belong to a simple character class. For example, if you want to search for any character except numbers, you need to use the negative sense.

\ W	Match any character that is not a letter, number, underline, or Chinese Character
\ S	Match any character that is not a blank character
\ D	Match any non-numeric characters
\ B	Match is not the start or end of a word
[^ X]	Match any character except x
[^ Aeiou]	Match any character except aeiou

Ex: \ S + matches strings that do not contain blank characters.
Ex: <a [^>] +> matches a string prefixed with a enclosed by Angle brackets.

Backward reference:
After a subexpression is specified with parentheses, the text that matches the subexpression (that is, the content captured by this group) can be further processed in the expression or other programs. By default, each group will automatically have a group number. The rule is: from left to right, marked by the left parentheses of the group, and the first group number that appears is 1, the second is 2, and so on.

Group 0 corresponds to the entire regular expression
In fact, the group number allocation process needs to be scanned from left to right twice: The first time is only allocated to untitled groups, for the second time, only the name group is assigned. Therefore, the group numbers of all naming groups are greater than those of untitled groups.
You can use (? : Exp) to deprive a group of the right to participate in group number allocation.

Backward reference is used to repeatedly search text matched by the previous Group. For example, \ 1 indicates the text matched by Group 1.
Ex: \ B (\ w +) \ B \ s + \ 1 \ B can be used to match repeated words, such as go or kitty. This expression is a word, that is, more than one letter or number (\ B (\ w +) \ B) between the start and end of a word ), this word is captured in a group numbered 1, followed by one or several blank characters (\ s + ), finally, the content captured in group 1 (that is, the previously matched word) (\ 1 ).
You can also specify the group name of the subexpression. Syntax for customizing subexpressions :(? <Word> \ w +) (or you can change the angle brackets :(? 'Word' \ w +), so that the Group Name of \ w + is specified as Word. To reverse reference the content captured by this group, you can use \ k <Word>, so the previous example can also be written as follows: \ B (? <Word> \ w +) \ B \ s + \ k <Word> \ B.

Common grouping expressions:

	(Exp)	Match exp and capture text to automatically named group
	(<Name> exp)	Match exp and capture the text to the group named name. You can also write (? 'Name' exp)
	(? : Exp)	Matches exp, does not capture matched text, and does not assign group numbers to this group
	(? = Exp)	Match the position before exp (the position where the asserted itself appears can match the expression exp)
	(? <= Exp)	Match the position next to exp (the expression exp can be matched before the location where the assertions appear)
	(?! Exp)	The position behind the matching is not the exp position (asserted that the position is not followed by the expression exp)
	(? <! Exp)	Match the position where the first part is not exp (asserted that the first part of the position cannot match the expression exp)
	(? # Comment)	This type of grouping does not affect the processing of regular expressions. It is used to provide comments for reading.

Assertion with Zero Width:
In the table in the previous section, only two of them are described, and the next four are used to find things before or after some content (but not including the content), that is, they are like \ B, ^, $ is used to specify a position, which must meet certain conditions (that is, assertions). Therefore, they are also called assertion with zero width. Assertions are used to declare a fact that should be true. In a regular expression, matching continues only when the assertions are true.
Ex:

(? = Exp) is also called a zero-width positive prediction predicate. It asserted that the position where it appears can match the expression exp. For example, \ B \ w + (? = Ing \ B), matching the front part of the word ending with ing (except for the ing part), such as searching for I'm singing while you're dancing. it will match sing and danc.
(? <= Exp) is also called the zero-width positive review and then asserted that it can match the expression exp in front of its own position. For example (? <= \ Bre) \ w + \ B will match the second half of the word starting with re (Except re). For example, it matches ading when searching for reading a book.

If you want to add a comma (, of course, from the right side) to each of the three digits in a long number, you can search for the parts that need to be added with a comma :((? <= \ D) \ d {3}) + \ B. When it is used to search for 1234567890, the result is 234567890.
The following example uses both assertions :(? <= \ S) \ d + (? = \ S) match the numbers separated by spaces (emphasize again, do not include these spaces ).

Assertion of negative zero width:
We have previously mentioned how to search for characters that are not a character or are not in a character class (assense ). But what if we only want to ensure that a character does not appear, but do not want to match it? For example, if we want to find such a word, which contains the Letter q, but q is not followed by the letter u, we can try this:
\ B \ w * q [^ u] \ w * \ B matches a word that contains the Letter q, not the letter u. But if you do more tests (or if you are keen enough, you can simply observe them), you will find that if q appears at the end of a word, like Iraq, Benq, this expression will cause an error. This is because [^ u] Always matches one character, so if q is the last character of a word, the [^ u] Following will match the word separator (which may be a space, a full stop or something else) after q, And the \ w * \ B Following will match the next word, therefore, \ B \ w * q [^ u] \ w * \ B can match the entire Iraq fighting. The negative zero-width assertion can solve this problem because it only matches one location and does not consume any characters.
Now, we can use this expression to solve this problem: \ B \ w * q (?! U) \ w * \ B.
0-width negative prediction predicate (?! Exp), asserted that the position is not followed by the expression exp. Example: \ d {3 }(?! \ D) match three digits, and the three digits cannot be followed by digits; \ B ((?! Abc) \ w) + \ B match words that do not contain consecutive strings abc.
Similarly, we can use (? <! Exp), zero-width negative review, and then assertion to assert that the front of this position cannot match the expression exp :(? <! [A-z]) \ d {7} matches the first seven digits that are not lowercase letters.
A complex example :(? <= <(\ W +)> ).*(? = <\/\ 1>) matches the content in the simple HTML Tag that does not contain the attribute. (? <= <(\ W +)>) specifies the prefix: The word enclosed by Angle brackets (for example, ), and then. * (any string), followed by a suffix (? = <\/\ 1> ). Pay attention to the \/In the suffix, which uses the character escape mentioned above; \ 1 is a reverse reference, which references the first group captured, the previous (\ w +) if the prefix is , the suffix is . The entire expression matches the content between and (remind me again, excluding the prefix and suffix itself ).

Note: The above expression, I use this expression in java, will report an error (java. util. regex. patternSyntaxException: Look-behind group does not have an obvious maximum length near index 10. net environment will report an error, did not try, below I give the correct expression I found in java :(? <= <(\ W {1, 15})> ).*(? = <\/\ 1>). The difference is that the parameter after \ w is changed from + to {}. I don't know why I want to change it like this, according to the error message (the error message prompts that the group does not have a clear maximum length, so), I tried to give a <> the maximum length of the label name in the tag is 15, please give me some advice;

Note:
Another use of parentheses is through the syntax (? # Comment) to include comments. Example: 2 [0-4] \ d (? #200-249) | 25 [0-5] (? #250-255) | [01]? \ D? (? #0-199 ).
To include comments, it is best to enable the "blank characters in ignore mode" option. In this way, spaces, tabs, and line breaks can be added when an expression is written, which will be ignored in actual use. After this option is enabled, all the text that ends at the end of the line after # is ignored as a comment.

Greed and laziness:
When a regular expression contains a qualifier that can accept duplicates, the common behavior is to match as many characters as possible (on the premise that the entire expression can be matched. Take this expression as an example: a. * B, which will match the longest string starting with a and ending with B. If you use it to search for aabab, it will match the entire string aabab. This is called greedy matching.
Sometimes, we need to be more lazy to match, that is, to match as few characters as possible. All the qualifiers given above can be converted to the lazy match mode, as long as a question mark is added after it ?. This way .*? This means to match any number of duplicates, but use the minimum number of duplicates if the entire match is successful.
Ex: .*? B matches the string that is shortest, starts with a, and ends with B. If it is applied to aabab, it will match aab (first to third character) and AB (fourth to fifth character ).
Why is the first match aab (the first to the third character) rather than AB (the second to the third character )? Simply put, because a regular expression has another rule, it has a higher priority than a lazy/greedy rule: The first match to start has The highest priority-The match that begins earliest wins.

*?	Repeat any time, but as few as possible
+?	Repeat once or more times, but as few as possible
??	Repeated 0 or 1 times, but as few as possible
{N, m }?	Repeat n to m times, but as few as possible
{N ,}?	Repeated more than n times, but as few as possible

Processing options:
In Java, you can use the Pattern. compileRegex (String regex, int flag) constructor to set the processing options of regular expressions.

IgnoreCase (Case Insensitive)	Matching is case insensitive.
Multiline (Multiline Mode)	Change the meaning of ^ and $ so that they match the beginning and end of a row, not just the beginning and end of the entire string. (In this mode, the exact meaning of $ is: match the position before \ n and the position before the string ends .)
Singleline (single row Mode)	Change the meaning of. To match each character (including line break \ n ).
IgnorePatternWhitespace (ignore blank space)	Ignore non-escape spaces in the expression and enable annotation marked.
ExplicitCapture (explicit capture)	Only explicitly named groups are captured.

A frequently asked question is: Can I only use one of the multiple-row mode and single-row mode at the same time? The answer is: no. There is no relationship between the two options.

Others:

Balancing group/recursive match:

The balanced group syntax described here is as follows., an error is reported.
Sometimes we need to match a nested hierarchical structure like (100*(50 + 15), and then simply use $. + $ then it will only match the content between the leftmost left brace and rightmost right brace (here we are discussing the greedy pattern, and the lazy pattern has the same problem ). If the numbers of left and right brackets in the original string are not the same, for example (5/(3 + 2 ))), then the numbers in our matching results are not equal. Is there a way to match the longest pair of brackets in such a string?
To avoid (confusion with \, brackets are used instead of parentheses temporarily. Now the question is how to capture the content in the longest pair angle brackets in a string like xx <aa <bbb> aa> yy?
The following syntax structure is required:

(? 'Group') Name the captured content as a group and press it into the Stack)
(? '-Group') from the stack, the capture content named "group" pushed into the stack is displayed. If the stack is empty, the matching of the group fails.
(? (Group) yes | no) if the capture content named group exists on the stack, continue to match the expression of the yes part; otherwise, continue to match the no part.
(?!) Assertion with Zero Width and negative direction, attempts to match always fail because there is no suffix expression

If you are not familiar with the stack, you can also understand the above three syntaxes: the first is to write a "group" on the blackboard, and the second is to erase a "group" from the blackboard ", the third is to check whether there is a "group" written on the blackboard. If there is a group, continue to match the "yes" section; otherwise, match the "no" section.
What we need to do now is press a "Open" button every time we encounter a left bracket, and each right bracket is displayed, at the end, let's see if the stack is empty. If it is not empty, it means that there are more left brackets than right brackets, and the matching should fail. The Regular Expression Engine will backtrack (discard the first or last character) and try to match the entire expression.

<^ <>] * # The content behind the left parenthesis of the outermost layer is not the content of the parentheses (* indicates that this limit is repeated 0 or multiple times, and can be matched in this way <? 'Open' <) [^ <>] + )*? '-Open'>) [^ <>] + )**? (Open) (?!> # Right parenthesis of the outermost layer

Note: The article is from the internet. I just want to streamline it. Click to view the original article.

Some common Regular Expressions will be updated one after another

Match target	Regular Expression
Matching url	(Http \| ftp \| https): \/[\ w \-_] + (\. [\ w \-_] +) + ([\ w \-\., @? ^ = % & Amp ;:/~ \ + #] * [\ W \-\@? ^ = % & Amp ;/~ \ + #])?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More