Regular-expression Learning

Last Update:2015-07-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, we have come into contact with Web servers such as Nginx and node, various write route mapping rules, so it is necessary to look at the regular expression of the system. In fact, before the calculation theory class seems to have learned, and then eggs, all returned to the teacher. Here is a more excellent tutorial, reproduced.

"Essence" Regular expression 30-minute introductory tutorial (http://www.oschina.net/question/12_9507)

What exactly is a regular expression?

Characters are the most basic unit of computer software processing text, which may be letters, numbers, punctuation, spaces, line breaks, kanji, etc. A string is a sequence of 0 or more characters. text is the literal, string. To say that a string matches a regular expression, usually refers to a part (or parts of it) in the string that satisfies the condition given by an expression.

When writing a program or Web page that handles strings, there is often a need to find strings that match certain complex rules. Regular expressions are the tools used to describe these rules. In other words, the regular expression is the code that records the text rule.

It is possible that you have used the wildcard character (wildcard) for file lookup under Windows/dos, which is * and ?. If you want to find all the Word documents in a directory, you will search for *.doc. Here,* will be interpreted as an arbitrary string. Like wildcards, regular expressions are also tools for text matching, but rather than wildcards, they can describe your needs more precisely-and, of course, the cost is more complex.

Entry

Suppose you look for hi in an English novel , you can use the regular expression hi.

This is almost the simplest regular expression, it can exactly match such a string: two characters, the previous character is H, the latter is I. Typically, a tool that handles regular expressions provides an option to ignore the case, and if this option is selected, it can match any of the four cases of Hi, Hi,hi,hi.

Unfortunately, many words contain hi two consecutive characters, such as him,history,high and so on. With hi to find out, the hi here will also be found. If you want to \bhi\b. \b is a special code prescribed by regular expressions (well, some people call it The beginning or end of the word, which is the dividing point of the word. Although English words are usually delimited by spaces, punctuation marks, or line breaks, \b does not match any of these word-delimited characters, and it matches only one location . If a more precise argument is needed, the \b matches such a position: its previous character and the latter is not all (one is, one is not or does not exist) \w.

Here,. is another meta-character that matches any character except the line break. * is also a meta-character, but it does not represent a character, nor a position, but a quantity-it specifies that the contents of the front can be reused any number of times to match the entire expression. Therefore,. * Together means any number of characters that do not contain a newline. Now the meaning of \bhi\b.*\blucy\b is obvious: first a word hi, then any arbitrary character (but not a newline), and finally the word Lucy.

The newline character is ' \ n ' and the ASCII encoding is 10 (hexadecimal 0x0a).

If you use a different meta-character at the same time, we can construct a more powerful regular expression. For example, the following:

0\d\d-\d\d\d\d\d\d\d\d matches such a string: starting with 0, then two digits, then a hyphen "-", and finally 8 digits (that is, China's phone number.) Of course, this example only matches the case where the area code is 3 bits).

The \d here is a new meta-character that matches one number (0, or 1, or 2, or ...). -not a meta-character, only matches itself-a hyphen (or a minus sign, or a middle line, or whatever you call it). to avoid so many annoying repetitions, we can also write this expression:0\d{2}-\d{8}. Here {2} ({8}) behind \d means that the front \d must match 2 times consecutively (8 times).

Metacharacters

Table 1. Commonly used meta-characters
Code	Description
.	Match any character other than line break
\w	Match letters or numbers or underscores or kanji
\s	Match any of the whitespace characters
\d	Match numbers
\b	Match the beginning or end of a word
^	Match the start of a string
$	Match the end of a string

Metacharacters ^ (and the number 6 on the same keyed symbol) and $ All match a position, which is a bit similar to \b. ^ matches the end of the string you want to use to find the beginning of thematch. These two codes are useful when validating input, such as a Web site that requires you to fill in a 5-bit to 12-digit QQ number, which you can use:^\d{5,12}$. The { 5,12} here is similar to the {2} described earlier, except that {2} matches can only be repeated 2 times, and{5,12} is repeated no less than 5 times, not more than 12 times, otherwise it does not match.

Character escapes

If you want to find the meta-character itself, such as when you look up , or *, there's a problem: You can't specify them, because they'll be interpreted as something else. Then you have to use \ To cancel the special meaning of these characters. Therefore, you should use \. and \*. Of course, to find \ itself, you also have to use \ \.

For example:deerchao\.net matches deerchao.net,c:\\windows matches C:\Windows.

Repeat

You have seen the previous *,+,{2},{5,12} these several match the way of repetition. The following are all qualifiers (the specified number of codes) in the regular expression:

Table 2. Common Qualifiers
Code	Description
*	Repeat 0 or more times
+	Repeat 1 or more times
?	Repeat 0 or 1 times.
N	Repeat n times
{N,}	Repeat N or more times
{N, m}	Repeats n times to M times

Here are some examples of using duplicates:

windows\d+ matches 1 or more digits behind Windows

^\w+ matches the first word of a line (or the first word of the entire string, exactly what it means to look at the option setting)

Character class

To find numbers, letters, or numbers, white space is simple because you already have metacharacters that correspond to these character sets, but what if you want to match a character set that doesn't have predefined metacharacters (such as a vowel a,e,i,o,u)?

Very simply, you just have to list them in square brackets, like [Aeiou] to match any English vowel,[.?!] Matches a punctuation mark (. or? or!).

We can also easily specify a range of characters , like [0-9] representing exactly the same meaning as \d: a number, and the same [A-z0-9a-z_] is exactly the same as \w (if only English is considered)

The following is a more complex expression:\ (? 0\d{2}[)-]?\d{8}.

"(" and ")" is also a meta-character, which is mentioned later in the Grouping section, so you need to use escape here.

This expression can match phone numbers in several formats, such as (010) 88886666, or 022-22334455, or 02912345678. Let's do some analysis of it: The first is an escape character \ (it can appear 0 or 1 times (?), then a 0, followed by 2 digits ( \d{2}) or -or one of the spaces, it appears 1 times or does not appear (?), and finally 8 digits (\d{8}).

Branching conditions

Unfortunately, the expression just now matches 010) 12345678 or (022-87654321) of the "incorrect" format. To solve this problem, we need to use branching conditions. The branching condition in regular expressions refers to a number of rules that should be matched if any of these rules are met, by separating the different rules with a | Don't you understand? Okay, look at the example:

0\d{2}-\d{8}|0\d{3}-\d{7} This expression can match two phone numbers separated by a hyphen: a three-bit area code, a 8-bit local number (such as 010-12345678), a 4-bit area code, and a 7-bit local number (0376-2233445).

\ (0\d{2}\) [-]?\d{8}|0\d{2}[-]?\d{8} This expression matches the phone number of the 3-bit area code, where the area code can be enclosed in parentheses or not, the area code and the local number can be separated by a hyphen or space, or there can be no interval. You can try branching conditions to extend this expression to also support 4-bit area codes.

\d{5}-\d{4}|\d{5} This expression is used to match the U.S. ZIP code. The rules of the U.S. ZIP Code are 5 digits, or 9 digits spaced with hyphens. The reason to give this example is because it illustrates a problem: when using branching conditions, be aware of the order of each condition . If you change it to \d{5}|\d{5}-\d{4} then it will only match the 5-bit ZIP code (and the top 5 digits of the 9-bit zip code). The reason is that when matching the branching conditions, each condition will be tested from left to right, and if a branch is satisfied, it will not be able to control the other conditions .

Group

We've already mentioned how to repeat a single character (just after the character is preceded by a qualifier), but what if you want to repeat multiple characters? You can specify sub-expressions (also called groupings) with parentheses , and then you can specify the number of repetitions of the subexpression, and you can do some other things with the subexpression (described later).

(\d{1,3}\.) {3}\d{1,3} is a simple IP-address matching expression. To understand this expression, parse it in the following order:\d{1,3} matches numbers from 1 to 3 digits,(\d{1,3}\.) {3} matches three digits plus an English period (this whole is the grouping) repeats 3 times, and finally adds one to three digits (\d{1,3}).

Every number in the IP address can not be greater than 255, we must not be "24" the third quarter of the writers to be fooled ...

Unfortunately, it will also match 256.300.888.999, an IP address that cannot exist. If you can use arithmetic comparisons, you may be able to solve this problem simply, but the regular expression does not provide any function about mathematics, so you can only use a lengthy grouping, select, character class to describe the correct IP address:((2[0-4]\d|25[0-5]|[ 01]?\d\d?) \.) {3} (2[0-4]\d|25[0-5]| [01]?\d\d?].

Anti-righteousness

Sometimes you need to find characters that are not part of a character class that can be easily defined. For example, if you want to find any character other than the number, then you need to use the opposite justification:

Table 2. Commonly used antisense code
Code	Description
\w	Match any characters that are not letters, numbers, underscores, kanji
\s	Match any character that is not a whitespace character
\d	Match any non-numeric character
\b	Match a position that is not the beginning or end of a word
[^x]	Matches any character except X
[^aeiou]	Matches any character except for the letters AEIOU

Example:<a[^>]+> matches a string that starts with a in angle brackets.

Back to reference

When you specify a subexpression with parentheses, the text that matches the subexpression (that is, what this grouping captures) can be further processed in an expression or other program. By default, each grouping automatically has a group number, with the rule: left-to-right, with the left parenthesis of the group as the flag, the group number for the first occurrence is 1, the second is 2, and so on.

Uh...... In fact, the group number allocation is not as simple as I have just said:

Group 0 corresponds to the entire regular expression
In fact, the group number allocation process is to scan from left to right two times: first pass only to the unnamed group assignment, the second time only to assign the named group--so all named groups are larger than the unnamed group number
You can use the (?: EXP) syntax to deprive a group of the right to participate in group number assignment.

A back reference is used to repeat the search for text that precedes a grouping match. For example,\1 represents the text for grouping 1 matches . Hard to understand? Take a look at the example:

\b (\w+) \b\s+\1\b can be used to match duplicate words, like go go, or kitty Kitty. The expression is first a word, that is, more than one letter or number (\b (\w+) \b) between the beginning and end of the word,the word is captured in a group numbered 1, followed by 1 or more whitespace characters (\s+), and finally The content captured in Group 1 (that is, the word that preceded it) (\1).

You can also specify the group name of the sub-expression yourself . To specify a group name for a subexpression, use this syntax: (? <word>\w+) (or replace the angle brackets with ' also:(? ') Word ' \w+) so that the \w+ group name is specified as Word. To reverse reference this packet capture, you can use \k<word>, so the previous example can be written like this:\b (? <word>\w+) \b\s+\k<word>\b.

There are many syntax for specific uses when using parentheses. Some of the most common ones are listed below:

Table 4. Common grouping syntax
Classification	Code	Description
Capture	(exp)	Match exp, and capture text into an automatically named group
	(? <name>exp)	Match exp, and capture the text to a group named name, or you can write (? ') Name ' exp ')
	(?: EXP)	Matches exp, does not capture matching text, and does not assign group numbers to this group
0 Wide Assertion	(? =exp)	Match the position of the exp front
	(? <=exp)	Match the position after exp
	(?! Exp	Match the position followed by the exp.
	(? <!exp)	Match a location that is not previously exp
Comments	(? #comment)	This type of grouping does not have any effect on the processing of regular expressions, and is used to provide comments for people to read

0 Wide Assertion

The next four are used to find things before or after some content (but not including them), which means that they are used to specify a location like \b, ^, $, whichshould satisfy certain conditions (that is, assertions), so they are also called 0 wide assertions. It's best to take an example to illustrate it:

Assertions are used to declare a fact that should be true. In a regular expression, the match is resumed only if the assertion is true.

(? =exp) Also known as 0 width positive lookahead assertion, which asserts itself where the occurrence of the back can match the expression exp. For example, \b\w+ (? =ing\b), matching i ' m singing while you ' re dancing, it will match sing and danc.
(? <=exp) also known as 0 width Just after recalling the assertion, it assert itself where the presence of the front can match the expression exp. For example, reading a book, it matches ading.

The following example uses both of these assertions:(? <=\s) \d+ (? =\s) to match numbers separated by whitespace (again, these whitespace characters are not included).

Negative 0 Wide Assertion

We mentioned earlier how to find a method that is not a character or a character that is not in a character class (antisense). But what if we just want to make sure that a character doesn't appear, but doesn't want to match it ? For example, if we want to find a word like this-it's got the letter Q, but the q is not followed by the letter u, we can try this:

\b\w*q[^u]\w*\b matches A word that contains the letter Q that is not followed by the letter U . But if you do more testing (or if your mind is sharp enough to see it directly), you will find that if Q appears at the end of the word, like Iraq,Benq, the expression will go wrong. This is because [^u] always matches a character, so if Q is the last character of the word, the subsequent [^u] will match the word delimiter after Q (possibly a space, or a period or something else), and the \w*\b will match the next word, so \b\w*q[^u]\w*\b can match the entire Iraq fighting. a negative 0-wide assertion solves this problem because it matches only one location and does not consume any characters. Now, we can solve this problem:\b\w*q (?! u) \w*\b.

0 width Negative lookahead assertion (?! EXP), asserts that after this position does not match the expression exp. For example:\d{3} (?! \d) matches three digits, and this three-digit number cannot be followed by a number;\b (?! ABC) \w) +\b matches words that do not contain continuous string ABC.

Similarly, we can use the (? <!exp), 0 width negative review post assertion to assert that the front of this position cannot match the expression Exp:(? <![ A-z] \d{7} matches a seven-digit number that is not preceded by a lowercase letter.

Please analyze the expression in detail (?<=< (\w+) >). * (?=<\/\1>), this expression best describes the true purpose of the 0-wide assertion.

A more complex example:(?<=< (\w+) >). * (?=<\/\1>) matches the contents of a simple HTML tag that does not contain a property. (<? ( \w+) >) specifies the prefix: a word enclosed in angle brackets (for example, possibly ), followed by a . * (arbitrary string), and finally a suffix (?=<\/\1>). Pay attention to the suffix of the /, it uses the previous mentioned character escapes;\1 is a reverse reference, referring to the first set of captures, the preceding (\w+) match, so if the prefix is actually , the suffix is . The entire expression matches the content between and (again, not including the prefix and suffix itself).

Greed and laziness

When a regular expression contains a qualifier that can accept duplicates, the usual behavior is to match as many characters as possible (in order for the entire expression to be matched). Take this expression as an example:a.*b, which will match the longest string starting with a and ending with B. If you use it to search for Aabab, it will match the entire string Aabab. This is called a greedy match.

sometimes we need more Span class= "name" > lazy match, which is the character that matches as few as possible. The qualifier specified in the preceding can be converted to lazy matching mode, as long as a question mark is appended to it ? . So .*? means that matches any number of repetitions, but uses the fewest repetitions on the premise that the entire match succeeds. Now look at the lazy version of the example: a.*?b match Shortest, starting with a, a string ending with B. If you apply it to aabab, it will match aab (first to third character) and ab (fourth to fifth characters).

Why is the first match a AAB (first to third character) instead of AB (second to third character)? Simply put, because the regular expression has another rule, the priority is higher than the lazy/greedy rule: The first match has the highest priority --the match that begins earliest wins.

Table 5. Lazy Qualifiers
Code	Description
*?	Repeat any number of times, but repeat as little as possible
+?	Repeat 1 or more times, but repeat as little as possible
??	Repeat 0 or 1 times, but repeat as little as possible
{n,m}?	Repeat N to M times, but repeat as little as possible
{N,}?	Repeat more than n times, but repeat as little as possible

Regular-expression Learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Regular-expression Learning

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support