Regular Expression learning-repeated matching

Source: Internet
Author: User

PS: In all examples, regular expression matching results are included between [and] in the source text. In some examples, java is used. If the regular expression is used in java, it will be described in the corresponding area. All java examples have passed the test under JDK1.6.0 _ 13.

 

1. How many matches are there?

In the previous articles, we talked about matching a character. But a character or character set must be matched multiple times. What should we do? For example, to match an email address, someone may write something like \ w @ \ w \ In the method mentioned earlier \. A regular expression like \ w, but this can only match an address like a@ B .c, obviously not correct, then let's look at how to match the email address.

First, you must know the composition of the email address: A group of characters starting with letters, numbers, or underscores, followed by the @ symbol, followed by the domain name, that is, the username @ domain name address. However, this is also related to the specific email service provider, and some allow. characters in the user name.

 

1. Match one or more characters

To match multiple repetitions of the same character (or character set combination), simply add a + character to the character (or character set combination) as the suffix. + Match one or more characters (at least one ). For example, if a matches a itself, a + matches one or more consecutive numbers. [0-9] + matches multiple consecutive numbers.

Note: When adding a + suffix to a character set, you must put + outside the character set. Otherwise, it will not be a duplicate match. For example, [0-9 +] indicates a number or a number. Although the syntax is correct, it is not what we want.

Text: Hello, mhmyqn@qq.com or mhmyqn@126.com is my email.

Regular Expression: \ w + @ (\ w + \.) + \ w +

Results: Hello, mhmyqn@qq.com or mhmyqn@126.com is my email.

Analysis: \ w + can match one or more characters, while the subexpression (\ w + \.) + matching images like xxxx.edu. such a string, but not in the end. the end of the character, so there will be a \ w +. Email Addresses like mhmyqn@xxxx.edu.cn will also match.

 

2. match zero or multiple characters

It matches zero or multiple characters with a metricenter *. Its usage is exactly the same as that of +. You only need to put it after the character or character set combination, the character (or character set combination) can be matched zero or multiple times consecutively. For example, the regular expression AB * c can match ac, abc, and abbbbbc.

 

3. match zero or one character

Match zero or one character using metacharacters ?. As mentioned in the previous article, the regular expression \ r \ n is used to match a blank line, but the metacharacter can be used without \ r in Unix and Linux ?, \ R? \ N \ r? \ N can match both blank lines in windows and blank lines in Unix and Linux. Here is an example of a URL that matches the http or https protocol:

Text: The URL is http://www.mikan.com, to connect securely use https://www.mikan.cominstead.

Regular Expression: https? : // (\ W + \.) + \ w +

Results: The URL is http://www.mikan.com, to connect securely use https://www.mikan.com instead.

Analysis: This mode uses https? Start? The previous character may or may not exist, so it can match http or https. The latter part is the same as the previous example.

 

2. Number of matching repetitions

+, *, And? In the Regular Expression? Solved many problems,:

1) There is no upper limit on the number of matching characters for + and. We cannot set a maximum number of matched characters for them.

2) +, *, and? It must be at least one or zero character. We cannot set a minimum value for the number of matched characters.

3) If only * and + are used, we cannot set the number of matched characters to a precise number.

A regular expression provides a syntax used to set the number of repetitions. The number of repetitions must be given with {And} characters, and the values are written in the middle of them.

1. Set an exact value for the number of repeated matches

To set an exact value for the number of repeated matches, write the number between {And. For example, {4} indicates that the character (or character set combination) before it must be repeated four times in the original text for a match. If it appears only three times, it is not a match.

As mentioned in the previous articles, you can use the repeated number of times to match the color in the page: # [[: xdigit:] {6} Or # [0-9a-fA-F] {6}. The POSIX character is # \ p {XDigit} {6} in java }.

2. Set an interval for the number of repeated matches

The {} syntax can also be used to set an interval for the number of repeated matches, that is, to set a minimum and maximum value for the number of repeated matches. This interval must be given in the form of {n, m}, where n> = m> = 0. If you check whether the date format is correct (the validity of the date is not checked), the regular expression (such as date or ): \ d {4}-\ d {1, 2}-\ d {1, 2 }.

3. How many times of matching must be repeated?

The last Syntax of {} is to give a minimum number of repetitions (but not the maximum number of repetitions). For example, {3,} indicates at least three repetitions. Note: {3,} must contain a comma, and there cannot be spaces after the comma. Otherwise, an error occurs.

Let's look at an example. Use a regular expression to find out all the amounts above $100:

Text:

$25.36

$125.36

$205.0

$2500.44

$44.30

Regular Expression: $ \ d {3,} \. \ d {2}

Result:

$25.36

[$125.36]

[$205.0]

[$2500.44]

$44.30

 

+ ,*,? Can be expressed as repeated times:

+ Equivalent to {1 ,}

* Equivalent to {0 ,}

? Equivalent to {0, 1}

 

Iii. prevent over-matching

? Only zero or one character can be matched. {n} and {n, m} also have an upper limit on the number of times of matching, but no upper limit exists for *, +, {n, this sometimes leads to over-matching.

Let's look at an example of matching an html Tag.

Text:

Yesterday is <B> history </B>, tomorrow is a <B> mystery </B>, but today is a <B> gift </B>.

Regular Expression: <[Bb]>. * </[Bb]>

Result:

Yesterday is [<B> history </B>, tomorrow is a <B> mystery </B>, but today is a <B> gift </B> ].

Analysis: <[Bb]> match <B> tags (Case Insensitive) </[Bb]> match </B> tags (Case Insensitive ). However, there are three results not as expected. After the first </B> label, everything between the last </B> is matched.

Why? Because * and + are greedy metacharacters, their behavior pattern during matching is more favorable, And they will try to match from the beginning of a text to the end of the text, instead of starting from the beginning of the text to the first match.

When this greedy behavior is not required, you can use the lazy version of these metacharacters. Laziness means matching as few characters as possible, opposite to greedy. Do you only need to add one greedy metacharacter? Suffix. The following is the lazy version corresponding to the greedy metacharacters:

**?

++?

{N ,}{ n ,}?

In the above example, the regular expression only needs to be changed to <[Bb]> .*? </[Bb]> The result is as follows:

<B> history </B>

<B> mystery </B>

<B> gift </B>

 

Iv. Summary

The real power of a regular expression is reflected in the repeat matching. +, *, And ,*,? Usage of several metacharacters. to accurately determine the number of matching times, use {}. Metacharacters are classified into greedy and lazy types. To prevent over-matching, use the lazy metacharacters to construct regular expressions.
Author: zhanghu198901

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.