Regular Expression Analysis

Source: Internet
Author: User
Tags repetition expression engine

Some time ago I used my spare time to develop a tag-based CMS system, in which I used regular expressions to process tag-based data extraction and data filling, here I will briefly describe the syntax and usage of the regular expression, and next I will introduce the method and code example of using the regular expression in c.

What is a regular expression?

Basically, a regular expression is a pattern used to describe a certain number of texts. Regex represents Regular Express. We use a custom pattern to match a certain amount of text and extract the required data from it. Like the window wildcard, for example, *. txt searches for all txt files. * Is interpreted as any character.

A simple Regular Expression

As a convention, we start from hello world. This is the same way. To match the 11 characters "hello world", the regular expression is ^ hello \ sworld $. ^ Indicates the starting point, $ indicates the ending point, and \ s indicates matching a space. An Important Note: Regular Expressions are placeholder, that is, a character in a regular expression must correspond to a character in the actual text. The above regular expression cannot match "hello world" because "hello ...." There is a space in front of it.

Another Regular Expression

If we need to match an ipv4 IP address, the IP address is divided into four segments, each of which is separated. The size of each segment ranges from 0 to 25. The first three paragraphs all end with '.' And the last one does not. The following conclusions are obtained after analysis.

1) The matching modes for the first three segments are the same. The matching modes for the last segment and those for the first three segments only do not end '.'.

2) The Ipv4 IP address ranges from 0 to 25. It can be 0-9, 10-99, or 100-255. The main values are the 100-199 and 200-255 ranges. If the first digit starts with 1, the last two digits can be any number. If the first digit starts with 2, the second digit can only be 0-5, in addition, when the second digit is 0-4, the third digit can be any number. When the second digit is 5, the third digit can only be 0-5.

Below we will write a matching pattern that matches the second point above.

① When the first digit is 2 and the second digit is 0-4: 2 [0-4] \ d,

② When the first digit is 2 and the second digit is 5: 25 [0-5]

③ When the first digit is 0 or 1, the second three digits can be any number: [01] \ d

④ When the first digit is not 0-2, the address cannot be in the range of 100-255, but can only be in the range of 0-9 or 10-99. When the value is in the range of 0-9, an arbitrary number can be: \ d. When the value is 10-99, two digits can be: \ d {2 };

⑤ Then observe the ③ and ④ points and we can find that when the first place is 0-1 and the third place appears, the ③ point is applied. When the first place is not 0-1 and not the third place, the ④ point should be applied. Point ③ can be merged into [01]? \ D?

Now the matching mode of each segment in the range of 0 to has been completed, and the first three sections are considered '. 'Final question: ^ (2 [0-4] \ d | 25 [0-5] | [01]? \ D? \.) {3} (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) $

Metacharacters

Through the above two examples, I believe that you have a preliminary understanding of regular expressions. Here we list the metacharacters of regular expressions. The definition of metacharacters is similar to that of keywords in languages, are all characters with special meanings in the system.

1) '.': match any character

2) \ w: matching letters, numbers, underscores, or Chinese Characters

3) \ s: matches blank characters

4) \ d: match any number

5) \ B: match the start or end

6) ^: matches the start of a string

7) $: match the end of a string

Repetition and greedy repetition and inertia repetition

Repetition refers to text that matches a pattern repeatedly. For example '? The 'sign is a duplicate example. It means a matching pattern that appears at most once. Duplicate tags are listed below:

1) *: asterisks, repeated for any number of times, for example, \ w *, representing any letter, number, underline, or Chinese Character multiple times

2) +: plus sign: appears at least once. For example, \ d + indicates that any number appears at least once.

3 )? : Hello: it appears at most once (or it does not appear), for example, \ s ?, 0 or 1 blank character

4) {n }:n indicates any number, for example,. {5}. It indicates that any character appears 5 times.

5) {n ,}: n represents any number, for example, \ d {5 ,}. it indicates that any number appears 5 to multiple times.

6) {n, m}: n, m represents any number, for example, \ d {5, 10}, indicating that any number appears 5 to 10 times

Greedy and inert match

First, let's look at a regular expression example: "<. *>"

This regular expression matches any character starting with '<' and ending with '>', such as <br/>, <div/>, and </script>. If you use this regular expression to match <div> hello world </div>, the matched text is <div> hello world </div>, because the regular expression is greedy by default. By default, the Regular Expression Engine searches for the last matched anchor by default. In this example, the final matching anchor is </div> rather than <div>. This is greedy matching.

Since we understand that there is a greedy match, we need to know the inertia match.

The above regular expression is slightly changed: <. *?>, After adding a question mark (*), we will tell the engine to perform a search in the form of inertia matching. Using this regular expression to match <div> hello world </div> will produce two results, one is <div> the other is </div>. Because we do not limit the beginning and end.

Character Set, divergence, and antsense

Character Set: [A-Za-z0-9], matching a character, which is any one of the A-Za-z0-9

Differences: (a | B) *, a or B appears any number of times

Negative: [^ AB] +, Not a and B appear at least once

Group

The grouping of regular expressions is marked with parentheses (), for example: (\ w *). Mark \ w * into a matching unit and group. <(\ W + ?) [A-Za-z0-9 "= \ s] *> .*? </\ 1> This regular expression can match the <div class = "box"> hello world </div> segment, it is worth noting that \ w + is grouped, and then \ 1 is used to reference this group. By default, the engine names the first group as group 1st from left to right, and the second group as Group 2nd ...... we can also display the group name, and the regular expression in this instance can also be written as: <(? <G1> \ w + ?) [A-Za-z0-9 "= \ s] *> .*? </\ K <g1> we name the \ w + group g1 and reference it later through \ k <name>.

OK. Here we will talk about how to extract data, encode user input data, and verify user input in the C # language environment using regular expressions.

Some time ago I used my spare time to develop a tag-based CMS system, in which I used regular expressions to process tag-based data extraction and data filling, here I will briefly describe the syntax and usage of the regular expression, and next I will introduce the method and code example of using the regular expression in c.

What is a regular expression?

Basically, a regular expression is a pattern used to describe a certain number of texts. Regex represents Regular Express. We use a custom pattern to match a certain amount of text and extract the required data from it. Like the window wildcard, for example, *. txt searches for all txt files. * Is interpreted as any character.

A simple Regular Expression

As a convention, we start from hello world. This is the same way. To match the 11 characters "hello world", the regular expression is ^ hello \ sworld $. ^ Indicates the starting point, $ indicates the ending point, and \ s indicates matching a space. An Important Note: Regular Expressions are placeholder, that is, a character in a regular expression must correspond to a character in the actual text. The above regular expression cannot match "hello world" because "hello ...." There is a space in front of it.

Another Regular Expression

If we need to match an ipv4 IP address, the IP address is divided into four segments, each of which is separated. The size of each segment ranges from 0 to 25. The first three paragraphs all end with '.' And the last one does not. The following conclusions are obtained after analysis.

1) The matching modes for the first three segments are the same. The matching modes for the last segment and those for the first three segments only do not end '.'.

2) The Ipv4 IP address ranges from 0 to 25. It can be 0-9, 10-99, or 100-255. The main values are the 100-199 and 200-255 ranges. If the first digit starts with 1, the last two digits can be any number. If the first digit starts with 2, the second digit can only be 0-5, in addition, when the second digit is 0-4, the third digit can be any number. When the second digit is 5, the third digit can only be 0-5.

Below we will write a matching pattern that matches the second point above.

① When the first digit is 2 and the second digit is 0-4: 2 [0-4] \ d,

② When the first digit is 2 and the second digit is 5: 25 [0-5]

③ When the first digit is 0 or 1, the second three digits can be any number: [01] \ d

④ When the first digit is not 0-2, the address cannot be in the range of 100-255, but can only be in the range of 0-9 or 10-99. When the value is in the range of 0-9, an arbitrary number can be: \ d. When the value is 10-99, two digits can be: \ d {2 };

⑤ Then observe the ③ and ④ points and we can find that when the first place is 0-1 and the third place appears, the ③ point is applied. When the first place is not 0-1 and not the third place, the ④ point should be applied. Point ③ can be merged into [01]? \ D?

Now the matching mode of each segment in the range of 0 to has been completed, and the first three sections are considered '. 'Final question: ^ (2 [0-4] \ d | 25 [0-5] | [01]? \ D? \.) {3} (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) $

Metacharacters

Through the above two examples, I believe that you have a preliminary understanding of regular expressions. Here we list the metacharacters of regular expressions. The definition of metacharacters is similar to that of keywords in languages, are all characters with special meanings in the system.

1) '.': match any character

2) \ w: matching letters, numbers, underscores, or Chinese Characters

3) \ s: matches blank characters

4) \ d: match any number

5) \ B: match the start or end

6) ^: matches the start of a string

7) $: match the end of a string

Repetition and greedy repetition and inertia repetition

Repetition refers to text that matches a pattern repeatedly. For example '? The 'sign is a duplicate example. It means a matching pattern that appears at most once. Duplicate tags are listed below:

1) *: asterisks, repeated for any number of times, for example, \ w *, representing any letter, number, underline, or Chinese Character multiple times

2) +: plus sign: appears at least once. For example, \ d + indicates that any number appears at least once.

3 )? : Hello: it appears at most once (or it does not appear), for example, \ s ?, 0 or 1 blank character

4) {n }:n indicates any number, for example,. {5}. It indicates that any character appears 5 times.

5) {n ,}: n represents any number, for example, \ d {5 ,}. it indicates that any number appears 5 to multiple times.

6) {n, m}: n, m represents any number, for example, \ d {5, 10}, indicating that any number appears 5 to 10 times

Greedy and inert match

First, let's look at a regular expression example: "<. *>"

This regular expression matches any character starting with '<' and ending with '>', such as <br/>, <div/>, and </script>. If you use this regular expression to match <div> hello world </div>, the matched text is <div> hello world </div>, because the regular expression is greedy by default. By default, the Regular Expression Engine searches for the last matched anchor by default. In this example, the final matching anchor is </div> rather than <div>. This is greedy matching.

Since we understand that there is a greedy match, we need to know the inertia match.

The above regular expression is slightly changed: <. *?>, After adding a question mark (*), we will tell the engine to perform a search in the form of inertia matching. Using this regular expression to match <div> hello world </div> will produce two results, one is <div> the other is </div>. Because we do not limit the beginning and end.

Character Set, divergence, and antsense

Character Set: [A-Za-z0-9], matching a character, which is any one of the A-Za-z0-9

Differences: (a | B) *, a or B appears any number of times

Negative: [^ AB] +, Not a and B appear at least once

Group

The grouping of regular expressions is marked with parentheses (), for example: (\ w *). Mark \ w * into a matching unit and group. <(\ W + ?) [A-Za-z0-9 "= \ s] *> .*? </\ 1> This regular expression can match the <div class = "box"> hello world </div> segment, it is worth noting that \ w + is grouped, and then \ 1 is used to reference this group. By default, the engine names the first group as group 1st from left to right, and the second group as Group 2nd ...... we can also display the group name, and the regular expression in this instance can also be written as: <(? <G1> \ w + ?) [A-Za-z0-9 "= \ s] *> .*? </\ K <g1> we name the \ w + group g1 and reference it later through \ k <name>.

OK. Here we will talk about how to extract data, encode user input data, and verify user input in the C # language environment using regular expressions.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.