Regular Expression-getting started

Source: Internet
Author: User
Tags character classes repetition
I. Introduction

I believe many people have heard of the regular expression. This term originated from the early 1956 s. An American mathematician named Stephen Kleene was working in McCulloch and Pitts, I published a paper titled "neural network event representation" and introduced the concept of regular expressions. A regular expression is an expression used to describe the algebra of a positive set. Therefore, the regular expression is used.

Later, it was found that this work could be applied to some early research using Ken Thompson's computational search algorithm, which is the main inventor of UNIX. The first utility of regular expressions is the QED editor in UNIX.

Q: What can a regular expression do for us?

A: An important part of text-based editors and search tools. Regular Expressions allow users to construct matching modes by using a series of special characters, and then compare the matching modes with target objects such as data files, program input, and form input on the web page, execute the corresponding program based on whether the comparison object contains the matching mode.

Next we will introduce the use of regular expressions by combining its syntax step by step.

Ii. Regular Expressions for first-time contact

Let's first understand some basic concepts of regular expressions. As a Representation Language, regular expressions define a set of descriptions to describe various character classes. The following is a definition in msdn. (MS-help: // Ms. VSCC/ms. msdnvs.2052/cpgenref/html/cpconcharacterclasses.htm)

Character escape table
Character class
Description

.
Matches any character except \ n. If the singleline option is used (see regular expression options), the periods match any character.

[Aeiou]
Matches any single character in the specified character set.

[^ Aeiou]
Matches any single character that is not in the specified character set.

[0-9a-fa-f]
The hyphen (-) allows you to specify the range of consecutive characters.

\ P {name}
Matches any character in the name character class. The supported names are Unicode groups and block ranges. For example, ll tasks? Nd unknown? Z role? Isgreek success? Isboxdrawing.

\ P {name}
Match the text that is not included in the group and block range specified in {name.

\ W
Match any word character. Equivalent to the Unicode character category
[\ P {ll} \ P {Lu} \ P {lt} \ P {lo} \ P {nd} \ P {PC}]. If the ecmascript-compliant behavior is specified through the ecmascript option, \ W is equivalent to [a-zA-Z_0-9].

\ W
Matches any non-word character. It is equivalent to the Unicode category [^ \ P {ll} \ P {Lu} \ P {lt} \ P {lo} \ P {nd} \ P {PC}]. If the ecmascript-compliant behavior is specified through the ecmascript option, \ W is equivalent to [^ a-zA-Z_0-9].

\ S
Matches any blank character. It is equivalent to the Unicode character category [\ f \ n \ r \ t \ v \ x85 \ P {z}]. If the ecmascript action is specified through the ecmascript option, \ s is equivalent to [\ f \ n \ r \ t \ v].

\ S
Matches any non-blank characters. It is equivalent to the Unicode character category [^ \ f \ n \ r \ t \ v \ x85 \ P {z}]. If the ecmascript action is specified through the ecmascript option, \ s is equivalent to [^ \ f \ n \ r \ t \ v].

\ D
Matches any decimal number. The behavior is the same as that of Unicode \ P {nd} and non-Unicode [0-9] And ecmascript.

\ D
Matches any non-digit. The behavior is the same as that of Unicode \ P {nd}, Unicode [^ 0-9], and ecmascript.

The above table lists the most basic syntax definitions in regular expressions. For more information, we can define some simple rules, for example:

1. Match All characters

Of course, you do not need to write anything (@_@)

2. match all English characters

A) \ W

B) [a-zA-Z_0-9]

3. Match the decimal number

A) \ D

B) [0-9]

Looking at the example above, do you think it is very simple? However, so far, there is still a major defect in writing this rule, that is, there is no declaration of the number of matching characters?

Q: I want to match five English letters.

A :???

If you understand the above knowledge, you cannot solve this problem. So how does the regular expression solve this problem? Let's look at the following table:

(MS-help: // Ms. VSCC/ms. msdnvs.2052/cpgenref/html/cpconquantifiers.htm)

Qualifier table
Qualifier
Description

*
Specify zero or more matches, for example, \ W * or (ABC )*. Same as {0.

+
Specify one or more matches, for example, \ W + or (ABC) +. Same as {1.

?
Specify zero or one match. For example, \ W? Or (ABC )?. Same as {0, 1.

{N}
Exactly n matches are specified. For example, (pizza) {2 }.

{N ,}
Specify at least N matches; for example, (ABC) {2 ,}.

{N, m}
Specify at least N matches but no more than m matches.

*?
Specify as few as possible to use the repeated first match (lazy *).

+?
Specify to use as few duplicates as possible but at least once (lazy + ).

??
Specify zero repetition (if possible) or one repetition (lazy ?).

{N }?
It is equivalent to {n} (lazy {n }).

{N ,}?
Specify to use as few duplicates as possible, but use at least N times (lazy {n ,}).

{N, m }?
Specify to use as few duplicates as possible between N and m times (lazy {n, m }).

As listed in the table above, we can easily compile stronger Regular Expressions by using regular expressions in combination with these characters.

For example:

1. match zero or more all characters

*

2. Match one or more all characters

+

3. match zero or multiple English characters

\ W *

4. Match one or more English characters

A-zA-Z0-9 +

5. Match 3 decimal digits

\ D {3}

6. Match at least three decimal digits

\ D {3 ,}

7. Match 3 to 6 decimal numbers

\ D {3, 6}

Now we can answer the above questions:

Q: I want to match five English letters.

A: \ W {5}

We are glad that we have solved the above problems. However, new problems are always emerging. How can I restrict where matching characters appear?

Q: I want to match a string starting with Doc.

A :???

To solve this problem, let's take a look at this table:

(MS-help: // Ms. VSCC/ms. msdnvs.2052/cpgenref/html/cpconatomiczero-widthassertions.htm)

Atomic zero-width assertion
Assertions
Description

^
The specified match must start with the string or line. For more information, see multiline options in regular expression options.

$
The specified match must appear at the following position: the end of the string, the end of the string, or the end of the line. For more information, see multiline options in regular expression options.

\
Specifies that the match must appear at the beginning of the string (ignore the multiline option ).

\ Z
Specifies that the match must appear at the end of the string or before \ n at the end of the string (ignore the multiline option ).

\ Z
Specifies that the match must appear at the end of the string (ignore the multiline option ).

\ G
The specified match must appear at the start of the current search (this location is usually the first character after the end of the last search ). For example, consider a series string consisting of separated character groups, where each group has a length of n characters. If a regular expression matches a string of 0, N, 2n, or 3N characters, the regular expression is successful. The match is successful only when it appears on the boundary of the positioning group.

\ B
The specified match must appear on the boundary between the \ W (alphanumeric) and \ W (non-alphanumeric) characters. A match must appear on the word boundary, that is, the first or last character in a word separated by spaces.

\ B
The specified match cannot appear on the \ B boundary.

I believe everyone has noticed that the first asserted character in this table is what we need @_@.

For example, ^ specifies the current position at the beginning of a row or string. Therefore, the regular expression ^ FTP will only return matching items of the string "ftp" that appears at the beginning of the row.

It seems that the problem we encountered above can be solved again. Let's solve the problem together:

Q: I want to match a string starting with Doc.

A: ^ Doc

The above is a preliminary understanding of what is a regular expression, and we have already understood its most basic syntax, as a warm-up @ _ @. Next, we will officially enter the topic, we will explore the use of regular expressions from the second article.
In the previous article, I introduced some basic concepts of regular expressions. I believe many people have some basic knowledge about regular expressions. Next, we use some practical programming examples to conceal the role of regular expressions.

First, let's take a look at several practical examples:

1. Verify that all input characters are English characters

Javascript:

VaR EX = "^ \ W + $ ";

VaR Re = new Regexp (ex, "I ");

Return re. Test (STR );

VBScript

Dim RegEx, flag, ex

Ex = "^ \ W + $"

Set RegEx = new Regexp

RegEx. ignorecase = true

RegEx. Global = true

RegEx. pattern = ex

Flag = RegEx. Test (STR)

C #

System. String EX = @ "^ \ W + $ ";

System. Text. regularexpressions. RegEx Reg = new RegEx (Ex); bool flag = reg. ismatch (STR );

2. Verify the email format

C #

System. String EX = @ "^ \ W + @ \ W + \. \ W + $ ";

System. Text. regularexpressions. RegEx Reg = new RegEx (Ex );

Bool flag = reg. ismatch (STR );

3. Change the date format (replace the date format of mm/DD/yy with the date format of DD-mm-yy)

C #

String mdytodmy (string input)

{

Return RegEx. Replace (input,

"\ B (? <Month> \ D {1, 2 })/(? <Day> \ D {1, 2 })/(? <Year> \ D {2, 4}) \ B ",

"$ {Day}-$ {month}-$ {year }");

}

4. Extract protocol and port number from URL

C #

String extension (string URL)

{

RegEx r = new RegEx (@ "^ (? <Proto> \ W +): // [^/] +? (? <Port>: \ D + )? /",

Regexoptions. Compiled );

Return R. Match (URL). Result ("$ {proto }$ {port }");

}

The example here may be some of the regular expressions we usually encounter in Web development. Especially in the first example, we show how to use JavaScript, VBScript, C # and other implementation methods in different languages, it is not difficult to see that for different languages, regular expressions are no different, but the implementation classes of regular expressions are different. How to make full use of the regular expression is also dependent on the support of the Implementation class.

(From msdn: Microsoft. net Framework SDK provides a large number of regular expression tools, allowing you to efficiently create, compare, and modify strings, and quickly analyze a large amount of text and data to search, remove, and replace text patterns. MS-help: // Ms. VSCC/ms. msdnvs.2052/cpgenref/html/cpconregularexpressionslanguageelements.htm)

We will analyze these examples one by one:

1-2. These two examples are very simple. They simply verify that the string conforms to the format specified by the regular expression. The syntax used is described in the first article, here is a simple description.

Expression for the 1st example: ^ \ W + $

^ -- Specifies that the match starts with the start of the string.

\ W-Indicates matching English characters

+ -- Indicates that a matching character appears once or multiple times.

$ -- Indicates that the matching character ends at the end of the string.

Verify the string like asgasdfs

Expression in the 2nd example: ^ \ W + @ \ W +. \ W + $

^ -- Specifies that the match starts with the start of the string.

\ W-Indicates matching English characters

+ -- Indicates that a matching character appears once or multiple times.

@ -- Match common characters @

\.-Match common characters. (note. It is a special character, so \ translation must be added)

$ -- Indicates that the matching character ends at the end of the string.

Validate the format of a message like a dragontt@sina.com

Replace is used in the 3rd example. Therefore, let's take a look at the replace definition in the regular expression:

(MS-help: // Ms. VSCC/ms. msdnvs.2052/cpgenref/html/cpconsubstitutions.htm)

Replace
Character
Description

$123
Replace the last substring matched by group number 123 (decimal.

$ {Name}
Replace (? <Name>) The last substring that the group matches.

$
Replace a single "$" character.

$ &
Replace a copy that exactly matches itself.

$'
Replace all text of the input string before matching.

$'
Replace all text of the matched input string.

$ +
Replace the last captured group.

$ _
Replace the entire input string.

Group Structure
(MS-help: // Ms. VSCC/ms. msdnvs.2052/cpgenref/html/cpcongroupingconstructs.htm)

Group Structure
Definition

()
Capture matched substrings (or non-capturing groups). For more information, see the explicitcapture option in regular expression options .) Capture with () is automatically numbered starting from 1 according to the sequence of left parentheses. The first capture of zero element number is text that is matched by the entire regular expression pattern.

(? <Name>)
Capture the matched substring to a group name or serial number name. The string used for name cannot contain any punctuation marks and cannot start with a number. You can use single quotes to replace angle brackets, such (? 'Name ').

(? <Name1-name2>)
Balancing Group definition. Delete the definition of the previously defined name2 group and store the interval between the previously defined name2 group and the current group in the name1 group. If no name2 group is defined, the matching will be traced back. Since deleting the last definition of name2 displays the previous definition of name2, this construction allows the capture stack of the name2 group to be used as a counter to trace nested structures (such as parentheses ). In this construction, name1 is optional. You can use single quotes to replace angle brackets, such (? 'Name1-name2 ').

(? :)
Non-capturing group.

(? Imnsx-imnsx :)
Apply or disable the options specified in the subexpression. For example ,(? I-s:) will enable case-insensitive and disable single row mode. For more information, see regular expression options.

(? =)
0-width positive prediction first asserted. The child expression continues matching only when it matches the right side of the position. For example, \ W + (? = \ D) matches the word followed by a number instead of the number. This construction will not be traced back.

(?! )
0-width negative prediction first asserted. The child expression continues matching only when it does not match the right side of the position. For example, \ B (?! UN) \ W + \ B matches words that do not start with UN.

(? <=)
Assertion after the blank width is reviewed. The child expression continues matching only when it matches the left side of the position. For example ,(? <= 19) 99 matches the 99 instance following 19. This construction will not be traced back.

(? <! )
Assertion after review with Zero Width and negative. The child expression continues matching only when it does not match on the left side of the position.

(?> )
Non-backtracking subexpression (also called greedy subexpression ). This subexpression only matches exactly once, and then does not participate in backtracking step by step. (That is, this subexpression only matches strings that can be independently matched by this subexpression .)

Let's take a brief look at these two concepts:

Group structure:

The most basic constructor is (). The Section enclosed in parentheses is a group;

Further grouping is like :(? <Name>). The difference between this method and the first method is to name the part of the Group so that information can be obtained through the group name;

(Is there a shape like (? =). We have not used the grouping structure in this example. We will introduce it next time)

Replace:

The two basic grouping methods () and (? <Name>). Through these two grouping methods, we can get the matching results, such as $1, $ {name.

In this case, the concept may be vague. Let's take the above example as an example:

In the third example, the regular expression is \ B (? <Month> \ D {1, 2 })/(? <Day> \ D {1, 2 })/(? <Year> \ D {2, 4}) \ B

(To explain why all of them are used together \: Here is the C # example. in C #, \ is a conversion character. If you want \ In a string to not translate, you need to use \ or add @ to the start of the entire string, that is, the above is equivalent

@ "\ B (? <Month> \ D {1, 2 })/(? <Day> \ D {1, 2 })/(? <Year> \ D {2, 4} \ B ")

\ B -- is a special case. In a regular expression, \ B Represents the boundary (between \ W and \ W) except for the escape character in the [] character class ). In replacement mode, \ B always indicates the return character

(? <Month> \ D {1, 2})-creates a group named month, which matches a number with a length of 1-2.

/-- Match common/Character

(? <Day> \ D {}) -- creates a group named Day, which matches a number with a length of 1-2.

/-- Match common/Character

(? <Year> \ D {2, 4} \ B ") -- creates a group named year, which matches a number with a length of 2 to 4.

The role of these groups cannot be seen here. Let's look at this sentence.

$ {Day}-$ {month}-$ {year}

$ {Day}-obtain the matched information of the group named day constructed above.

--- Common-Character

$ {Month} -- Obtain the matched information of the group named month constructed above.

--- Common-Character

$ {Year} -- Obtain the matched information of the group named year constructed above.

For example:

Replace the following three methods in the Date Format: 04/02/2003

(? <Month> \ D {1, 2}) the Group will match 04 and get this matching value from $ {month }.

(? <Day> \ D {1, 2}) the Group will match 02 and get this matching value from $ {day }.

(? <Year> \ D {2003}) the Group will match to and get the matching value from $ {year }.

After learning about this example, it is very easy to look at the 4th examples.

Regularizedtype in 4th examples

^ (? <Proto> \ W +): // [^/] +? (? <Port>: \ D + )? /

^ -- Specifies that the match starts with the start of the string.

(? <Proto> \ W +)-constructs a group named proto that matches one or more letters.

: -- Common: character

// -- Match two/Character

[^/]-Indicates that the character is not allowed here.

+? -Indicates to use as few duplicates as possible, but at least one match is used.

(? <Port>: \ D +)-construct a group named port, which is in the following format: 2134 (colon + one or more numbers)

? -It indicates that the matching character appears 0 or 1 time.

/-- Match/Character

Finally, use $ {proto} $ {port} to obtain the Matching content of the two grouping structures.

(For usage of the RegEx object, refer

MS-help: // Ms. VSCC/ms. msdnvs.2052/cpref/html/frlrfsystemtextregularexpressionsregexmemberstopic.htm)

Well, I have already mentioned several examples in this article. I hope you will have some gains. Next time, I will discuss some special requirements to further explore the implementation of regular expressions.
In the previous article, we introduced the basic syntax of regular expressions and some simple examples. But these are not all of the problems we will encounter. Sometimes we have to write some complicated regular expressions to solve our actual problems.

Here, I will first raise a few questions, and then we will use the knowledge of regular expressions to solve them one by one.

1. Either of the two conditions is true. For example, it is a pure number or a pure character.

123 (true), hello (true), 234. test23 (false)

2. To obtain a character combination that does not start with a number

For example: how2234do> you234do, you want to get how And you instead of do, do

3. Get a combination of characters starting with a number

In the preceding example, do and do are obtained.

4. To obtain a character combination that does not end with a number

In the above case, we need to get Ho, do, yo, do

5. Get the character combination ending with a number.

In the same example, get Ho, do, yo, do

6. It is not allowed to contain both AB characters.

For example: nihaoma (true), abve (false), agoodboy (true)

Let's start to solve these problems:

First: either of the two conditions is true

This requirement may represent a general requirement. Let's take a look at this table first.

Replacement Structure
Replacement Structure
Definition

|
Matches any term in a term separated by | (vertical) characters, such as CAT | dog | tiger. Use the leftmost successful match.

(? (Expression) Yes | No)
If the expression matches this position, it matches the "yes" part; otherwise, it matches the "no" part. The "no" part can be omitted. An expression can be any valid expression, but it will become a zero-width assertion. Therefore, this syntax is equivalent (? (? = Expression) Yes | no ). Note that if the expression is the name or capture group number of the naming group, the replacement structure is interpreted as a capture test (which is described in the next row of the table ). To avoid confusion in these cases, you can explicitly spell out the internal (? = Expression ).

(? (Name) Yes | No)
If the name capture string matches, it matches the "yes" part; otherwise, it matches the "no" part. The "no" part can be omitted. If the given name does not match the name or number of the capture group used in the expression, the replacement structure is interpreted as an expression test (described in the previous row of the table ).

(MS-help: // Ms. VSCC/ms. msdnvs.2052/cpgenref/html/cpconalternationconstructs.htm)

In this table, we can see that in regular expressions, to solve this type of problem, we define the | to represent the or relationship, just like common or operators, now let's take a look at how to use | to solve our problem.

1. First, write an expression for the selectable expression:

A) pure number-[0-9] *

B) Pure letter-[A-Za-Z] *

2. Use the optional conditions | connect them as needed

^ [0-9] * $ | ^ [A-Za-Z] * $

(Here I add the ^ and $ delimiters to the two conditions, which is necessary to verify whether the strings fully meet the requirements. If these two delimiters are not added, if you are interested, you can try it on your own.

The last four problems are actually one type, so we put them together for processing. Next we will solve the second to fourth problems:

First, let's review the grouping structure introduced last time:

(? =)
0-width positive prediction first asserted. The child expression continues matching only when it matches the right side of the position. For example, \ W + (? = \ D) matches the word followed by a number instead of the number. This construction will not be traced back.

(?! )
0-width negative prediction first asserted. The child expression continues matching only when it does not match the right side of the position. For example, \ B (?! UN) \ W + \ B matches words that do not start with UN.

(? <=)
Assertion after the blank width is reviewed. The child expression continues matching only when it matches the left side of the position. For example ,(? <= 19) 99 matches the 99 instance following 19. This construction will not be traced back.

(? <! )
Assertion after review with Zero Width and negative. The child expression continues matching only when it does not match on the left side of the position.

We can see that the four rules of this table can solve our problem.

@ Solve our problem first and then:

Example 2: A character combination not starting with a number

(? <! \ D) [A-Za-Z] {2 ,}

(? <! \ D) -- The character must start with a number.

[A-Za-Z] {2,}-description matches more than 2 letters

(Note: This is a clever practice, because according to our logic, the o letters of the two do in how2234do> you234do are also consistent, but this is not what we want, of course there are other solutions, which can be handled according to the actual situation. Here we want to explain this method @_@)

Example 3: Get a character combination starting with a number

(? <= \ D) [A-Za-Z] +

(? <= \ D)-only matching characters starting with a number

[A-Za-Z] + -- the description matches one or more letters.

Example 4: A character combination that does not end with a number

[A-Za-Z] + (?! \ D)

[A-Za-Z] + -- the description matches one or more letters.

(?! \ D)-only matching of letters that do not end with a number

Example 5: Get a character combination ending with a number

[A-Za-Z] + (? = \ D)

[A-Za-Z] + -- the description matches one or more letters.

(? = \ D)-only letters ending with a digit can match

Example 6: AB cannot appear simultaneously in characters

^ (?!. *? AB). * $

(?!. *? AB)-restrict the occurrence of AB-connected characters

. * -- Any character

The problem we raised this time has also been solved. Although the example is simple, complicated things are also based on simplicity. In fact, the key to writing regular expressions is to be good at customizing rules, describing them with the most concise and correct words, and then writing them out with the regular expression syntax (@_@) this depends on your accumulated experience.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.