On the regular expression

Source: Internet
Author: User
Tags expression engine

1. What exactly is a regular expression?

When writing a program or Web page that handles strings, there is often a need to find strings that match certain complex rules. Regular expressions are the tools used to describe these rules. In other words, the regular expression is the code that records the text rule.

It is possible that you have used the wildcard character (wildcard) for file lookup under Windows/dos, which is * and ?. If you want to find all the Word documents in a directory, you will search for *.doc. Here,* will be interpreted as an arbitrary string. Like wildcards, regular expressions are also tools for text matching, but rather than wildcards, they can describe your needs more precisely--and, of course, the cost is more complex--you can write a regular expression that looks for all 0, followed by 2-3 numbers, then a hyphen "-" and finally a 7 or 8-digit string (like 010-12345678 or 0376-7654321).

2. Getting Started

The best way to learn a regular expression is to start with an example and then modify the example yourself to experiment. Here are a few simple examples, and they are described in detail.

Suppose you look for hi in an English novel , you can use the regular expression hi.

This is almost the simplest regular expression, it can exactly match such a string: two characters, the previous character is H, the latter is I. Typically, a tool that handles regular expressions provides an option to ignore the case, and if this option is selected, it can match any of the four cases of Hi, Hi,hi,hi.

Unfortunately, many words contain the two consecutive characters of Hi, such as him, history, high, and so on. With hi to find, the side of the hi will be found. If we want to find the word hi exactly, we should use \bhi\b.

\b is a special code prescribed by regular expressions (well, some people call it metacharacters, metacharacter), which represents the beginning or end of a word, that is, the boundary of a word. Although English words are usually delimited by spaces, punctuation marks, or line breaks, \b does not match any of these word-delimited characters, it only matches one position .

If a more precise argument is needed,\b matches such a position: its previous character and the next one are not all (one is, one is not or does not exist)\w.

If you're looking for a hi, not far behind, follow a Lucy, you should use \bhi\b.*\blucy\b.

Here,. is another meta-character that matches any character except the line break. * is also a meta-character, but it does not represent a character, nor a position, but a quantity-it specifies that the contents of the front can be reused any number of times to match the entire expression. Therefore,. * Together means any number of characters that do not contain a newline. Now the meaning of \bhi\b.*\blucy\b is obvious: first a word hi, then any arbitrary character (but not a newline), and finally the word Lucy.

The newline character is ' \ n ' and the ASCII encoding is 10 (hexadecimal 0x0a).

If you use a different meta-character at the same time, we can construct a more powerful regular expression. For example, the following:

0\d\d-\d\d\d\d\d\d\d\d matches such a string: starting with 0, then two digits, then a hyphen "-", and finally 8 digits (that is, China's phone number.) Of course, this example only matches the case where the area code is 3 bits).

The \d here is a new meta-character that matches one number (0, or 1, or 2, or ...). -not a meta-character, only matches itself-a hyphen (or a minus sign, or a middle line, or whatever you call it).

To avoid so many annoying repetitions, we can also write this expression:0\d{2}-\d{8}. Here {2} ({8}) behind \d means that the front \d must match 2 times consecutively (8 times).

3. Meta-Characters

Now you know a few useful meta characters, such as \b,.,*, and \d. There are more metacharacters in the regular expression, such as \s matches any whitespace, including spaces, tabs (tab), newline characters, Chinese full-width space, etc. \w matches letters or numbers or underscores or kanji.

The special handling of Chinese/kanji is by. NET provides the regular expression engine support, in other circumstances, see the relevant documents.

Let's take a look at more examples below:

\ba\w*\b matches a word that begins with the letter A-first at the beginning of a word (\b), then the letter A, then any number of letters or numbers (\w*), and finally the end of the word (\b).

OK, now let's talk about what the word in the regular expression means: it's not less than a continuous \w. Yes, it does not really matter with the thousands of things that you have to memorize when learning English:)

\d+ matches 1 or more consecutive digits. Here the + is and * similar to the meta-character, the difference is * Match repeat any time (may be 0 times), and + match repeat 1 or more times.

\b\w{6}\b matches exactly 6 characters in a word

Character escapes

If you want to find the meta-character itself, such as when you look up , or *, there's a problem: You can't specify them, because they'll be interpreted as something else. Then you have to use \ To cancel the special meaning of these characters. Therefore, you should use \. and \*. Of course, to find \ itself, you also have to use \ \.

For example:deerchao\.net matches deerchao.net,c:\\windows matches C:\Windows.

Repeat

You have seen the previous *,+,{2},{5,12} these several match the way of repetition. The following are all qualifiers in a regular expression (a specified number of codes, such as *,{5,12}, and so on):

Here are some examples of using duplicates:

windows\d+ matches 1 or more digits behind Windows

^\w+ matches the first word of a line (or the first word of the entire string, exactly what it means to look at the option setting)

Character class

To find numbers, letters, or numbers, white space is simple because you already have metacharacters that correspond to these character sets, but what if you want to match a character set that doesn't have predefined metacharacters (such as a vowel a,e,i,o,u)?

Very simply, you just have to list them in square brackets, like [Aeiou] to match any English vowel,[.?!] Matches a punctuation mark (. or? or!).

We can also easily specify a range of characters , like [0-9] representing exactly the same meaning as \d: a number, and the same [A-z0-9a-z_] is exactly the same as \w (if only in English).

The following is a more complex expression:\ (? 0\d{2}[)-]?\d{8}.

"(" and ")" is also a meta-character, which is mentioned later in the grouping section , so you need to use escape here.

This expression can match phone numbers in several formats, such as (010) 88886666, or 022-22334455, or 02912345678. Let's do some analysis of it: first an escape character \ (it can appear 0 or 1 times (?), then a 0, followed by 2 numbers (\d{2}), or -or one of the spaces, It appears 1 times or does not appear (?), and finally 8 digits (\d{8}).

Branch conditions

Unfortunately, the expression just now matches 010) 12345678 or (022-87654321) of the "incorrect" format. To solve this problem, we need to use the branching condition. The branching condition in a regular expression refers to several rules that should be matched if any of these rules are met, by separating the different rules with a | Don't you understand? Okay, look at the example:

0\d{2}-\d{8}|0\d{3}-\d{7} This expression can match two phone numbers separated by a hyphen: a three-bit area code, a 8-bit local number (such as 010-12345678), a 4-bit area code, and a 7-bit local number (0376-2233445).

\ (0\d{2}\) [-]?\d{8}|0\d{2}[-]?\d{8} This expression matches the phone number of the 3-bit area code, where the area code can be enclosed in parentheses or not, the area code and the local number can be separated by a hyphen or space, or there can be no interval. You can try branching conditions to extend this expression to support 4-bit area codes as well.

\d{5}-\d{4}|\d{5} This expression is used to match the U.S. ZIP code. The rules of the U.S. ZIP Code are 5 digits, or 9 digits spaced with hyphens. The reason to give this example is because it illustrates a problem: when using branching conditions, be aware of the order of each condition . If you change it to \d{5}|\d{5}-\d{4} then it will only match the 5-bit ZIP code (and the top 5 digits of the 9-bit zip code). The reason is that when a branch condition is matched, each condition is tested from left to right, and if a branch is satisfied, the other conditions are not to be managed.

Group

We've already mentioned how to repeat a single character (just after the character is preceded by a qualifier), but what if you want to repeat multiple characters? You can specify sub-expressions (also called groupings) with parentheses , and then you can specify the number of repetitions of the subexpression, and you can do some other things with the subexpression (described later).

(\d{1,3}\.) {3}\d{1,3} is a simple IP-address matching expression. To understand this expression, parse it in the following order:\d{1,3} matches numbers from 1 to 3 digits,(\d{1,3}\.) {3} matches three digits plus an English period (this whole is the grouping) repeats 3 times, and finally adds one to three digits (\d{1,3}).

Every number in the IP address can not be greater than 255, we must not be "24" the third quarter of the writers to be fooled ...

Unfortunately, it will also match 256.300.888.999, an IP address that cannot exist. If you can use arithmetic comparisons, you may be able to solve this problem simply, but the regular expression does not provide any function about mathematics, so you can only use a lengthy grouping, select, character class to describe the correct IP address:((2[0-4]\d|25[0-5]|[ 01]?\d\d?) \.) {3} (2[0-4]\d|25[0-5]| [01]?\d\d?].

The key to understanding this expression is to understand 2[0-4]\d|25[0-5]|[ 01]?\d\d, I'm not going to go into the detail here, you should be able to analyze the meaning of it.

Anti-righteousness

Sometimes you need to find characters that are not part of a character class that can be easily defined. For example, if you want to find any character other than the number, then you need to use the opposite justification:

Example:\s+ matches A string that does not contain whitespace characters.

<a[^>]+> matches A string that starts with a in angle brackets.

Back to reference

A back reference is used to repeat the search for text that precedes a grouping match. For example,\1 represents the text for grouping 1 matches. Hard to understand? Take a look at the example:

\b (\w+) \b\s+\1\b can be used to match duplicate words, like go go, or kitty Kitty. This expression is first More than one letter or number between the beginning and end of the word (\b (\w+) \b), the word is captured in a group numbered 1, followed by \s+), and finally \1).

You can also specify the group name of the sub-expression yourself . To specify a group name for a subexpression, use this syntax: (? <word>\w+) (or replace the angle brackets with ' also:(? ') Word ' \w+) so that the \w+ group name is specified as Word. To reverse reference this packet capture, you can use \k<word>, so the previous example can be written like this:\b (? <word>\w+) \b\s+\k<word>\b.

There are many syntax for specific uses when using parentheses. Some of the most common ones are listed below:

Comments

Another use of parentheses is to include comments through the syntax (? #comment). For example:2[0-4]\d (? #200 -249) |25[0-5] (? #250-255) |[ 01]?\d\d? (? #0-199).

To include comments, it is best to enable the "ignore whitespace in mode" option so that you can add spaces, Tab, and line breaks when you write an expression, which are ignored when you actually use them. When this option is enabled, all text that ends at the end of the # will be ignored as comments. For example, we can write a previous expression like this:

      (? <=    # assert the prefix of the text to match      < (\w+) > # Find the letters or numbers enclosed in angle brackets (i.e. the html/xml tags)      )       # prefix end      . *      match any text      (? =     # assertion to match the suffix of the text      <\/\1>  # Find the contents of the angle brackets: preceded by a "/", followed by a previously captured label      )       # suffix End

On the regular expression

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.