Regular Expressions (i)

Source: Internet
Author: User
Tags character classes processing text expression engine

Summarize self -Regular Expressions 30-minute introductory tutorials (great) and rookie tutorials

What exactly is a regular expression?

Regular expressions, also known as formal representations and regular representations (Regular expression, are often abbreviated in code as regex, RegExp, or re).

Character is the most basic unit of computer software processing text, it may be letters, numbers, punctuation, spaces, newline characters, Chinese characters (different encoding methods of Chinese characters occupy different character number) and so on. A string is a sequence of 0 or more characters. Text is the literal, string. To say that a string matches a regular expression, usually refers to a part (or parts of it) in the string that satisfies the condition given by an expression. When writing a program or Web page that handles strings, there is often a need to find strings that match certain complex rules. Regular expressions are the tools used to describe these rules. In other words, the regular expression is the code that records the text rule.

It is possible that you have used Windows/dos (wildcard), which is used for file lookups, that is, * and? (used primarily for fuzzy searches, asterisks instead of 0 or more characters, and question marks instead of one character). If you want to find all the Word documents in a directory, you will search for *.doc. Like wildcards, regular expressions are also tools for text matching, but they are more powerful and flexible than wildcards, and can accurately describe your needs--of course, the cost is more complex--for example, you can write a regular expression to find all 0-based, followed by 2-3 numbers, Then there is a hyphen "-" and finally a 7 or 8-digit string (like 010-12345678 or 0376-7654321).

Entry

Suppose you look for hi in an English novel, you can use the regular expression hi.

This is almost the simplest regular expression, it can exactly match such a string: Two characters, the previous character is H, the latter is I. Typically, a tool that handles regular expressions provides an option to ignore the case, and if this option is selected, it can match any of the four cases of Hi,hi,hi,hi. Unfortunately, many words contain the two consecutive characters of Hi, such as Him,history,high and so on. With Hi to find, the side of the hi will be found. If we want to find the word hi exactly, we should use \bhi\b.

\b is a special code prescribed by regular expressions (some people call it metacharacters, metacharacter), which represents the beginning or the end of a word, that is, the boundary of a word. Although English words are usually delimited by spaces, punctuation marks, or line breaks, \b does not match any of these word-delimited characters, it only matches one position.

If you're looking for a hi, not far behind, follow a Lucy, you should use \bhi\b.*\blucy\b.

here. is another meta-character that matches any character other than the newline character, which is ' \ n ', ASCII-encoded as 10 (hexadecimal 0x0a). * is also a meta-character, but it does not represent a character, nor a position, but a quantity-it specifies that the contents of the front can be reused any number of times to match the entire expression. Therefore,. * Together means any number of characters that do not contain a newline. Now the meaning of \bhi\b.*\blucy\b is obvious: first a word hi, then any arbitrary character (but not a newline), and finally the word Lucy.

If you use a different meta-character at the same time, we can construct a more powerful regular table -up. For example, the following:

0\d\d-\d\d\d\d\d\d\d\d matches such a string: Starting with 0, then two digits, then a hyphen "-", and finally 8 digits (that is, China's phone number.) Of course, this example only matches the case where the area code is 3 bits).

The \d here is a new meta-character that matches one digit (0, or 1, or 2, or ...). -Not a meta-character, only matches itself-a hyphen (or a minus sign, or a middle line, or whatever you call it).

To avoid so many annoying repetitions, we can also write this expression: 0\d{2}-\d{8}. here {2} ({8}) behind \d means that the front \d must match 2 times consecutively (8 times)

Metacharacters

Now you know a few useful meta characters, like \b,.,* and \d. There are more metacharacters in the regular expression, such as \s matches any whitespace character, including spaces, tabs (tab), line breaks, Chinese full-width spaces, and so on. \w matches letters or numbers or underscores or kanji (special handling of Chinese/kanji is made by. NET provides the regular expression engine support, in other circumstances, see the relevant documents, etc.).

Let's take a look at more examples below:

\ba\w*\b matches a word that begins with the letter A-first at the beginning of a word (\b), then the letter A, then any number of letters or numbers or underscores (\w*), and finally the end of the word (\b).

Usually we use not less than one continuous \w to denote words.

\d+ matches 1 or more consecutive digits. Here the + is and * similar to the meta-character, the difference is * match repeat any time (may be 0 times), and + match repeat 1 or more times.

\b\w{6}\b matches exactly 6 characters of a word.

' er\b ' can match ' er ' in ' never ', but cannot match ' er ' in ' verb '.

The metacharacters ^ and $ both match a position, which is a bit similar to \b. ^ matches the end of the string you want to use to find the beginning of the match. These two codes are useful when validating input, such as a Web site that requires you to fill out a 5-bit to 12-digit QQ number, which you can use: ^\d{5,12}$ (if you don't use ^ and $, for \d{5,12}, Using this method only guarantees that the string contains 5 to 12 consecutive digits, rather than the entire string being 5 to 12 digits. The {5,12} here is similar to the {2} described earlier, except that {2} matches can only be repeated 2 times, and {5,12} is repeated no less than 5 times, not more than 12 times, otherwise it does not match. Because ^ and $ are used, the entire string entered is used to match the \d{5,12}, which means that the entire input must be 5 to 12 digits, so if the QQ number entered matches the regular expression, then it will meet the requirements.

Similar to ignoring case options, some regular expression processing tools have an option to handle multiple rows. If this option is selected, the meaning of ^ and $ becomes the beginning and end of the matching line.

the common metacharacters

Code description

. Match any character other than line break

\w match letters or numbers or underscores or kanji

\s matches any whitespace character, equivalent to [\f\n\r\t\v]

\d Matching numbers

\b Locator to match the beginning or end of a word

^ Locator, matches the start of the string

$ locator that matches the end of the string

Character escapes

If you want to find the meta-character itself, such as you find. or *, you have to use \ To cancel the special meaning of these characters, using \. And \*, of course, to find \ itself, you have to use \ \.

For example: Deerchao\.net matches deerchao.net,c:\\windows matching C:\Windows.

Qualifier

You've seen the previous *,+,{2},{5,12} and these are several ways to match duplicates.

The following are all qualifiers in the regular expression:

frequently-used qualifiers

Code/Syntax Description

* Repeat 0 or more times, equivalent to {0,}

+ Repeat one or more times, equivalent to {1,}

? Repeat 0 or one time, equivalent to {0,1}

{n} repeats n times

{n,} repeats n or more times

{N,m} repeats n to M times, where n <= m

Example:

Windows\d+ matches 1 or more digits behind windows

^\w+ matches the first word of a row (when set to multiline mode) or the first word of the entire string

Character class

It's easy to find numbers, letters, or blanks because you already have metacharacters \d \w \s that correspond to these character sets, but what if you want to match a character set that doesn't have predefined metacharacters (such as a vowel a,e,i,o,u)? Very simply, you just have to list them in square brackets, like [aeiou] to match any English vowel, [.?!] Matches a punctuation mark (. or? or!).

We can also easily specify a range of characters, as the meaning of [0-9] represents is exactly the same as the \d, all representing a single digit, and the same [a-z0-9a-z_] is identical to the \w (alphanumeric underline, regardless of Chinese characters).

The following is a more complex expression: \ (? 0\d{2}[)-]?\d{8}.

"(" and ")" is also a meta-character, which is mentioned later in the Grouping section, so you need to use escape here.

Make some analysis of it: First is an escape character \ (it can occur 0 or 1 times (?), then a 0, followed by 2 digits (\d{2}), then a) or-or a space, it appears 0 or 1 times (?), and finally 8 digits (\d{8}). This expression can match phone numbers in several formats, such as (010) 88886666, or 022-22334455, or 02912345678.

Branching conditions

It should be noted that the expression just now can match 010) 12345678 or (022-87654321) of the "incorrect" format. To solve this problem, we need to use branching conditions.

The branching condition in regular expressions refers to a number of rules that should be matched if any of these rules are met, by separating the different rules with a |

Example:

0\d{2}-\d{8}|0\d{3}-\d{7} This expression can match two phone numbers separated by a hyphen: a three-bit area code, a 8-bit local number (such as 010-12345678), a 4-bit area code, and a 7-bit local number (0376-2233445).

\ (0\d{2}\) [-]?\d{8}|0\d{2}[-]?\d{8} This expression matches the phone number of the 3-bit area code, where the area code can be enclosed in parentheses, or not, the area code and the 8-bit local number can be separated by a hyphen or space, or there can be no interval.

Note: When using branching conditions, be aware of the order of each condition. For example:

\d{5}-\d{4}|\d{5} This expression is used to match the U.S. ZIP code. The rules of the U.S. ZIP Code are 5 digits, or 9 digits spaced with hyphens. If you change it to \d{5}|\d{5}-\d{4} then it will only match the 5-bit ZIP code and will not match the next four-bit ZIP code (and 9-bit ZIP code 5, for example: 12345-1234, it will only match 12345). The reason is that when matching the branching conditions, each condition will be tested from left to right, and if a branch is satisfied, it will not be able to control the other conditions.

Group

We've already mentioned how to repeat a single character (just after the character is preceded by a qualifier), but what if you want to repeat multiple characters? You can specify sub-expressions (also called groupings) with parentheses, and then you can specify the number of repetitions of the subexpression, and you can do some other things with the subexpression (described later).

(\d{1,3}\.) {3}\d{1,3} is a simple IP-address matching expression. Analyze it in the following order: \d{1,3}\. Match 1 to 3 digits plus one point, (\d{1,3}\.) The match in {3} parentheses is repeated 3 times, followed by a 1 to 3 bit number (\d{1,3}).

The IP address of each number can not be greater than 255, 01.02.03.04 such as the front with 0 of the number, is not the correct IP address? The answer is: Yes, the number in the IP address can contain a leading 0 (leading zeroes).

Unfortunately, it will also match 256.300.888.999, an IP address that cannot exist. If you can use arithmetic comparisons, you may be able to solve this problem simply, but the regular expression does not provide any functionality about mathematics, so you can only use lengthy groupings, selections, and character classes to describe the correct IP address: ((2[0-4]\d|25[0-5]|[ 01]?\d\d?) \.) {3} (2[0-4]\d|25[0-5]| [01]?\d\d?].

Analysis:

2[0-4]\d: Represents 3 digits, the first digit is 2, the second digit is 0-4, the third digit is random, indicating the value between 200--249;

25[0-5]: Represents a 3-digit number, the first digit is 2, the second digit is 5, the third digit is between 0-5, indicating the value between 250--255;

[01]?\d\d?: Can be a digital \d, can be two-digit 0\d,1\d,\d\d, can be three-bit digital 0\d\d,1\d\d, of course, the above classification to get the value of cross, but understand the meaning of the line, that is, the value of 0--199; : Indicates a value followed by a point number;

((2[0-4]\d|25[0-5]| [01]?\d\d?] \.) {3}: Indicates that the (number + dot) in the parentheses matches three times.

Anti-righteousness

Sometimes you need to find characters that are not part of a character class that can be easily defined. For example, if you want to find anything other than numbers and any other characters, then you need to use the opposite justification:

commonly used anti-semantic code

Code/Syntax Description

\w matches any character that is not a letter, number, underscore, or character

\s matches any character that is not a white letter

\d matches any non-numeric character

\b Match is not where the word starts or ends

[^x] matches any character except X

[^aeiou] matches any character except the letters AEIOU

Example: \s+ matches a string that does not contain whitespace characters;

<a[^>]+> matches a string preceded by a with angle brackets;

\BAPT matches the string apt in Chapter, but does not match the string apt in aptitude.

Back to reference

When you specify a subexpression with parentheses, the text that matches the subexpression (that is, what this grouping captures) can be further processed in an expression or other program. By default, each grouping automatically has a group number, and the rule is that the related match is stored in a temporary buffer, and each captured sub-match is stored in the order in which it appears from left to right in the regular expression pattern, with the group number 1 for the first occurrence, 2 for the second, and so on. The buffer number starts at 1 and stores up to 99 captured sub-expressions.

A back reference is used to repeat the search for text that precedes a grouping match. Each buffer can be accessed using ' \ n ', where n is a single or two-bit decimal number that identifies a particular buffer. For example, \1 represents the text for grouping 1 matches.

Example:

\b (\w+) \b\s+\1\b

Purpose: Can be used to match duplicate words, like go go, or Kitty kitty.

Analysis: Between the beginning and end of the word is one or more letters or numbers (\b (\w+) \b), usually used to denote a word, the word will be captured in a group numbered 1, then 1 or several white space characters (\s+), and finally \1, repeat the contents captured in Group 1 again.

You can also specify the group name of the sub-expression yourself. To specify a group name for a subexpression, use this syntax: (? <word>\w+) (or replace the angle brackets with ' also: (? ') Word ' \w+) so that the \w+ group name is specified as Word. To reverse reference this packet capture, you can use \k<word>, so the previous example can be written like this: \b (? <word>\w+) \b\s+\k<word>\b.

Group number assignment is not as simple as that:

In fact, the group number allocation process is to scan from left to right two times: first pass only to the unnamed group assignment, the second time only for the named group assignment-so all named groups are greater than the group number of unnamed group. Group 0 corresponds to the entire regular expression.

With parentheses, the associated match is cached, and you can use the (?: EXP) syntax to deprive a group of the right to participate in the group number assignment: (?: EXP) does not change the way the regular expression is handled, except that such a group match is not captured in a group as in the first two, and does not have a group number.

In short,?: one of the non-capturing elements, and two non-capturing elements are? = and?!, which is described later.

A reverse reference can also decompose a generic resource representation (URI) into its component. Suppose you want to break down the following URIs into protocols (FTP, HTTP, and so on), domain addresses, and page/path:

Http://www.w3cschool.cc:80/html/html-tutorial.html

The following regular expression provides this functionality:

(\w+): \/\/([^/:]+) (: \d*)? ([^# ]*)

Analysis:

The first parenthesis subexpression (\w+) captures the protocol portion of the Web address, "http";

The second parenthesis subexpression ([^/:]+) captures the domain address portion of the address, "www.w3cschool.cc";

The third parenthesis subexpression (: \d*) captures the port number (if specified), ": 80";

The fourth parenthesis subexpression ([^#]*) captures the path and/or page information specified by the Web address, which matches any sequence of characters that does not include the # or space character, "/html/html-tutorial.html".

Capture

Classification code/Syntax description

(exp) matches exp, and captures text into an automatically named group

(? <name>exp) matches exp, and captures the text into a group named name, which can also be written as (? ' Name ' exp ')

(?: EXP) matches exp, does not capture matching text, and does not assign group numbers to this group

0 Wide Assertion

Assertions are used to declare a fact that should be true. In a regular expression, the match is resumed only if the assertion is true.

The next two assertions are used to find things before or after something, but they are not captured, that is, they are used to specify a location like \b, ^, $, which should satisfy certain conditions (that is, assertions), which are also known as 0-wide assertions.

(? =exp) is also called a 0-width positive lookahead assertion, which asserts that the subsequent contents of the position itself appears to match the expression exp. For example, \b\w+ (? =ing\b), matches the previous part of the word in ing end (? =ing\b) (except for parts other than ing), such as finding I ' m singing while you ' re dancing. It will match sing and Danc; Windows (? =95|98| nt|2000) "Can match" windows "in" Windows2000 ", but does not match" windows "in" Windows3.1 ".

Note: The pre-check does not consume characters, that is, after a match occurs, the next matching search starts immediately after the last match, rather than starting with the character that contains the pre-check. For example, if the above expression matches Windows2000, the search will continue after Windows rather than after 98.

(? <=exp) also called 0 width is recalling the post assertion, which asserts that the front content of the position itself appears to match the expression exp. For example (? <=\bre) \w+\b matches the second half of a word (? <=\bre) that begins with re (except for the part of the RE), for example, when looking for reading a book, it matches ading.

Example: (? <=\s) \d+ (? =\s), matching numbers separated by whitespace (again, not including the whitespace characters).

Example: (?<=< (\w+) >). * (?=<\/\1>), matching the contents of a simple HTML tag that does not contain a property.

Analysis: (?<=< (\w+) >) Specifies such prefixes: words enclosed in angle brackets, such as <body>;

. * Denotes 0 or more characters that are not line breaks, or that are arbitrary strings, such as Hello;

(?=<\/\1>) is the suffix, the inside of the \ \ used in front of the character escaped; \1 is a reverse reference, referring to the first set of captures, that is, the previous (\w+) matching content. Like the </body>.

The entire expression matches the content between <body> and </body> hello (again, not including the prefix and suffix itself).

Regular Expressions (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.