Getting Started with regular expressions

Last Update:2015-10-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is a regular expression?

When writing a program or Web page that handles strings, there is often a need to find strings that match certain complex rules. Regular expressions are the tools used to describe these rules. In other words, the regular expression is the code that records the text rule.

Meta character

A special code specified by a regular expression that represents the beginning or end of a word, which is the boundary of a word. Although English words are usually delimited by spaces, punctuation marks, or line breaks, \b does not match any of these word-delimited characters, it only matches one position .

. is another meta-character that matches any character except the line break.

* is also a meta-character, but it does not represent a character, nor a position, but a quantity-it specifies that the contents of the front can be reused any number of times to match the entire expression.

Match numbers

\s matches any whitespace characters, including spaces, tabs (tab), line breaks, Chinese full-width spaces, and so on.

\w matches letters or numbers or underscores or kanji.

Match the start of a string
$

Match the end of a string

So how do we organize the rules with these symbols? Take a look at the following example

\ba\w*\b matches a word that begins with the letter A-first at the beginning of a word (\b), then the letter A, then any number of letters or numbers (\w*), and finally the end of the word (\b).

\d+ matches 1 or more consecutive digits. Here the + is and * similar to the meta-character, the difference is * Match repeat any time (may be 0 times), and + match repeat 1 or more times.

\b\w{6}\b matches exactly 6 characters of a word.

^\d{5,12}$ a website If you need to fill in the QQ number must be 5 to 12 digits. {5,12} is repeated no less than 5 times and cannot be more than 12 times, otherwise it does not match. Because ^ and $ are used, the entire string entered is used to match the \d{5,12}, which means that the entire input must be 5 to 12 digits, so if the QQ number entered matches the regular expression, then it will meet the requirements. So say ^ matches the end of the string that you want to use to find the beginning of thematch. These two codes are useful when validating input content!

Character escapes

If you want to find the meta-character itself, such as when you look up , or *, there's a problem: You can't specify them, because they'll be interpreted as something else. Then you have to use \ To cancel the special meaning of these characters. Therefore, you should use \. and \*. Of course, to find \ itself, you also have to use \ \.

For example:deerchao\.net matches deerchao.net,c:\\windows matches C:\Windows.

Repetition of quantifiers

* Repeat 0 or more times
+ Repeat one or more times
? Repeat 0 or one time
{n} repeats n times
{n,} repeats n or more times
{N,m} repeats n to M times

Here are some examples of using duplicates:

windows\d+ matches 1 or more digits behind Windows

^\w+ matches the first word of a line (or the first word of the entire string, exactly what it means to look at the option setting)

Character class

To find numbers, letters, or numbers, white space is simple because you already have metacharacters that correspond to these character sets, but what if you want to match a character set that doesn't have predefined metacharacters (such as a vowel a,e,i,o,u)?

Very simply, you just have to list them in square brackets, like [Aeiou] to match any English vowel,[.?!] Matches a punctuation mark (. or? or!).

We can also easily specify a range of characters , like [0-9] representing exactly the same meaning as \d: a number, and the same [A-z0-9a-z_] is exactly the same as \w (if only in English).

The following is a more complex expression:\ (? 0\d{2}[)-]?\d{8}.

This expression can match phone numbers in several formats, such as (010) 88886666, or 022-22334455, or 02912345678. Let's do some analysis of it: first an escape character \ (it can appear 0 or 1 times (?), then a 0, followed by 2 numbers (\d{2}), or -or one of the spaces, It appears 1 times or does not appear (?), and finally 8 digits (\d{8}).

Branching conditions

Unfortunately, the expression just now matches 010) 12345678 or (022-87654321) of the "incorrect" format. To solve this problem, we need to use branching conditions. The branching condition in regular expressions refers to a number of rules that should be matched if any of these rules are met, by separating the different rules with a | Don't you understand? Okay, look at the example:

0\d{2}-\d{8}|0\d{3}-\d{7} This expression can match two phone numbers separated by a hyphen: a three-bit area code, a 8-bit local number (such as 010-12345678), a 4-bit area code, and a 7-bit local number (0376-2233445).

\ (? 0\d{2}\)? [-]?\d{8}|0\d{2}[-]?\d{8} This expression matches the phone number of the 3-bit area code, where the area code can be enclosed in parentheses or not, the area code and the local number can be separated by a hyphen or space, or there can be no interval. You can try branching conditions to extend this expression to also support 4-bit area codes.

\d{5}-\d{4}|\d{5} This expression is used to match the U.S. ZIP code. The rules of the U.S. ZIP Code are 5 digits, or 9 digits spaced with hyphens. The reason to give this example is because it illustrates a problem: when using branching conditions, be aware of the order of each condition . If you change it to \d{5}|\d{5}-\d{4} then it will only match the 5-bit ZIP code (and the top 5 digits of the 9-bit zip code). The reason is that when matching the branching conditions, each condition will be tested from left to right, and if a branch is satisfied, it will not be able to control the other conditions.

Group

We've already mentioned how to repeat a single character (just after the character is preceded by a qualifier), but what if you want to repeat multiple characters? You can specify sub-expressions (also called groupings) with parentheses , and then you can specify the number of repetitions of the subexpression, and you can do some other things with the subexpression (described later).

(\d{1,3}\.) {3}\d{1,3} is a simple IP-address matching expression. To understand this expression, parse it in the following order:\d{1,3} matches numbers from 1 to 3 digits,(\d{1,3}\.) {3} matches three digits plus an English period (this whole is the grouping) repeats 3 times, and finally adds one to three digits (\d{1,3}).

Unfortunately, it will also match 256.300.888.999, an IP address that cannot exist. If you can use arithmetic comparisons, you may be able to solve this problem simply, but the regular expression does not provide any function about mathematics, so you can only use a lengthy grouping, select, character class to describe the correct IP address:((2[0-4]\d|25[0-5]|[ 01]?\d\d?) \.) {3} (2[0-4]\d|25[0-5]| [01]?\d\d?].

The key to understanding this expression is to understand 2[0-4]\d|25[0-5]|[ 01]?\d\d, I'm not going to go into the detail here, you should be able to analyze the meaning of it.

Anti-righteousness

Sometimes you need to find characters that are not part of a character class that can be easily defined. For example, if you want to find any character other than the number, then you need to use the opposite justification:

Table 3. Commonly used antisense code
Code/Syntax	Description
\w	Match any characters that are not letters, numbers, underscores, kanji
\s	Match any character that is not a whitespace character
\d	Match any non-numeric character
\b	Match a position that is not the beginning or end of a word
[^x]	Matches any character except X
[^aeiou]	Matches any character except for the letters AEIOU

Example:\s+ matches A string that does not contain whitespace characters.

<a[^>]+> matches A string that starts with a in angle brackets.

Back to reference

When you specify a subexpression with parentheses, the text that matches the subexpression (that is, what this grouping captures) can be further processed in an expression or other program. By default, each grouping automatically has a group number, with the rule: left-to-right, with the left parenthesis of the group as the flag, the group number for the first occurrence is 1, the second is 2, and so on.

A back reference is used to repeat the search for text that precedes a grouping match. For example,\1 represents the text for grouping 1 matches. Hard to understand? Take a look at the example:

\b (\w+) \b\s+\1\b can be used to match duplicate words, like go go, or kitty Kitty. The expression is first a word, or more than one letter or number (\b (\w+) \b) between the beginning and end of the word, andthe word is captured in a group numbered 1, followed by 1 or more whitespace characters (\s+). Finally, the content captured in Group 1 (that is, the word that preceded it) (\1).

You can also specify the group name of the sub-expression yourself . To specify a group name for a subexpression, use this syntax: (? <word>\w+) (or replace the angle brackets with ' also:(? ') Word ' \w+) so that the \w+ group name is specified as Word. To reverse reference this packet capture, you can use \k<word>, so the previous example can be written like this:\b (? <word>\w+) \b\s+\k<word>\b.

There are many syntax for specific uses when using parentheses. Some of the most common ones are listed below:

table 4. Common grouping Syntax
category	Code/Syntax	Description
Capture	(exp)	Match exp, and capture text into an automatically named group
	(? <name>exp)	Match exp, and capture the text to a group named name, or you can write (? ') Name ' exp ')
	(?: EXP)	Matches exp, does not capture matching text, and does not assign group numbers to this group
0 Wide Assertion	(? =exp)	Match the position of the exp front
	(? <=exp)	Match the position after exp
	(?! Exp	Match the position followed by the exp.
	(? <!exp)	Match a location that is not previously exp
Comments	(? #comment)	This type of grouping does not have any effect on the processing of regular expressions, and is used to provide comments for people to read

We have discussed the first two syntaxes. The third (?: EXP) does not change the way regular expressions are handled, except that such groups of matching content are not captured in a group as in the first two, and do not have a group number. "Why would I want to do that?" "--good question, why do you think?"

Getting started with regular expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Getting Started with regular expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Getting Started with regular expressions

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support