Popular understanding of regular expressions

Last Update:2015-04-12 Source: Internet

Author: User

Tags control characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is not an entry-level article, but if you know about regular expressions or use them, you may be able to quickly recall them. Reading this article requires that you have used regular expressions or some understanding, because I have not written many examples. Summarize the regular expression, just because the individual after several years of accumulation, want to plain simple outline of what the regular expression is, what can do. There are a lot of regular expressions on the web, but there is always a sense of jargon too much to read too mealy or too clumsy. The last two sections come from the regular expression 30-minute introductory tutorial, partially repaired.

The regular expression string consists of two basic characters: literal text character and metacharacters. A meta-character is a special-purpose character in a regular expression.

Metacharacters

A meta-character may represent a number, letter, position, or number.

Representative character

Code	Description
.	Match any character other than line break
\w	Match letters or numbers or underscores or kanji
\s	Match any of the whitespace characters
\d	Match numbers
\s	Contrary to \s
\d	Contrary to \d
\w	Contrary to \w

Number of Delegates

Code	Description
*	Repeat 0 or more times
+	Repeat one or more times
?	Repeat 0 or one time
N	Repeat n times
{N,}	Repeat N or more times
{N,m}	Repeat N to M times

Representative location

code	description
\b	match start or end of word
\b
^	start of matching string
$	match end of string

Escape character

Since the meta-character regular has special meaning in the expression, what if you want to treat them as normal literal text characters now? Adding a backslash \ In front of the metacharacters means that the current metacharacters have lost their special meaning in the regular expression and become a literal literal character.

Pre-defined character set

The meta-character range for the character is too wide, what if I just want to match a character in a small range?

Simple, you can use square brackets to list them. such as [Aeiou] or [.?!], which represents only the characters that appear in square brackets.

Note that in this square bracket, all the preceding metacharacters are no longer metacharacters, only one is a meta-character, or even a hyphen.

If the hyphen is between two characters, then the range is represented, and the hyphen itself is not counted, such as [0-9], [A-z], etc.;
If there is no character after the hyphen, it means that the hyphen is also one of the characters of the predefined character set, such as [*%-];

In a predefined character set, you can use escape characters .

Anti-righteousness (antonymy)

Sometimes you just need to find characters that are not part of a simple defined character set, which is antisense .

Code	Description
\w	Match any characters that are not letters, numbers, underscores, kanji
\s	Match any character that is not a whitespace character
\d	Match any non-numeric character
\b	Match a position that is not the beginning or end of a word
[^x]	Matches any character except X
[^aeiou]	Matches any character except for the letters AEIOU

or | Branch

This is equivalent to or statement, the method is to use a vertical line | Separate the different rules. such as 0\d{2}-\d{8}| 0\d{3}-\d{7} This expression can match two phone numbers separated by a hyphen: a three-bit area code, a 8-bit local number (such as 010-12345678), a 4-bit area code, and a 7-bit local number (0376-2233445).

Group (Group)

The so-called grouping, in fact, is a small bracket () in the expression as a small matching unit.

There are two functions of grouping:

By default, the regular expression parser assigns a group number to each group, in which case the following regular expression can refer to the group 's matched content by the group number.
The group can be followed by a number of meta-characters, simplifying the regular expression writing;(\d{1,3}\.) {3}\d{1,3} is a simple IP-address matching expression.

A group followed by a number of metacharacters has a problem to note: The feeling should be to generate a number of groups, but is actually a grouping, because the group number is parsed by the time the parser is given, the string representing the group has only one occurrence in the entire regular expression string, so only a unique group number is assigned. Therefore, after the regular expression is executed, the group matches the content of the last match.

Like what:

/(\d{1,3}\.) {3}\d{1,3}/g.exec ("201.202.203.204");

Results: ["201.202.203.204", "203."]

The following regular expression refers to the previously grouped matched content by group number, which is called a back reference . The rule of the group number is: left to right, with the left parenthesis of the group as the flag, the first occurrence of the group number is 1, the second is 2, and so on. When quoting, remember to precede the group number with a slash.

Like what:

/\b (\w+) \b\s+\1\b/.test ("Hellohello"); True

/\b (\w+) \b\s+\1\b/.test ("Hellohell"); False

Syntax for grouping:

Code/syntax	Description
(exp)	Match exp, and capture text into an automatically named group
(? <name>exp)	Match exp, and capture the text to a group named name, or you can write (? ') Name ' exp ')
(?: EXP)	Matches exp, does not capture matching text, and does not assign group numbers to this group

0 Wide Assertion

The

0 wide assertion, like \b, ^, and $, is used to specify a location, but the position should satisfy certain conditions, which is the assertion. Because this assertion does not have a group number like a grouping and does not consume matching strings, it is called a 0-wide assertion.

tr> tr>

code/syntax
	assert that there must be a match for exp at the back of this position;
(<=EXP) /span>	assert that there must be a matching exp content in front of this position;
(?! EXP)	assert that behind this position, there must be no matching exp content
(? <!exp)	There must be no match for the exp content in front of this position

Comments

Comments are represented by (? #comment).

Greed and laziness (greedy and idle)

A fixed regular expression that can match either the entire string or just part of the entire string. What if there is a regular expression that matches the entire string and only matches the part?

/h.*o/.exec ("Hello Ho"); ["Hello Ho"]

/h.*? O/.exec ("Helloho"); ["Hello"]

In this example, the match hello uses a lazy match, and a greedy match is used without a question mark.

The so-called greedy match is to repeat as much as possible to match as many characters as possible. Conversely, a lazy match is a match for as few characters as possible . The default is the greedy algorithm, which only uses the Lazy qualifier (or question mark) after representing the number of metacharacters to indicate that this place uses the lazy algorithm.

Lazy Qualifier
Grammar	Description
*?	Repeat any number of times, but repeat as little as possible
+?	Repeat 1 or more times, but repeat as little as possible
??	Repeat 0 or 1 times, but repeat as little as possible
{n,m}?	Repeat N to M times, but repeat as little as possible
{N,}?	Repeat more than n times, but repeat as little as possible

Balance set (balance Group)/recursive match (recursive match)

All of the preceding matches are linear, and all of the above methods are useless for a nested hierarchical structure that matches like (100 * (50 + 15)). Because you can't know when the brackets appear, and if the opening parenthesis and the closing parenthesis are not equal in number of occurrences? How to match the content between the longest, paired parentheses?

The following syntax constructs are required:

(?‘ Group ' exp ' names the captured content as group and presses it onto the stack (stacks)
(?‘ -group ' exp ' pops up from the stack the captured content named group that was last pressed onto the stack, and if the stack was empty, the match for this group failed
(? (group) Yes|no) if a capture with the name group is present on the stack, continue to match the expression in the Yes section, or continue to match the no section
(?!) 0 wide Negative lookahead assertion, because there is no exp, trying to match always fails

To avoid (and/or to confuse your brain completely, we use angle brackets instead of parentheses.) Now our question becomes how to capture the contents of the longest paired angle brackets in a string such as xx <aa <bbb> <bbb> aa> yy?

< #最外层的左括号

[^<>]* #最外层的左括号后面的不是括号的内容

(

(?‘ Open ' < ' #碰到了左括号, write an "open" on the blackboard

[^<>]* #匹配左括号后面的不是括号的内容

(

(?‘ -open ' >) #碰到了右括号, Erase an "Open"

[^<>]* #匹配右括号后面不是括号的内容

(? (Open) (?!)) #在遇到最外层的右括号前面, judging if there is still no erase on the blackboard "Open"; if there is, the match fails

> #最外层的右括号

Common escape characters and other partial metacharacters
Code/syntax	Description
\a	Alarm character (the effect of printing it is the computer beeps)
\b	is usually the word dividing position, but if the delegate backspace is used in the character class
\ t	tab, tab
\ r	Enter
\v	Vertical tab
\f	Page break
\ n	Line break
\e	Escape
\0nn	The octal code in ASCII code is an nn character
\xnn	The hexadecimal code in the ASCII code is an nn character
\unnnn	Characters with hexadecimal code nnnn in Unicode code
\cn	ASCII control characters. For example, \CC represents CTRL + C
\a	The beginning of the string (similar to ^, but not affected by the processing of multiline options)
\z	End of string or end of line (not affected by multi-line processing options)
\z	End of string (similar to $, but not affected by the processing of multiline options)
\g	The beginning of the current search
\p{name}	A character class named name in Unicode, such as \p{isgreek}

Popular understanding of regular expressions

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More