This is not an entry-level article, but if you know about regular expressions or use them, you may be able to quickly recall them. Reading this article requires that you have used regular expressions or some understanding, because I have not written many examples. Summarize the regular expression, just because the individual after several years of accumulation, want to plain simple outline of what the regular expression is, what can do. There are a lot of regular expressions on the web, but there is always a sense of jargon too much to read too mealy or too clumsy. The last two sections come from the regular expression 30-minute introductory tutorial, partially repaired.
The regular expression string consists of two basic characters: literal text character and metacharacters. A meta-character is a special-purpose character in a regular expression.
Metacharacters
A meta-character may represent a number, letter, position, or number.
Representative character
Code |
Description |
. |
Match any character other than line break |
\w |
Match letters or numbers or underscores or kanji |
\s |
Match any of the whitespace characters |
\d |
Match numbers |
\s |
Contrary to \s |
\d |
Contrary to \d |
\w |
Contrary to \w |
Number of Delegates
Code |
Description |
* |
Repeat 0 or more times |
+ |
Repeat one or more times |
? |
Repeat 0 or one time |
N |
Repeat n times |
{N,} |
Repeat N or more times |
{N,m} |
Repeat N to M times |
Representative location
code |
description |
\b |
match start or end of word |
\b |
|
^ |
start of matching string |
$ |
match end of string |
Escape character
Since the meta-character regular has special meaning in the expression, what if you want to treat them as normal literal text characters now? Adding a backslash \ In front of the metacharacters means that the current metacharacters have lost their special meaning in the regular expression and become a literal literal character.
Pre-defined character set
The meta-character range for the character is too wide, what if I just want to match a character in a small range?
Simple, you can use square brackets to list them. such as [Aeiou] or [.?!], which represents only the characters that appear in square brackets.
Note that in this square bracket, all the preceding metacharacters are no longer metacharacters, only one is a meta-character, or even a hyphen.
- If the hyphen is between two characters, then the range is represented, and the hyphen itself is not counted, such as [0-9], [A-z], etc.;
- If there is no character after the hyphen, it means that the hyphen is also one of the characters of the predefined character set, such as [*%-];
In a predefined character set, you can use escape characters .
Anti-righteousness (antonymy)
Sometimes you just need to find characters that are not part of a simple defined character set, which is antisense .
Code |
Description |
\w |
Match any characters that are not letters, numbers, underscores, kanji |
\s |
Match any character that is not a whitespace character |
\d |
Match any non-numeric character |
\b |
Match a position that is not the beginning or end of a word |
[^x] |
Matches any character except X |
[^aeiou] |
Matches any character except for the letters AEIOU |
or | Branch
This is equivalent to or statement, the method is to use a vertical line | Separate the different rules. such as 0\d{2}-\d{8}| 0\d{3}-\d{7} This expression can match two phone numbers separated by a hyphen: a three-bit area code, a 8-bit local number (such as 010-12345678), a 4-bit area code, and a 7-bit local number (0376-2233445).
Group (Group)
The so-called grouping, in fact, is a small bracket () in the expression as a small matching unit.
There are two functions of grouping:
- By default, the regular expression parser assigns a group number to each group, in which case the following regular expression can refer to the group 's matched content by the group number.
- The group can be followed by a number of meta-characters, simplifying the regular expression writing;(\d{1,3}\.) {3}\d{1,3} is a simple IP-address matching expression.
A group followed by a number of metacharacters has a problem to note: The feeling should be to generate a number of groups, but is actually a grouping, because the group number is parsed by the time the parser is given, the string representing the group has only one occurrence in the entire regular expression string, so only a unique group number is assigned. Therefore, after the regular expression is executed, the group matches the content of the last match.
Like what:
/(\d{1,3}\.) {3}\d{1,3}/g.exec ("201.202.203.204");
Results: ["201.202.203.204", "203."]
The following regular expression refers to the previously grouped matched content by group number, which is called a back reference . The rule of the group number is: left to right, with the left parenthesis of the group as the flag, the first occurrence of the group number is 1, the second is 2, and so on. When quoting, remember to precede the group number with a slash.
Like what:
/\b (\w+) \b\s+\1\b/.test ("Hellohello"); True
/\b (\w+) \b\s+\1\b/.test ("Hellohell"); False
Syntax for grouping:
Code/syntax |
Description |
(exp) |
Match exp, and capture text into an automatically named group |
(? <name>exp) |
Match exp, and capture the text to a group named name, or you can write (? ') Name ' exp ') |
(?: EXP) |
Matches exp, does not capture matching text, and does not assign group numbers to this group |
0 Wide Assertion
The
0 wide assertion, like \b, ^, and $, is used to specify a location, but the position should satisfy certain conditions, which is the assertion. Because this assertion does not have a group number like a grouping and does not consume matching strings, it is called a 0-wide assertion.
code/syntax |
|
tr>
|
assert that there must be a match for exp at the back of this position; |
(<=EXP) /span> |
assert that there must be a matching exp content in front of this position; |
tr>
(?! EXP) |
assert that behind this position, there must be no matching exp content |
(? <!exp) |
There must be no match for the exp content in front of this position |
Comments
Comments are represented by (? #comment).
Greed and laziness (greedy and idle)
A fixed regular expression that can match either the entire string or just part of the entire string. What if there is a regular expression that matches the entire string and only matches the part?
/h.*o/.exec ("Hello Ho"); ["Hello Ho"]
/h.*? O/.exec ("Helloho"); ["Hello"]
In this example, the match hello uses a lazy match, and a greedy match is used without a question mark.
The so-called greedy match is to repeat as much as possible to match as many characters as possible. Conversely, a lazy match is a match for as few characters as possible . The default is the greedy algorithm, which only uses the Lazy qualifier (or question mark) after representing the number of metacharacters to indicate that this place uses the lazy algorithm.
Lazy Qualifier |
Grammar |
Description |
*? |
Repeat any number of times, but repeat as little as possible |
+? |
Repeat 1 or more times, but repeat as little as possible |
?? |
Repeat 0 or 1 times, but repeat as little as possible |
{n,m}? |
Repeat N to M times, but repeat as little as possible |
{N,}? |
Repeat more than n times, but repeat as little as possible |
Balance set (balance Group)/recursive match (recursive match)
All of the preceding matches are linear, and all of the above methods are useless for a nested hierarchical structure that matches like (100 * (50 + 15)). Because you can't know when the brackets appear, and if the opening parenthesis and the closing parenthesis are not equal in number of occurrences? How to match the content between the longest, paired parentheses?
The following syntax constructs are required:
- (?‘ Group ' exp ' names the captured content as group and presses it onto the stack (stacks)
- (?‘ -group ' exp ' pops up from the stack the captured content named group that was last pressed onto the stack, and if the stack was empty, the match for this group failed
- (? (group) Yes|no) if a capture with the name group is present on the stack, continue to match the expression in the Yes section, or continue to match the no section
- (?!) 0 wide Negative lookahead assertion, because there is no exp, trying to match always fails
To avoid (and/or to confuse your brain completely, we use angle brackets instead of parentheses.) Now our question becomes how to capture the contents of the longest paired angle brackets in a string such as xx <aa <bbb> <bbb> aa> yy?
< #最外层的左括号
[^<>]* #最外层的左括号后面的不是括号的内容
(
(
(?‘ Open ' < ' #碰到了左括号, write an "open" on the blackboard
[^<>]* #匹配左括号后面的不是括号的内容
)+
(
(?‘ -open ' >) #碰到了右括号, Erase an "Open"
[^<>]* #匹配右括号后面不是括号的内容
)+
)*
(? (Open) (?!)) #在遇到最外层的右括号前面, judging if there is still no erase on the blackboard "Open"; if there is, the match fails
> #最外层的右括号
More
Common escape characters and other partial metacharacters |
Code/syntax |
Description |
\a |
Alarm character (the effect of printing it is the computer beeps) |
\b |
is usually the word dividing position, but if the delegate backspace is used in the character class |
\ t |
tab, tab |
\ r |
Enter |
\v |
Vertical tab |
\f |
Page break |
\ n |
Line break |
\e |
Escape |
\0nn |
The octal code in ASCII code is an nn character |
\xnn |
The hexadecimal code in the ASCII code is an nn character |
\unnnn |
Characters with hexadecimal code nnnn in Unicode code |
\cn |
ASCII control characters. For example, \CC represents CTRL + C |
\a |
The beginning of the string (similar to ^, but not affected by the processing of multiline options) |
\z |
End of string or end of line (not affected by multi-line processing options) |
\z |
End of string (similar to $, but not affected by the processing of multiline options) |
\g |
The beginning of the current search |
\p{name} |
A character class named name in Unicode, such as \p{isgreek} |
Popular understanding of regular expressions