Regular expression syntax (from cainiao tutorial)

Last Update:2018-11-01 Source: Internet

Author: User

Tags printable characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A regular expression (Regular Expression) describes a pattern for string matching ), it can be used to check whether a string contains a seed string, replace a matched substring, or retrieve a substring that meets a certain condition from a substring.

For example:

Runoo + B, which can match runoob, runooob, runoooooob, etc. The + number indicates that the previous character must appear at least once (once or multiple times ).
Runoo * B, which can match runob, runoob, runoooooob, etc. * indicates that the character may not appear, it can also appear once or multiple times (0, or 1, or multiple times ).
Colou? R can match color or color ,? A question mark indicates that the preceding character can only appear once (0 or 1 ).

The method for constructing a regular expression is the same as that for creating a mathematical expression. That is to say, using a variety of metacharacters and operators can combine small expressions to create larger expressions. The regular expression component can be a single character, Character Set combination, character range, choice between characters, or any combination of all these components.

A regular expression is a text format consisting of common characters (such as characters A to Z) and special characters (called metacharacters. Mode description one or more strings to be matched when searching text. A regular expression is used as a template to match a character pattern with the searched string.

Common characters

Common characters include all printable and non-printable characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letters, all numbers, all punctuation marks, and some other symbols.

Non-printable characters

Non-printable characters can also be part of a regular expression. The following table lists escape sequences for non-printable characters:

Character	Description
\ CX	Match the control characters specified by X. For example, \ cm matches a control-M or carriage return character. The value of X must be either a A-Z or a-Z. Otherwise, C is treated as an original 'C' character.
\ F	Match a form feed. It is equivalent to \ x0c and \ Cl.
\ N	Match A linefeed. It is equivalent to \ x0a and \ CJ.
\ R	Match a carriage return. It is equivalent to \ x0d and \ cm.
\ S	Matches any blank characters, including spaces, tabs, and page breaks. It is equivalent to [\ f \ n \ r \ t \ v]. Note that the Unicode Regular Expression matches the full-width space character.
\ S	Match any non-blank characters. It is equivalent to [^ \ f \ n \ r \ t \ v].
\ T	Match a tab. It is equivalent to \ x09 and \ CI.
\ V	Match a vertical tab. It is equivalent to \ x0b and \ ck.

Special characters

Special characters are characters with special meanings, such as * In runoo * B mentioned above. Simply put, they represent the meaning of any string. If you want to find the * symbol in the string, you need to escape *, that is, add \: runo \ * ob before it to match runo * ob.

Many metacharacters require special treatment when attempting to match them. To match these special characters, you must first make the character "escape", that is, put the backslash character \ before them. The following table lists the special characters in a regular expression:

Special characters	Description
$	Matches the end position of the input string. If the multiline attribute of the Regexp object is set, $ also matches '\ n' or' \ R '. To match the $ character, use \ $.
()	Mark the start and end positions of a subexpression. Subexpressions can be obtained for future use. To match these characters, use \ (and \).
*	Matches the previous subexpression zero or multiple times. To match * characters, use \*.
+	Match the previous subexpression once or multiple times. To match + characters, use \ +.
.	Match any single character except linefeed \ n. To match., use \..
[	Mark the start of a bracket expression. To match [, use \[.
?	Match the previous subexpression zero or once, or specify a non-Greedy qualifier. To match? Character, use \?.
\	Mark the next character as or a special character, or a literal character, or backward reference, or an octal escape character. For example, 'n' matches the character 'n '. '\ N' matches the line break. The sequence '\' matches "\", while '\ (' matches "(".
^	Matches the start position of the input string. Unless used in the square brackets expression, this character set is not accepted. To match the ^ character itself, use \ ^.
{	Mark the start of a qualifier expression. To match {, use \{.
\|	Specifies a choice between two items. To match \|, use \ \|.

Qualifier

A qualifier is used to specify how many times a given component of a regular expression must appear to match. There are * or + or? There are 6 types: {n}, {n,}, or {n, m.

Regular expressions have the following delimiters:

Character	Description
*	Matches the previous subexpression zero or multiple times. For example, Zo * can match "Z" and "Zoo ". * Is equivalent to {0 ,}.
+	Match the previous subexpression once or multiple times. For example, 'Zo + 'can match "zo" and "Zoo", but cannot match "Z ". + Is equivalent to {1 ,}.
?	Match the previous subexpression zero or once. For example, "Do (ES )? "It can match" do "in" do "," does "in" does ", and" do "in" doxy ".? It is equivalent to {0, 1 }.
{N}	N is a non-negative integer. Match n times. For example, 'O {2} 'cannot match 'O' in "Bob", but can match two o in "food.
{N ,}	N is a non-negative integer. Match at least N times. For example, 'O {2,} 'cannot match 'O' in "Bob", but can match all o in "foooood. 'O {1,} 'is equivalent to 'o + '. 'O {0,} 'is equivalent to 'o *'.
{N, m}	Both m and n are non-negative integers, where n <= m. Match at least N times and at most m times. For example, "O {1, 3}" matches the first three o in "fooooood. 'O {0, 1} 'is equivalent to 'o? '. Note that there must be no space between a comma and two numbers.

Because the number of a chapter may exceed nine in a large input document, you need to process two or three chapter numbers in one way. The qualifier gives you this ability. The following regular expression matches the title of a section numbered any digits:

/Chapter [1-9][0-9]*/

Note that the qualifier appears after the range expression. Therefore, it is applied to the entire range expression. In this example, only numbers ranging from 0 to 9 (including 0 and 9) are specified ).

The + qualifier is not used here, because there is not necessarily a number in the second or later position. Or not? Character, because? The unit number is limited to only two digits. You must match at least one number after the chapter and space characters.

If you know that the chapter number is limited to 99, you can use the following expression to specify at least one character but at most two digits.

/Chapter [0-9]{1,2}/

The disadvantage of the above expression is that the chapter number greater than 99 still matches only the first two digits. Another drawback is that chapter 0 will also match. A better expression that matches only two digits is as follows:

/Chapter [1-9][0-9]?/

/Chapter [1-9][0-9]{0,1}/

* And + delimiters are greedy because they will match as many words as possible, only after which one? You can achieve non-greedy or minimum matching.

For example, You may search for HTML documents to search for chapter titles included in the H1 mark. The text is as follows in your document:

<H1> Chapter 1-Regular Expression introduction

Greedy:The following expression matches all content starting from less than the symbol (<) to closing the H1 mark greater than the symbol (>.

/<.*>/

Non-Greedy:If you only need to match the start and end H1 tags, the following non-Greedy expressions only match

/<.*?>/

If you only want to match the start H1 tag, the expression is:

/<\w+?>/

Through *, +, or? Placed after the qualifier ?, This expression is converted from a "greedy" expression to a "non-greedy" expression or a minimum match.

Operator

The locator allows you to fix a regular expression to the beginning or end of a row. They also enable you to create such regular expressions that appear in a word, at the beginning of a word, or at the end of a word.

The locator is used to describe the boundary of a string or word. ^ and $ respectively refer to the start and end of a string, \ B describes the boundary before or after a word, and \ B indicates non-word boundary.

Regular Expressions are located with the following delimiters:

Character	Description
^	Match the start position of the input string. If the multiline attribute of the Regexp object is set, ^ matches the position after \ n or \ r.
$	Matches the position at the end of the input string. If the multiline attribute of the Regexp object is set, $ also matches the position before \ n or \ r.
\ B	Match A Word boundary, that is, the position between the word and the space.
\ B	Non-word boundary match.

Note:: The qualifier cannot be used with the locator. Expressions such as ^ * are not allowed because there cannot be more than one position near the line break or word boundary.

To match the beginning of a line of text, use the ^ character at the beginning of the regular expression. Do not confuse the usage of ^ With the usage in the brackets expression.

To match the text at the end of a line of text, use the $ character at the end of the regular expression.

To use the positioning point when searching for a Chapter title, the following regular expression matches a Chapter title, which contains only two trailing numbers and appears at the beginning of the row:

/^Chapter [1-9][0-9]{0,1}/

The title of a real chapter not only appears at the beginning of the line, but also is the only text in the line. It appears at the beginning of the row and at the end of the same row. The following expression ensures that the specified match matches only the chapter but does not match the cross reference. This can be done by creating a regular expression that matches the beginning and end of a line of text.

/^Chapter [1-9][0-9]{0,1}$/

Matching the word boundary is slightly different, but it adds a very important capability to the regular expression. The word boundary is the position between a word and a space. Non-word boundary is any other position. The following expression matches the first three characters of the word chapter, because the three characters appear after the word boundary:

/\bCha/

The position of the \ B character is very important. If it is located at the beginning of the string to be matched, it searches for matching items at the beginning of the word. If it is at the end of a string, it searches for matches at the end of a word. For example, the following expression matches the character string TER in the word chapter because it appears before the word boundary:

/ter\b/

The following expression matches the apt string in chapter, but does not match the apt string in aptitude:

/\Bapt/

The APT string appears in the non-word boundary of the word chapter, but in the word boundary of the word aptitude. Position is not important for non-word boundary operators of \ B, because matching does not care whether it is the start or end of a word.

Select

Enclose all selection items with parentheses, and separate adjacent selection items with |. But there is a side effect when parentheses are used, so that the related matching will be cached. Is it available now? : Put the first option to eliminate this side effect.

Where? : One non-capturing element, and two non-capturing elements? = And ?!, The two have more meanings. The former is forward pre-query, and matches the search string at any position starting to match the Regular Expression Pattern in parentheses. The latter is negative pre-query, match the search string at any position that does not match the regular expression pattern.

Reverse reference

Adding parentheses on both sides of a regular expression or partial expression will cause the matching to be stored in a temporary buffer, each captured sub-match is stored in the order from left to right in the regular expression mode. The buffer number starts from 1 and can store a maximum of 99 captured subexpressions. Each buffer zone can be accessed using \ n, where n is one or two decimal digits that identify a specific buffer zone.

Can I use non-captured metacharacters? :,? = Or ?! To overwrite the capture and ignore the save of the matching.

One of the simplest and most useful applications of reverse reference is to provide the ability to find matching items of two identical adjacent words in text.

The captured expression, as specified by [a-Z] +, contains one or more letters. The second part of a regular expression is a reference to the previously captured sub-matching items. That is, the second matching item of a word is matched by a bracket expression. \ 1 specifies the first child match.

The word boundary character ensures that only the entire word is detected. Otherwise, phrases such as "is issued" or "this is" cannot be correctly recognized by this expression.

The global tag g after the regular expression specifies that the expression is applied to the input string to find as many matches as possible.

The I tag at the end of the expression is case insensitive.

A potential match may occur between two sides of a line break when multiple lines are marked.

A reverse reference can also break down a common resource identifier (URI) into its components.

Regular expression syntax (from cainiao tutorial)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More