Regular expression Syntax _ regular expressions

Last Update:2017-01-18 Source: Internet

Author: User

Tags lowercase printable characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A regular expression (regular expression) describes a pattern of string matching that can be used to check whether a string contains a seed string, replaces a matching substring, or extracts a substring from a string that matches a condition.

Column directory, the *.txt in dir *.txt or LS *.txt is not a regular expression, because the meaning of this * is different from that of the regular type.
The method for constructing regular expressions is the same as for creating mathematical expressions. That is, multiple metacharacters and operators can combine small expressions to create larger expressions. The component of a regular expression can be a single character, character set, character range, selection between characters, or any combination of any of these components.
A regular expression is a literal pattern composed of ordinary characters, such as characters A through z, and special characters, called "metacharacters". The pattern describes one or more strings to match when searching for text. A regular expression is used as a template to match a character pattern with the string being searched for.

Ordinary characters
Normal characters include all printable and nonprinting characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letters, all numbers, all punctuation marks, and some other symbols.

Non-printable characters
nonprinting characters can also be part of a regular expression. The following table lists the escape sequences that represent nonprinting characters:

character	Description
\cx	Matches the control character indicated by X. For example, \cm matches a control-m or carriage return character. The value of x must be one-a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character.
\f	Matches a page feed character. Equivalent to \x0c and \CL.
\ n	Matches a line feed character. Equivalent to \x0a and \CJ.
\ r	Matches a carriage return character. Equivalent to \x0d and \cm.
\s	Matches any white space character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v].
\s	Matches any non-white-space character. equivalent to [^ \f\n\r\t\v].
\ t	Matches a tab character. Equivalent to \x09 and \ci.
\v	Matches a vertical tab. Equivalent to \x0b and \ck.

Special characters

The so-called special characters, that is, some special meaning of the characters, such as the above "*.txt" in the *, simply to say that any string meaning. If you want to find files with * in the file name, you need to escape the *, which is preceded by a \. LS \*.txt.

Many meta characters require special treatment when trying to match them. To match these special characters, you must first "escape" the character, that is, precede the backslash character (\). The following table lists the special characters in the regular expression:

Special Characters	Description
$	Matches the end position of the input string. If the Multiline property of the RegExp object is set, then $ also matches ' \ n ' or ' \ R '. To match the $ character itself, use \$.
( )	Marks the start and end position of a subexpression. The subexpression can be obtained for later use. To match these characters, use \ (and \).
*	Matches the preceding subexpression 0 or more times. To match the * character, use \*.
+	Matches the preceding subexpression one or more times. to match the + character, use \+.
.	Matches any single character except the newline character \ n. to match., please use \.
[	Marks the beginning of a bracket expression. To match [, use \[.
?	Matches the preceding subexpression 0 or more times, or indicates a non-greedy qualifier. Want to match? characters, please use \?.
\	Marks the next character as either a special character, or a literal character, or a backward reference, or a octal escape character. For example, ' n ' matches the character ' n '. ' \ n ' matches line breaks. The sequence ' \ \ ' matches ' \ ' and ' \ (' matches '.
^	Matches the starting position of the input string, unless used in a bracket expression, at which point it means that the character set is not accepted. To match the ^ character itself, use \^.
{	The beginning of a tag qualifier expression. To match {, use \{.
\|	Indicates a choice between two items. to match \|, use \\|.

Qualifier

A qualifier is used to specify how many times a given component of a regular expression must appear to satisfy a match. There are * or + or? or {n} or {n,} or {n,m} altogether 6 species.

The qualifiers for regular expressions are:

character	Description
*	Matches the preceding subexpression 0 or more times. For example, zo* can match "z" and "Zoo". * is equivalent to {0,}.
+	Matches the preceding subexpression one or more times. For example, ' zo+ ' can match "Zo" and "Zoo", but cannot match "Z". + is equivalent to {1,}.
?	Match the preceding subexpression 0 times or once. For example, "Do (es)" can match "do" in "do" or "does". is equivalent to {0,1}.
N	n is a non-negative integer. Matches the determined n times. For example, ' o{2} ' cannot match ' o ' in ' Bob ', but can match two o in ' food '.
{N,}	n is a non-negative integer. Match at least n times. For example, ' o{2,} ' cannot match ' o ' in ' Bob ' but can match all o in ' Foooood '. ' O{1,} ' is equivalent to ' o+ '. ' O{0,} ' is equivalent to ' o* '.
{N,m}	M and n are nonnegative integers, of which n <= M. Matches n times at least and matches up to M times. For example, "o{1,3}" will match the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' o '. Notice that there is no space between the comma and the two number.

Because the chapter number is likely to exceed nine in a large input document, you need a way to handle a two-bit or three-bit chapter number. Qualifiers give you this ability. The following regular expression matches a chapter title numbered to any number of digits:

/chapter [1-9][0-9]*/

Notice that the qualifier appears after the range expression. Therefore, it applies to the entire range expression, in this case, specifying only numbers from 0 to 9 (including 0 and 9).

The + qualifier is not used here because a number does not necessarily need to be in the second or subsequent position. Also not used? character, because it limits the chapter number to only two digits. You need to match at least one number following the Chapter and space characters.

If you know that the chapter number is limited to only 99 chapters, you can use the following expression to specify at least one but up to two digits.

/chapter [0-9]{1,2}/

The disadvantage of the above expression is that the chapter number greater than 99 still matches only the initial two digits. Another disadvantage is that Chapter 0 will also match. A better expression that matches only two digits is as follows:

/chapter [1-9][0-9]?/

/chapter [1-9][0-9]{0,1}/

*, +, and? Qualifiers are greedy because they match as many words as possible, only to add one behind them. You can achieve a non greedy or minimal match.

For example, you might search an HTML document to find chapter headings enclosed in H1 tags. This text is in your document as follows:

The following expression matches everything from the beginning less than the symbol (<) to the greater-than sign (>) that closes the H1 tag.

/<.*>/

If you only need to match the start H1 tag, the following "non-greedy" expression only matches <H1>.

/<.*?>/

Through in *, + or? Qualifier, the expression is converted from a greedy expression to a "not greedy" expression or a minimum match.

Locator character

A locator allows you to pin a regular expression to the beginning or end of a line. They also enable you to create regular expressions that appear within a word, at the beginning of a word, or at the end of a word.

A locator character is used to describe the bounds of a string or word, ^ and $, which are the beginning and end of a string, \b describes the front or back bounds of a word, and \b represents a non word boundary.

The qualifiers for regular expressions are:

character	Description
^	Matches the position where the input string starts. If the Multiline property of the RegExp object is set, ^ also matches the position after \ n or \ r.
$	Matches the position of the end of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before \ n or \ r.
\b	Matches a word boundary, which is the position between the word and the space.
\b	Non-word boundary matching.

Note : You cannot use qualifiers with anchor points. Expressions such as ^* are not allowed because you cannot have more than one position immediately before or after a newline or word boundary.

To match the text at the beginning of a line of text, use the ^ character at the beginning of the regular expression. Do not confuse this usage of ^ with the usage within the bracket expression.

To match the text at the end of a line of text, use the $ character at the end of the regular expression.

To use an anchor when searching for chapter headings, the following regular expression matches a chapter heading that contains only two trailing digits and appears at the beginning of the line:

/^chapter [1-9][0-9]{0,1}/

The true chapter title not only appears at the beginning of the line, but it is also the only text in the line. It appears at the beginning and at the end of the same line. The following expression ensures that the specified match matches only the chapter and does not match the cross-reference. You can do this by creating a regular expression that matches only the beginning and end of a line of text.

/^chapter [1-9][0-9]{0,1}$/

The matching word boundary is slightly different, but it adds important capabilities to the regular expression. A word boundary is the position between a word and a space. A non word boundary is any other location. The following expression matches the beginning three characters of the word Chapter, because these three characters appear after the word boundary:

/\bcha/

The position of the \b character is very important. If it is at the beginning of the string to match, it looks for a match at the beginning of the word. If it is at the end of the string, it looks for a match at the end of the word. For example, the following expression matches the string ter in the word Chapter because it appears in front of the word boundary:

/ter\b/

The following expression matches the string apt in Chapter, but does not match the string apt in aptitude:

/\bapt/

The string apt appears in the word Chapter boundary, but appears at the word boundary in the word aptitude. For the \b operator, the position is not important because the match does not care whether it is the beginning or the end of the word.

Choose

Enclose all the selections with parentheses, separating the adjacent selections with |. But with parentheses there is a side effect that the related match is cached and available at this time?: Put the first option to eliminate this side effect.

Among them: is one of the non-capture elements, and there are two not-captured dollars? = and?!, these two also have more meaning, the former is forward lookup, in any start matching the regular expression pattern within the parentheses position to match the search string, the latter is negative check, Matches the search string at any position that does not begin to match the regular expression pattern.

Reverse reference

Adding parentheses around a regular expression pattern or part of a pattern causes the correlation match to be stored in a temporary buffer, and each captured child match is stored in the order that it appears from left to right in the regular expression pattern. The buffer number starts at 1 and stores up to 99 captured child expressions. Each buffer can be accessed using ' \ n ', where n is a single or two-bit decimal number that identifies a particular buffer.

You can use the non-capture metacharacters '?: ', '? = ' or '?! ' to override the capture, ignoring the save on the related match.

One of the simplest, most useful applications of a reverse reference is the ability to find matches between two identical contiguous words in the text. Take the following sentence as an example:

Gasoline going up?

The above sentence clearly has multiple duplicate words. If you can design a method to locate the sentence without having to look up the repetition of each word, how good it should be. The following regular expression uses a single subexpression to accomplish this:

/\b ([a-z]+) \1\b/gi

The captured expression, as specified by [a-z]+, includes one or more letters. The second part of the regular expression is a reference to a previously captured child match, that is, the second occurrence of the word exactly matched by the bracket expression. \1 Specifies the first child match. The word boundary Meta character ensures that only the entire word is detected. Otherwise, phrases such as "is issued" or "This is" will not be recognized correctly by this expression.

The global Tag (g) Following the regular expression indicates that the expression is applied to as many matches as can be found in the input string. The case-insensitive (i) tag at the end of an expression specifies case-insensitive. A multiline tag specifies that potential matches may occur on either side of a newline character.

A reverse reference can also decompose a generic resource indicator (URI) into its components. Suppose you want to decompose the following URI into protocol (FTP, HTTP, and so on), domain address, and page/path:

Http://www.jb51.net:80/html/html-tutorial.html

The following regular expression provides this functionality:

/(\w+): \/\/([^/:]+) (: \d*)? ([^# ]*)/

The first bracket subexpression captures the protocol portion of the WEB address. The subexpression matches any word preceding the colon and two forward slashes. The second bracket subexpression captures the domain address portion of the address. subexpression matches/and: one or more characters outside of the expression. The third bracket subexpression captures the port number (if specified). The subexpression matches 0 or more digits following the colon. This subexpression can be repeated only once. Finally, the fourth bracket subexpression captures the path and/or page information specified by the WEB address. The subexpression can match any sequence of characters that does not include the # or space characters.

Apply a regular expression to the URI above, and each child match contains the following:

The first bracket subexpression contains an "http"
The second bracket subexpression contains "www.jb51.net"
The third bracket subexpression contains ": 80"
The fourth bracket subexpression contains "/html/html-tutorial.html"

The function of the regular expression is too powerful. The following is an introduction to the basic syntax for regular expressions:

First let's look at two special symbols ' ^ ' and ' $ '. Their role is to point out the beginning and end of a string, respectively. Examples are as follows:

"^the": denotes all strings starting with "the" ("There", "the Cat", etc.);
"Of despair$": a string representing the end of "of despair";
"^abc$": means that the beginning and end are "abc" string-hehe, only "ABC" itself;
"Notice": Represents any string containing "notice".

Like the last example, if you don't use two special characters, you're saying that the string you are looking for is in any part of the string you're looking for--you don't position it at a top.

and other ' * ', ' + ' and '? ' The three symbols that represent the number of occurrences of one or a sequence of characters. They say "no or no More", "one or more" and "No or once". Here are a few examples:

"ab*" means that a string has one followed by 0 or several B. ("A", "AB", "ABBB",...... "ab+": means that a string has one followed by at least one B or more; "Ab?" : Indicates that a string has one followed by 0 or a B; "a?b+$": means that at the end of the string there are 0 or a followed by one or several B.

You can also use a range, enclosed in curly braces, to indicate the range of repeat times.

"Ab{2}": Indicates that a string has a followed 2 B ("ABB");
"Ab{2,}": Indicates that a string has a a followed by at least 2 B;
"ab{3,5}": Indicates that a string has a followed 3 to 5 B.

Please note that you must specify the lower bound of the range (for example: "{0,2}" instead of "{2}"). Also, you may have noticed, ' * ', ' + ' and '? ' Equivalent to "{0,}", "{1,}" and "{0,1}". There is also a ' | ', which indicates the "or" action:

"Hi|hello": means "hi" or "Hello" in a string;
"(B|CD) EF": means "bef" or "cdef";
"(a|b) *c": Represents a string of "a" "B" mixed strings followed by a "C";

'.' You can override any character:

"A.[0-9]": Indicates that a string has a "a" followed by an arbitrary character and a number;
"^. {3}$ ": A string representing any three characters (3 characters in length);

The square brackets indicate that certain characters are allowed to appear at a specific position in a string:

"[AB]": Indicates that a string has a "a" or "B" (equivalent to "a|b");
[A-d]: Indicates that a string contains one of the lowercase ' a ' to ' d ' (equivalent to "a|b|c|d" or "[ABCD]");
"^[a-za-z]": Represents a string that begins with a letter;
"[0-9]%": Indicates a digit before a percent semicolon;
", [a-za-z0-9]$": Indicates that a string ends with a comma followed by a letter or number.

You can also use ' ^ ' in square brackets to indicate a character that you do not want to appear, ' ^ ' should be first in square brackets. (For example, "%[^a-za-z]%" means that no letters should appear in the two percent sign).

In order to express it verbatim, you have to be in "^.$ () |*+?" {"Precede these characters with the transfer character".)

Note that in square brackets, you do not need an escape character.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More