Regular Expressions-Grammar
The regular expression (regular expression) describes a pattern of string matching that can be used to check whether a string contains a seed string, replaces a matched substring, or extracts a substring that matches a certain condition from a string.
- When a directory is listed, the *.txt in dir *.txt or LS *.txt is not a regular expression, because the meaning of * is different from the regular type.
- The method of constructing a regular expression is the same as the method for creating a mathematical expression. That is, using a variety of meta-characters and operators to combine small expressions together to create larger expressions. A component of a regular expression can be a single character, a character set, a range of characters, a selection between characters, or any combination of all of these components.
Regular expressions are text patterns that consist of ordinary characters, such as characters A through z, and special characters (called metacharacters). The pattern describes one or more strings to match when searching for text. A regular expression, as a template, matches a character pattern to the string you are searching for.
Normal characters
Ordinary characters include all printable and non-printable characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letters, all numbers, all punctuation marks, and some other symbols.
Non-printable characters
Nonprinting characters can also be part of a regular expression. The following table lists the escape sequences that represent nonprinting characters:
character |
Description |
\cx |
Matches the control character indicated by X. For example, \cm matches a control-m or carriage return. The value of x must be one of a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character. |
\f |
Matches a page break. Equivalent to \x0c and \CL. |
\ n |
Matches a line break. Equivalent to \x0a and \CJ. |
\ r |
Matches a carriage return character. Equivalent to \x0d and \cm. |
\s |
Matches any whitespace character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v]. |
\s |
Matches any non-whitespace character. equivalent to [^ \f\n\r\t\v]. |
\ t |
Matches a tab character. Equivalent to \x09 and \ci. |
\v |
Matches a vertical tab. Equivalent to \x0b and \ck. |
Special characters
The so-called special character, is some special meaning of the character, such as the above said "*.txt" in the *, simply means that the meaning of any string. If you are looking for a file with * in the file name, you need to escape the *, which is preceded by a \. LS \*.txt.
Many metacharacters require special treatment when trying to match them. To match these special characters, you must first make the characters "escaped," that is, the backslash character (\) is placed in front of them. The following table lists the special characters in the regular expression:
|
description |
$ |
matches the end position of the input string. If the Multiline property of the RegExp object is set, then $ also matches ' \ n ' or ' \ R '. To match the $ character itself, use \$. |
() |
marks the start and end positions of a subexpression. Sub-expressions can be obtained for later use. To match these characters, use \ (and \). |
* |
matches the preceding subexpression 0 or more times. To match the * character, use \*. |
+ |
matches the preceding subexpression one or more times. to match the + character, use \+. |
. | The
matches any single character except for the newline character \ n. To match, please use \. |
[ |
marks the beginning of a bracket expression. to match [, please use \[. |
? | The
matches the preceding subexpression 0 or one time, or indicates a non-greedy qualifier. to match? characters, use \?. |
\ |
marks the next character as either a special character, or a literal character, or a backward reference, or octal escape character. For example, ' n ' matches the character ' n '. ' \ n ' matches line breaks. The sequence ' \ \ ' matches ' \ ', while ' \ (' then Match ' (". |
^ |
matches the starting position of the input string, unless used in a square bracket expression, which indicates that the character set is not accepted. To match the ^ character itself, use \^. |
{ |
tags the beginning of the qualifier expression. To match {, use \{. |
| | The
indicates a choice between the two items. to match |, please use \|. |
Qualifier
Qualifiers are used to specify how many times a given component of a regular expression must appear to satisfy a match. There are 6 types of * or + or? or {n} or {n,} or {n,m}.
The qualifiers for a regular expression are:
character |
Description |
* |
Matches the preceding subexpression 0 or more times. For example, zo* can match "z" and "Zoo". * Equivalent to {0,}. |
+ |
Matches the preceding subexpression one or more times. For example, ' zo+ ' can match "Zo" and "Zoo", but not "Z". + equivalent to {1,}. |
? |
Matches the preceding subexpression 0 or one time. For example, "Do (es)?" can match "do" in "do" or "does".? Equivalent to {0,1}. |
N |
N is a non-negative integer. Matches the determined n times. For example, ' o{2} ' cannot match ' o ' in ' Bob ', but can match two o in ' food '. |
{N,} |
N is a non-negative integer. Match at least n times. For example, ' o{2,} ' cannot match ' o ' in ' Bob ', but can match all o in ' Foooood '. ' O{1,} ' is equivalent to ' o+ '. ' O{0,} ' is equivalent to ' o* '. |
{N,m} |
Both M and n are non-negative integers, where n <= m. Matches at least n times and matches up to M times. For example, "o{1,3}" will match the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' O? '. Note that there can be no spaces between a comma and two numbers. |
Because chapter numbering is likely to exceed nine in large input documents, you need a way to handle two-bit or three-bit chapter numbering. Qualifiers give you this ability. The following regular expression matches a chapter title that is numbered in any number of digits:
/chapter [1-9][0-9]*/
Notice that the qualifier appears after the range expression. Therefore, it is applied to the entire range expression, in this case, only numbers from 0 to 9 (including 0 and 9) are specified.
The + qualifier is not used here, because a number is not necessarily required in the second or subsequent position. Also not used? character, because it restricts chapter numbering to only two digits. You need to match at least one number after the Chapter and space characters.
If you know that the chapter number is limited to only 99 chapters, you can use the following expression to specify at least one but at most two digits.
/chapter [0-9]{1,2}/
The disadvantage of the above expression is that the chapter number greater than 99 still matches only the beginning two digits. Another drawback is that Chapter 0 will also match. A better expression that matches only two digits is as follows:
/chapter [1-9][0-9]?/
Or
/chapter [1-9][0-9]{0,1}/
The *, +, and? Qualifiers are greedy because they match as many words as possible, but only after they are added with one? You can implement a non-greedy or minimal match.
For example, you might search for an HTML document to find the chapter headings enclosed within the H1 tag. The text is as follows in your document:
<H1>Chapter 1–introduction to Regular Expressions</H1>
The following expression matches everything from the beginning less than the symbol (<) to the closing of the H1 mark (>).
/<.*>/
If you only need to match the start H1 tag, the following "non-greedy" expression matches only <H1>.
/<.*?>/
By the *, + or? After the qualifier is placed, the expression is converted from a "greedy" expression to a "non-greedy" expression or a minimum match.
Locator characters
Locators enable you to pin regular expressions to the beginning or end of a line. They also enable you to create regular expressions that appear within a word, at the beginning of a word, or at the end of a word.
Locators are used to describe the boundaries of a string or word, and ^ and $ refer to the beginning and end of a string, \b describes the front or back of a word, and \b represents a non-word boundary.
The qualifiers for a regular expression are:
character |
Description |
^ |
Matches the starting position of the input string. If you set the Multiline property of the RegExp object, ^ will also match the position after \ n or \ r. |
$ |
Matches the position of the end of the input string. If you set the Multiline property of the RegExp object, the $ will also match the position before \ n or \ r. |
\b |
Matches a word boundary, which is the position between the word and the space. |
\b |
Non-word boundary match. |
Note : You cannot use qualifiers with anchor points. Expressions such as ^* are not allowed because they cannot have more than one position immediately before or after a newline or word boundary.
To match the text at the beginning of a line of text, use the ^ character at the beginning of the regular expression. Do not confuse this usage of ^ with the usage within the brackets expression.
To match the text at the end of a line of text, use the $ character at the end of the regular expression.
To use anchor points when searching for chapter headings, the following regular expression matches a chapter heading that contains only two trailing digits and appears at the top of the line:
/^chapter [1-9][0-9]{0,1}/
The true chapter title not only appears at the beginning of the line, but it is also the only text in the row. It appears at the beginning of the line and at the end of the same row. The following expression ensures that the specified match matches only the chapter and does not match the cross-reference. You do this by creating a regular expression that matches only the beginning and end of a line of text.
/^chapter [1-9][0-9]{0,1}$/
The match word boundary is slightly different, but adds a very important ability to the regular expression. The word boundary is the position between the word and the space. A non-word boundary is any other location. The following expression matches the first three characters of the word Chapter, since these three characters appear after the word boundary:
/\bcha/
The position of the \b character is very important. If it is at the beginning of the string to match, it looks for a match at the beginning of the word. If it is at the end of the string, it looks for a match at the end of the word. For example, the following expression matches the string ter in the word Chapter, because it appears in front of the word boundary:
/ter\b/
The following expression matches the string apt in Chapter, but does not match the string in the aptitude apt:
/\bapt/
The string apt appears in the word Chapter at a non-word boundary, but appears at the word boundary in the word aptitude. For \b Non-word boundary operators, the position is not important because the match does not care whether it is the beginning or the end of the word.
Choose
Enclose all selections in parentheses, separating the adjacent selections by |. But with parentheses there is a side effect that the associated match is cached and available at this time?: Put the first option to eliminate this side effect.
Where?: one of the non-capturing elements, and two non-capturing elements are? = and?!, these two also have more meanings, the former is forward pre-check, in any beginning to match the position of the regular expression pattern within the parentheses to match the search string, the latter is a negative pre-check, Matches the search string at any start where the regular expression pattern does not match.
Reverse reference
Adding parentheses around either a regular expression pattern or a partial pattern causes the related match to be stored in a temporary buffer, and each captured sub-match is stored in the order in which it appears from left to right in the regular expression pattern. The buffer number starts at 1 and stores up to 99 captured sub-expressions. Each buffer can be accessed using ' \ n ', where n is a single or two-bit decimal number that identifies a particular buffer.
You can use the non-capturing metacharacters '?: ', '? = ' or '?! ' to override the capture, ignoring the save for the related match.
One of the simplest and most useful applications for reverse referencing is the ability to provide a match to find two identical adjacent words in text. Take the following sentence as an example:
Is are the cost of gasoline going up?
The above sentence obviously has multiple repeating words. If you can design a method to locate the sentence without having to find the repetition of each word, how good is it. The following regular expression uses a single sub-expression to accomplish this:
/\b ([a-z]+) \1\b/gi
The captured expression, as specified by [a-z]+, includes one or more letters. The second part of the regular expression is a reference to a previously captured sub-match, that is, the second occurrence of the word is exactly matched by a parenthesis expression. \1 Specifies the first child match. Character Boundary metacharacters ensure that only the entire word is detected. Otherwise, phrases such as "is issued" or "This is" will not be recognized correctly by this expression.
The global Tag (g) Following the regular expression indicates that the expression is applied to as many matches as can be found in the input string. The case-insensitive (i) mark at the end of the expression specifies case insensitive. A multiline tag specifies that a potential match may occur on either side of a line break.
A reverse reference can also decompose a generic resource indicator (URI) into its components. Assume that you want to break down the following URIs into protocols (FTP, HTTP, and so on), domain addresses, and page/path:
HTTP://www.w3cschool.cc:80/html/html-tutorial.html
The following regular expression provides this functionality:
/(\w+): \/\/([^/:]+) (: \d*)? ([^# ]*)/
The first parenthesis subexpression captures the protocol portion of the WEB address. The subexpression matches any word in front of a colon and two forward slashes. The second parenthesis subexpression captures the domain address portion of the address. Sub-expression matches/and: one or more characters apart. The third parenthesis subexpression captures the port number (if specified). The subexpression matches 0 or more digits following the colon. This subexpression can be repeated only once. Finally, a fourth parenthesis subexpression captures the path and/or page information specified by the WEB address. The subexpression can match any sequence of characters that does not include a # or a space character.
Applying a regular expression to the above URI, each sub-match contains the following content:
- The first parenthesis subexpression contains "http"
- The second parenthesis subexpression contains "www.w3cschool.cc"
- The third parenthesis subexpression contains a ": 80"
- The fourth parenthesis subexpression contains "/html/html-tutorial.html"
Transferred from: http://www.runoob.com/regexp/regexp-syntax.html
Regular Expressions-syntax