Preface
Regular Expressions are cumbersome, but they are powerful. The application after learning will give you an absolute sense of accomplishment in addition to improving your efficiency. As long as you carefully read these materials and make some reference when applying them, it is not a problem to master regular expressions.
Index
1. Introduction
At present, regular expressions have been widely used in many software applications, including * nix (Linux, Unix, etc.), HP and other operating systems, PHP, C #, Java and other development environments, and many applications can see the shadow of regular expressions.
The use of regular expressions can be implemented in a simple way. In order to be simple, effective, and powerful, it makes the Regular Expression Code more difficult and difficult to learn. Therefore, you need to make some effort. After getting started, refer to some references, it is relatively simple and effective to use.
Example: ^. + @. + \... + $
Such code has been used to scare me out many times. Many people may be scared away by such code. Continue reading this article to allow you to freely apply such code.
Note: Part 1 here seems to be a bit repetitive, with the aim of re-describing the content in the previous table to make it easier to understand.
2. Regular Expression history
The "Ancestor" of regular expressions can be traced back to early studies on how the human nervous system works. Warren McCulloch and Walter Pitts, two neuroscientists, developed a mathematical method to describe these neural networks.
In 1956, a mathematician named Stephen Kleene published a paper titled "neural network event representation" based on McCulloch and Pitts's early work, introduces the concept of regular expressions. A regular expression is an expression used to describe what he calls "the algebra of a regular set" because it uses the term "Regular Expression.
Later, it was found that this work could be applied to some early research using Ken Thompson's computational search algorithm, which is the main inventor of Unix. The first utility of regular expressions is the qed editor in Unix.
As they said, the rest is the well-known history. Since then, regular expressions have been an important part of text-based editors and search tools.
3. Regular Expression Definition
A regular expression (regular expression) describes a string matching pattern, it can be used to check whether a string contains a seed string, replace matched substrings, or retrieve substrings that meet certain conditions from a string.
In the column directory, *. txt in dir *. txt or ls *. txt is not a regular expression, because here * is different from the regular expression.
A regular expression is a text mode consisting of common characters (such as characters a to z) and special characters (such as metacharacters. A regular expression is used as a template to match a character pattern with the searched string.
3.1 common characters
It consists of all the print and non-print characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase letter characters, all numbers, all punctuation marks, and some symbols.
3.2 non-printable characters
Character meaning
\ Cx matches the control characters specified by x. For example, \ cM matches a Control-M or carriage return character. The value of x must be either a A-Z or a-z. Otherwise, c is treated as an original 'C' character.
\ F matches a break. It is equivalent to \ x0c and \ cL.
\ N matches a linefeed. It is equivalent to \ x0a and \ cJ.
\ R matches a carriage return. It is equivalent to \ x0d and \ cM.
\ S matches any blank characters, including spaces, tabs, and page breaks. It is equivalent to [\ f \ n \ r \ t \ v].
\ S matches any non-blank characters. It is equivalent to [^ \ f \ n \ r \ t \ v].
\ T matches a tab. It is equivalent to \ x09 and \ cI.
\ V matches a vertical tab. It is equivalent to \ x0b and \ cK.
Special Character 3.3
Special characters are characters with special meanings, such as *. txt. in simple words, they represent the meaning of any string. If you want to find a file with * in the file name, you need to escape *, that is, add a \ before it \. Ls \ *. txt. Regular expressions have the following special characters.
Special characters
$ Matches the end position of the input string. If the Multiline attribute of the RegExp object is set, $ also matches '\ n' or' \ R '. To match the $ character, use \ $.
() Mark the start and end positions of a subexpression. Subexpressions can be obtained for future use. To match these characters, use \ (and \).
* Matches the previous subexpression zero or multiple times. To match * characters, use \*.
+ Match the previous subexpression once or multiple times. To match + characters, use \ +.
. Match any single character except linefeed \ n. To match., use \.
[Mark the start of a bracket expression. To match [, use \[.
? Match the previous subexpression zero or once, or specify a non-Greedy qualifier. To match? Character, use \?.
\ Mark the next character as a special character, or a literal character, or backward reference, or an octal escape character. For example, 'n' matches the character 'n '. '\ N' matches the line break. The sequence '\' matches "\", while '\ (' matches "(".
^ Matches the start position of the input string. Unless used in the square brackets expression, this character set is not accepted. To match the ^ character itself, use \ ^.
{Mark the start of the qualifier expression. To match {, use \{.
| Specify an option between the two items. To match |, use \ |.
The method for constructing a regular expression is the same as that for creating a mathematical expression. That is, a larger expression is created by combining a small expression with a variety of metacharacters and operators. The regular expression component can be a single character, Character Set combination, character range, choice between characters, or any combination of all these components.
3.4 qualifier
A qualifier is used to specify how many times a given component of a regular expression must appear to match. There are * or + or? There are 6 types: {n}, {n,}, or {n, m.
*, +, And? The delimiters are greedy because they will match as many words as possible, and only add one? You can achieve non-greedy or minimum matching.
Regular expressions have the following delimiters:
Character Description
* Matches the previous subexpression zero or multiple times. For example, zo * can match "z" and "zoo ". * Is equivalent to {0 ,}.
+ Match the previous subexpression once or multiple times. For example, 'Zo + 'can match "zo" and "zoo", but cannot match "z ". + Is equivalent to {1 ,}.
? Match the previous subexpression zero or once. For example, "do (es )? "Can match" do "in" do "or" does ".? It is equivalent to {0, 1 }.
{N} n is a non-negative integer. Match n times. For example, 'O {2} 'cannot match 'O' in "Bob", but can match two o in "food.
{N,} n is a non-negative integer. Match at least n times. For example, 'O {2,} 'cannot match 'O' in "Bob", but can match all o in "foooood. 'O {1,} 'is equivalent to 'o + '. 'O {0,} 'is equivalent to 'o *'.
Both {n, m} m and n are non-negative integers, where n <= m. Match at least n times and at most m times. For example, "o {1, 3}" matches the first three o in "fooooood. 'O {0, 1} 'is equivalent to 'o? '. Note that there must be no space between a comma and two numbers.
3.5 positioning Operator
It is used to describe the boundary of a string or word. ^ and $ respectively refer to the start and end of a string, \ B describes the boundary before or after a word, and \ B indicates non-word boundary. The delimiters cannot be used.
3.6 select
Enclose all selection items with parentheses, and separate adjacent selection items with |. But there is a side effect when parentheses are used, that is, the related matching will be cached. Is it available now? : Put the first option to eliminate this side effect.
Where? : One non-capturing element, and two non-capturing elements? = And ?!, The two have more meanings. The former is forward pre-query, and matches the search string at any position starting to match the Regular Expression Pattern in parentheses. The latter is negative pre-query, match the search string at any position that does not match the regular expression pattern.
3.7 Back Reference
Adding parentheses on both sides of a regular expression or partial expression will cause the matching to be stored in a temporary buffer, each captured sub-match is stored in the content storage from left to right in the regular expression mode. The buffer number that stores the sub-match starts from 1 and ranges from consecutive numbers to a maximum of 99 subexpressions. Each buffer zone can be accessed using '\ n', where n is one or two decimal digits that identify a specific buffer zone.
Can I use non-captured metacharacters '? :','? = ', Or '?! 'To ignore the save of the matching.
4. Operation priority of various operators
Operations with the same priority are performed from left to right. Operations with different priorities are first high and then low. The priorities of operators are as follows:
Operator description
\ Escape Character
(),(? :),(? =), [] Parentheses and square brackets
*, + ,?, {N}, {n ,}, {n, m} qualifier
^, $, \ Anymetacharacter location and Sequence
| "Or" Operation
5. All symbolic interpretations
Character Description
\ Mark the next character as a special character, an original character, or a backward reference, or an octal escape character. For example, 'n' matches the character "n ". '\ N' matches a line break. The sequence '\' matches "\" and "\ (" matches "(".
^ Matches the start position of the input string. If the Multiline attribute of the RegExp object is set, ^ matches the position after '\ n' or' \ R.
$ Matches the end position of the input string. If the Multiline attribute of the RegExp object is set, $ also matches the position before '\ n' or' \ R.
* Matches the previous subexpression zero or multiple times. For example, zo * can match "z" and "zoo ". * Is equivalent to {0 ,}.
+ Match the previous subexpression once or multiple times. For example, 'Zo + 'can match "zo" and "zoo", but cannot match "z ". + Is equivalent to {1 ,}.
? Match the previous subexpression zero or once. For example, "do (es )? "Can match" do "in" do "or" does ".? It is equivalent to {0, 1 }.
{N} n is a non-negative integer. Match n times. For example, 'O {2} 'cannot match 'O' in "Bob", but can match two o in "food.
{N,} n is a non-negative integer. Match at least n times. For example, 'O {2,} 'cannot match 'O' in "Bob", but can match all o in "foooood. 'O {1,} 'is equivalent to 'o + '. 'O {0,} 'is equivalent to 'o *'.
Both {n, m} m and n are non-negative integers, where n <= m. Match at least n times and at most m times. For example, "o {1, 3}" matches the first three o in "fooooood. 'O {0, 1} 'is equivalent to 'o? '. Note that there must be no space between a comma and two numbers.
? When this character is followed by any other delimiter (*, + ,?, The matching mode after {n}, {n ,}, {n, m}) is not greedy. The non-Greedy mode matches as few searched strings as possible, while the default greedy mode matches as many searched strings as possible. For example, for strings "oooo", 'O ++? 'Will match a single "o", and 'O + 'will match all 'O '.
. Match any single character except "\ n. To match any character including '\ n', use a pattern like' [. \ n.
(Pattern) matches pattern and obtains this match. The obtained match can be obtained from the generated Matches set. The SubMatches set is used in VBScript, and $0… is used in JScript... $9 attribute. To match the parentheses, use '\ (' or '\)'.
(? : Pattern) matches pattern but does not get the matching result. That is to say, this is a non-get match and is not stored for future use. This is useful when you use the "or" character (|) to combine each part of a pattern. For example, 'industr (? : Y | ies) is a simpler expression than 'industry | industries.
(? = Pattern) Forward pre-query: matches the search string at the beginning of any string that matches pattern. This is a non-get match, that is, the match does not need to be obtained for future use. For example, 'windows (? = 95 | 98 | NT | 2000) 'can match "Windows" in "Windows 2000", but cannot match "Windows" in "Windows 3.1 ". Pre-query does not consume characters, that is, after a match occurs, the next matching search starts immediately after the last match, instead of starting after the pre-query characters.
(?! Pattern) negative pre-query: matches the search string at the beginning of any string that does not match pattern. This is a non-get match, that is, the match does not need to be obtained for future use. For example, 'windows (?! 95 | 98 | NT | 2000) 'can match "Windows" in "Windows 3.1", but cannot match "Windows" in "Windows 2000 ". Pre-query does not consume characters. That is to say, after a match occurs, the next matching search starts immediately after the last match, instead of starting after the pre-query characters.
X | y matches x or y. For example, 'z | food' can match "z" or "food ". '(Z | f) ood' matches "zood" or "food ".
[Xyz] Character Set combination. Match any character in it. For example, '[abc]' can match 'A' in "plain '.
[^ Xyz] combination of negative character sets. Match any character not included. For example, '[^ abc]' can match 'p' in "plain '.
[A-z] character range. Matches any character in the specified range. For example, '[a-z]' can match any lowercase letter in the range of 'A' to 'Z.
[^ A-z] negative character range. Matches any character that is not within the specified range. For example, '[^ a-z]' can match any character that is not in the range of 'A' to 'Z.
\ B matches a word boundary, that is, the position between a word and a space. For example, 'er \ B 'can match 'er' in "never", but cannot match 'er 'in "verb '.
\ B matches non-word boundaries. 'Er \ B 'can match 'er' in "verb", but cannot match 'er 'in "never '.
\ Cx matches the control characters specified by x. For example, \ cM matches a Control-M or carriage return character. The value of x must be either a A-Z or a-z. Otherwise, c is treated as an original 'C' character.
\ D matches a numeric character. It is equivalent to [0-9].
\ D matches a non-numeric character. It is equivalent to [^ 0-9].
\ F matches a break. It is equivalent to \ x0c and \ cL.
\ N matches a linefeed. It is equivalent to \ x0a and \ cJ.
\ R matches a carriage return. It is equivalent to \ x0d and \ cM.
\ S matches any blank characters, including spaces, tabs, and page breaks. It is equivalent to [\ f \ n \ r \ t \ v].
\ S matches any non-blank characters. It is equivalent to [^ \ f \ n \ r \ t \ v].
\ T matches a tab. It is equivalent to \ x09 and \ cI.
\ V matches a vertical tab. It is equivalent to \ x0b and \ cK.
\ W matches any word characters that contain underscores. It is equivalent to '[A-Za-z0-9 _]'.
\ W matches any non-word characters. It is equivalent to '[^ A-Za-z0-9 _]'.
\ Xn matches n, where n is the hexadecimal escape value. The hexadecimal escape value must be determined by the length of two numbers. For example, '\ x41' matches "". '\ X041' is equivalent to '\ x04' & "1 ". The regular expression can use ASCII encoding ..
\ Num matches num, where num is a positive integer. References to the obtained matching. For example, '(.) \ 1' matches two consecutive identical characters.
\ N identifies an octal escape value or a backward reference. If at least n subexpressions are obtained before \ n, n is backward referenced. Otherwise, if n is an octal digit (0-7), n is an octal escape value.
\ Nm identifies an octal escape value or a backward reference. If at least one child expression is obtained before \ nm, the nm is backward referenced. If at least n records are obtained before \ nm, n is a backward reference followed by text m. If none of the preceding conditions are met, if n and m are Octal numbers (0-7), \ nm matches the octal escape value nm.
\ Nml if n is an octal digit (0-3) and both m and l are octal digits (0-7), the octal escape value nml is matched.
\ Un matches n, where n is a Unicode character represented by four hexadecimal numbers. For example, \ u00A9 matches the copyright symbol (?).
6. Some examples
Regular Expression description
/\ B ([a-z] +) \ 1 \ B/gi position where a word appears continuously
/(\ W +): \/([^/:] +) (: \ d *)? ([^ #] *)/Resolve a URL as a protocol, domain, port, and relative path
/^ (? : Chapter | Section) [1-9] [0-9] {0, 1} $/locate the position of the Chapter
/[-A-z]/A to z a total of 26 letters plus A-number.
/Ter \ B/can match chapter, but cannot be terminal
/\ Bapt/can match chapter, but not aptitude
/Windows (? = 95 | 98 | NT)/matches Windows95, Windows98, or WindowsNT. After a match is found, the next retrieval match starts after Windows.
7. Regular Expression matching rules
7.1 basic mode matching
Everything starts from the most basic. Pattern is the most basic element of a regular expression. They are a set of characters that describe character strings. The mode can be very simple. It is composed of common strings and can be very complex. special characters are often used to indicate characters in a range, repeated occurrences, or context. For example:
^ Once
This mode contains a special character ^, indicating that this mode only matches strings starting with once. For example, this pattern matches the string "once upon a time" and does not match "There once was a man from NewYork. Like a ^ symbol, $ is used to match character strings that end with a given pattern.
Bucket $
This mode matches "Who kept all of this cash in a bucket" and does not match "buckets. When both the character ^ and $ are used, it indicates exact match (the string is the same as the pattern ). For example:
^ Bucket $
Only matches the string "bucket ". If a mode does not include ^ and $, it matches any string containing this mode. Example: Mode
Once
And string
There once was a man from NewYork
Who kept all of his cash in a bucket.
Is matched.
In this mode, letters (o-n-c-e) are literal characters, that is, they indicate the letter itself, and numbers are the same. Escape sequences are used for other slightly complex characters, such as punctuation marks and white characters (empty spaces and tabs. All escape sequences start with a backslash. The escape sequence of the tab is \ t. So if we want to check whether a string starts with a Tab character, we can use this mode:
^ \ T
Similarly, \ n is used to represent a new line, and \ r is used to represent a carriage return. Other special symbols can be used in front with a backslash. For example, the backslash itself is represented by \, the period is represented by \., and so on.
7.2 character Cluster
In INTERNET programs, regular expressions are usually used to verify user input. After a user submits a FORM, it is not enough to determine whether the entered phone number, address, EMAIL address, and credit card number are valid.
Therefore, we need to use a more free way to describe the mode we want. It is a character cluster. To create a character cluster that represents all vowel characters, put all the vowel characters in a square bracket:
[AaEeIiOoUu]
This mode matches any vowel character, but can only represent one character. The font size can be used to indicate the range of a character, for example:
[A-z] // match all lowercase letters
[A-Z] // match all uppercase letters
[A-zA-Z] // match all letters
[0-9] // match all numbers
[0-9 \. \-] // match all numbers, periods, and periods
[\ F \ r \ t \ n] // match all white characters
Similarly, these are only one character, which is very important. If you want to match a string consisting of a lowercase letter and a digit, such as "z2", "t6", or "g7 ", if it is not "ab2", "r2d3", or "b52", use this mode:
^ [A-z] [0-9] $
Although [a-z] represents the range of 26 letters, it can only match strings with lowercase letters with the first character.
^ Indicates the start of a string, but it has another meaning. When ^ is used in square brackets, it indicates "not" or "excluded", which is often used to remove a character. In the preceding example, the first character must not be a number:
^ [^ 0-9] [0-9] $
This pattern matches "& 5", "g7", and "-2", but does not match "12", "66. The following are examples of how to exclude specific characters:
[^ A-z] // All characters except lowercase letters
[^ \/\ ^] // All characters except (\) (/) (^)
[^ \ "\ '] // All characters except double quotation marks (") and single quotation marks (')
The special character "." (point, period) is used to represent all characters except the "New Line" in a regular expression. Therefore, the pattern "^. 5 $" matches any two-character string that ends with a number 5 and starts with another non-New Line character. Mode "." can match any string, except empty strings and strings containing only one "New Line.
PHP regular expressions have some built-in general character clusters. The list is as follows:
Character cluster meaning
[[: Alpha:] Any letter
[[: Digit:] Any number
[[: Alnum:] Any letter or number
[[: Space:] any white characters
[[: Upper:] Any uppercase letter
[[: Lower:] Any lowercase letter
[[: Punct:] Any punctuation marks
[[: Xdigit:] Any hexadecimal number, equivalent to [0-9a-fA-F]
7.3 confirm repeated occurrence
Until now, you know how to match a letter or number, but in more cases, you may need to match a word or a group of numbers. A word may consist of several letters, and a group of numbers may consist of several singular numbers. Braces ({}) following the character or character cluster are used to determine the number of occurrences of the preceding content.
Character cluster meaning
^ [A-zA-Z _] $ all letters and underscores
^ [[: Alpha:] {3} $ all 3-letter words
^ A $ Letter
^ A {4} $ aaaa
^ A {2, 4} $ aa, aaa or aaaa
^ A {1, 3} $ a, aa or aaa
^ A {2, }$ contains more than two a strings
^ A {2,} For example, aardvark and aaab, but not apple
A {2,} such as baad and aaa, but not Nantucket
\ T {2} two tabs
. {2} All two characters
These examples describe three different usages of curly brackets. A number, {x} indicates "the character or character cluster appears only x times"; a number is added with a comma, {x ,} it means "the previous content shows x or more times"; two numbers separated by commas (,). {x, y} indicates that "the previous content appears at least x times, but not more than y times ". We can extend the pattern to more words or numbers:
^ [A-zA-Z0-9 _] {1, }$ // All strings containing more than one letter, number, or underline
^ [0-9] {1, }$ // all positive numbers
^ \-{0, 1} [0-9] {1, }$ // All integers
^ \-{0, 1} [0-9] {0 ,}\. {0, 1} [0-9] {0 ,}$ // all decimals
The last example is hard to understand, right? Let's take a look: It starts with an optional negative sign (\-{0, 1}) (^), followed by 0 or more numbers ([0-9] {0,}), and an optional decimal point (\. {0, 1}) followed by 0 or multiple numbers ([0-9] {0,}), and nothing else ($ ). Next you will know the simpler method that can be used.
Special Character "? "It is equal to {0, 1}, and both represent" 0 or 1 previous content "or" previous content is optional ". So the example just now can be simplified:
^ \-? [0-9] {0 ,}\.? [0-9] {0,} $
The special characters "*" and {0,} are equal. They all represent "0 or multiple front content ". Finally, the character "+" is equal to {1,}, indicating "one or more previous content". Therefore, the preceding four examples can be written as follows:
^ [A-zA-Z0-9 _] + $ // All strings that contain more than one letter, number, or underline
^ [0-9] + $ // all positive numbers
^ \-? [0-9] + $ // All integers
^ \-? [0-9] * \.? [0-9] * $ // all decimals
Of course, this does not technically reduce the complexity of regular expressions, but it can make them easier to read.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.