Tutorials | It's good to learn regular expressions recently and see this article.
Preface
Regular expression is cumbersome, but powerful, after the application of learning will let you in addition to improve efficiency, will give you a sense of absolute achievement. As long as the careful reading of these materials, coupled with the application of a certain reference, master regular expression is not a problem.
Index
1. Intro
At present, regular expressions have been widely used in many software applications, including *nix (Linux, UNIX, etc.), HP and other operating systems, Php,c#,java and other development environments, as well as many applications, can see the shadow of regular expression.
The use of regular expressions can be a simple way to achieve powerful features. In order to be simple and effective and powerful, resulting in regular expression code is difficult to learn, it is not very easy, so it takes some effort to do, after the introduction of reference to certain references, the use of it is relatively simple and effective.
Example: ^.+@.+\\. +$
Such code has repeatedly scared me out of myself. Maybe a lot of people are scared away by such code. Continue reading this article will let you also have the freedom to apply such code.
Note: The 7th and previous sections of this section seem to be a bit repetitive, with the aim of repeating the parts in the previous table to make them easier to understand.
2. History of regular expressions
The "ancestors" of regular expressions can be traced back to early studies of how the human nervous system works. Warren McCulloch and Walter Pitts, two neuroscientists, have developed a mathematical way of describing these neural networks.
In 1956, a mathematician named Stephen Kleene, based on the early work of McCulloch and Pitts, published a paper entitled "Representation of neural network events", introducing the concept of regular expressions. A regular expression is an expression that describes what he calls the algebra of a regular set, so the term "regular expression" is used.
Subsequently, it was found that this work could be applied to some early studies using Ken Thompson's computational Search algorithm, and Ken Thompson was the main inventor of Unix. The first practical application of regular expressions is the QED editor in Unix.
As they say, the rest is a well-known history. From then until now regular expressions are an important part of text-based editors and search tools.
3. Definition of regular expression
A regular expression (regular expression) describes a pattern of string matching that can be used to check whether a string contains a seed string, replaces a matching substring, or extracts a substring from a string that matches a condition.
Column directory, the *.txt in dir *.txt or LS *.txt is not a regular expression, because the meaning of this * is different from that of the regular type.
A regular expression is a literal pattern consisting of ordinary characters, such as characters A through z, and special characters, called metacharacters. A regular expression is used as a template to match a character pattern with the string being searched for.
3.1 Ordinary characters
Consists of all print and nonprinting characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase alphabetic characters, all numbers, all punctuation marks, and some symbols.
3.2 nonprinting characters
character |
meaning |
\cx |
Matches the control character indicated by X. For example, \cm matches a control-m or carriage return character. The value of x must be one-a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character. |
\f |
Matches a page feed character. Equivalent to \x0c and \CL. |
\ n |
Matches a line feed character. Equivalent to \x0a and \CJ. |
\ r |
Matches a carriage return character. Equivalent to \x0d and \cm. |
\s |
Matches any white space character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v]. |
\s |
Matches any non-white-space character. equivalent to [^ \f\n\r\t\v]. |
\ t |
Matches a tab character. Equivalent to \x09 and \ci. |
\v |
Matches a vertical tab. Equivalent to \x0b and \ck. |
3.3 Special characters
The so-called special characters, that is, some special meaning of the characters, such as the above "*.txt" in the *, simply to say that any string meaning. If you want to find files with * in the file name, you need to escape the *, which is preceded by a \. LS \*.txt. Regular expressions have the following special characters.
Special Characters |
Description |
$ |
Matches the end position of the input string. If the Multiline property of the RegExp object is set, then $ also matches ' \ n ' or ' \ R '. To match the $ character itself, use \$. |
( ) |
Marks the start and end position of a subexpression. The subexpression can be obtained for later use. To match these characters, use \ (and \). |
* |
Matches the preceding subexpression 0 or more times. To match the * character, use \*. |
+ |
Matches the preceding subexpression one or more times. to match the + character, use \+. |
. |
Matches any single character except the newline character \ n. to match., please use \. |
[ |
Marks the beginning of a bracket expression. To match [, use \[. |
? |
Matches the preceding subexpression 0 or more times, or indicates a non-greedy qualifier. Want to match? characters, please use \?. |
\ |
Marks the next character as either a special character, or a literal character, or a backward reference, or a octal escape character. For example, ' n ' matches the character ' n '. ' \ n ' matches line breaks. The sequence ' \ \ ' matches ' \ ' and ' \ (' matches '. |
^ |
Matches the starting position of the input string, unless used in a bracket expression, at which point it means that the character set is not accepted. To match the ^ character itself, use \^. |
{ |
The beginning of a tag qualifier expression. To match {, use \{. |
| |
Indicates a choice between two items. to match |, use \|. |
The method for constructing regular expressions is the same as for creating mathematical expressions. That is, using multiple metacharacters and operators to combine small expressions to create larger expressions. The component of a regular expression can be a single character, character set, character range, selection between characters, or any combination of any of these components.
3.4 Qualifiers
A qualifier is used to specify how many times a given component of a regular expression must appear to satisfy a match. There are * or + or? or {n} or {n,} or {n,m} altogether 6 species.
*, +, and? Qualifiers are greedy because they match as many words as possible, only to add one behind them. You can achieve a non greedy or minimal match.
The qualifiers for regular expressions are:
character |
Description |
* |
Matches the preceding subexpression 0 or more times. For example, zo* can match "z" and "Zoo". * is equivalent to {0,}. |
+ |
Matches the preceding subexpression one or more times. For example, ' zo+ ' can match "Zo" and "Zoo", but cannot match "Z". + is equivalent to {1,}. |
? |
Match the preceding subexpression 0 times or once. For example, "Do (es)" can match "do" in "do" or "does". is equivalent to {0,1}. |
N |
n is a non-negative integer. Matches the determined n times. For example, ' o{2} ' cannot match ' o ' in ' Bob ', but can match two o in ' food '. |
{N,} |
n is a non-negative integer. Match at least n times. For example, ' o{2,} ' cannot match ' o ' in ' Bob ' but can match all o in ' Foooood '. ' O{1,} ' is equivalent to ' o+ '. ' O{0,} ' is equivalent to ' o* '. |
{N,m} |
M and n are nonnegative integers, of which n <= M. Matches n times at least and matches up to M times. For example, "o{1,3}" will match the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' o '. Notice that there is no space between the comma and the two number. |
3.5 Locator Character
Used to describe the boundary of a string or word, ^ and $, respectively, the beginning and end of a string, \b describes the front or back bounds of a word, and \b represents a non word boundary. Qualifiers cannot be used on the locator.
3.6 Select
Enclose all the selections with parentheses, separating the adjacent selections with |. But with parentheses there is a side effect that the related match is cached and available at this time?: Put the first option to eliminate this side effect.
Among them: is one of the non-capture elements, and there are two not-captured dollars? = and?!, these two also have more meaning, the former is forward lookup, in any start matching the regular expression pattern within the parentheses position to match the search string, the latter is negative check, Matches the search string at any position that does not begin to match the regular expression pattern.
3.7 Forward Reference
Adding parentheses around a regular expression pattern or part of a pattern causes the correlation match to be stored in a temporary buffer, and each captured child match is stored in the content that is encountered from left to right in the regular expression pattern. The buffer number for the storage child match starts at 1 and is numbered consecutively up to 99 subexpression. Each buffer can be accessed using ' \ n ', where n is a single or two-bit decimal number that identifies a particular buffer.
You can use a non-capture meta character '?: ', '? = ', or '?! ' to ignore the preservation of the related match.
4. Operation Priority of various operators
The same priority of the operation from left to right, the operation of different priorities first high and then low. The precedence of various operators from high to Low is as follows:
operator |
Description |
\ |
Escape character |
(), (?:), (?=), [] |
Parentheses and square brackets |
*, +,?, {n}, {n,}, {n,m} |
Qualifier |
^, $, \anymetacharacter |
Location and order |
| |
"or" action |
5. Full Symbolic interpretation
character |
Description |
\ |
Marks the next character as a special character, or a literal character, or a backward reference, or a octal escape character. For example, ' n ' matches the character ' n '. ' \ n ' matches a newline character. Sequence ' \ ' matches ' \ ' and ' \ (' Matches ' (". |
^ |
Matches the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after ' \ n ' or ' \ R '. |
$ |
Matches the end position of the input string. If the Multiline property of the RegExp object is set, the $ also matches the position before ' \ n ' or ' \ R '. |
* |
Matches the preceding subexpression 0 or more times. For example, zo* can match "z" and "Zoo". * is equivalent to {0,}. |
+ |
Matches the preceding subexpression one or more times. For example, ' zo+ ' can match "Zo" and "Zoo", but cannot match "Z". + is equivalent to {1,}. |
? |
Match the preceding subexpression 0 times or once. For example, "Do (es)" can match "do" in "do" or "does". is equivalent to {0,1}. |
N |
n is a non-negative integer. Matches the determined n times. For example, ' o{2} ' cannot match ' o ' in ' Bob ', but can match two o in ' food '. |
{N,} |
n is a non-negative integer. Match at least n times. For example, ' o{2,} ' cannot match ' o ' in ' Bob ' but can match all o in ' Foooood '. ' O{1,} ' is equivalent to ' o+ '. ' O{0,} ' is equivalent to ' o* '. |
{N,m} |
M and n are nonnegative integers, of which n <= M. Matches n times at least and matches up to M times. For example, "o{1,3}" will match the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' o '. Notice that there is no space between the comma and the two number. |
? |
When the character is immediately following any of the other qualifiers (*, +,?, {n}, {n,}, {n,m}), the matching pattern is not greedy. Non-greedy patterns match as few strings as possible, while the default greedy pattern matches as many of the searched strings as possible. For example, for the string "oooo", ' o+? ' will match a single "O", and ' o+ ' will match all ' o '. |
. |
Matches any single character except "\ n". To match any character including ' \ n ', use a pattern like ' [. \ n] '. |
(pattern) |
Match pattern and get this match. The obtained matches can be obtained from the resulting matches collection, use the Submatches collection in VBScript, and use the $0...$9 property in JScript. To match the parentheses character, use ' \ (' or ' \ '). |
(?:p Attern) |
Matches pattern but does not get a matching result, which means it is a non fetch match and is not stored for later use. This is useful for combining parts of a pattern with the "or" character (|). For example, ' Industr (?: y|ies) is a more abbreviated expression than ' industry|industries '. |
(? =pattern) |
Forward lookup, matching the find string at the beginning of any string matching pattern. This is a non-fetch match, that is, the match does not need to be acquired for later use. For example, ' Windows (? =95|98| nt|2000) ' Can match windows in Windows 2000, but cannot match windows in Windows 3.1. It does not consume characters, that is, after a match occurs, the next matching search begins immediately after the last match, instead of starting after the character that contains the pre-check. |
(?! Pattern |
Negative pre-check, matches the lookup string at the beginning of any mismatched pattern string. This is a non-fetch match, that is, the match does not need to be acquired for later use. For example, ' Windows (?! 95|98| nt|2000) ' Can match windows in Windows 3.1, but cannot match windows in Windows 2000. It does not consume characters, that is, after a match occurs, the next matching search begins immediately after the last match, instead of starting after the character that contains the pre-check. |
X|y |
Match x or Y. For example, ' Z|food ' can match "z" or "food". ' (z|f) Ood ' matches ' zood ' or ' food '. |
[XYZ] |
Character set combination. Matches any one of the characters contained. For example, ' [ABC] ' can match ' a ' in ' plain '. |
[^XYZ] |
Negative character set combination. Matches any characters that are not included. For example, ' [^ABC] ' can match ' P ' in ' plain '. |
[A-z] |
The range of characters. Matches any character within the specified range. For example, ' [A-z] ' can match any lowercase alphabetic character in the range ' a ' to ' Z '. |
[^a-z] |
Negative character range. Matches any character that is not in the specified range. For example, ' [^a-z] ' can match any character that is not in the range of ' a ' to ' Z '. |
\b |
Matches a word boundary, which is the position between the word and the space. For example, ' er\b ' can match ' er ' in ' never ', but cannot match ' er ' in ' verb '. |
\b |
Matches a non-word boundary. ' er\b ' can match ' er ' in ' verb ', but cannot match ' er ' in ' Never '. |
\cx |
Matches the control character indicated by X. For example, \cm matches a control-m or carriage return character. The value of x must be one-a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character. |
\d |
Matches a numeric character. equivalent to [0-9]. |
\d |
Matches a non-numeric character. equivalent to [^0-9]. |
\f |
Matches a page feed character. Equivalent to \x0c and \CL. |
\ n |
Matches a line feed character. Equivalent to \x0a and \CJ. |
\ r |
Matches a carriage return character. Equivalent to \x0d and \cm. |
\s |
Matches any white space character, including spaces, tabs, page breaks, and so on. equivalent to [\f\n\r\t\v]. |
\s |
Matches any non-white-space character. equivalent to [^ \f\n\r\t\v]. |
\ t |
Matches a tab character. Equivalent to \x09 and \ci. |
\v |
Matches a vertical tab. Equivalent to \x0b and \ck. |
\w |
Matches any word character that includes an underscore. Equivalent to ' [a-za-z0-9_] '. |
\w |
Matches any non word character. Equivalent to ' [^a-za-z0-9_] '. |
\xn |
Matches n, where n is the hexadecimal escape value. The hexadecimal escape value must be a determined two digits long. For example, ' \x41 ' matches ' A '. ' \x041 ' is equivalent to ' \x04 ' & ' 1 '. You can use ASCII encoding in regular expressions ... |
\num |
Matches num, where num is a positive integer. A reference to the match that was obtained. For example, ' (.) \1 ' matches two consecutive identical characters. |
\ n |
Identifies a octal escape value or a backward reference. n is a backward reference if you have at least n obtained subexpression before \ nthe. Otherwise, if n is an octal number (0-7), then N is an octal escape value. |
\nm |
Identifies a octal escape value or a backward reference. NM is a backward reference if at least NM has obtained the subexpression before \nm. If there are at least N fetches before \nm, then n is a backward reference followed by a literal m. If all the preceding conditions are not satisfied, if both N and M are octal digits (0-7), then \nm will match octal escape value nm. |
\nml |
If n is an octal number (0-3) and both M and L are octal digits (0-7), the octal escape value NML is matched. |
\un |
Matches n, where N is a Unicode character represented in four hexadecimal digits. For example, \u00a9 matches the copyright symbol (?). |
6. Some examples
Regular Expressions |
Description |
/\b ([a-z]+) \1\b/gi |
The place where a word appears continuously |
/(\w+): \/\/([^/:]+) (: \d*)? ([^# ]*)/ |
Resolves a URL to a protocol, domain, port, and relative path |
/^ (?: chapter| section) [1-9][0-9]{0,1}$/ |
Position the Chapter |
/[-a-z]/ |
A to Z a total of 26 letters plus a-number. |
/ter\b/ |
Can match chapter, but not terminal |
/\bapt/ |
Can match chapter, but not aptitude |
/windows (? =95 |98 | NT)/ |
You can match Windows95 or Windows98 or WindowsNT, and after you find a match, start the next retrieval match from behind Windows. |
7. Regular expression matching rules
7.1 Basic Pattern matching
Everything starts from the basics. Patterns, the most basic element of regular expressions, are a set of characters that describe the character of a string. Patterns can be simple, composed of ordinary strings, and can be very complex, often using special characters to represent a range of characters, repeat occurrences, or represent contexts. For example:
^once
This pattern contains a special character ^, which indicates that the pattern matches only those strings that start with once. This pattern, for example, matches the string "Once Upon a Time" and does not match the "There once was a mans from NewYork". Just as the ^ symbol indicates the beginning, the $ symbol is used to match the strings that end in the given pattern.
bucket$
This pattern matches the "who kept all of the this cash in a bucket" and does not match "buckets". Characters ^ and $ are used at the same time to represent exact matches (strings are the same as patterns). For example:
^bucket$
Matches only the string "bucket". If a pattern does not include ^ and $, it matches any string that contains the pattern. For example: mode
Once
and string
There once was a mans from NewYork
Who kept all of the cash in a bucket.
is a match.
The letters in the pattern (O-N-C-E) are literal characters, that is, they represent the letter itself, and the number is the same. Some other slightly more complex characters, such as punctuation and white space (spaces, tabs, and so on), use the escape sequence. All escape sequences begin with a backslash (\). The escape sequence for tab characters is: \ t. So if we're going to detect whether a string starts with tabs, you can use this pattern:
^\t
Similarly, \ n means "new line", \ r indicates carriage return. Other special symbols, which can be preceded by a backslash, such as the backslash itself with \ \, a period.
7.2 Character clusters
In programs on the Internet, regular expressions are typically used to validate user input. When a user submits a form, it is not enough to determine whether the phone number, address, email address, credit card number, etc. are valid, and the usual literal characters are not sufficient.
So to use a more liberal way of describing the pattern we want, it's a character cluster. To create a character cluster that represents all the vowel characters, put all the vowel characters in one square bracket:
[Aaeeiioouu]
This pattern matches any vowel character, but can only represent one character. A hyphen can represent a range of characters, such as:
[A-z]//Match all lowercase letters
[A-z]//Match all uppercase letters
[A-za-z]//Match all the letters
[0-9]//Match all numbers
[0-9\.\-]//Match all numbers, periods and minus signs
[\f\r\t\n]//Match all white characters
Similarly, these also represent only one character, which is a very important one. If you want to match a string consisting of a lowercase and a digit, such as "Z2", "T6" or "G7", but not "ab2", "r2d3" or "B52", use this pattern:
^[a-z][0-9]$
Although [A-z] represents a 26-letter range, it can only match a string with the first character being a lowercase letter.
The previous reference to ^ represents the beginning of a string, but it has another meaning. When used in a set of square brackets, it denotes the meaning of "non" or "exclusion", which is often used to remove a character. In the previous example, we asked that the first character not be a number:
^[^0-9][0-9]$
This pattern matches "&5", "G7" and "2", but does not match "12" or "66". Here are a few examples of excluding specific characters:
[^a-z]//All characters except lowercase letters
[^\\\/\^]//all characters except (\) (/) (^)
[^\ ' \]//All characters except double quotes (") and single quotes (')
Special character "." (dots, periods) are used in regular expressions to denote all characters except "new lines." So the pattern "^.5$" matches any two-character string that ends with the number 5 and begins with another non-"new line" character. Mode "." You can match any string except the empty string and a string that includes only a "new line".
PHP's regular expressions have some built-in common character clusters, which are listed below:
Character cluster meaning
[[: Alpha:]] any letter
[[:d Igit:]] any number
[[: Alnum:]] any letter or number
[[: Space:]] any white character
[[: Upper:]] Any capital letter
[[: Lower:]] any lowercase letter
[[:p UNCT:]] any punctuation mark
[[: Xdigit:]] Any number 16, equivalent to [0-9a-fa-f]
7.3 Confirm Repeat Appearance
Until now, you already know how to match a letter or number, but more often than not, you might want to match a word or a set of numbers. A word consists of several letters, and a set of numbers has several singular components. The curly braces ({}) that follow a character or a cluster of characters are used to determine how many occurrences of the preceding content are repeated.
Character cluster meaning
^[a-za-z_]$ all the letters and underscores
^[[:alpha:]]{3}$ all 3-letter words
^a$ Letter A
^a{4}$ AAAA
^a{2,4}$ aa,aaa or AAAA
^a{1,3}$ A,aa or AAA
^a{2,}$ contains more than two strings of a
^a{2,} such as: Aardvark and Aaab, but not Apple.
A{2,} such as: Baad and AAA, but Nantucket not.
\T{2} two tabs
. {2} all two characters
These examples describe the three different uses of curly braces. A number, {x}, means "the preceding character or cluster of characters appears only x times"; A number with a comma, {x,} means "x or more times before"; two comma-delimited numbers, {x,y} indicates that "previous content appears at least x times, but not more than y times." We can extend the pattern to more words or numbers:
^[a-za-z0-9_]{1,}$//All strings containing more than one letter, number, or underscore
^[0-9]{1,}$//All positive numbers
^\-{0,1}[0-9]{1,}$//all integers
^\-{0,1}[0-9]{0,}\. {0,1} [0-9] {0,}$//all decimals
The last example is not very well understood, is it? So see: With all the optional minus sign (\-{0,1}) preceded (^), followed by 0 or more digits ([0-9]{0,}), and an optional decimal point (\.{ 0,1}) followed by 0 or more digits ([0-9]{0,}) and nothing else ($). Below you will know the simpler way to use it.
Special character "?" is equal to {0,1}, they all represent: "0 or 1 previous content" or "previous content is optional." So the example can be simplified as follows:
^\-? [0-9] {0,}\.? [0-9] {0,}$
The special character "*" is equal to {0, and they all represent "0 or more preceding content". Finally, the character "+" is equal to {1}, representing "1 or more preceding content", so the above 4 examples can be written as follows:
^[a-za-z0-9_]+$//All strings containing more than one letter, number, or underscore
^[0-9]+$//All positive numbers
^\-? [0-9]+$//all integers
^\-? [0-9]*\.? [0-9]*$//All Decimal
This does not, of course, technically reduce the complexity of regular expressions, but it makes them easier to read.