Regular expressions are cumbersome, but powerful, learned applications will give you an absolute sense of accomplishment in addition to improving efficiency. It is not a problem to master regular expressions, as long as you read them carefully, add some references when applying them.
1. Intro
At present, regular expression has been widely used in many software, including *nix (Linux, UNIX, etc.), HP and other operating systems, Php,c#,java and other development environments, as well as many applications, can see the shadow of the regular expression.
The use of regular expressions can be achieved through a simple approach to powerful functions. In order to be simple and effective without losing strong, resulting in regular expression code difficult, learning is not very easy, so need to pay some effort to do, after the introduction of reference to certain references, use up or relatively simple and effective.
Example: ^.+@.+. +$
This kind of code used to scare me a lot. Maybe a lot of people are scared to run away by this kind of code. Continuing to read this article will allow you to freely apply such code.
Note: The 7th part here looks somewhat repetitive, with the intention of re-describing the sections in the previous table to make the content easier to understand.
2. Regular expression History
The "ancestors" of regular expressions can be traced back to early studies of how the human nervous system works. Warren McCulloch and Walter Pitts the two neuroscientists have developed a mathematical approach to describe these neural networks.
In 1956, a mathematician named Stephen Kleene, on the basis of the early work of McCulloch and Pitts, published a paper titled "Representation of Neural network events", introducing the concept of regular expressions. The regular expression is the expression that describes what he calls "the algebra of the regular set", so the term "regular expression" is used.
Later, it was discovered that this work could be applied to some early studies of the computational search algorithm using Ken Thompson, the main inventor of Unix. The First Utility application of a regular expression is the QED editor in Unix.
As they say, the rest is a well-known history. Since then, regular expressions have been an important part of text-based editors and search tools.
3. Regular expression definitions
The regular expression (regular expression) describes a pattern of string matching that can be used to check whether a string contains a seed string, replaces a matched substring, or extracts a substring that matches a certain condition from a string.
When a directory is listed, the *.txt in dir *.txt or LS *.txt is not a regular expression, because the meaning of * is different from the regular type.
Regular expressions are text patterns that consist of ordinary characters, such as characters A through z, and special characters (called metacharacters). A regular expression, as a template, matches a character pattern to the string you are searching for.
3.1 Ordinary characters
Consists of all printed and nonprinting characters that are not explicitly specified as metacharacters. This includes all uppercase and lowercase alphabetic characters, all numbers, all punctuation marks, and some symbols.
3.2 Non-printable characters
Character meaning
CX matches the control character indicated by X. For example, CM matches a control-m or carriage return character. The value of x must be one of a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character.
F matches a page break. Equivalent to x0c and CL.
n matches a line break. Equivalent to x0a and CJ.
R matches a carriage return character. Equivalent to x0d and CM.
s matches any whitespace character, including spaces, tabs, page breaks, and so on. equivalent to [FNRTV].
s matches any non-whitespace character. equivalent to [^ FNRTV].
T matches a tab character. Equivalent to x09 and CI.
V matches a vertical tab. Equivalent to x0b and CK.
3.3 Special characters
The so-called special character, is some special meaning of the character, such as the above said "*.txt" in the *, simply means that the meaning of any string. If you are looking for a file with * in the file name, you need to escape the *, which is preceded by one. LS *.txt. The regular expression has the following special characters.
Special Character description
$ matches the end position of the input string. If the Multiline property of the RegExp object is set, then $ also matches ' n ' or ' R '. To match the $ character itself, use $.
() to mark the start and end positions of a sub-expression. Sub-expressions can be obtained for later use. To match these characters, use (and).
* matches the preceding subexpression 0 or more times. To match the * character, use *.
+ matches the preceding subexpression one or more times. to match the + character, use +.
. matches any single character except for line break N. to match., please use.
[Marks the beginning of a bracket expression. To match [, use [.
? matches the preceding subexpression 0 or one time, or indicates a non-greedy qualifier. to match? character, please use?.
Marks the next character as either a special character, a literal character, a backward reference, or an octal escape. For example, ' n ' matches the character ' n '. ' n ' matches line breaks. The sequence ' matches ', while ' (' then matches ' (".
^ matches the starting position of the input string, unless used in a square bracket expression, at which point it indicates that the character set is not accepted. To match the ^ character itself, please use ^.
{The beginning of the tag qualifier expression.} To match {, use {.
| Indicates a choice between the two items. to match |, please use |.
The method of constructing a regular expression is the same as the method for creating a mathematical expression. That is, using a variety of meta-characters and operators to combine small expressions together to create larger expressions. A component of a regular expression can be a single character, a character set, a range of characters, a selection between characters, or any combination of all of these components.
3.4 Qualifiers
Qualifiers are used to specify how many times a given component of a regular expression must appear to satisfy a match. There are 6 types of * or + or? or {n} or {n,} or {n,m}.
The *, +, and? Qualifiers are greedy because they match as many words as possible, but only after they are added with one? You can implement a non-greedy or minimal match.
The qualifiers for a regular expression are:
Character description
* matches the preceding subexpression 0 or more times. For example, zo* can match "z" and "Zoo". * Equivalent to {0,}.
+ matches the preceding subexpression one or more times. For example, ' zo+ ' can match "Zo" and "Zoo", but not "Z". + equivalent to {1,}.
Match the preceding subexpression 0 or one time. For example, "Do (es)?" can match "do" in "do" or "does".? Equivalent to {0,1}.
{N}n is a non-negative integer.} Matches the determined n times. For example, ' o{2} ' cannot match ' o ' in ' Bob ', but can match two o in ' food '.
{N,}n is a non-negative integer.} Match at least n times. For example, ' o{2,} ' cannot match ' o ' in ' Bob ', but can match all o in ' Foooood '. ' O{1,} ' is equivalent to ' o+ '. ' O{0,} ' is equivalent to ' o* '.
{N,M}M and N are non-negative integers, where n <= m. Matches at least n times and matches up to M times. For example, "o{1,3}" will match the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' O? '. Note that there can be no spaces between a comma and two numbers.
3.5 Locator
Used to describe the bounds of a string or word, ^ and $ refer to the beginning and end of the string, B describes the front or back of the word, and B represents a non-word boundary. You cannot use qualifiers on locators.
3.6 Selection
Enclose all selections in parentheses, separating the adjacent selections by |. But with parentheses there is a side effect that the associated match is cached and available at this time?: Put the first option to eliminate this side effect.
Where?: one of the non-capturing elements, and two non-capturing elements are? = and?!, these two also have more meanings, the former is forward pre-check, in any beginning to match the position of the regular expression pattern within the parentheses to match the search string, the latter is a negative pre-check, Matches the search string at any start where the regular expression pattern does not match.
3.7 Back to reference
Adding parentheses around a regular expression pattern or part of a pattern causes the related match to be stored in a temporary buffer, and each captured sub-match is stored according to what is encountered in the regular expression pattern from left to right. The buffer number for the storage sub-match starts at 1 and continues numbering up to 99 sub-expressions. Each buffer can be accessed using ' n ', where n is a single or two-bit decimal number that identifies a particular buffer.
You can use the non-capturing metacharacters '?: ', '? = ', or '?! ' to ignore the save of the related match.
4. Operation priority for various operators
The operations of the same priority are left-to-right, and the operations of different priorities are higher and lower than before. The precedence of the various operators is from high to low as follows:
Operator description
Escape character
(), (?:), (? =), [] parentheses and square brackets
*, +,?, {n}, {n,}, {n,m} qualifier
^, $, anymetacharacter position and order
| or the action
5. All symbols Explained
Character description
Marks the next character as a special character, or a literal character, or a backward reference, or an octal escape. For example, ' n ' matches the character "n". ' n ' matches a line break. The sequence "matches" and "(" matches "(".
^ matches the starting position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after ' n ' or ' R '.
$ matches the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before ' n ' or ' R '.
* matches the preceding subexpression 0 or more times. For example, zo* can match "z" and "Zoo". * Equivalent to {0,}.
+ matches the preceding subexpression one or more times. For example, ' zo+ ' can match "Zo" and "Zoo", but not "Z". + equivalent to {1,}.
Match the preceding subexpression 0 or one time. For example, "Do (es)?" can match "do" in "do" or "does".? Equivalent to {0,1}.
{N}n is a non-negative integer.} Matches the determined n times. For example, ' o{2} ' cannot match ' o ' in ' Bob ', but can match two o in ' food '.
{N,}n is a non-negative integer.} Match at least n times. For example, ' o{2,} ' cannot match ' o ' in ' Bob ', but can match all o in ' Foooood '. ' O{1,} ' is equivalent to ' o+ '. ' O{0,} ' is equivalent to ' o* '.
{N,M}M and N are non-negative integers, where n <= m. Matches at least n times and matches up to M times. For example, "o{1,3}" will match the first three o in "Fooooood". ' o{0,1} ' is equivalent to ' O? '. Note that there can be no spaces between a comma and two numbers.
When the character immediately follows any other restriction (*, +,?, {n}, {n,}, {n,m}), the matching pattern is non-greedy. The non-greedy pattern matches the searched string as little as possible, while the default greedy pattern matches as many of the searched strings as possible. For example, for the string "oooo", ' o+? ' will match a single "O", while ' o+ ' will match all ' o '.
. matches any single character except for "N". To match any character including ' n ', use a pattern like ' [. N] '.
Pattern matches the pattern and gets the match. The obtained matches can be obtained from the resulting Matches collection, the Submatches collection is used in VBScript, and the $0...$9 property is used in JScript. To match the parentheses character, use ' (' or ') '.
(?:p Attern) matches the pattern but does not get a matching result, which means that this is a non-fetch match and is not stored for later use. This is useful when using the "or" character (|) to combine parts of a pattern. For example, ' Industr (?: y|ies) is a more abbreviated expression than ' industry|industries '.
(? =pattern) forward, matching the lookup string at the beginning of any string that matches the pattern. This is a non-fetch match, which means that the match does not need to be acquired for later use. For example, ' Windows (? =95|98| nt|2000) ' Can match Windows 2000 ', but does not match Windows 3.1 in Windows. Pre-checking does not consume characters, that is, after a match occurs, the next matching search starts immediately after the last match, rather than starting with the character that contains the pre-check.
(?! pattern), which matches the lookup string at the beginning of any string that does not match the pattern. This is a non-fetch match, which means that the match does not need to be acquired for later use. For example ' Windows (?! 95|98| nt|2000) ' can match Windows 3.1 ', but does not match Windows 2000 in Windows. Pre-check does not consume characters, that is, after a match occurs, the next matching search starts immediately after the last match, rather than starting with the character that contains the pre-check
X|y matches x or Y. For example, ' Z|food ' can match "z" or "food". ' (z|f) Ood ' matches "Zood" or "food".
[XYZ] Character set. Matches any one of the characters contained. For example, ' [ABC] ' can match ' a ' in ' plain '.
[^XYZ] negative character set. Matches any character that is not contained. For example, ' [^ABC] ' can match ' P ' in ' plain '.
A [A-z] character range. Matches any character within the specified range. For example, ' [A-z] ' can match any lowercase alphabetic character in the ' a ' to ' Z ' range.
[^a-z] negative character range. Matches any character that is not in the specified range. For example, ' [^a-z] ' can match any character that is not within the range of ' a ' to ' Z '.
b matches a word boundary, which means the position between the word and the space. For example, ' Erb ' can match ' er ' in ' never ', but not ' er ' in ' verb '.
b matches a non-word boundary. ' ErB ' can match ' er ' in ' verb ', but cannot match ' er ' in ' Never '.
CX matches the control character indicated by X. For example, CM matches a control-m or carriage return character. The value of x must be one of a-Z or a-Z. Otherwise, c is treated as a literal ' C ' character.
D matches a numeric character. equivalent to [0-9].
D matches a non-numeric character. equivalent to [^0-9].
F matches a page break. Equivalent to x0c and CL.
n matches a line break. Equivalent to x0a and CJ.
R matches a carriage return character. Equivalent to x0d and CM.
s matches any whitespace character, including spaces, tabs, page breaks, and so on. equivalent to [FNRTV].
s matches any non-whitespace character. equivalent to [^ FNRTV].
T matches a tab character. Equivalent to x09 and CI.
V matches a vertical tab. Equivalent to x0b and CK.
W matches any word character that includes an underscore. Equivalent to ' [a-za-z0-9_] '.
W matches any non-word character. Equivalent to ' [^a-za-z0-9_] '.
xn matches n, where n is the hexadecimal escape value. The hexadecimal escape value must be two digits long for a determination. For example, ' x41 ' matches ' A '. ' x041 ' is equivalent to ' x04 ' & ' 1 '. ASCII encoding can be used in regular expressions:
num matches num, where num is a positive integer. A reference to the obtained match. For example, ' (.) 1 ' matches two consecutive identical characters.
N identifies an octal escape value or a backward reference. If n has at least n obtained sub-expressions, then n is a backward reference. Otherwise, if n is the octal number (0-7), N is an octal escape value.
NM identifies an octal escape value or a backward reference. NM is a backward reference if at least NM has a sub-expression before nm. If there are at least n fetches before nm, then n is a backward reference followed by the literal m. If none of the preceding conditions are met, if both N and M are octal digits (0-7), NM will match the octal escape value nm.
NML If n is an octal number (0-3) and both M and L are octal digits (0-7), the octal escape value NML is matched.
UN matches n, where N is a Unicode character represented by four hexadecimal digits. For example, u00a9 matches the copyright symbol (?).
6. Some examples
Regular Expression Description
/b ([a-z]+) 1b/gi where a word appears consecutively
/(w+)://([^/:]+) (:D *)? ([^#]*)/resolves a URL to a protocol, domain, port, and relative path
/^ (?: chapter| section) [Location of the 1-9][0-9]{0,1}$/positioning chapter
/[-a-z]/a to Z a total of 26 letters plus one-number.
/terb/can match chapter, not terminal
/bapt/can match chapter, not aptitude
/windows (? =95 |98 | NT)/can match Windows95 or Windows98 or WindowsNT, when a match is found, the next search match starts from behind Windows.
7. Regular expression matching rules
7.1 Basic Pattern matching
Everything starts from the most basic. Patterns are the most basic elements of formal expressions, which are a set of characters that describe the character of a string. Patterns can be simple, consist of ordinary strings, or can be very complex, often with special characters representing a range of characters, repeating, or representing context. For example:
^once
This pattern contains a special character ^, which indicates that the pattern matches only those strings that begin with once. For example, the pattern matches the string "Once Upon a Time" and does not match "there once is a man from NewYork". Just as the ^ symbol represents the beginning, the $ symbol is used to match strings that end in a given pattern.
bucket$
This pattern matches the "who kept all of the cash in a bucket" and does not match "buckets". The characters ^ and $ are used together to indicate exact matches (the string is the same as the pattern). For example:
^bucket$
Matches only the string "bucket". If a pattern does not include ^ and $, then it matches any string that contains the pattern. Example: Mode
Once
With string
There once is a man from NewYork
Who kept all the cash in a bucket.
is a match.
The letters in the pattern (O-N-C-E) are literal characters, that is, they represent the letter itself, and the numbers are the same. Some other slightly more complex characters, such as punctuation and white characters (spaces, tabs, etc.), are used to escape sequences. All escape sequences begin with a backslash (). The escape sequence for a tab is: T. So if we're going to check if a string starts with a tab, you can use this pattern:
^t
Similarly, use N for "new line" and R for carriage return. Other special symbols can be used in front with a backslash, such as a backslash itself with an expression, a period. To express, and so on.
7.2 Character clusters
In programs in the Internet, regular expressions are often used to validate the user's input. When the user submits a form, to determine whether the input phone number, address, email address, credit card number, etc. is valid, with ordinary literal-based characters is not enough.
So to use a more liberal way of describing the pattern we want, it's a character cluster. To create a character cluster that represents all vowel characters, place all the vowels in a square bracket:
[Aaeeiioouu]
This pattern matches any vowel character, but can only represent one character. A hyphen can be used to represent a range of characters, such as:
[A-z]//Match all lowercase letters
[A-z]//Match all uppercase letters
[A-za-z]//Match all the letters
[0-9]//Match all the numbers
[0-9.-]//Match all numbers, periods and minus signs
[FRTN]//match all whitespace characters
Similarly, these also represent only one character, which is a very important one. If you want to match a string consisting of a lowercase letter and a single digit, such as "Z2", "T6" or "G7", but not "ab2", "r2d3", or "B52", use this pattern:
^[a-z][0-9]$
Although [A-z] represents a range of 26 letters, here it can only match a string with the first character being a lowercase letter.
The previous mention of ^ represents the beginning of a string, but it has another meaning. When used in a set of square brackets ^ is, it means "non" or "exclude" meaning, often used to remove a character. Also with the previous example, we require that the first character cannot be a number:
^[^0-9][0-9]$
This pattern matches "&5", "G7" and "2", but does not match "12" or "66". Here are a few examples of excluding specific characters:
[^a-z]//All characters except lowercase letters
[^/^]//All characters except () () (/) (^)
[^ "']//all characters except double quotation marks (") and single quotation marks (')
Special character "." (point, period) is used in regular expressions to denote all characters except the "New line". So the pattern "^.5$" matches any two-character string that ends with the number 5 and begins with other non-"new line" characters. Mode "." You can match any string, except for an empty string, and to include only a "new line" of strings.
The regular expressions for PHP have some built-in universal character clusters, which are listed below:
Character cluster meaning
[[: Alpha:]] any letter
[[:d Igit:]] any number
[[: Alnum:]] Any letters and numbers
[[: Space:]] any whitespace character
[[: Upper:]] Any capital letter
[[: Lower:]] any lowercase letter
[[:p UNCT:]] any punctuation
[[: Xdigit:]] Any 16 binary number, equivalent to [0-9a-fa-f]
7.3 OK repeat occurrence
So far, you already know how to match a letter or number, but more likely, you might want to match a word or a group of numbers. A word consists of several letters, and a group of numbers has several singular parts. The curly braces ({}) following the character or character cluster are used to determine the number of occurrences of the preceding content.
Character cluster meaning
^[a-za-z_]$ all the letters and underscores
^[[:alpha:]]{3}$ all 3-letter words
^a$ Letter A
^a{4}$ AAAA
^a{2,4}$ aa,aaa or AAAA
^a{1,3}$ A,aa or AAA
^a{2,}$ contains more than two a strings
^a{2,} such as: Aardvark and Aaab, but not Apple
A{2,} such as: Baad and AAA, but Nantucket not
T{2} two tab characters
. {2} all two characters
These examples describe the three different uses of curly braces. A number, {x}, means "the preceding character or character cluster appears only x times"; A number plus a comma, {x,} means "x or more occurrences of the preceding content", and two comma-delimited numbers, {x, y} means "the preceding content appears at least x times, but not more than Y". We can extend the pattern to more words or numbers:
^[a-za-z0-9_]{1,}$//All strings that contain more than one letter, number, or underscore
^[0-9]{1,}$//All positive numbers
^-{0,1}[0-9]{1,}$//all integers
^-{0,1}[0-9]{0,}. {0,1} [0-9] {0,}$//all decimals
The last example is not very well understood, is it? Let's see: With all start with an optional minus sign (-{0,1}), followed by 0 or more digits ([0-9]{0,}), and an optional decimal point (. { 0,1}) followed by 0 or more numbers ([0-9]{0,}), and nothing else ($). Below you will know the simpler way to use it.
Special characters "?" is equal to {0,1}, and they all represent: "0 or 1 preceding content" or "previous content is optional". So just the example can be simplified to:
^-? [0-9] {0,}.? [0-9] {0,}$
The special characters "*" are equal to {0,}, and they all represent "0 or more of the preceding content." Finally, the character "+" is equal to {1,}, which means "1 or more preceding contents", so the above 4 examples can be written as:
^[a-za-z0-9_]+$//All strings that contain more than one letter, number, or underscore
^[0-9]+$//All positive numbers
^-? [0-9]+$//all integers
^-? [0-9]*.? [0-9]*$//All decimals
Of course, this does not technically reduce the complexity of formal expressions, but it makes them easier to read.
http://www.bkjia.com/PHPjc/371871.html www.bkjia.com true http://www.bkjia.com/PHPjc/371871.html techarticle Regular expressions are cumbersome, but powerful, learned applications will give you an absolute sense of accomplishment in addition to improving efficiency. Just read the information carefully, plus ...