PHP regular expressions do not capture the source: www.iq662.com has been replaced, and now it is time to solve the problem of three or four-bit area codes. Replacement in a regular expression refers to several rules. if any rule is satisfied, it should be regarded as a match. the specific method is to use | to separate different rules. Can't you understand? It doesn't matter. let's look at the example: 0d {2}-d {8} | 0d {3}-d {7} this expression can match two types of e-PHP regular expressions separated by font size without capturing
Source: http://www.iq662.com
Replace
Now it's time to solve the problem of three or four-digit area codes. Replacement in a regular expression refers to several rules. if any rule is satisfied, it should be regarded as a match. the specific method is to use | to separate different rules. Can't you understand? It doesn't matter. let's look at the example:
0 \ d {2}-\ d {8} | 0 \ d {3}-\ d {7} this expression can match two phone numbers separated by a hyphen: one is a three-digit area code, an eight-digit local code (for example, 010-12345678), a four-digit area code, and a seven-digit local code (0376-2233445 ).
\ (0 \ d {2} \) [-]? \ D {8} | 0 \ d {2} [-]? The expression \ d {8} matches the phone number of the three-digit area code. The area code can be enclosed in parentheses or not. The area code can be separated by a hyphen or space, or there is no interval. You can try to replace it with | extend this expression to a four-digit area code.
The expression \ d {5}-\ d {4} | \ d {5} is used to match the zip code of the United States. The U.S. Postal Code uses five digits or nine digits separated by a hyphen. This example is given because it indicates a problem: sequence is important when replacement is used. If you change it to \ d {5} | \ d {5}-\ d {4, then, it will only match the 5-digit ZIP code (and the first 5-digit of the 9-digit ZIP code ). The reason is that each condition is tested from left to right during matching and replacement. if a condition is met, other replacement conditions are not managed.
Windows98 | Windows2000 | the example of WindosXP is to tell you that replacement can be used not only for two rules, but also for more rules.
Group
We have already mentioned how to repeat a single character. But what if I want to repeat a string? You can use parentheses to indicate the subexpression (also called grouping), and then you can specify the number of repetitions of this subexpression, you can also perform other operations on the subexpression (this will be introduced later in the tutorial ).
(\ D {1, 3} \.) {3} \ d {1, 3} is a simple IP address matching expression. To understand this expression, analyze it in the following order: \ d {1, 3} represents a number ranging from 1 to 3 (\ d {1, 3 }\.} {3} indicates that a three-digit ending point (this group is used as a whole) is repeated three times, and a one-to-three-digit ending point (\ d {1, 3}) is added }).
Unfortunately, it will also match an impossible IP address such as 256.300.888.999 (each number in the IP address cannot exceed 255 ). If arithmetic comparison can be used, this problem may be solved simply. However, regular expressions do not provide any mathematical functions. Therefore, you can only use lengthy grouping and selection, character class to describe a correct IP address: (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?) \.) {3} (2 [0-4] \ d | 25 [0-5] | [01]? \ D ?).
The key to understanding this expression is to understand 2 [0-4] \ d | 25 [0-5] | [01]? \ D ?, I will not elaborate on it here. you should be able to analyze its meaning.
Backward reference
After a subexpression is specified with parentheses, the text matching the subexpression can be further processed in the expression or other programs. By default, each group will automatically have a group number. The rule is: use the left parentheses of the group as the sign, from left to right, the group number of the first group is 1, and the second is 2, and so on.
Backward reference is used to repeatedly search text matched by the previous group. For example, \ 1 indicates the text matched by group 1. Hard to understand? See the example:
\ B (\ w +) \ B \ s + \ 1 \ B can be used to match duplicate words, such as go and kitty. The first is a word, that is, more than one letter or number (\ B (\ w +) \ B) between the start and end of a word ), then there is one or several blank characters (\ s +, and finally the matched word (\ 1 ).
You can also specify the group number or group name of the subexpression. To specify the group name of a subexpression, use the following syntax :(? \ W +), so that the group name of \ w + is specified as Word. To reverse reference the content captured by this group, you can use \ k So the previous example can also be written as follows: \ B (? \ W +) \ B \ s * \ k \ B.
When parentheses are used, there are many syntax for specific purposes. The most common ones are listed below:
Table 4. group syntax capture
(Exp) match exp and capture the text to the automatically named group
(? Exp) match exp and capture the text to the group named name
(? : Exp) matches exp and does not capture the matched text.
Location specified
(? = Exp) match the position before exp
(? <= Exp) match the position behind exp
(?! Exp) the position behind the matching is not the exp position.
(? Note
(? # Comment) This type of group does not affect the processing of regular expressions, just to provide comments for reading.
We have discussed the first two syntaxes. Third (? : Exp) does not change the processing method of the regular expression, but the content of such a group match will not be captured into a group as in the first two methods.
Location specified
The following four items are used to search for things before or after some content (but not including the content), that is, they are used to specify a location, just like \ B, ^, $, therefore, they are also called assertion with zero width. We 'd better illustrate it with examples:
(? = Exp) is also called the zero-width predicate. it matches certain positions in the text. these positions can be followed by the given suffix exp. For example, \ B \ w + (? = Ing \ B), match the first part of the word ending with ing (except for the ing part), if you are looking for I'm singing while you're dancing. it will match sing and danc.
(? <= Exp) is also called the assertion with zero width. it matches certain positions in the text and matches exp with the given prefix. For example (? <= \ Bre) \ w + \ B will match the second half of the word starting with re (except re). For example, it matches ading when searching for reading a book.
If you want to add a comma (, of course, from the right side) to each of the three digits in a long number, you can search for the parts that need to be added with a comma :((? <= \ D) \ d {3}) * \ B. Please analyze this expression carefully. it may not be as simple as what you first see.
The following example uses both the prefix and suffix :(? <= \ S) \ d + (? = \ S) match the numbers separated by spaces (emphasize again, do not include these spaces ).
Specified negative position
We have previously mentioned how to find out characters that are not a character or are not in a character class ). But what if we only want to ensure that a character does not appear, but do not want to match it? For example, if we want to find such a word, which contains the letter q, but q is not followed by the letter u, we can try this:
\ B \ w * q [^ u] \ w * \ B matches a word that contains the letter q, not the letter u. But if you do more tests (or if you are keen enough, you can simply observe them), you will find that if q appears at the end of a word, like Iraq, Benq, this expression will cause an error. This is because [^ u] always matches one character, so if q is the last character of a word, the [^ u] following will match the word separator (which may be a space, a full stop or something else) after q, and the \ w + \ B following will match the next word, therefore, \ B \ w * q [^ u] \ w * \ B can match the entire Iraq fighting. The specified negative position can solve this problem because it only matches one position and does not consume any characters. Now, we can solve this problem as follows: \ B \ w * q (?! U) \ w * \ B.
Assertion (?! Exp), only matching the position where the suffix exp does not exist. \ D {3 }(?! \ D) matches three digits, and the three digits cannot be followed by digits.
Similarly, we can use (?
A more complex example :(? <= <(\ W +)> ).*(? = <\/\ 1>) matches the content in the simple HTML tag that does not contain the attribute. ( ) Specifies the prefix: the word enclosed by angle brackets (for example, it may be
), Followed by. * (any string), followed by a suffix (? = <\/\ 1> ). Pay attention to the \/in the suffix, which uses the character escape mentioned above; \ 1 is a reverse reference, which references the first group captured, the previous (\ w +) the matched content so that if the prefix isThe suffix is. The entire expression matchesAndContent (remind again, excluding the prefix and suffix itself ).
Note
Another use of parentheses is the pass-through syntax (? # Comment) to include comments. To include comments, it is best to enable the "blank characters in Ignore Mode" option. in this way, spaces, tabs, and line breaks can be added when an expression is written, which will be ignored in actual use. After this option is enabled, all the text that ends at the end of the line after # is ignored as a comment. For example, we can write the previous expression as follows:
(? <= # Search for a prefix, but does not contain it <(\ w +)> # Search for letters or numbers (tags) enclosed in angle brackets # End of the prefix. * # match any text (? = # Search for the suffix, but does not contain it <\/\ 1> # search for the content enclosed by angle brackets: The front is a "/", followed by the previously captured tag) # End of suffix
Greed and laziness
When a regular expression contains repeated quantifiers (a specified number of codes, such as *, {5, 12}), the common action is to match as many characters as possible. Consider this expression: a. * B, which will match the longest string starting with a and ending with B. If you use it to search for aabab, it will match the entire string aabab. This is called greedy matching.
Sometimes, we need to be more lazy to match, that is, to match as few characters as possible. All the quantifiers mentioned above can be converted to the lazy match mode. you just need to add a question mark after it ?. This way .*? This means to match any number of duplicates, but use the minimum number of duplicates if the entire match is successful. Now let's look at the lazy version example:
A .*? B matches the string that is shortest, starts with a, and ends with B. If it is applied to aabab, it will match aab and AB.
Table 5. lazy quantifiers *? Repeat any time, but as few as possible
+? Repeat once or more times, but as few as possible
?? Repeated 0 or 1 times, but as few as possible
{N, m }? Repeat n to m times, but as few as possible
{N ,}? Repeated more than n times, but as few as possible
Nothing to mention
I have already described a large number of elements for constructing regular expressions, and there are some things I have not mentioned. The following is a list of unmentioned elements, including syntax and simple description. You can find more detailed references on the Internet to learn about them-when you need them. If you have installed the MSDN Library, you can also find detailed documentation on regular expressions under. net.
Table 6. syntax \ a alarm characters that have not been discussed (print it to a computer)
\ B is usually the division of words, but if it is used in the character class, it indicates the return
\ T Tab, Tab
\ R press enter
\ V vertical tab
\ F page feed
\ N linefeed
\ E Escape
\ 0nn ASCII code contains characters whose octal code is nn.
Characters in \ xnn ASCII code with hexadecimal code nn
Characters in the \ unnnn Unicode code whose hexadecimal code is nnnn
\ CN ASCII control characters. For example, \ cC stands for Ctrl + C
\ A string (similar to ^, but not affected by the option to process multiple rows)
\ Z string end or end of line (not affected by the option of processing multiple lines)
\ Z string end (similar to $, but not affected by the option to process multiple rows)
\ G start of the current search
The character class named name in \ p {name} Unicode, for example, \ p {IsGreek}
(?> Exp) greedy subexpression
(? - Exp) balancing Group
(? - Exp) balancing Group
(? Im-nsx: exp) change the processing option in the sub-expression exp
(? Im-nsx) is the partial change processing option after the expression
(? (Exp) yes | no) use exp as a zero-width forward positive asserted. if this position can match, use yes as the expression of this group; otherwise, use no
(? (Exp) yes) Same as above, just use an empty expression as no
(? (Name) yes | no) if the content is captured by the group named name, use yes as the expression; otherwise, use no
(? (Name) yes) Same as above, but uses an empty expression as no
Some references to terms that I think you may already know
Character
The most basic unit for a program to process a text, such as letters, numbers, punctuation marks, spaces, line breaks, and Chinese characters.
String
A sequence of 0 or more characters.
Text
Text, string.
Match
Compliance with the rules and check whether the rules are met.