Regular Expression Advanced Skills

Source: Internet
Author: User
Tags expression engine

The regular expression (regular expression abbr. regex) is powerful and can be used to find the required information in a large string of characters. It uses regular character structure expressions. Unfortunately, simple regular expressions are far from functional enough for some advanced applications. To filter a complex structure, you may need to use an advanced regular expression.

This topic describes the Advanced Skills of regular expressions. Eight common concepts are filtered out, coupled with instance parsing. Each example is a simple method that meets certain complex requirements. If you are not familiar with the basic concepts of regular expressions, read this article, this tutorial, or Wikipedia.

The regular syntax here applies to php and is compatible with perl.



1. Greedy/lazy


All the more limited regular operators are greedy. They match as many target strings as possible, that is, the matching results will be as long as possible. Unfortunately, this approach is not always what we want. Therefore, we add the "lazy" qualifier to solve the problem. Add "?" After each greedy Operator Allows the expression to match only the shortest length. In addition, the modifier "u" can also cohile operators that can be limited multiple times. The difference between greed and laziness is the basis for using advanced regular expressions.

Greedy Operator
The operator matches the previous expression zero or more times. It is a greedy operator. See the following example:

Copy codeThe Code is as follows:
Preg_match ('/<H1> This is another one. </H1> '$ matches)

Periods (.) can represent any character except line breaks. The above regular expression matches the h1 tag and all content in the tag. It uses periods (.) and periods () to match all the content in the tag. The matching result is as follows:

1. The entire string is returned. The operator will consecutively match all content-or even include the h1 closed tag in the middle. Because it is greedy, matching the entire string is in line with the principle of maximizing its benefits.

Lazy Operator
Add a question mark (?), This can make expressions become lazy:

1./In this way, it will feel that the task is completed by matching the end label of the first h1.

Another greedy operator with similar attributes is {n }. It indicates that the previous match pattern repeats n or more times. If no question mark is added, it will search for as many repetitions as possible, it will be as few duplicates as possible (of course, "Repeat n times" at least ).


Copy codeThe Code is as follows:
# Create a string
$ Str = 'hihihi oops Hi'
# Use greedy {n} operator for matching
Preg_match ('/(hi) {2}/' $ str $ matches) # matches [0] will be 'hihihi'
# {N }? Operator matching
Preg_match ('/(hi) {2 }? /'$ Str $ matches) # matches [0] will be 'hihi'

2. back referencing)

What is the purpose?
Back referencing is generally translated into "reverse reference", "back Reference", and "back Reference". I personally think "back Reference" is more appropriate [stupid work]. It is a method of referencing the content captured before the regular expression. For example, the following simple example aims to match the content inside the quotation marks:

Copy codeThe Code is as follows:
# Create a matching Array
$ Matches = array ()

# Create a string
$ Str = "" this is a 'string '""

# Capturing content using regular expressions
Preg_match ("/(" | ').? ("| ')/" $ Str $ matches)

# Output the entire matching string
Echo $ matches [0]

It will output:



1. "this is'
Obviously, this is not what we want.

This expression starts matching with double quotation marks at the beginning, and ends the matching incorrectly after encountering single quotation marks. This is because the expressions include: ("| '), double quotation marks ("), and single quotation marks. To fix this problem, you can use return references. Expression 1 2... 9 is the serial number of each sub-content captured in the previous step. It can be referenced as a "Pointer" to these groups. In this example, the first matching quotation mark is represented by 1.

How to use it?
Replace the closed quotation marks in the above example with 1:

1. preg_match ('/("| ').? 1/'$ str $ matches)
This correctly returns the string:

1. "this is a 'string '"
Comments:

What should I do if the quotation marks are Chinese and the quotation marks are not the same character?

Do you still remember the php function preg_replace? There are also return references. But we didn't use 1... 9, but $1... $9... $ N (any number here) serves as a return pointer. For example, if you want to replace all paragraph labels with text:

Copy codeThe Code is as follows:
$ Text = preg_replace ('/<p> (.?) </P> /'
"& Lt p & gt $1 & lt/p & gt" $ html)

The $1 parameter is a callback reference that represents the text in the paragraph label <p> and is inserted into the replaced text. This simple and easy-to-use expression provides us with a simple way to get matched text, even when replacing text.

3. named capture group (named groups)
When callback references are used multiple times in an expression, it is easy to confuse things and find out the numbers (1... 9) which sub-content is very troublesome. An alternative to callback reference is to use a named capture group (hereinafter referred to as "famous group "). Use a famous group (? P <name> pattern). name indicates the group name. pattern matches the regular structure of the famous group. See the following example:

1 ./(? P <quote> "| ').? (? P = quote )/
In the above formula, quote is the group name, "|" is the regular expression matching the content. After (? P = quote) is a famous group in the call group named quote. The effect of this Sub-statement is the same as that of the callback reference instance above, but it is implemented using a famous group. Is it easier to read and understand?

A famous group can also be used to process internal data in an array of matched content. The group name assigned to a specific regular expression can also be used as the index word of the matched content in the array.

Copy codeThe Code is as follows:
Preg_match ('/(? P <quote> "| ')/'" 'string' "$ matches)

# The following statement outputs "'" (not including double quotation marks)
Echo $ matches [1]

# If the group name is used for calling, "'" will also be output.
Echo $ matches ['quote']

Therefore, a famous group is not only easier to write code, but also used to organize code.

4. word boundaries)

The word boundary is the position between the characters in a string (including letters, numbers, and underscores, naturally including Chinese characters) and non-word characters. It does not match a real character. Its length is zero. B matches all word boundaries.

Unfortunately, word boundaries are ignored, and most people do not care about their practical significance. For example, if you want to match the word "import ":

1./import/
Note! Regular Expressions are sometimes naughty. The following strings can also be matched with the preceding sub-statement:

1. important
You may think that if you add spaces before and after the import, you won't be able to match this independent word:

1./import/
If this happens:

1. the trader voted for the import
When the word import starts or ends with a string, the modified expression is still unavailable. Therefore, it is necessary to consider various situations:

1./(^ import | import $)/I
Don't worry. It's not over yet. What if there is a punctuation mark? To match the word, your regular expression may need to be written as follows:

1./(^ import (: | )? | Import (: | )? | Import (. |? | !)? $)/I
For matching only one word, this is a little tricky. Therefore, word boundaries are significant. To adapt to the above requirements and many other variants, with the character boundary, the code we need to write is:

1./bimportb/
All the above situations have been solved. The flexibility of B lies in that it is a non-length match. It only matches the positions imagined between two actual characters. It checks whether two adjacent characters are a single word and the other is a non-single word. If the condition is correct, a match is returned. If a word starts or ends, B treats it as a non-word character. Since I in import is still considered as a word character, import is matched.

Note that we have B in comparison to B. This operator matches the position between two or two non-words. Therefore, if you want to match the 'Hi' in a word, you can use:

1. bhib
"This" and "hight" will return a match, while "hi there" will return a mismatch.

5. Minimal group (atomic groups)

The smallest group is a non-capturing special regular expression group. It is usually used to improve the performance of regular expressions and to eliminate specific matches. A minimal group can be used (?> Pattern), where pattern is a matching expression.

1./(?> His | this )/
When the Regular Expression Engine matches the smallest group, it skips the Backtracking position marked in the group. Take the word "smashing" as an example. When the above regular expression is used for matching, the Regular Expression Engine first tries to find "his" in "smashing ". Obviously, no matching is found. At this point, the smallest group plays a role: the Regular Expression Engine will discard all backtracking positions. That is to say, it will not try to find "this" from "smashing ". Why do we set it like this? Because "his" does not return a matching result, the "this" containing "his" cannot match any more!

The above example is not practical. We use/t? His? /Can also achieve the effect. Let's take a look at the following example:

1./B (engineer | engrave | end) B/
If "engineering" is used for matching, the regular engine will first match "engineer", but then it will encounter the word boundary, B, so the matching fails. Then, the Regular Expression Engine tries to find the next Matching content in the string: engrave. When eng is matched, the matching fails. Finally, if you try "end", the result is also a failure. After careful observation, you will find that once the engineer fails to match and both reach the word boundary, the word "engrave" and "end" are no longer likely to match successfully. These two words are short compared with engineer, and the regular expression engine should not make unnecessary attempts.

1./B (?> Engineer | engrave | end) B/
The alternative writing method above can save the matching time of the Regular Expression Engine and improve the code efficiency.

6. recursion (recursion)

Recursive (recursion) is used to match nested structures, such as Arc embedding, (this (that), and html Tag embedding <div> </div>. We use (? R) to represent the subpattern In the recursion process. The following is an example of matching nested arc:

1./(?> [^ ()] +) | (? R )))/
The outermost layer uses the parentheses ("matching the beginning of the nested structure. Then there is a multi-choice operator (|), which may match all the characters except parentheses "(?> [^ ()] +) ", Or the sub-mode" (? R) "to match the entire expression again. Note that this operator will match as many nesting conditions as possible.

Another example of recursion is as follows:

1./<([w] +).?> (?> [^ <>] +) | ((? R) </1>/
The preceding expressions use character grouping, greedy operators, backtracking, and minimal grouping to match nested labels. The first inner group ([w] +) in the ARC matches the tag signature and is used for subsequent applications. If you find the label of this angle bracket style, try to find the remaining part of the label content. The subexpression enclosed by the next arc is very similar to the previous example: either match all characters not including angle brackets (?> [^ <>] +), Or recursively matches the entire expression (? R ). The last part of the entire expression is the closed label of the angle bracket style </1>.

7. Callback (callbacks)

The specific content in the matching result may sometimes need some special modification. To apply multiple and complex modifications, the callback of a regular expression is useful. Callback is used to dynamically modify the string in the preg_replace_callback function. You can specify a function as a parameter for preg_replace_callback. This function can receive the matching result array as a parameter and return the array after modification as a replacement result.

For example, we want to convert all the letters in a string into uppercase letters. Unfortunately, php does not directly convert uppercase/lowercase regular operators. To complete this task, you can use the regular callback. First, the expression must match all uppercase letters:

1./bw/
Both the word boundary and character class are used. This formula is not enough. We need a callback function:
Copy codeThe Code is as follows:
Function upper_case ($ matches ){
Return strtoupper ($ matches [0])
}

The upper_case function receives an array of matching results and converts the entire matching result to uppercase. In this example, $ matches [0] indicates the letters to be capitalized. Then, we use preg_replace_callback to implement the callback:

1. preg_replace_callback ('/bw/' "upper_case" $ str)
A simple callback has such powerful power.

8. commenting)

Annotations are not used to match strings, but are indeed the most important part of regular expressions. The deeper the regular expression, the more complicated the writing, and the more difficult it is to deduce exactly what is matched. Adding comments to the regular expression is the best way to minimize future confusion and confusion.

To add a comment inside the regular expression, use (? # Comment) format. Replace "comment" with your comment statement:

1 ./(? # Number) d/
If you want to make the code public, it is especially important to add comments to the regular expression. This makes it easier for others to understand and modify your code. Similar to comments on other occasions, this can also facilitate your re-access to previously written programs.

Consider using "x" or "(? X) "modifier to format comments. This modifier allows the regular engine to ignore spaces between expression parameters. The "useful" space can still be matched by [] or (adding spaces to the negative sign.

Copy codeThe Code is as follows:
/
D # digit
[] # Space
W + # word
/X

The above code serves the same purpose as the following formula:

1./d (? # Digit) [] (? # Space) w + (? # Word )/
Always pay attention to the readability of the Code.

Pattern Modifier
Is a function to enhance and supplement regular expressions.
Example:/regular/U indicates a pattern modifier.
The following are commonly used in php: (Note: Case Sensitive)
I. The regular content is case-insensitive during matching (the default value is case-sensitive)
M uses multiple lines to identify and match the first or last content
S cancels the escape carriage return as a unit match

X ignore the blank in the Regular Expression
A force match from scratch
D Force $ match any content at the end \ n
U prohibits greedy mei matching. It only traces the latest matching character and ends. It is often used in the regular expression of the collection program.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.