Regular expressions-advanced skill sharing _ regular expressions

Source: Internet
Author: User
Tags character classes closing tag modifier
Regular expressions (regular expression abbr. Regex) are powerful and can be used to find the information you need in a large string of words character. It takes advantage of the conventional character-structure expressions to function. Unfortunately, simple regular expressions are not nearly as powerful for some advanced applications. The structure of the filter is more complex, and you may need to use an advanced regular expression.

This article describes the advanced techniques for regular expressions. Eight commonly used concepts have been screened out, with instance parsing, each of which is a simple way of satisfying a complex requirement. If you have a lack of understanding of the basic concepts of regular, please read this article, or this tutorial, or Wikipedia entry.

The regular syntax here applies to PHP and is compatible with Perl.



1. Greed/laziness


All the regular operators that can be qualified more than once are greedy. They match the target string as much as possible, which means the result will be as long as possible. Unfortunately, this practice is not always what we want. Therefore, we add the "lazy" qualifier to solve the problem. Add "?" after each greedy operator Allows an expression to match only the shortest possible length. In addition, the modifier "U" can also be inert to operators that can be qualified more than once. Understanding the difference between greed and laziness is the basis for using advanced regular expressions.

greedy operator
An operator matches an expression before 0 or 0 times. It is a greedy operator. Take a look at the following example:

Copy Code code as follows:

Preg_match ('/< h1> .< h1> this is another.

Period (.) can represent any character other than a line break. The regular expression above matches the H1 label and all content within the label. It uses a period (.) and an asterisk () to match all the contents of the label. The results are as follows:

1.< H1> This is a headline. The entire string is returned. The operator will match everything-even the middle H1 closing tag. Because it is greedy, matching the entire string is in line with its interests maximization principle.

lazy operator
Make the expression lazy by slightly modifying the formula above and adding a question mark (?):

1./< h1>.? It would feel that the task would be complete only by matching the tag at the end of the first H1.

Another greedy operator with similar attributes is {n}. It represents the previous match pattern repeat n or n times above, if not add a question mark, it will look for as many repetitions as possible, plus, it will be as little as possible (of course, "Repeat n times" the least).


Copy Code code as follows:

# Build Strings
$str = ' Hihihi oops hi '
# Use the greedy {n} operator to match
Preg_match ('/(HI) {2}/' $str $matches) # Matches[0] will be ' hihihi '
# using the aborted {n}? Operator matching
Preg_match ('/(HI) {2}?/' $str $matches) # Matches[0] will be ' hihi '

2. Return reference (back referencing)

What's the use?
Return references (back referencing) are generally translated as "reverse references", "Backward references", "backwards references", and individuals find "return references" more appropriate [Woole]. It is a method of referencing the previously captured content within a regular expression. For example, the purpose of the following simple example is to match the contents of the quote inside:

Copy Code code as follows:

# Create a matching array
$matches = Array ()

# Build Strings
$str = "" is a ' string ' ""

# capturing content with regular expressions
Preg_match ("/(" |).? (" |' ) /"$str $matches)

# Output the entire matching string
echo $matches [0]

It will output:



1. "This is a '
Obviously, this is not what we want.

This expression starts with a double quotation mark at the beginning and ends the match incorrectly after encountering single quotes. This is because the expression says: ("|"), that is, double quotes (") and single quotes (') are available. To fix this problem, you can use the return reference. Expression 1 2 ... 9 is the grouping sequence number of each child content that has been captured before, and can be referenced as a "pointer" to these groupings. In this case, the first matching quotation mark is represented by 1.

How to use it?
In the example above, replace the closing quotation mark with the following 1:

1.preg_match ('/("| ').? 1/' $str $matches)
This returns the string correctly:

1. "This is a ' string '"
Study questions:

If it is a Chinese quotation mark, and the front and back quotes are not the same character, what should I do?

Do you remember PHP function preg_replace? There are also return references. It's just that we didn't use 1. 9, but with $ ... $n (any number available) as a return pointer. For example, if you want to replace all the paragraph labels < p> with text:

Copy Code code as follows:

$text = Preg_replace ('/< p> (.?) </p>/'
"& lt p& gt $1& lt/p& GT" $html)

The parameter is a callback reference that represents the text inside the paragraph label < p>, and is inserted into the replaced text. This easy-to-use expression provides us with a simple way to get the matching text, even when replacing text.

3. Named Capture Group (named groups)
When you use callback references more than once in an expression, it's easy to confuse things and figure out those numbers (1 ...). 9 It is a very troublesome thing to represent which child content. An alternative to callback references is to use a capturing group with a name (hereinafter referred to as "a named group"). A named group uses (?p< name> pattern) to set the name to represent the group name, which is the regular structure that fits the named group. Take a look at the following example:

1./(?p< quote> "|").? (? p=quote)/
The quote is the name of the group, and the "|" is the regular of the matching content. The following (? p=quote) is a named group named quote in the calling group. The effect of this formula is the same as the callback reference instance above, except that it is implemented with a well-known group. Is it easier to read and understand?

A named group can also be used to process the internal data of an array of matched content. A specific regular group name can also be used as an index word within an array of matched content.

Copy Code code as follows:

Preg_match ('/(?p< quote> ' | ') /' "' String '" $matches)

# The following statement outputs "'" (excluding double quotes)
echo $matches [1]

# called with the group name, also outputs ' '
echo $matches [' quote ']

So a named group doesn't just make writing code easier, it can also be used to organize code.

4. Word Boundaries (boundaries)

Word boundaries are the positions of Word characters in a string (including letters, numbers and underscores, naturally including Chinese characters) and non word characters. The special thing about it is that it doesn't match some real character. Its length is zero. b matches all word boundaries.

Unfortunately, word boundaries are generally overlooked, and most people don't care about his practical significance. For example, if you want to match the word "import":

1./import/
Watch out! Regular expressions are sometimes very naughty. The following string can also match the above formula successfully:

1.important
You might think that if you add a space before and after the import, you will not be able to match this individual word:

1./Import/
What if this happens:

1.the trader voted for the import
When the word import is at the beginning or end of a string, the modified expression still does not work. It is therefore necessary to consider the various situations:

1./(^import | import | import$)/I
Don't panic, it's not finished yet. What if you encounter punctuation? Just to satisfy this word, your regular may need to write this:

1./(^import (: | |)? | import (: | |)? | import (. |?|!)? $)/I
It's a bit of a fuss to just match one word. For this reason, word boundaries appear to be of great significance. To accommodate the above requirements, as well as many other variants, with character boundaries, the code we need to write is just:

1./bimportb/
All of the above are resolved. The flexibility of B is that it is a match without a length. It matches only the imaginary position between the two actual characters. It checks whether two adjacent characters are one word and the other is not a word. If the situation is met, the match is returned. If you encounter the beginning or end of a word, B treats it as a non-word character. Since I in import is still considered a word character, import is matched.

Notice that, as opposed to B, we have B, which matches the position between two words or two non words. So, if you want to match the ' Hi ' inside a word, you can use:

1.bhib
"This", "hight", will return a match, and "Hi there" will return a mismatch.

5. Minimum delegation (atomic groups)

The smallest group is a special regular expression grouping that is not captured. It is often used to improve the performance of regular expressions and to eliminate specific matches. A minimal group can be defined by (?> pattern), where the match is the formula.

1./(?> his|this)/
When the regular engine matches the smallest group, it skips the backtracking position of the tag in the group. In the case of the word "smashing", the regular engine first tries to find "his" in "smashing" when it is matched with the above regular expression. Obviously, no match was found. At this point, the smallest group works: the regular engine discards all backtracking positions. That is, it does not try to find "this" from "smashing" again. Why are you setting this up? Because "his" does not return the match result, contains "his" the "this" certainly not to be able to match!

The above example is not practical, and we can use/t?his?/to achieve the result. Take another look at the following example:

1./b (engineer|engrave|end) b
If the "engineering" is taken to match, the regular engine matches the "engineer" first, but then it encounters the word boundary, B, so the match is unsuccessful. The regular engine then tries to find the next match in the string: Engrave. Match to Eng, the back is not on again, the match failed. Finally, try "end" and the result is also a failure. Careful observation, you will find that once the engineer match fails, and all reach the word boundary, "engrave" and "end" These two words can no longer match the success. These two words are shorter than the engineer, the regular engine should not do more futile attempts.

1./b (?> engineer|engrave|end) b
The above substitution will save the regular engine matching time and improve the efficiency of the Code.

6. Recursion (recursion)

Recursion (recursion) is used to match nested structures, such as parentheses nesting, (this), HTML tags nesting < div> < div> </div> </div>. We use (? r) to represent the child patterns in the recursive process. The following is an example of matching nested parentheses:

1./((?> [^ ()]+) | (? R)))/
The outermost layer uses the parentheses of the backslash "(") to match the beginning of the nested structure. Then there is a multi-option operator (|) that can match all the characters except parentheses (?> [^ ()]+), or it may be through child mode "(? r)" to match the entire expression again. Note that this operator will match as many nested sets as possible.

Another example of recursion is as follows:

1./< ([w]+). > ((?> [^< >]+) | ( (? r))) </1>/
The above expression combines character grouping, greedy operators, backtracking, and minimizing delegation to match nested tags. The first Bracket group ([w]+) matches the label name for the next application. If you find the label for this angle bracket style, try to find the remainder of the label content. The next bracket-enclosed subexpression is very similar to the previous instance: either match all characters that do not include the angle brackets (?> [^< >]+) or recursively match the entire expression (? r). The last part of the entire expression is the angle Bracket style closure label </1>.

7. Callback (callbacks)

Matching specific content in a result may sometimes require some special modification. To apply multiple and complex modifications, the callback of the regular expression is useful. A callback is a way to dynamically modify a string in a function Preg_replace_callback. You can specify a function as a parameter for Preg_replace_callback, which can receive an array of matching results as a parameter, and modify the array to return it as the result of the substitution.

For example, we want to convert all the letters in a string to uppercase. Unfortunately, PHP has no direct conversion to the letter case of the regular operator. To complete this task, you can use a regular callback. First, an expression matches all the letters that need to be capitalized:

1./bw/
The upper style uses both word boundaries and character classes. It's not enough to have this equation, we need a callback function:
Copy Code code as follows:

function Upper_case ($matches) {
return Strtoupper ($matches [0])
}

The function upper_case receives an array of matching results and converts the entire matching result to uppercase. In this case, the $matches [0] represents the letter that needs to be capitalized. We then use Preg_replace_callback to implement the callback:

1.preg_replace_callback ('/bw/' "Upper_case" $str)
A simple callback is such a powerful force.

8. Note (commenting)

Annotations do not have to match strings, but they are the most important part of a regular expression. As you write more and more deeply, the more complex you write, the more difficult it becomes to push and interpret what is being matched. Adding annotations in the middle of regular expressions is the best way to minimize future confusion and confusions.

To add a comment inside a regular expression, use the (? #comment) format. Replace "comment" with your comment statement:

1./(? #数字) d/
If you intend to make the code public, it is particularly important to annotate regular expressions. This will make it easier for others to read and modify your code. As with the comments on other occasions, this can also be handy for revisiting your previous program.

Consider using the "X" or "(? x)" modifier to format the annotation. This modifier lets the regular engine ignore the spaces between the expression parameters. "Useful" spaces can still be matched by [] or (antisense Fuga).

Copy Code code as follows:

/
D #digit
[] #space
w+ #word
/x

The code above is the same as the following formula:

1./D (? #digit) [] (? #space) w+ (? #word)/
Always be aware of the readability of your code.

Pattern modifier
is a feature that is enhanced and supplemented for regular expressions and is used outside of the regular
Example:/Regular U/u denotes a pattern modifier
Some of the commonly used in PHP: (note: case-sensitive)
I regular content is not case-sensitive when matching (default is distinguished)
M uses multiple line recognition matching when matching first content or tail content
S will escape carriage return cancellation for unit match

x ignores whitespace in the regular
A force to start A match from scratch
D Force $ to match any contents of the tail \ n
U prohibit greedy Mei match, track only to a recent match and end, commonly used in the acquisition program of regular expressions

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.