Regular expression advanced techniques and examples detailed Woole _ Regular expressions

Source: Internet
Author: User
Tags character classes closing tag generator html tags modifier
The original English text comes from smashing Magazine. Translated by Woole. Reprint please indicate the source.

Regular Expressions (Regular Expression, abbr. Regex) are powerful and can be used to find the information you need in a large string of words character. It takes advantage of the conventional character-structure expressions to function. Unfortunately, simple regular expressions are not nearly as powerful for some advanced applications. The structure of the filter is more complex, and you may need to use an advanced regular expression .

This article introduces you to the advanced techniques of regular expressions . We have selected eight commonly used concepts, with example parsing, each of which is a simple way of satisfying a complex requirement. If you have a lack of understanding of the basic concepts of regular, please read This article , or This tutorial , or Wikipedia entry .

The regular syntax here applies to PHP and is compatible with Perl .

1. Greed/laziness

All the regular operators that can be qualified more than once are greedy. They match the target string as much as possible , which means the result will be as long as possible . Unfortunately, this practice is not always what we want. Therefore, we add the "lazy" qualifier to solve the problem. Add "?" after each greedy operator Allows an expression to match only the shortest possible length. In addition, the modifier "U" can also be inert to operators that can be qualified more than once. Understanding the difference between greed and laziness is the basis for using advanced regular expressions.

Greedy operator

The operator * matches the previous expression 0 times or more than 0 times. It is a greedy operator. Take a look at the following example:

preg_match( '/, '$matches );

Period (.) can represent any character other than a line break. The regular expression above matches the H1 label and all content within the label. It uses a period (.) and an asterisk (*) to match all content within the label. The results are as follows:

The entire string is returned. The * operator will match everything-even the middle H1 closing tag. Because it is greedy, matching the entire string is in line with its interests maximization principle.

Lazy operator

Make the expression lazy by slightly modifying the formula above and adding a question mark (?):

/

It would feel that the task would be complete only by matching the tag at the end of the first H1.

Another greedy operator with similar attributes is {n,}. It represents the previous match pattern repeat n or n times above, if not add a question mark, it will look for as many repetitions as possible, plus, it will be as little as possible (of course, "Repeat n times" the least).

# 建立字串
$str = 'hihihi oops hi';
# 使用贪婪的{n,}操作符进行匹配
preg_match( '/(hi){2,}/', $str, $matches );  # matches[0] 将是 'hihihi'
# 使用堕化了的 {n,}? 操作符匹配
preg_match( '/(hi){2,}?/', $str, $matches );  # matches[0] 将是 'hihi'

2. Return reference (back referencing)

What's the use?

return references (back referencing) are generally translated as "reverse references", "Backward references", "backwards references", and individuals find "return references" more appropriate. It is a method of referencing the previously captured content within a regular expression. For example, the purpose of the following simple example is to match the contents of the quote inside:

# 建立匹配数组
$matches = array();
# 建立字串
$str = ""This is a 'string'"";
# 用正则表达式捕捉内容
preg_match( "/(\"|').*?(\"|')/", $str, $matches );
# 输出整个匹配字串
echo  $matches[0];

It will output:

"This is a'

Obviously, this is not what we want.

This expression starts with a double quotation mark at the beginning and ends the match incorrectly after encountering single quotes. This is because the expression says: ("|') that is, double quotes ( " ) and single quotes ( ' ) are available. To fix this problem, you can use the return reference. The expression \1,\2,..., \9 is the marshalling sequence number of each child content that has been captured before, and can be referenced as a "pointer" to these groupings. In this case, the first quotation mark to be matched is \1 represented.

How to use it?

In the example above, replace the closing quotation mark with the following 1:

preg_match( '/(\"|').*?\1/', $str, $matches );

This returns the string correctly:

"This is a 'string'"

Study questions:

If it is a Chinese quotation mark, and the front and back quotes are not the same character, what should I do?

Do you remember PHP functions? preg_replace There are also return references. It's just that we didn't use \1 ... \9, but we used $ ... $n (any number available) as the return pointer. For example, if you want to replace all paragraph labels with <p> text:

$text = preg_replace( '/<p>(.*?)</p>/',
"<p>$1</p>", $html );

The parameter is a return reference that represents <p> the text inside the paragraph label and is inserted into the replaced text. This easy-to-use expression provides us with a simple way to get the matching text, even when replacing text.

3. Named Capture Group (Named Groups)

When you use callback references more than once in an expression, it's easy to confuse things and figure out those numbers (1 ...). 9 It is a very troublesome thing to represent which child content. An alternative to callback references is to use a capturing group with a name (hereinafter referred to as "a named group"). A named group (?P<name>pattern) is used to set, name represents the group name, pattern is the regular structure that fits the named group. Take a look at the following example:

/(?P<quote>"|').*?(?P=quote)/

In the upper, quote is the group name, "|' is the regular of the reorganization match content. Behind the (? P=quote) is a named group named quote in the calling group. The effect of this formula is the same as the callback reference instance above, except that it is implemented with a well-known group. Is it easier to read and understand?

A named group can also be used to process the internal data of an array of matched content. A specific regular group name can also be used as an index word within an array of matched content.

preg_match( '/(?P<quote>"|\')/', "'String'", $matches );
# 下面的语句输出“'”(不包括双引号)
echo $matches[1];
# 使用组名调用,也会输出“'”
echo $matches['quote'];

So a named group doesn't just make writing code easier, it can also be used to organize code.

4. Word Boundaries (boundaries)

word boundaries are the positions of Word characters in a string (including letters, numbers and underscores, naturally including Chinese characters) and non word characters. The special thing about it is that it doesn't match some real character. Its length is 0 . \bmatches all word boundaries.

Unfortunately, word boundaries are generally overlooked, and most people don't care about his practical significance. For example, if you want to match the word "import":

/import/

Watch out! Regular expressions are sometimes very naughty. The following string can also match the above formula successfully:

important

You might think that if you add a space before and after the import, you will not be able to match this individual word:

/ import /

What if this happens:

The trader voted for the import

When the word import is at the beginning or end of a string, the modified expression still does not work. It is therefore necessary to consider the various situations:

/(^import | import | import$)/i

Don't panic, it's not finished yet. What if you encounter punctuation? Just to satisfy this word, your regular may need to write this:

/(^import(:|;|,)? | import(:|;|,)? | import(\.|\?|\!)?$)/i

It's a bit of a fuss to just match one word. For this reason, word boundaries appear to be of great significance. To accommodate the above requirements, as well as many other variants , with character boundaries, the code we need to write is just:

/\bimport\b/

All of the above are resolved. \bthe flexibility is that it is a no length match. It matches only the imaginary position between the two actual characters. It checks whether two adjacent characters are one word and the other is not a word. If the situation is met, the match is returned. If you encounter the beginning or end of a word, \b treat it as a non-word character. As the import is i still considered as a word character, import is matched.

Notice that, with the \b relative, we also have \B that this operator matches the position between two words or two non words. So, if you want to match the ' Hi ' inside a word, you can use:

\Bhi\B

"This", "Hight" will return a match, and "Hi there" will not return a match.

5. Minimum delegation (Atomic Groups)

The smallest group is a special regular expression grouping that is not captured. It is often used to improve the performance of regular expressions and to eliminate specific matches. A minimum group can be defined using (? >pattern), where pattern is a match.

/(?>his|this)/

When the regular engine matches the smallest group, it skips the backtracking position of the tag in the group. In the case of the word "smashing", the regular engine first tries to find "his" in "smashing" when it is matched with the above regular expression. Obviously, no match was found. At this point, the smallest group works: the regular engine discards all backtracking positions. That is, it does not try to find "this" from "smashing" again. Why are you setting this up? Because "his" does not return the match result, contains "his" the "this" certainly not to be able to match!

The example above is not practical, and we /t?his?/ can use it to achieve results. Take another look at the following example:

/\b(engineer|engrave|end)\b/

If the "engineering" is taken to match, the regular engine matches the "engineer" first, but then the word boundary is encountered, \b so the match is unsuccessful. The regular engine then tries to find the next match in the string: Engrave. Match to Eng, the back is not on again, the match failed. Finally, try "end" and the result is also a failure. Careful observation, you will find that once the engineer match fails, and all reach the word boundary, "engrave" and "end" These two words can no longer match the success. These two words are shorter than the engineer, the regular engine should not do more futile attempts.

/\b(?>engineer|engrave|end)\b/

The above substitution will save the regular engine matching time and improve the efficiency of the Code.

6. Recursion (recursion)

recursion (recursion) is used to match nested structures, such as parentheses nesting, (this), and HTML tags nesting <div> <div></div> </div> . We use (?R) to represent the child patterns in the recursive process. The following is an example of matching nested parentheses:

/\(((?>[^()]+)|(?R))*\)/

The outermost layer uses the backslash brackets "" to \( match the beginning of the nested structure. Then there is a multiple-option operator ( * | * ) that may match all the characters except parentheses (?>[^()]+) , or it may be through the child mode " (?R) " to match the entire expression again. Note that this operator will match as many nested sets as possible.

Another example of recursion is as follows:

/<([\w]+).*?>((?>[^<>]+)|((?R)))*<\/\1>/

The above expression combines character grouping, greedy operators, backtracking, and minimizing delegation to match nested tags. The first Bracket group ([w]+) matches the label name for the next application. If you find the label for this angle bracket style, try to find the remainder of the label content. The next bracket-enclosed subexpression is very similar to the previous one: either match all characters that do not include the angle brackets ?>[^<>]+ or recursively match the entire expression (?R) . The final representation of the expression is the </1> closing label.

7. Callback (callbacks)

Matching specific content in a result may sometimes require some special modification. To apply multiple and complex modifications, the callback of the regular expression is useful. A callback is a preg_replace_callback way to dynamically modify a string in a function. You can preg_replace_callback specify a function as a parameter, this function can receive the matching result array as a parameter, and the array is modified to return, as the result of substitution.

For example, we want to convert all the letters in a string to uppercase. Unfortunately, PHP has no direct conversion to the letter case of the regular operator. To complete this task, you can use a regular callback. First, an expression matches all the letters that need to be capitalized:

/\b\w/

The upper style uses both word boundaries and character classes. It's not enough to have this equation, we need a callback function:

function upper_case( $matches ) {
return strtoupper( $matches[0] );
}

The function upper_case receives an array of matching results and converts the entire matching result to uppercase. In this case, the $matches[0] letter that needs to be capitalized is represented. We then use the preg_replace_callback implementation callback:

preg_replace_callback( '/\b\w/', "upper_case", $str );

A simple callback is such a powerful force.

8. Note (commenting)

annotations do not have to match strings, but they are the most important part of a regular expression. As you write more and more deeply, the more complex you write, the more difficult it becomes to push and interpret what is being matched. Adding annotations in the middle of regular expressions is the best way to minimize future confusion and confusions.

To add a comment inside a regular expression, use the (?#comment) format. Replace "comment" with your comment statement:

/(?#数字)\d/

If you intend to make the code public, it is particularly important to annotate regular expressions. This will make it easier for others to read and modify your code. As with the comments on other occasions, this can also be handy for revisiting your previous program.

Consider using the "X" or "(? x)" modifier to format the annotation. This modifier lets the regular engine ignore the spaces between the expression parameters. "Useful" spaces can still be [ ] matched by or \s , or \ (antisense Fuga).

/
\d    #digit
[ ]   #space
\w+   #word
/x

The code above is the same as the following formula:

/\d(?#digit)[ ](?#space)\w+(?#word)/

Always be aware of the readability of your code.

More Resources (English)

    • Regular-expressions.info Comprehensive website on Regular Expressions
    • Cheat Sheet Informative Regular Expressions Cheat sheet
    • Regex Generator JavaScript Regular Expressions Generator

About the author


Karthik Viswanathan is a high school student who likes to program and do websites. You can see his work on his blog: lateral Code. You can also focus on his online Twitter apps.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.