Examples of basic concepts of Regular Expression advanced skills [translation]

Source: Internet
Author: User
Tags expression engine
Document directory
  • 1. Greedy/lazy
  • 2. Back referencing)
  • 3. Named capture group (named groups)
  • 4. word boundaries)
  • 5. Minimal group (atomic groups)
  • 6. recursion (recursion)
  • 7. Callback (callbacks)
  • 8. Commenting)
  • More resources)

Regular Expression (Regular Expression,Abbr. RegEx) Is powerful and can be used to find the required information in a large string of characters. It uses regular character structure expressions. Unfortunately, simple regular expressions are far from functional enough for some advanced applications. You may need to useAdvanced Regular Expression.

This article is for youAdvanced regular expression skills. We have filtered out eight common concepts and matched them with instance parsing. Each example is a simple method that meets some complicated requirements. If you are not familiar with the basic concepts of regular expressions, read this article, this tutorial, or Wikipedia.

The regular syntax here applies to PhP and is compatible with Perl.

 

1. Greedy/lazy

All the more limited regular operators are greedy. TheyAs many as possibleMatch the target string, that is, the matching result willAs long as possible. Unfortunately, this approach is not always what we want. Therefore, we add the "lazy" qualifier to solve the problem. Add "?" After each greedy Operator Allows expressions to match onlyAs short as possible. In addition, the modifier "U" can also cohile operators that can be limited multiple times. The difference between greed and laziness is the basis for using advanced regular expressions.

Greedy Operator

The operator * matches the previous expression zero or more times. It is a greedy operator. See the following example:


1. preg_match( '/ , '<H1> This is a title. </H1> 2. <H1> This is another one. </H1> ', $matches );

Periods (.) can represent any character except line breaks. The above regular expression matches the H1 tag and all content in the tag. It uses periods (.) and periods (*) to match all the content in the tag. The matching result is as follows:


1. <H1> This is a title. </H1>

The entire string is returned. * The operator will consecutively match all content-or even include the H1 closed tag in the middle. Because it is greedy, matching the entire string is in line with the principle of maximizing its benefits.

Lazy Operator

Add a question mark (?), This can make expressions become lazy:


1. /

In this way, it will feel that the task is completed by matching the end label of the first H1.

Another greedy operator with similar attributes is {n ,}. It indicates that the previous match pattern repeats N or more times. If no question mark is added, it will search for as many repetitions as possible, it will be as few duplicates as possible (of course, "Repeat n times" at least ).


1. # Create a string 2. $str = 'hihihi oops hi' ; 3. # Use greedy {n,} operator for matching 4. preg_match( '/(hi){2,}/' , $str , $matches ); # Matches [0] will be 'hihihi' 5. # {N ,}? Operator matching 6. preg_match( '/(hi){2,}?/' , $str , $matches ); # Matches [0] will be 'hihi'2. Back referencing)

What is the purpose?

Back referencing)It is generally translated into "reverse reference", "back-to-Reference", and "back-to-reference". I personally think "back-to-reference" is more appropriate [stupid work]. It is referenced inside a regular expression.Previously captured content. For example, the following simple example aims to match the content inside the quotation marks:


01. # Create a matching Array 02. $matches = array (); 03.  04. # Create a string 05. $str = "" This is a 'string' "" ; 06.  07. # Capturing content using regular expressions 08. preg_match( "/(" | ').*?("|' )/", $str , $matches ); 09.  10. # Output the entire matching string 11. echo $matches [0];

It will output:


1. "This is a'

Obviously, this is not what we want.

This expression starts matching with double quotation marks at the beginning, and ends the matching incorrectly after encountering single quotation marks. This is because the expressions include: ("| '), double quotation marks ("), and single quotation marks. To fix this problem, you can use return references.Expression 1, 2 ,..., 9Is the serial number of each sub-content that has been captured in the previous step. It can be referenced as a "Pointer" to these groups. In this example, the first matching quotation mark is represented by 1.

How to use it?

Replace the closed quotation marks in the above example with 1:


1. preg_match( '/("|' ).*?1/', $str , $matches );

This correctly returns the string:


1. "This is a 'string'"

Comments:

What should I do if the quotation marks are Chinese and the quotation marks are not the same character?

Remember the PHP functionpreg_replace? There are also return references. But we didn't use 1... 9, but $1... $9... $ N (any number here) serves as a return pointer. For example, if you want to label all paragraphs<p>Replace all with text:


1. $text = preg_replace( '/<p>(.*?)</p>/' , 2. "&lt;p&gt;$1&lt;/p&gt;" , $html );

The $1 parameter is a callback reference that represents the text in the paragraph label <p> and is inserted into the replaced text. This simple and easy-to-use expression provides us with a simple way to get matched text, even when replacing text.

3. Named capture group (named groups)

When callback references are used multiple times in an expression, it is easy to confuse things and find out the numbers (1... 9) which sub-content is very troublesome. An alternative to callback reference is to use a named capture group (hereinafter referred to as "famous group "). Use of famous groups(?P<name>pattern)Name indicates the group name. pattern matches the regular structure of the famous group. See the following example:


1. /(?P<quote>"|').*?(?P=quote)/

In the above formula, quote is the group name, "|" is the regular expression matching the content. After (? P = quote) is a famous group in the call group named quote. The effect of this Sub-statement is the same as that of the callback reference instance above, but it is implemented using a famous group. Is it easier to read and understand?

A famous group can also be used to process internal data in an array of matched content. The group name assigned to a specific regular expression can also be used as the index word of the matched content in the array.


1. preg_match( '/(?P<quote>"|' )/ ', "' String'", $matches ); 2.  3. # The following statement outputs "'" (not including double quotation marks) 4. echo $matches [1]; 5.  6. # If the group name is used for calling, "'" will also be output. 7. echo $matches [ 'quote' ];

Therefore, a famous group is not only easier to write code, but also used to organize code.

4. word boundaries)

Word boundaryIt is the position between the characters in a string (including letters, numbers, and underscores, naturally including Chinese characters) and non-word characters. It does not match a real character. Its length isZero.bMatch All word boundaries.

Unfortunately, word boundaries are ignored, and most people do not care about their practical significance. For example, if you want to match the word "import ":


1. /import/

Note! Regular Expressions are sometimes naughty. The following strings can also be matched with the preceding sub-statement:


1. important

You may think that if you add spaces before and after the import, you won't be able to match this independent word:


1. / import /

If this happens:


1. The trader voted for the import

When the word import starts or ends with a string, the modified expression is still unavailable. Therefore, it is necessary to consider various situations:


1. /(^import | import | import$)/i

Don't worry. It's not over yet. What if there is a punctuation mark? To match the word, your regular expression may need to be written as follows:


1. /(^import(:|;|,)? | import(:|;|,)? | import(.|?|!)?$)/i

For matching only one word, this is a little tricky. Therefore, word boundaries are significant. To meet the above requirements, andMany other variantsWith the character boundary, the code we need to write is:


1. /bimportb/

All the above situations have been solved.bThe flexibility is that it is a non-length match. It only matches the positions imagined between two actual characters. It checks whether two adjacent characters are a single word and the other is a non-single word. If the condition is correct, a match is returned. If you encounter the start or end of a word,bIt is treated as a non-word character. BecauseiIf it is still regarded as a word character, the import will be matched.

Note that we have B in comparison to B. This operator matches the position between two or two non-words. Therefore, if you want to match the 'Hi' in a word, you can use:


1. BhiB

"This" and "hight" will return a match, while "Hi there" will return a mismatch.

5. Minimal group (atomic groups)

Minimum GroupIs a non-capturing special regular expression group. It is usually used to improve the performance of regular expressions and to eliminate specific matches. A minimal group can be used (?> Pattern), where pattern is a matching expression.


1. /(?>his|this)/

When the Regular Expression Engine matches the smallest group, it skips the Backtracking position marked in the group. Take the word "smashing" as an example. When the above regular expression is used for matching, the Regular Expression Engine first tries to find "his" in "smashing ". Obviously, no matching is found. At this point, the smallest group plays a role: the Regular Expression Engine will discard all backtracking positions. That is to say, it will not try to find "This" from "smashing ". Why do we set it like this? Because "his" does not return a matching result, the "this" containing "his" cannot match any more!

The above example is not practical. We use/t?his?/It can also achieve results. Let's take a look at the following example:


1. /b(engineer|engrave| end )b/

If "engineering" is used for matching, the regular engine will first match "engineer", but then it will encounter the word boundary, B, so the matching fails. Then, the Regular Expression Engine tries to find the next Matching content in the string: engrave. When Eng is matched, the matching fails. Finally, if you try "end", the result is also a failure. After careful observation, you will find that once the engineer fails to match and both reach the word boundary, the word "engrave" and "end" are no longer likely to match successfully. These two words are short compared with engineer, and the regular expression engine should not make unnecessary attempts.


1. /b(?>engineer|engrave| end )b/

The alternative writing method above can save the matching time of the Regular Expression Engine and improve the code efficiency.

6. recursion (recursion)

Recursion (recursion)Used to match nested structures, such as Arc embedding, (this (that), and HTML Tag nesting<div><div></div></div>. We use(?R)To represent the subpattern In the recursion process. The following is an example of matching nested arc:


1. /(((?>[^()]+)|(?R))*)/

The outermost layer uses the parentheses ("matching the beginning of the nested structure. Then there is a multi-choice operator (* | *), which may match all the characters except the brackets"(?>[^()]+)", Or the sub-mode"(?R)To match the entire expression again. Note that this operator will match as many nesting conditions as possible.

Another example of recursion is as follows:


1. /<([w]+).*?>((?>[^<>]+)|((?R)))*</1>/

The preceding expressions use character grouping, greedy operators, backtracking, and minimal grouping to match nested labels. First inactive arc Group([w]+)Match the exit signature for the next application. If you find the label of this angle bracket style, try to find the remaining part of the label content. The subexpression enclosed by the next arc is very similar to the previous example: either match all characters not including angle brackets(?>[^<>]+)Or recursively match the entire expression.(?R). The last part of the entire expression is the closed label of the angle bracket style.</1>.

7. Callback (callbacks)

The specific content in the matching result may sometimes need some special modification. Apply multiple and complex modifications, regular expressionsCallbackIt will be useful. Callback is used for Functionspreg_replace_callbackIn. You canpreg_replace_callbackSpecify a function as a parameter. This function can receive the matching result array as a parameter and return the result after modifying the array.

For example, we want to convert all the letters in a string into uppercase letters. Unfortunately, PHP does not directly convert uppercase/lowercase regular operators. To complete this task, you can use the regular callback. First, the expression must match all uppercase letters:


1. /bw/

Both the word boundary and character class are used. This formula is not enough. We need a callback function:


1. function upper_case( $matches ) { 2. return strtoupper ( $matches [0] ); 3. }

Functionupper_caseReceives an array of matching results and converts the entire matching result to uppercase. In this example,$matches[0]Indicates the letters to be capitalized. Then we usepreg_replace_callbackImplement callback:


1. preg_replace_callback( '/bw/' , "upper_case" , $str );

A simple callback has such powerful power.

8. Commenting)

NoteIt is not used to match strings, but it is indeed the most important part of regular expressions. The deeper the regular expression, the more complicated the writing, and the more difficult it is to deduce exactly what is matched. Adding comments to the regular expression is the best way to minimize future confusion and confusion.

Add comments to the Regular Expression and use(?#comment)Format. Replace "comment" with your comment statement:


1. /(? # Number) D/

If you want to make the code public, it is especially important to add comments to the regular expression. This makes it easier for others to understand and modify your code. Similar to comments on other occasions, this can also facilitate your re-access to previously written programs.

Consider using "X" or "(? X) "modifier to format comments. This modifier allows the regular engine to ignore spaces between expression parameters. "Useful" spaces can still be passed[ ]Or(Adding spaces to the assignees.


1. / 2. d    #digit 3. [ ]   #space 4. w+   #word 5. /x

The above code serves the same purpose as the following formula:


1. /d(?#digit)[ ](?#space)w+(?#word)/

Always pay attention to the readability of the Code.

More resources)
  • Regular-Expressions.infoComprehensive website on Regular Expressions
  • Cheat sheetinformative Regular Expressions cheat sheet
  • RegEx generatorjavascript Regular Expressions Generator
About the author

Karthik viswan.pdf is a high school student who enjoys programming and website preparation. You can view his work: Lateral code on his blog. You can also take a look at his online Twitter application.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.