PHP real-Combat regular Expressions (II): Extracting HTML elements

Last Update:2016-06-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article introduces the knowledge points of pattern modifiers, greedy and non-greedy matches, Unicode patterns, and surround view in regular expressions by extracting HTML elements.
Before reading this article, it is best to put the same series of articles in PHP combat regular expression (a): Verify the mobile phone number first carefully read it over.

Basic extraction

Have such a table

User name	Career
Kobe Bryant	Basketball player
Jay Chou	Singer, songwriter, producer, actor, director
Lionel Messi	Football player

Its source code is as follows:

 
  
    
  
   
    
     
     User name 
     Career 
     
     
  
   
    
     
     Kobe Bryant 
     Basketball player 
     
     
     Jay Chou 
     Singer, songwriter, producer, actor, director 
     
     
     Lionel Messi 
     Football player

Now you want to extract the first element. The simplest regular expression should look like this:

\s+.*<\/tr>

which

\s is a PHP combat regular expression (a): Verify the mobile phone number introduced by the character group précis-writers one, representing the carriage return, Space, tab and other white space characters
Quantifier + indicates that the number of characters or groups of characters it modifies is greater than or equal to 1
The dot character . In the regular expression is a special meta-character, which can match "any character"
A slash in a closed tag/is a pattern delimiter in a regular expression in PHP, so it needs to be escaped to represent a slash character.

But in fact such an expression is not able to extract the first element from the above

The main problem here is the dot character by default. Cannot match line break \ n. There are two ways to solve this problem:

Using the pattern modifier s, the regular expression is/\s+.*<\/tr>/s or (? s) \s+.*<\/tr>. The function of the pattern modifier s is to let the dot character. You can match the line break.
Use [\s\s] or [\w\w] or [\d\d] instead of the dot character. To match all characters, the regular expression is \s+[\s\s]*<\/tr>

For the pattern modifier (pattern Modifiers), here's a detailed introduction (click here to see all the schema modifiers supported by PHP). Pattern modifiers can change the regular expression of some of the default rules, commonly used pattern modifiers have I, S, u, u, etc., we will use some of them later, here does not expand the role of each pattern modifier, followed by the specific introduction. Here is the main comparison of/.../{modifier} with ... (? {modifier}) ... The difference between the two methods of presentation.

. *<\/tr>/s (? s) .*<\/tr>

Pattern Modifiers	/.../{modifier}	... (? {modifier}) ...
Example	/

Name (PHP manual)	Pattern modifiers	In-pattern modifiers
Name ("Regular Guide")	Pre-defined constants	Pattern modifiers
Function range	The entire regular expression	When not in a grouping (subexpression), it works on all regular expressions that follow it, and if in a grouping (subexpression), the remainder of the grouping is in effect. It is equivalent to/.../{modifier when there is no grouping and is placed at the front of the entire regular expression.
Level of support	Supports all pattern modifiers	Support for partial pattern modifiers
Other programming languages	may not support	is generally supported

From the GIF above you can see that there are three tr in the extracted result, not just one. This is because the quantifier in the regular expression defaults to a greedy match , here,. * matches all characters until the last character is no longer backward, and goes back to the last time in the regular expression to match the <\/tr>, thus completing the entire matching process, The final result is a total of three.

You can use the pattern modifier u to specify that the entire regular expression is non-greedy, or you can use a non-greedy matching quantifier to specify a quantifier as a non-greedy pattern:

Specifies that the entire regular expression is non-greedy mode:
- /\s+.*<\/tr>/us
- or (? Us) \s+.*<\/tr>
Non-greedy quantifiers:
/\s+.*?<\/tr>/s

The complete greedy quantifier (Match priority quantifier) and non-greedy quantifier (ignoring the priority quantifier) are shown in the following table:

greedy quantifiers	non-greedy quantifiers	Limit Times
*	*？	May appear, may not appear, there is no limit to the number of occurrences
+	+	At least 1 times, no upper limit
?	??	Occurs 0 or 1 times
{M,n}	{m,n}?	Occurrences greater than or equal to m, less than or equal to n
{m,}	{m,}?	appear at least m times, no upper limit
{0,n}	{0,n}?	Occurs 0 times-N times

Extracts the row containing the specified content

Suppose we want to extract the records of the athletes in the table, we may use/.* players. *<\/tr>/s such a regular expression.

This expression can match the result in a Unicode encoding environment, but not in a GBK environment. We can specify the Unicode pattern by using the pattern modifier u:

/.* athlete. *<\/tr>/us

In Unicode mode, we can even use code values instead of Chinese characters:

/.*\x{8fd0}\x{52a8}\x{5458}.*<\/tr>/us

PHP is used in the form of \x{hex} to represent the code value of Unicode characters, the advantage of using code value is that you can combine word Fu Zulai to represent a range, such as [\X{4E00}-\X{9FFF}] to match all kanji characters.

The above expression can match the result, but it is not correct. We can see that it matches the entire string first to the last.
Intuitively, we want the regular expression to match the "athlete" first, then the nearest one to the left, and the nearest one to the right. In fact, the regular expression is matched from left to right, that is, from the beginning, the entire regular expression matches the following table (blank characters are not shown).

An expression	Matching values
/


.*		User name	Career
Kobe Bryant	Basketball
Athletes	Athletes
.*
Jay Chou	Singer, songwriter, producer, actor, director
Lionel Messi	Football player
<\/tr>
/us

Here are two. * matches to more characters than expected. The second. * The match character is more than expected because the regular expression defaults to the greedy match pattern, which matches each character in the remaining string until the end of the string and then forwards back to the last one, which can be resolved by specifying a non-greedy matching pattern. But the first one. * Matching characters is more normal than expected because the regular expression is matched from left to right, the first in the matching string in the expression, and the following. * matches all remaining characters until the end of the string, and then goes back to "athlete".

Let's take a look at the results when using a non-greedy match:

As you can see, the second. * Matching characters is already what we want. So, for the first. * Match characters more than expected this problem how to solve?

If you use only the knowledge that I have described in my article so far, there are ways to solve it. We can first match all the rows (...) from left to right. ) by using the Preg_match_all function in PHP with a non-greedy matching pattern, and then traversing each row to filter out the rows containing the "athlete".

Of course, we can also solve this problem by purely regular expressions. If a friend with a regular expression experience can easily associate with an excluded character group , we have introduced the character group [...], which represents characters that may appear in the same location. An excluded character group represents a character that cannot appear in the same position, and it is in the form of [^ ...], which is followed by the opening parenthesis [followed by the ^ to represent the excluded character group. For example, [^\d] means that a matching character is any character except a number.
If there is an exclusion subexpression, similar to (^) *, we only need to specify the first one. * The exclusion is OK. Unfortunately, there are no excluded sub-expressions or excluded groupings in regular expressions. In this case, we can only use the surround

/(. (?!)) * Athlete. *<\/tr>/us

Look Around (look-around) does not match any characters and is used to "stop in place and look round". The above expression uses a negative order to look around, its form is (?! ...)。 specifically for (. (?!))) * to analyze, whenever. After matching a character, just look to the right if the right side of the current matching character does not appear to match successfully.

The complete look is:

name	notation	meaning
A certain sequence of look	(?=...)	Look right, the right side appears.
Negative order Look around	(?! ...)	Look right, the right side does not appear in the view of the content to match
Affirmative reverse look	(? <= ...)	Look left, the left side of what's in the surround
Negative reverse look	(?	Look left, the left side does not appear in the view of the content to match

Because the above regular expression has a grouping (sub-expression), so the result of the match in addition to subscript 0, there is subscript 1, where the result of subscript 1 is actually no use, we can use the previously described non-capturing group :

/(?:. (?!)) * Athlete. *<\/tr>/us

Our real goal is to extract all the rows that contain the "athlete", but only the first one is extracted, so we need to change the Preg_match function to Preg_match_all.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PHP real-Combat regular Expressions (II): Extracting HTML elements

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PHP real-Combat regular Expressions (II): Extracting HTML elements

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support