PHP real-Combat regular Expressions (II): Extracting HTML elements

Source: Internet
Author: User
This article introduces the knowledge points of pattern modifiers, greedy and non-greedy matches, Unicode patterns, and surround view in regular expressions by extracting HTML elements.
Before reading this article, it is best to put the same series of articles in PHP combat regular expression (a): Verify the mobile phone number first carefully read it over.

Basic extraction

Have such a table

User name Career
Kobe Bryant Basketball player
Jay Chou Singer, songwriter, producer, actor, director
Lionel Messi Football player

Its source code is as follows:

 
  
  
User name Career
Kobe Bryant Basketball player
Jay Chou Singer, songwriter, producer, actor, director
Lionel Messi Football player

Now you want to extract the first element. The simplest regular expression should look like this:

\s+.*<\/tr>

which

    • \s is a PHP combat regular expression (a): Verify the mobile phone number introduced by the character group précis-writers one, representing the carriage return, Space, tab and other white space characters

    • Quantifier + indicates that the number of characters or groups of characters it modifies is greater than or equal to 1

    • The dot character . In the regular expression is a special meta-character, which can match "any character"

    • A slash in a closed tag/is a pattern delimiter in a regular expression in PHP, so it needs to be escaped to represent a slash character.

But in fact such an expression is not able to extract the first element from the above

The main problem here is the dot character by default. Cannot match line break \ n. There are two ways to solve this problem:

    • Using the pattern modifier s, the regular expression is/\s+.*<\/tr>/s or (? s) \s+.*<\/tr>. The function of the pattern modifier s is to let the dot character. You can match the line break.

    • Use [\s\s] or [\w\w] or [\d\d] instead of the dot character. To match all characters, the regular expression is \s+[\s\s]*<\/tr>

For the pattern modifier (pattern Modifiers), here's a detailed introduction (click here to see all the schema modifiers supported by PHP). Pattern modifiers can change the regular expression of some of the default rules, commonly used pattern modifiers have I, S, u, u, etc., we will use some of them later, here does not expand the role of each pattern modifier, followed by the specific introduction. Here is the main comparison of/.../{modifier} with ... (? {modifier}) ... The difference between the two methods of presentation.

. *<\/tr>/s (? s) .*<\/tr>
Pattern Modifiers /.../{modifier} ... (? {modifier}) ...
Example /
Name (PHP manual) Pattern modifiers In-pattern modifiers
Name ("Regular Guide") Pre-defined constants Pattern modifiers
Function range The entire regular expression When not in a grouping (subexpression), it works on all regular expressions that follow it, and if in a grouping (subexpression), the remainder of the grouping is in effect. It is equivalent to/.../{modifier when there is no grouping and is placed at the front of the entire regular expression.
Level of support Supports all pattern modifiers Support for partial pattern modifiers
Other programming languages may not support is generally supported

From the GIF above you can see that there are three tr in the extracted result, not just one. This is because the quantifier in the regular expression defaults to a greedy match , here,. * matches all characters until the last character is no longer backward, and goes back to the last time in the regular expression to match the <\/tr>, thus completing the entire matching process, The final result is a total of three.

You can use the pattern modifier u to specify that the entire regular expression is non-greedy, or you can use a non-greedy matching quantifier to specify a quantifier as a non-greedy pattern:

    • Specifies that the entire regular expression is non-greedy mode:

      • /\s+.*<\/tr>/us

      • or (? Us) \s+.*<\/tr>

    • Non-greedy quantifiers:
      /\s+.*?<\/tr>/s

The complete greedy quantifier (Match priority quantifier) and non-greedy quantifier (ignoring the priority quantifier) are shown in the following table:

greedy quantifiers non-greedy quantifiers Limit Times
* *? May appear, may not appear, there is no limit to the number of occurrences
+ + At least 1 times, no upper limit
? ?? Occurs 0 or 1 times
{M,n} {m,n}? Occurrences greater than or equal to m, less than or equal to n
{m,} {m,}? appear at least m times, no upper limit
{0,n} {0,n}? Occurs 0 times-N times

Extracts the row containing the specified content

Suppose we want to extract the records of the athletes in the table, we may use/.* players. *<\/tr>/s such a regular expression.

This expression can match the result in a Unicode encoding environment, but not in a GBK environment. We can specify the Unicode pattern by using the pattern modifier u:

/.* athlete. *<\/tr>/us

In Unicode mode, we can even use code values instead of Chinese characters:

/.*\x{8fd0}\x{52a8}\x{5458}.*<\/tr>/us

PHP is used in the form of \x{hex} to represent the code value of Unicode characters, the advantage of using code value is that you can combine word Fu Zulai to represent a range, such as [\X{4E00}-\X{9FFF}] to match all kanji characters.

The above expression can match the result, but it is not correct. We can see that it matches the entire string first to the last.
Intuitively, we want the regular expression to match the "athlete" first, then the nearest one to the left, and the nearest one to the right. In fact, the regular expression is matched from left to right, that is, from the beginning, the entire regular expression matches the following table (blank characters are not shown).

An expression Matching values
/
.* User name Career
Kobe Bryant Basketball
Athletes Athletes
.*
Jay Chou Singer, songwriter, producer, actor, director
Lionel Messi Football player
<\/tr>
/us

Here are two. * matches to more characters than expected. The second. * The match character is more than expected because the regular expression defaults to the greedy match pattern, which matches each character in the remaining string until the end of the string and then forwards back to the last one, which can be resolved by specifying a non-greedy matching pattern. But the first one. * Matching characters is more normal than expected because the regular expression is matched from left to right, the first in the matching string in the expression, and the following. * matches all remaining characters until the end of the string, and then goes back to "athlete".

Let's take a look at the results when using a non-greedy match:

As you can see, the second. * Matching characters is already what we want. So, for the first. * Match characters more than expected this problem how to solve?

If you use only the knowledge that I have described in my article so far, there are ways to solve it. We can first match all the rows (...) from left to right. ) by using the Preg_match_all function in PHP with a non-greedy matching pattern, and then traversing each row to filter out the rows containing the "athlete".

Of course, we can also solve this problem by purely regular expressions. If a friend with a regular expression experience can easily associate with an excluded character group , we have introduced the character group [...], which represents characters that may appear in the same location. An excluded character group represents a character that cannot appear in the same position, and it is in the form of [^ ...], which is followed by the opening parenthesis [followed by the ^ to represent the excluded character group. For example, [^\d] means that a matching character is any character except a number.
If there is an exclusion subexpression, similar to (^) *, we only need to specify the first one. * The exclusion is OK. Unfortunately, there are no excluded sub-expressions or excluded groupings in regular expressions. In this case, we can only use the surround

/(. (?!)) * Athlete. *<\/tr>/us

Look Around (look-around) does not match any characters and is used to "stop in place and look round". The above expression uses a negative order to look around, its form is (?! ...)。 specifically for (. (?!))) * to analyze, whenever. After matching a character, just look to the right if the right side of the current matching character does not appear to match successfully.

The complete look is:

name notation meaning
A certain sequence of look (?=...) Look right, the right side appears.
Negative order Look around (?! ...) Look right, the right side does not appear in the view of the content to match
Affirmative reverse look (? <= ...) Look left, the left side of what's in the surround
Negative reverse look (? Look left, the left side does not appear in the view of the content to match

Because the above regular expression has a grouping (sub-expression), so the result of the match in addition to subscript 0, there is subscript 1, where the result of subscript 1 is actually no use, we can use the previously described non-capturing group :

/(?:. (?!)) * Athlete. *<\/tr>/us

Our real goal is to extract all the rows that contain the "athlete", but only the first one is extracted, so we need to change the Preg_match function to Preg_match_all.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.