Php regular expression (2): extracting html elements

Source: Internet
Author: User
Tags php regular expression
Php regular expressions (II): extracting html elements This article introduces the pattern modifier, greedy match, non-greedy match, Unicode mode, and surround view knowledge points in regular expressions by extracting html elements.
Before reading this article, we 'd better read the regular expression of php practice in the same series of articles (1): verify the mobile phone number and read it carefully.

Basic extraction

There is such a table

User name Occupation
Kobe Bryant Basketball players
Jay Chou Singers, Song creators, producers, actors, directors
Lionel Messi Football players

Its source code is as follows:

 
 
User name Occupation
Kobe Bryant Basketball players
Jay Chou Singers, Song creators, producers, actors, directors
Lionel Messi Football players

ExtractFirstElement. The simplest regular expression is like this:

\ S +. * <\/Tr>

Where

  • \ S is a regular expression in php practice (1): verifies one of the character group briefs introduced by the mobile phone number, representing blank characters such as carriage returns, spaces, and tabs.

  • Quantizer + indicates that the character or character group it modifies appears more than or equal to 1

  • Point character. It is a special metacharacter in a regular expression that can match "any character"

  • Close tagsThe slash in/is a pattern separator in the regular expression of php, so escape is required to represent the slash character.

But in fact, such an expression cannot beExtract the firstElement

The main problem here is that by default, the DOT character cannot match the line break \ n. There are two ways to solve this problem:

  • Use the pattern modifier s, and the regular expression is/\ S +. * <\/Tr>/s or (? S)\ S +. * <\/Tr>. The pattern modifier s is used to enable the dot character. to match the line break.

  • Use [\ s \ S], [\ w \ W], or [\ d \ D] to replace the dot. to match all characters. the regular expression is\ S +[\ S \ S] * <\/tr>

AboutPattern modifier(Pattern Modifiers). here you need to describe it in detail (Click here to view all the Pattern Modifiers supported by php ). Pattern modifiers can change some default rules of regular expressions. common pattern modifiers include I, s, U, and u. we will use some of them later, this section does not describe the role of each pattern modifier. we will introduce it in detail later. Here we mainly compare the values of/.../{modifier} and ...(? {Modifier.

. * <\/Tr>/s (? S). * <\/tr>
Pattern modifier /.../{Modifier} ...(? {Modifier })...
Example /
Name (php Manual) Pattern modifier In-mode modifier
Name (regular expression guide) Predefined Constants Pattern modifier
Scope Entire regular expression When it is not in a group (subexpression), it takes effect for all the regular expressions behind it; if it is in a group (subexpression), it takes effect for the rest of the group. When there is no group and the regular expression is placed at the beginning, it is equivalent to/.../{modifier}
Degree of support All Mode modifiers are supported. Some pattern modifiers are supported.
Other programming languages Not supported Generally

From the preceding gif, we can see that there are three tr pairs in the extracted result, instead of only one. This is because the default quantifiers in regular expressions areGreedy match, Here,. * will match all characters until there are no characters at last and then go backThe lastMatch with <\/tr> in the regular expression to complete the matching process. The final result is that it contains three.

You can use the pattern modifier U to specify the entire regular expression as a non-greedy pattern. you can also useNon-greedy match quantifiersSpecify a certain quantizer as non-greedy mode:

  • Specify the non-greedy mode for the entire regular expression:

    • /\ S +. * <\/Tr>/Us

    • Or (? Us)\ S +. * <\/Tr>

  • Non-greedy quantifiers:
    /\ S +.*? <\/Tr>/s

The complete greedy quantifiers (matching priority quantifiers) and non-greedy quantifiers (ignoring priority quantifiers) are shown in the following table:

Greedy quantifiers Non-greedy quantifiers Limited times
* *? It may or may not appear, and there is no upper limit on the number of occurrences
+ +? Appears at least once, no upper limit
? ?? 0 or 1
{M, n} {M, n }? Appears more than or equal to m, less than or equal to n
{M ,} {M ,}? At least MB appears, no upper limit
{0, n} {0, n }? 0-n times
Extract the row containing the specified content

Suppose we wantAthletesWe may use/Regular expressions such as. * Athlete. * <\/tr>/s.

This expression can match the results in the Unicode encoding environment, but it is not necessary in the GBK environment. We can specify the Unicode mode through the pattern modifier u:

/. * Athlete. * <\/tr>/us

In Unicode mode, we can even useCode valueTo replace Chinese characters:

/. * \ X {8fd0} \ x {52a8} \ x {5458}. * <\/tr>/us

In php regular expressions, \ x {hex} is used to represent the Unicode character code value. the advantage of using the code value is that it can be used in combination with character groups to represent a range, for example, [\ x {4e00}-\ x {9fff}] indicates that all Chinese characters are matched.

The above expression can match the result, but it is incorrect. We can see that it matches the first of the entire stringTo the Last.
Intuitively, we want the regular expression to match the "athlete" first, and then look for the nearest one on the left., Find the nearest one to the right. But in fact, regular expressions match from left to right, that isStart searching. the matching of the entire regular expression is shown in the following table (blank characters are not displayed ).

Expression Matching value
/
.* User name Occupation
Kobe Bryant Basketball
Athletes Athletes
.*
Jay Chou Singers, Song creators, producers, actors, directors
Lionel Messi Football players
<\/Tr>
/Us

The two. * match more characters than expected. Second. * The reason for more matching characters than expected is that the regular expression uses the greedy match mode by default. it matches each character in the remaining string until the end of the string, and then goes back to the last one.You can solve this problem by specifying the non-greedy match mode. However, the first. * It is normal to match more characters than expected because the regular expression matches from left to right.Match the first string, Followed by. * matches all the remaining characters until the end of the string, and then goes back to "athlete ".

Let's first look at the results when non-greedy match is used:

We can see that the second. * matched character is what we want. So how can we solve the problem that the first. * matches more characters than expected?

If you only use the knowledge described in my article so far, you can solve this problem. We can first match all rows from left to right (...), The method is to use the preg_match_all function in php in combination with the non-greedy match mode; then traverse each row and filter out the rows containing "athletes.

Of course, we can also solve this problem through pure regular expressions. If you have experience using regular expressions, you may easily think of them.Excluded character Group, We have introducedCharacter Group[...], Which indicates characters that may appear at the same position. WhileExcluded character GroupIt indicates a character that cannot appear at the same position. it is in the form of [^...]. it represents an excluded character Group by following the ^ Following the square brackets. For example, [^ \ d] indicates that the matched character is any character other than a number.
If an exclusive subexpression exists, it is similar to (^) *, We only need to specify the first. *Just remove it. Unfortunately, the regular expression does not include an exclusive subexpression or an exclusive group. In this case, we can only useView

/(.(?!) * Athlete. * <\/tr>/Us

Look-around does not match any character. it is used to "stop in the same place and look around ". The above expression usesNegative sequential viewIt is in the form (?!...). For (.(?!) * For analysis, every time. matches a character, it will look to the right, if the current matching character does not appear on the rightThe match is successful.

Complete Surround View:

Name Recording method Description
Certainly sequential view (? = ...) Look to the right. the content in the loop appears on the right to match.
Negative sequential view (?!...) Look to the right. the content in the loop does not appear on the right.
Certainly reverse view (? <= ...) Look left, and the content in the view on the left is matched.
Negative reverse view (? Look to the left, and match only when the content in the loop is not displayed on the left.

Because the above regular expression has a grouping (subexpression), the matching results include subscript 0 and subscript 1. the results of subscript 1 are useless, we can useNon-capturing group:

/(? :.(?!) * Athlete. * <\/tr>/Us

Our real purpose is to extract all the rows that contain "athletes", and the above only extracts the first one. Therefore, we need to replace the preg_match function with preg_match_all.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.