From a zero-width assertion to a Python match for HTML tag content

Source: Internet
Author: User
Tags assert

Copyright NOTICE: This article for Bo Master original article, reprint please accompany the original URL http://www.cnblogs.com/wbchanblog/p/7411750.html, thank you!

Tip: This article is mainly about the 0 wide assertion, so reading this article needs to have a certain regular expression foundation.

Concept

We know that the meta-character "\b", "^", "$" match a position, and that the position needs to satisfy certain conditions (such as "\b" denotes the boundary of the word), we refer to this condition as an assertion or a 0-width assertion. Here are two important messages: one is that the assertion is actually a certain condition , and the second is that it does not account for the character width, just a position and does not match any characters.

0 wide assertions are divided into two categories: forward and reverse , each of which is divided into predictive and retrospective two types

§ 0 width Positive lookahead assertion, short forward assertion, syntax is (? =exp), which asserts that the back of this position can match an expression exp.

§ 0 width is recalling the post assertion that the abbreviation is being asserted backwards, the syntax is (? <=exp), which asserts that the front of this position can match the expression exp.

§ 0 width Negative lookahead assertion, referred to as reverse antecedent assertion, syntax is (?!). EXP), which asserts that after this position cannot match the expression exp.

§ 0 width Negative review post assertion, abbreviation reverse post assertion, syntax is (? <!exp), it asserts that the front of this position cannot match an expression exp.

All right, speaking of which, you must feel foggy, reasoning. I just saw this official definition is also a face, the following examples to help understand what is the assertion. A friend of a Python crawler must have done the work of extracting HTML tag content, such as having <div>hello world</div>, and we're going to extract the ' Hello World ' from the DIV tag, with the assertion that it's like this:



Match Result: Hello World

We combined this expression to see that we used both (?<=<div>) and (?=</div>) two assertions.

Take a look at the first assertion (?<=<div>), see the form, is not the same as in the assertion syntax (? <=exp) , yes, this is a forward assertion, the exp here is <div it asserts that the front of this position can match the expression <div> In fact, it is very difficult to understand, the key is that the position of the three words do not know what to represent, in fact, this position can be replaced by The target string , which is what we need to extract, is replaced: it asserts that the target string is preceded by a match to the expression <div>, in a more figurative way: I assert that the target string I want to extract, The content in front of it must match the expression <div>. by this condition alone, to match <div>hello World</div>, you can get the result Hello world</div>;

Then look at the second assertion (?=</div>), see the form, as in the assertion syntax (? =exp) , then this is the forward assertion, here exp is </div> , it represents: I assert that I am going to extract the target string, and that the content behind it must match the expression </div>. According to this condition, combined with the last section of the obtained Hello world</div>, we can get the matching result Helloworld.

Here, an Amway software called Regex Match Tracer can help us learn regular expressions:

The idea of writing regular expressions with assertions

According to the above, when we need to extract the string, we can use assertions, such as the above string <div>hello World</div>, to get the contents of the div tag, we can follow the following ideas to write a regular expression:

First, the target string is Hello world, so it can be summed up as . * ;

Second, the target string is preceded by a <div>, and since it is preceded by the meaning of the four assertions, it is easy to derive a positive backward assertion (? <=exp), put it in front of the target string , and get (?<=<div>). *, further the DIV can be summed up as [a-za-z]+, thereby getting (?<=<[a-za-z]+>).

Finally, the target string is followed by a </div>, and since it is followed , it is easy to get a forward assertion (? =exp)based on the meaning of the four assertions, and place it behind the target string to obtain (?<=<[a-za-z]+>). * (?=</[a-za-z]+>);

Further, we find that there are [a-za-z]+] in the previous two assertions, which can be used to avoid writing repetitive content:(?<=< ([a-za-z]+) >). * (?=</\1>), Of course, you can also use named groupings, which are not expanded here.

Speaking of which, I have summed up a few words to write assertions:

    The front has, is being backward (? <=exp), put front;

Back there, forward first (? =exp), put the back;

Front no, reverse after hair (? <!exp), put front;

No back, reverse first (?! EXP), put the back.

Keep in mind that this front and back is for the target string , which is the string you want to extract.

Applications asserted in Python

So much has been said, that the expression itself is the same, and we know that different programming languages have their own extensions to regular expressions, and Python is no exception. Take a look at the following code:

ImportRepattern= Re.compile (r'(?<=< ([a-za-z]+>)). * (?=</\1>)') s=''ret=Re.search (pattern, s)Print(Ret.group ())#Get Results:#Traceback (most recent):#Raise Error ("Look-behind requires fixed-width pattern")#Sre_constants.error:look-behind requires fixed-width pattern

We saw an error in the Python interpreter, what's going on? Don't worry, keep looking:

ImportRepattern= Re.compile (r'(?<=< ([a-za-z]+>)). *') s=''ret=Re.search (pattern, s)Print(Ret.group ())#Get Results:#Traceback (most recent):#Raise Error ("Look-behind requires fixed-width pattern")#Sre_constants.error:look-behind requires fixed-width pattern
 import   Repattern  = Re.compile (r " .* (?=</[a-za-z]+>)   " ) s  =  " < Html>hello world  " ret  = Re.search (pattern, s)  print   (Ret.group ())  #   Get results:  #   

Did you see that? Comparing the second and third paragraphs above with the first one, we see that the second paragraph , relative to the first paragraph, removes the forward assertion and still the error; the third segment of the regular expression relative to the first paragraph removes the positive backward assertion (of course, the place of the grouping has been manually completed), but matched to the results. Combined with the error message "Sre_constants.error:look-behind requires fixed-width pattern", we can conclude that Python's re module does not support variable-length post-assertion , only the fixed-length post assertion is supported.

What then? Can't you extract the contents of the HTML tag? Don't worry, see the following code:

ImportRepattern= Re.compile (r'< ([a-za-z]+) > (. *) </\1>') s=''ret=Re.search (pattern, s)Print('re.group () →', Ret.group ())Print('Re.group (2) →', Ret.group (2))#Run Results#re.group () →#Re.group (2) →hello World

We can use grouping to extract a specific string, the above code gives. * Added a grouping, left to right is the second grouping, so we can get the target string in the match result. Group (2).

From a zero-width assertion to a Python match for HTML tag content

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.