International - English

Cart Console

Topic Center

Contact Sales

Home > Developer > Python

Use Htmlparser to parse HTML instances in Python _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags tag name in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A few days ago encountered a problem, need to pick out a part of the content of the Web page, so found the Urllib and Htmlparser two libraries. Urllib can crawl the Web page down, then to Htmlparser resolution, the first use of this library, in the search of official documents also encountered some problems, Write it down here to share with you.

An example

Copy Code code as follows:

From Htmlparser import Htmlparser
Class Myhtmlparser (Htmlparser):
def handle_starttag (self, Tag, attrs):
Print "A start tag:", Tag,self.getpos ()
Parser=myhtmlparser ()
Parser.feed (' <div><p> ' hello ' </p></div> ')

In this example, Htmlparser is the base class, overloading his Handle_starttag method and outputting some information. Parser is an instance of Myhtmlparser, invoking the feed method to begin parsing the function. It is noteworthy that no display calls are required Handle_ The Starttag method is executed.

Htmlparser method of calling way puzzled me for a long time, saw a lot of Bovencai suddenly dawned, Htmlparser contains methods are divided into two categories, one needs to be explicitly called, and the other type does not need to display the call.

Methods that do not need to be called explicitly

The following functions are triggered during parsing, but by default they do not produce any side effects, so we are overloaded according to our requirements.

1.htmlparser.handle_starttag (TAG,ATTRS): Parsing encountered a start tag call, such as <p class= ' para ', the parameter tag is the tag name, here is ' P ', attrs for all attributes of the label (name, Value) list, here is [(' class ', ' para ')]

2.htmlparser.handle_endtag (TAG): Call when the end tag is encountered, tag is the sign

3.htmlpars.handle_data (data): Called when the content in the middle of the label is encountered, such as <style> p {Color:blue} </STYLE>, the parameter data is the content between the opening and closing tabs. It is noteworthy that in the position of <div><p>...</p></div>, it is not called at the Div, but only at p

Of course, there are other functions, which are not introduced here.

Methods that are called explicitly

1.htmlparser.feed (data): parameter is an HTML string that needs to be parsed, and the string begins to be parsed when called

2.htmlparser.getpos (): Returns the current line number and offset position, such as (23,5)

3.htmlparser.get_starttag_text (): Returns the contents of the nearest start tag for the current position

All the content finished, and finally a little note, Htmlparser is just a simple module, parsing HTML function is not perfect, for example, can not accurately open the label and "Auto closed tag", look at the following code:

Copy Code code as follows:

From Htmlparser import Htmlparser
Class Myhtmlparser (Htmlparser):
def handle_starttag (self,tag,attrs):
print ' begin tag ', tag
def handle_startendtag (self,tag,attrs):
print ' Begin end tag ', tag

str1= ' <br> '
Str2= ' <br/> '
Parser=myhtmlparser ()

Parser.feed (str1) # output "Begin tag BR"
Parser.feed (str2) # output "Begin end BR"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

how to use in html how to use ord in python how to use range in python how to use for loop in html how to use html code in php how to use if condition in html how to use meta in html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Htmlparser to parse HTML instances in Python _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support