Htmlparser, Cookielib Crawl and parse pages in Python, extract links from HTML documents, images, text, Cookies (ii)

Last Update:2017-05-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For applications such as search engines, file indexing, document conversions, data retrieval, site backup, or migration, the parsing of Web pages (that is, HTML files) is often used. In fact, the various modules available in the Python language allow us to parse and manipulate HTML documents without using a Web server or Web browser. In this article, we describe a Python module that can help simplify opening HTML documents that reside on-premises and on the web. In this article, we'll discuss how to use the Python module to quickly parse the data in an HTML file to handle specific content, such as links, images, and cookies. It also describes how to standardize formatting tags for HTML files.

I. Extracting links from HTML documents

The Python language also has a very useful module, Htmlparser, which enables us to parse HTML documents in a concise and efficient manner based on the tags in the HTML document. As a result, Htmlparser is one of the most commonly used modules when working with HTML documents.

Import Htmlparser

Import Urllib

Class Parselinks (Htmlparser.htmlparser):

def handle_starttag (self, Tag, attrs):

if tag = = ' A ':

For Name,value in Attrs:

If name = = ' href ':

Print value

Print Self.get_starttag_text ()

Lparser = Parselinks ()

Lparser.feed (Urllib.urlopen ("http://www.python.org/index.html"). Read ())

When working with HTML documents, we often need to extract all the links from them. With the Htmlparser module, this task becomes a breeze. First, we need to define a new Htmlparser class to override the Handle_starttag () method, and we'll use this method to display the HREF attribute value for all tags.

Once you have defined the new Htmlparser class, you need to create an instance to return the Htmlparser object. You can then use Urllib.urlopen (URL) to open the HTML document and read the contents of the HTML file.

To parse the contents of the HTML file and display the links contained therein, you can use the Read () function to pass the data to the Htmlparser object. The feed function of the Htmlparser object will receive the data and parse the data appropriately through the defined Htmlparser object. Note that if the data passed to the Htmlparser feed () function is incomplete, then the incomplete label is saved and parsed the next time the feed () function is called. This feature is useful when the HTML file is large and needs to be sent to the parser in segments. Here's a concrete example.

Import Htmlparser

Import Urllib

Import Sys

#Define HTML parser

Class Parselinks (Htmlparser.htmlparser):

def handle_starttag (self, Tag, attrs):

if tag = = 'A':

For Name, value in Attrs:

If name = = 'href':

Print value

Print Self.get_starttag_text ()

#Create an instance of the HTML parser

Lparser = Parselinks ()

#Open HTML file

Lparser.feed (Urllib.urlopen (/

"Http://www.python.org/index.html"). Read ())

Lparser.close ()

The result of the above code is too long, so you can run your own code to try it out.

Ii. Extracting images from an HTML document

When working with HTML documents, we often need to extract all the images from them. With the Htmlparser module, this task becomes a breeze. First, we need to define a new Htmlparser class to override the Handle_starttag () method, which is to find the IMG tag and save the file referred to by the SRC attribute value.

Import Htmlparser

Import Urllib

def getImage (addr):

U = Urllib.urlopen (addr)

data = U.read ()

Class Parseimages (Htmlparser.htmlparser):

def handle_starttag (self, Tag, attrs):

if tag = = 'img':

For Name, value in Attrs:

If name = = 'src':

GetImage (urlstring + "/" + value)

U = Urllib.urlopen (urlstring)

Lparser.feed (U.read ())

Once you have defined the new Htmlparser class, you need to create an instance to return the Htmlparser object. You can then use Urllib.urlopen (URL) to open the HTML document and read the contents of the HTML file.

To parse the contents of the HTML file and display the images contained therein, you can use the feed (data) function to send the data to the Htmlparser object. The feed function of the Htmlparser object will receive the data and parse the data appropriately through the defined Htmlparser object. The following is a concrete example:

Import Htmlparser

Import Urllib

Import Sys

URLString = "http://www.python.org"

#Save image file to hard disk

def getImage (addr):

U = Urllib.urlopen (addr)

data = U.read ()

Splitpath = Addr.split ('/')

FName = Splitpath.pop ()

Print "Saving% s"% FName

f = open (FName, 'WB')

F.write (data)

F.close ()

#Define HTML parser

Class Parseimages (Htmlparser.htmlparser):

def handle_starttag (self, Tag, attrs):

if tag = = 'img':

For Name, value in Attrs:

If name = = 'src':

GetImage (urlstring + "/" + value)

#Create an instance of the HTML parser

Lparser = Parseimages ()

#Open HTML file

U = Urllib.urlopen (urlstring)

Print "Opening url / n ======================"

Print U.info ()

#Pass the HTML file to the parser

Lparser.feed (U.read ())

Lparser.close ()

The result of the above code is as follows:

Opening URL

======================

Date: fri, June 10:54:49 GMT

server: apache / 2.2.9 (Debian) dav / 2 svn / 1.5.1 mod_ssl / 2.2.9 openssl / 0.9.8g mod_wsgi / 2.3 python / 2.5.2

Last-modified: thu, June 09:44:54 GMT

ETag: "105800d-46e7-46d29136f7180"

Accept-ranges: bytes

content-length: 18151

Connection: close

Content-type: text / html

Saving Python-logo.gif

Saving Trans.gif

Saving Trans.gif

Saving Afnic.fr.png

Iii. Extracting text from an HTML document

When working with HTML documents, we often need to extract all the text from them. With the Htmlparser module, this task will become very simple. First, we need to define a new Htmlparser class that overrides the Handle_data () method, which is used to parse and text data.

Import Htmlparser

Import Urllib

Class ParseText (Htmlparser.htmlparser):

def handle_data (self, data):

if data! = '/ n':

Urltext.append (data)

Lparser = ParseText ()

Lparser.feed (Urllib.urlopen (/

http://docs.python.org/lib/module-HTMLParser.html). Read ())

Once you have defined the new Htmlparser class, you need to create an instance to return the Htmlparser object. You can then use Urllib.urlopen (URL) to open the HTML document and read the contents of the HTML file.

To parse the contents of the HTML file and display the text contained therein, we can use the feed (data) function to pass the data to the Htmlparser object. The feed function of the Htmlparser object will receive the data and parse the data appropriately through the defined Htmlparser object. Note that if the data passed to the Htmlparser feed () function is incomplete, then the incomplete label is saved and parsed the next time the feed () function is called. This feature is useful when the HTML file is large and needs to be sent to the parser in segments. The following is a specific code example:

Import Htmlparser

Import Urllib

Urltext = []

#Define HTML parser

Class ParseText (Htmlparser.htmlparser):

def handle_data (self, data):

if data! = '/ n':

Urltext.append (data)

#Create an instance of the HTML parser

Lparser = ParseText ()

#Pass the HTML file to the parser

Lparser.feed (Urllib.urlopen (/

"Http://docs.python.org/lib/module-HTMLParser.html"/

). Read ())

Lparser.close ()

For item in Urltext:

Print Item

The above code runs out too long, skipping

Iv. Extracting cookies from HTML documents

Very often, we all need to deal with cookies, and fortunately the Cookielib module of the Python language gives us a lot of classes that automatically handle HTTP cookies in HTML. These classes are useful to us when working with HTML documents that require cookies to be set for clients.

Import Urllib2

Import Cookielib

From URLLIB2 import Urlopen, Request

Cjar = Cookielib. Lwpcookiejar ()

Opener = urllib2.build_opener (/

Urllib2. Httpcookieprocessor (Cjar))

Urllib2.install_opener (opener)

R = Request (Testurl)

H = Urlopen (R)

For IND, cookies in Enumerate (Cjar):

Print "% d-% s"% (Ind, cookie)

Cjar.save (Cookiefile)

In order to extract cookies from an HTML document, first use the Lwpcookiejar () function of the Cookielib module to create an instance of the cookie jar. The Lwpcookiejar () function returns an object that can load a cookie from the hard disk and also store cookies on the hard disk.

Next, use the Build_opener of the URLLIB2 module ([Handler, ...]) The opener function creates an object that will handle cookies when the HTML file is opened. The function Build_opener can receive 0 or more handlers, which are concatenated in the order in which they are specified, and return one as a parameter.

Note that if you want Urlopen () to use the opener object to open the HTML file, you can call the Install_opener (opener) function and pass the opener object to it. Otherwise, use the opener object's open (URL) function to open the HTML file.

Once you have created and installed the opener object, you can use the request (URL) function in the URLLIB2 module to create a Request object, and then you can use the Urlopen (Request) function to open the HTML file.

When you open an HTML page, all the cookies for that page are stored in the Lwpcookiejar object, and you can then use the Save (filename) function of the Lwpcookiejar object.

Import OS

Import Urllib2

Import Cookielib

From URLLIB2 import Urlopen, Request

Cookiefile = "Cookies.dat"

Testurl = 'http://maps.google.com/'

Creating an instance #for cookie jar

Cjar = Cookielib. Lwpcookiejar ()

#Create an opener object of HTTPCookieProcessor

Opener = Urllib2.build_opener (/

Urllib2. Httpcookieprocessor (Cjar))

#Install HTTPCookieProcessor's opener

Urllib2.install_opener (opener)

#Create a Request object

R = Request (Testurl)

#Open HTML file

H = Urlopen (R)

Print "Page Header / n ========================"

Print H.info ()

Print "Page cookies / n ========================"

For IND, cookies in Enumerate (Cjar):

Print "% d-% s"% (Ind, cookie)

#Save cookies

Cjar.save (Cookiefile)

The result of the above code is as follows:

The head of the page

=======================

Cache-control: private

content-type: text / html; Charset = iso-8859-1

Set-cookie: pref = id = 5d9692b55f029733: nw = 1: tm = 1246015608: lm = 1246015608: s = frfx--b3xt73taea; Expires = sun, 26-jun-2011 11:26:48 GMT; path = /; Domain = .google.com

Date: fri, June 11:26:48 GMT

Server: mfe

Expires: fri, June 11:26:48 GMT

Transfer-encoding: chunked

Connection: close

Pages of cookies

=======================

0-

v. Add quotation marks to attribute values in an HTML document

Earlier we discussed the parsing of HTML files based on some sort of handler in the HTML parser, but sometimes we need to use all the handlers to process the HTML document. Thankfully, parsing all the elements of an HTML file using the Htmlparser module is not much more difficult than dealing with links or images.

Import Htmlparser

Import Urllib

Class Parseattrs (Htmlparser.htmlparser):

def handle_starttag (self, Tag, attrs):

...

Attrparser = Parseattrs ()

Attrparser.init_parser ()

Attrparser.feed (Urllib.urlopen ("test2.html"). Read ())

Here, we will discuss how to use the Htmlparser module to parse the HTML file to enclose the "bare-ben" attribute value in quotation marks. First, we'll define a new Htmlparser class that overwrites all of the following handlers to add quotation marks to the property values.

Handle_starttag (tag, attrs)

Handle_charref (name)

Handle_endtag (TAG)

Handle_entityref (ref)

Handle_data (text)

Handle_comment (text)

HANDLE_PI (text)

Handle_decl (text)

Handle_startendtag (tag, attrs)

We also need to define a function in the parser class to initialize the variables used to store the parsed data, as well as define another function to return the parsed data.

Once you have defined the new Htmlparser class, you need to create an instance to return the Htmlparser object. Initialize the parser with the INIT function we created so that we can use Urllib.urlopen (URL) to open the HTML document and read the contents of the HTML file.

To parse the contents of the HTML file and add quotation marks to the attribute values, you can use the feed (data) function to pass the data to the Htmlparser object. The feed function of the Htmlparser object will receive the data and parse the data appropriately through the defined Htmlparser object. The following is a specific example code:

Import Htmlparser

Import Urllib

Import Sys

#Define HTML parser

Class Parseattrs (Htmlparser.htmlparser):

def init_parser (self):

Self.pieces = []

def handle_starttag (self, Tag, attrs):

Fixedattrs = ""

#for Name, value in Attrs:

For n ame, value in Attrs:

Fixedattrs + = "% s = /"% s / ""% (name, value)

Self.pieces.append ("<% s% s>"% (tag, fixedattrs))

def handle_charref (self, name):

Self.pieces.append ("& #% s;"% (name))

def handle_endtag (self, Tag):

Self.pieces.append (""% (tag))

def handle_entityref (self, ref):

Self.pieces.append ("&% s"% (ref))

def handle_data (self, text):

Self.pieces.append (text)

def handle_comment (self, text):

Self.pieces.append (""% (text))

def handle_pi (self, text):

Self.pieces.append (""% (text))

def handle_decl (self, text):

Self.pieces.append (""% (text))

def parsed (self):

Return "". Join (Self.pieces)

#Create an instance of the HTML parser

Attrparser = Parseattrs ()

#Initialize parser data

Attrparser.init_parser ()

#Pass the HTML file to the parser

Attrparser.feed (Urllib.urlopen ("test2.html"). Read ())

#Show the original file content

Print "Original file / n =========================="

Print open ("test2.html"). Read ()

#Show the parsed file

Print "Parsed file / n =========================="

Print attrparser.parsed ()

Attrparser.close ()

We also need to create a test file named Test2.html, which can be seen from the running results of the above code, as follows:

The original file
=========================
<meta content = "text / html; Charset = utf-8"
http-equiv = "Content-type" />
<title> web page </ title>
<body>
<a Href=http://www.python.org> python website </a>
<a href=test.html> Local page </a>

</ body>

Parsed file
=========================
<meta content = "text / html; Charset = utf-8"
http-equiv = "Content-type"> </ meta>
<title> web Page </ title>
<body>
& LT; H1> web Page List <a href= "http://www.python.org"> python website </a>
<a href= "test.html"> local page </a>

</ body>
Vi. Summary

For applications such as search engines, file indexing, document conversions, data retrieval, site backup, or migration, the parsing of Web pages (that is, HTML files) is often used. In fact, the various modules available in the Python language allow us to parse and manipulate HTML documents without using a Web server or Web browser. This article details how to use the Python module to quickly parse data in an HTML file to handle specific content, such as links, images, and cookies. At the same time, we also give a specification of HTML file format label example, I hope this article will help you.

Htmlparser, Cookielib Crawl and parse pages in Python, extract links from HTML documents, images, text, Cookies (ii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Htmlparser, Cookielib Crawl and parse pages in Python, extract links from HTML documents, images, text, Cookies (ii)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Htmlparser, Cookielib Crawl and parse pages in Python, extract links from HTML documents, images, text, Cookies (ii)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support