For applications such as search engines, file indexing, document conversions, data retrieval, site backup, or migration, the parsing of Web pages (that is, HTML files) is often used. In fact, the various modules available in the Python language allow us to parse and manipulate HTML documents without using a Web server or Web browser. In this article, we describe a Python module that can help simplify opening HTML documents that reside on-premises and on the web. In this article, we'll discuss how to use the Python module to quickly parse the data in an HTML file to handle specific content, such as links, images, and cookies. It also describes how to standardize formatting tags for HTML files.
I. Extracting links from HTML documents
The Python language also has a very useful module, Htmlparser, which enables us to parse HTML documents in a concise and efficient manner based on the tags in the HTML document. As a result, Htmlparser is one of the most commonly used modules when working with HTML documents.
Import Htmlparser
Import Urllib
Class Parselinks (Htmlparser.htmlparser):
def handle_starttag (self, Tag, attrs):
if tag = = ' A ':
For Name,value in Attrs:
If name = = ' href ':
Print value
Print Self.get_starttag_text ()
Lparser = Parselinks ()
Lparser.feed (Urllib.urlopen ("http://www.python.org/index.html"). Read ())
When working with HTML documents, we often need to extract all the links from them. With the Htmlparser module, this task becomes a breeze. First, we need to define a new Htmlparser class to override the Handle_starttag () method, and we'll use this method to display the HREF attribute value for all tags.
Once you have defined the new Htmlparser class, you need to create an instance to return the Htmlparser object. You can then use Urllib.urlopen (URL) to open the HTML document and read the contents of the HTML file.
To parse the contents of the HTML file and display the links contained therein, you can use the Read () function to pass the data to the Htmlparser object. The feed function of the Htmlparser object will receive the data and parse the data appropriately through the defined Htmlparser object. Note that if the data passed to the Htmlparser feed () function is incomplete, then the incomplete label is saved and parsed the next time the feed () function is called. This feature is useful when the HTML file is large and needs to be sent to the parser in segments. Here's a concrete example.
Import Htmlparser
Import Urllib
Import Sys
#Define HTML parser
Class Parselinks (Htmlparser.htmlparser):
def handle_starttag (self, Tag, attrs):
if tag = = 'A':
For Name, value in Attrs:
If name = = 'href':
Print value
Print Self.get_starttag_text ()
#Create an instance of the HTML parser
Lparser = Parselinks ()
#Open HTML file
Lparser.feed (Urllib.urlopen (/
"Http://www.python.org/index.html"). Read ())
Lparser.close ()
The result of the above code is too long, so you can run your own code to try it out.
Ii. Extracting images from an HTML document
When working with HTML documents, we often need to extract all the images from them. With the Htmlparser module, this task becomes a breeze. First, we need to define a new Htmlparser class to override the Handle_starttag () method, which is to find the IMG tag and save the file referred to by the SRC attribute value.
Import Htmlparser
Import Urllib
def getImage (addr):
U = Urllib.urlopen (addr)
data = U.read ()
Class Parseimages (Htmlparser.htmlparser):
def handle_starttag (self, Tag, attrs):
if tag = = 'img':
For Name, value in Attrs:
If name = = 'src':
GetImage (urlstring + "/" + value)
U = Urllib.urlopen (urlstring)
Lparser.feed (U.read ())
Once you have defined the new Htmlparser class, you need to create an instance to return the Htmlparser object. You can then use Urllib.urlopen (URL) to open the HTML document and read the contents of the HTML file.
To parse the contents of the HTML file and display the images contained therein, you can use the feed (data) function to send the data to the Htmlparser object. The feed function of the Htmlparser object will receive the data and parse the data appropriately through the defined Htmlparser object. The following is a concrete example:
Import Htmlparser
Import Urllib
Import Sys
URLString = "http://www.python.org"
#Save image file to hard disk
def getImage (addr):
U = Urllib.urlopen (addr)
data = U.read ()
Splitpath = Addr.split ('/')
FName = Splitpath.pop ()
Print "Saving% s"% FName
f = open (FName, 'WB')
F.write (data)
F.close ()
#Define HTML parser
Class Parseimages (Htmlparser.htmlparser):
def handle_starttag (self, Tag, attrs):
if tag = = 'img':
For Name, value in Attrs:
If name = = 'src':
GetImage (urlstring + "/" + value)
#Create an instance of the HTML parser
Lparser = Parseimages ()
#Open HTML file
U = Urllib.urlopen (urlstring)
Print "Opening url / n ======================"
Print U.info ()
#Pass the HTML file to the parser
Lparser.feed (U.read ())
Lparser.close ()
The result of the above code is as follows:
Opening URL
======================
Date: fri, June 10:54:49 GMT
server: apache / 2.2.9 (Debian) dav / 2 svn / 1.5.1 mod_ssl / 2.2.9 openssl / 0.9.8g mod_wsgi / 2.3 python / 2.5.2
Last-modified: thu, June 09:44:54 GMT
ETag: "105800d-46e7-46d29136f7180"
Accept-ranges: bytes
content-length: 18151
Connection: close
Content-type: text / html
Saving Python-logo.gif
Saving Trans.gif
Saving Trans.gif
Saving Afnic.fr.png
Iii. Extracting text from an HTML document
When working with HTML documents, we often need to extract all the text from them. With the Htmlparser module, this task will become very simple. First, we need to define a new Htmlparser class that overrides the Handle_data () method, which is used to parse and text data.
Import Htmlparser
Import Urllib
Class ParseText (Htmlparser.htmlparser):
def handle_data (self, data):
if data! = '/ n':
Urltext.append (data)
Lparser = ParseText ()
Lparser.feed (Urllib.urlopen (/
http://docs.python.org/lib/module-HTMLParser.html). Read ())
Once you have defined the new Htmlparser class, you need to create an instance to return the Htmlparser object. You can then use Urllib.urlopen (URL) to open the HTML document and read the contents of the HTML file.
To parse the contents of the HTML file and display the text contained therein, we can use the feed (data) function to pass the data to the Htmlparser object. The feed function of the Htmlparser object will receive the data and parse the data appropriately through the defined Htmlparser object. Note that if the data passed to the Htmlparser feed () function is incomplete, then the incomplete label is saved and parsed the next time the feed () function is called. This feature is useful when the HTML file is large and needs to be sent to the parser in segments. The following is a specific code example:
Import Htmlparser
Import Urllib
Urltext = []
#Define HTML parser
Class ParseText (Htmlparser.htmlparser):
def handle_data (self, data):
if data! = '/ n':
Urltext.append (data)
#Create an instance of the HTML parser
Lparser = ParseText ()
#Pass the HTML file to the parser
Lparser.feed (Urllib.urlopen (/
"Http://docs.python.org/lib/module-HTMLParser.html"/
). Read ())
Lparser.close ()
For item in Urltext:
Print Item
The above code runs out too long, skipping
Iv. Extracting cookies from HTML documents
Very often, we all need to deal with cookies, and fortunately the Cookielib module of the Python language gives us a lot of classes that automatically handle HTTP cookies in HTML. These classes are useful to us when working with HTML documents that require cookies to be set for clients.
Import Urllib2
Import Cookielib
From URLLIB2 import Urlopen, Request
Cjar = Cookielib. Lwpcookiejar ()
Opener = urllib2.build_opener (/
Urllib2. Httpcookieprocessor (Cjar))
Urllib2.install_opener (opener)
R = Request (Testurl)
H = Urlopen (R)
For IND, cookies in Enumerate (Cjar):
Print "% d-% s"% (Ind, cookie)
Cjar.save (Cookiefile)
In order to extract cookies from an HTML document, first use the Lwpcookiejar () function of the Cookielib module to create an instance of the cookie jar. The Lwpcookiejar () function returns an object that can load a cookie from the hard disk and also store cookies on the hard disk.
Next, use the Build_opener of the URLLIB2 module ([Handler, ...]) The opener function creates an object that will handle cookies when the HTML file is opened. The function Build_opener can receive 0 or more handlers, which are concatenated in the order in which they are specified, and return one as a parameter.
Note that if you want Urlopen () to use the opener object to open the HTML file, you can call the Install_opener (opener) function and pass the opener object to it. Otherwise, use the opener object's open (URL) function to open the HTML file.
Once you have created and installed the opener object, you can use the request (URL) function in the URLLIB2 module to create a Request object, and then you can use the Urlopen (Request) function to open the HTML file.
When you open an HTML page, all the cookies for that page are stored in the Lwpcookiejar object, and you can then use the Save (filename) function of the Lwpcookiejar object.
Import OS
Import Urllib2
Import Cookielib
From URLLIB2 import Urlopen, Request
Cookiefile = "Cookies.dat"
Testurl = 'http://maps.google.com/'
Creating an instance #for cookie jar
Cjar = Cookielib. Lwpcookiejar ()
#Create an opener object of HTTPCookieProcessor
Opener = Urllib2.build_opener (/
Urllib2. Httpcookieprocessor (Cjar))
#Install HTTPCookieProcessor's opener
Urllib2.install_opener (opener)
#Create a Request object
R = Request (Testurl)
#Open HTML file
H = Urlopen (R)
Print "Page Header / n ========================"
Print H.info ()
Print "Page cookies / n ========================"
For IND, cookies in Enumerate (Cjar):
Print "% d-% s"% (Ind, cookie)
#Save cookies
Cjar.save (Cookiefile)
The result of the above code is as follows:
The head of the page
=======================
Cache-control: private
content-type: text / html; Charset = iso-8859-1
Set-cookie: pref = id = 5d9692b55f029733: nw = 1: tm = 1246015608: lm = 1246015608: s = frfx--b3xt73taea; Expires = sun, 26-jun-2011 11:26:48 GMT; path = /; Domain = .google.com
Date: fri, June 11:26:48 GMT
Server: mfe
Expires: fri, June 11:26:48 GMT
Transfer-encoding: chunked
Connection: close
Pages of cookies
=======================
0-
v. Add quotation marks to attribute values in an HTML document
Earlier we discussed the parsing of HTML files based on some sort of handler in the HTML parser, but sometimes we need to use all the handlers to process the HTML document. Thankfully, parsing all the elements of an HTML file using the Htmlparser module is not much more difficult than dealing with links or images.
Import Htmlparser
Import Urllib
Class Parseattrs (Htmlparser.htmlparser):
def handle_starttag (self, Tag, attrs):
...
Attrparser = Parseattrs ()
Attrparser.init_parser ()
Attrparser.feed (Urllib.urlopen ("test2.html"). Read ())
Here, we will discuss how to use the Htmlparser module to parse the HTML file to enclose the "bare-ben" attribute value in quotation marks. First, we'll define a new Htmlparser class that overwrites all of the following handlers to add quotation marks to the property values.
Handle_starttag (tag, attrs)
Handle_charref (name)
Handle_endtag (TAG)
Handle_entityref (ref)
Handle_data (text)
Handle_comment (text)
HANDLE_PI (text)
Handle_decl (text)
Handle_startendtag (tag, attrs)
We also need to define a function in the parser class to initialize the variables used to store the parsed data, as well as define another function to return the parsed data.
Once you have defined the new Htmlparser class, you need to create an instance to return the Htmlparser object. Initialize the parser with the INIT function we created so that we can use Urllib.urlopen (URL) to open the HTML document and read the contents of the HTML file.
To parse the contents of the HTML file and add quotation marks to the attribute values, you can use the feed (data) function to pass the data to the Htmlparser object. The feed function of the Htmlparser object will receive the data and parse the data appropriately through the defined Htmlparser object. The following is a specific example code:
Import Htmlparser
Import Urllib
Import Sys
#Define HTML parser
Class Parseattrs (Htmlparser.htmlparser):
def init_parser (self):
Self.pieces = []
def handle_starttag (self, Tag, attrs):
Fixedattrs = ""
#for Name, value in Attrs:
For n ame, value in Attrs:
Fixedattrs + = "% s = /"% s / ""% (name, value)
Self.pieces.append ("<% s% s>"% (tag, fixedattrs))
def handle_charref (self, name):
Self.pieces.append ("& #% s;"% (name))
def handle_endtag (self, Tag):
Self.pieces.append (""% (tag))
def handle_entityref (self, ref):
Self.pieces.append ("&% s"% (ref))
def handle_data (self, text):
Self.pieces.append (text)
def handle_comment (self, text):
Self.pieces.append (""% (text))
def handle_pi (self, text):
Self.pieces.append (""% (text))
def handle_decl (self, text):
Self.pieces.append (""% (text))
def parsed (self):
Return "". Join (Self.pieces)
#Create an instance of the HTML parser
Attrparser = Parseattrs ()
#Initialize parser data
Attrparser.init_parser ()
#Pass the HTML file to the parser
Attrparser.feed (Urllib.urlopen ("test2.html"). Read ())
#Show the original file content
Print "Original file / n =========================="
Print open ("test2.html"). Read ()
#Show the parsed file
Print "Parsed file / n =========================="
Print attrparser.parsed ()
Attrparser.close ()
We also need to create a test file named Test2.html, which can be seen from the running results of the above code, as follows:
The original file
=========================
<meta content = "text / html; Charset = utf-8"
http-equiv = "Content-type" />
<title> web page </ title>
<body>
<a Href=http://www.python.org> python website </a>
<a href=test.html> Local page </a>
</ body>
Parsed file
=========================
<meta content = "text / html; Charset = utf-8"
http-equiv = "Content-type"> </ meta>
<title> web Page </ title>
<body>
& LT; H1> web Page List <a href= "http://www.python.org"> python website </a>
<a href= "test.html"> local page </a>
</ body>
Vi. Summary
For applications such as search engines, file indexing, document conversions, data retrieval, site backup, or migration, the parsing of Web pages (that is, HTML files) is often used. In fact, the various modules available in the Python language allow us to parse and manipulate HTML documents without using a Web server or Web browser. This article details how to use the Python module to quickly parse data in an HTML file to handle specific content, such as links, images, and cookies. At the same time, we also give a specification of HTML file format label example, I hope this article will help you.
Htmlparser, Cookielib Crawl and parse pages in Python, extract links from HTML documents, images, text, Cookies (ii)