Python uses htmlparser, cookielib to crawl and parse Web pages, extract links from HTML documents, images, text, Cookies

Source: Internet
Author: User

I. Extracting links from HTML documents

Module Htmlparser, which enables us to parse HTML documents in a concise and efficient manner based on the tags in the HTML document.

When working with HTML documents, we often need to extract all the links from them. With the Htmlparser module, this task becomes a breeze. First, we need to define a new Htmlparser class to override the Handle_starttag () method, and we'll use this method to display the HREF attribute value for all tags.

Once you have defined the new Htmlparser class, you need to create an instance to return the Htmlparser object. You can then use Urllib.urlopen (URL) to open the HTML document and read the contents of the HTML file.

To parse the contents of the HTML file and display the links contained therein, you can use the Read () function to pass the data to the Htmlparser object. The feed function of the Htmlparser object will receive the data and parse the data appropriately through the defined Htmlparser object. Note that if the data passed to the Htmlparser feed () function is incomplete, then the incomplete label is saved and parsed the next time the feed () function is called. This feature is useful when the HTML file is large and needs to be sent to the parser in segments. Here is a concrete example.

#-*-Coding:utf-8-*-__author__ = ' paul ' import htmlparserimport urllibimport sys# define HTML parser class Parselinks ( Htmlparser.htmlparser):    def handle_starttag (self, Tag, attrs):        if tag = = ' A ': # #显示标签为a for            Name,value in Attrs:                If name = = ' href ': #显示属性为href                    print value# #显示标签属性值                    print Self.get_starttag_text () # #全部显示 # Create an instance of the HTML parser Lparser = Parselinks () #打开HTML文件lParser. Feed (Urllib.urlopen ("http://www.python.org/index.html"). Read ( )) Lparser.close ()

Ii. extracting images from an HTML document

Once you have defined the new Htmlparser class, you need to create an instance to return the Htmlparser object. You can then use Urllib.urlopen (URL) to open the HTML document and read the contents of the HTML file.

To parse the contents of the HTML file and display the images contained therein, you can use the feed (data) function to send the data to the Htmlparser object. The feed function of the Htmlparser object will receive the data and parse the data appropriately through the defined Htmlparser object. The following is a concrete example:

#-*-Coding:utf-8-*-__author__ = ' paul ' import htmlparserimport urllibimport sysurlstring = ' http://www.python.org ' #url String = "http://www.baidu.com" #把图像文件保存至硬盘def getImage (addr): U = urllib.urlopen (addr) data = U.read () splitpath = Addr.split ('/') FName = Splitpath.pop () print "Saving%s"% fName f = open (FName, ' WB ') f.write (data) F.C Lose () #定义HTML解析器class parseimages (htmlparser.htmlparser): Def handle_starttag (self, Tag, attrs): if tag = = '                    IMG ': For name,value in attrs:if name = = ' src ': #value1 =value[2:] #print value1 Print Self.get_starttag_text () getImage (urlstring + "/" + val UE) # #当图片的src为相对路径时需要拼接 #getImage ("http.//" +value1) # #当图片的src为相对路径时, use absolute paths directly, but be aware that Urllib.urlopen needs to add Add a link to the protocol such as HTTP///https://and so on # Create an instance of the HTML parser Lparser = parseimages () #打开HTML文件u = Urllib.urlopen (urlstring) print "Opening url/n==================== "PrintU.info () #把HTML文件传给解析器lParser. Feed (U.read ()) Lparser.close () 

Iii. extracting text from an HTML document

When working with HTML documents, we often need to extract all the text from them. With the Htmlparser module, this task will become very simple. First, we need to define a new Htmlparser class that overrides the Handle_data () method, which is used to parse and text data.

#-*-Coding:utf-8-*-__author__ = ' paul ' import htmlparserimport urlliburltext = [] #定义HTML解析器class parsetext (htmlparser. Htmlparser):    def handle_data (self, data):        if data! = '/n ':            urltext.append (data) #创建HTML解析器的实例lParser = ParseText () #把HTML文件传给解析器lParser. Feed (Urllib.urlopen ("http://www.baidu.com"). Read ()) Lparser.close () for item in Urltext:    Print Item

 Iv. extracting cookies from HTML documents

In order to extract cookies from an HTML document, first use the Lwpcookiejar () function of the Cookielib module to create an instance of the cookie jar. The Lwpcookiejar () function returns an object that can load a cookie from the hard disk and also store cookies on the hard disk.

Next, use the Build_opener of the URLLIB2 module ([Handler, ...]) The opener function creates an object that will handle cookies when the HTML file is opened. The function Build_opener can receive 0 or more handlers, which are concatenated in the order in which they are specified, and return one as a parameter.

Note that if you want Urlopen () to use the opener object to open the HTML file, you can call the Install_opener (opener) function and pass the opener object to it. Otherwise, use the opener object's open (URL) function to open the HTML file.

Once you have created and installed the opener object, you can use the request (URL) function in the URLLIB2 module to create a Request object, and then you can use the Urlopen (Request) function to open the HTML file.

When you open an HTML page, all the cookies for that page are stored in the Lwpcookiejar object, and you can then use the Save (filename) function of the Lwpcookiejar object.

#-*-Coding:utf-8-*-__author__ = ' paul ' import osimport urllib2import cookielibfrom urllib2 import Urlopen, Requestcooki Efile = "Cookies.dat" Testurl = ' http://www.baidu.com/' #为cookie jar Create instance Cjar = Cookielib. Lwpcookiejar () #创建HTTPCookieProcessor的opener对象opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (Cjar)) #安装HTTPCookieProcessor的openerurllib2. Install_opener (opener) #创建一个Request对象r = Request ( Testurl) #打开HTML文件h = Urlopen (r) Print page header/n====================== "Print h.info () print" page cookies/n============== ======== "Print cjarfor ind, cookie in Enumerate (Cjar):    print"%d-%s "% (Ind, cookie)    #保存cookies    Cjar.save (c Ookiefile)

V. Add quotation marks to attribute values in an HTML document

#-*-Coding:utf-8-*-__author__ = ' paul ' import htmlparserimport urllibimport sys# define HTML parser class Parseattrs (        Htmlparser.htmlparser): def init_parser (self): Self.pieces = [] def handle_starttag (self, Tag, attrs): Fixedattrs = "" #for name,value in Attrs:for name, value in Attrs:fixedattrs + = "%s=\"%s\ ""% (n Ame, value) self.pieces.append ("<%s%s>"% (tag, fixedattrs)) def handle_charref (self, name): s Elf.pieces.append ("&#%s;"% (name)) def handle_endtag (self, Tag): Self.pieces.append ("%s"% (tag)) def ha Ndle_entityref (self, ref): Self.pieces.append ("&%s"% (ref)) def handle_data (self, text): Self.pieces        . Append (text) def handle_comment (self, Text): Self.pieces.append ("s%"% (text)) def handle_pi (self, text): Self.pieces.append ("s%"% (text)) def handle_decl (self, Text): Self.pieces.append ("s%"% (text)) def Pars Ed (self): Return "". Join (Self. Pieces) #创建HTML解析器的实例attrParser = Parseattrs () #初始化解析器数据attrParser. Init_parser () #把HTML文件传给解析器attrParser. Feed ( Urllib.urlopen ("test2.html"). Read ()) #显示原来的文件内容print "original file/n========================" Print open ("test2.html"). Read () #显示解析后的文件print "parsed file/n========================" Print attrparser.parsed () attrparser.close ()

  

Python uses htmlparser, cookielib to crawl and parse Web pages, extract links from HTML documents, images, text, Cookies

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.