This is a case of using XPath, for more information, see: Python Learning Guide
Case: Crawler using XPath
Now we use XPath to make a simple crawler, we try to crawl all the posts in a bar and download the images from each floor of the post to local.
#-*-coding:utf-8-*-#tieba_xpath. PY"""role: This case uses XPath to make a simple crawler, we try to crawl to a bar of all posts"""ImportOsImportUrllib2ImportUrllib fromlxmlImportEtreeclassSpider:def __init__( Self): Self. tiebaname= Raw_input("Please enter the bar you need to visit:") Self. beginpage= int(Raw_input("Please enter the start page:")) Self. endpage= int(Raw_input("Please enter the termination page:")) Self. URL= "http://tieba.baidu.com/f" Self. ua_header={"User-agent":"mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 trident/5.0; "}#图片编号 Self. userName= 1 defTiebaspider ( Self): forPageinch Range( Self. Beginpage, Self. endpage+1): PN=(page-1)* - #page NumberWord={' PN ':p N,' kw ': Self. Tiebaname} Word=Urllib.urlencode (Word)#转换成url编码格式 (String)Myurl= Self. URL+ "?" +Word#示例: http://tieba.baidu.com/f?kw=%E7%BE%8E%E5%A5%B3 & pn=50 #调用 page processing function load_page #并且获取页面所有帖子链接Links= Self. LoadPage (Myurl)#urllib2_test3. PY #获取页面内容 defLoadPage ( Self, URL): req=Urllib2. Request (URL, headers= Self. Ua_header) HTML=Urllib2.urlopen (req). Read ()#解析html为HTML DOM DocumentSelector=Etree. HTML (HTML)#抓取当前页面的所有帖子的url的后半部分, which is the post number #http: "p/4884069807" in//tieba.baidu.com/p/4884069807Links=Selector.xpath ('//div[@class = "Threadlist_lz clearfix"]/div/a[@rel = "Noreferrer"]/@href ')#links类型为etreeElementString列表 #遍历列表, and merge as a post address, call the picture processing function LoadImage forLinkinchLinks:link= "Http://tieba.baidu.com" +Link Self. LoadImage (link)#获取图片 defLoadImage ( Self, link): Req=Urllib2. Request (Link, headers= Self. Ua_header) HTML=Urllib2.urlopen (req). Read () Selector=Etree. HTML (HTML)#获取这个帖子里面所有图片的src路径Imagelinks=Selector.xpath ('//img[@class = ' bde_image ']/@src ')#依次取出图片路径, download save forImageLinkinchImagelinks: Self. Writeimages (ImageLink)#保存页面内容 defWriteimages ( Self, ImageLink):"""depositing binary content in images into the username file """ Print(ImageLink)Print the file is being stored%d..."% Self. userName#1. Opens a file that returns a file object file = Open('./images/'+Str( Self. UserName)+ '. png ',' WB ')#获取图片里内容Images=Urllib2.urlopen (ImageLink). Read ()#调用文件对象write () method to write the contents of the page_html to a file file. Write (Images)#最后关闭文件 file. Close ()#计数器自增1 Self. userName+= 1#模拟__main__函数:if __name__ == ' __main__ ':#首先创建爬虫对象Myspider=Spider ()#调用爬虫对象的方法, get to work.Myspider.tiebaspider ()
Python Crawler (13) _ Case: Crawler using XPath