Python Crawler (13) _ Case: Crawler using XPath

Last Update:2017-12-12 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a case of using XPath, for more information, see: Python Learning Guide

Case: Crawler using XPath

Now we use XPath to make a simple crawler, we try to crawl all the posts in a bar and download the images from each floor of the post to local.

#-*-coding:utf-8-*-#tieba_xpath. PY"""role: This case uses XPath to make a simple crawler, we try to crawl to a bar of all posts"""ImportOsImportUrllib2ImportUrllib fromlxmlImportEtreeclassSpider:def __init__( Self): Self. tiebaname= Raw_input("Please enter the bar you need to visit:") Self. beginpage= int(Raw_input("Please enter the start page:")) Self. endpage= int(Raw_input("Please enter the termination page:")) Self. URL= "http://tieba.baidu.com/f"         Self. ua_header={"User-agent":"mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 trident/5.0; "}#图片编号         Self. userName= 1    defTiebaspider ( Self): forPageinch Range( Self. Beginpage, Self. endpage+1): PN=(page-1)*  -   #page NumberWord={' PN ':p N,' kw ': Self. Tiebaname} Word=Urllib.urlencode (Word)#转换成url编码格式 (String)Myurl=  Self. URL+ "?" +Word#示例: http://tieba.baidu.com/f?kw=%E7%BE%8E%E5%A5%B3 & pn=50            #调用 page processing function load_page            #并且获取页面所有帖子链接Links=  Self. LoadPage (Myurl)#urllib2_test3. PY    #获取页面内容    defLoadPage ( Self, URL): req=Urllib2. Request (URL, headers=  Self. Ua_header) HTML=Urllib2.urlopen (req). Read ()#解析html为HTML DOM DocumentSelector=Etree. HTML (HTML)#抓取当前页面的所有帖子的url的后半部分, which is the post number        #http: "p/4884069807" in//tieba.baidu.com/p/4884069807Links=Selector.xpath ('//div[@class = "Threadlist_lz clearfix"]/div/a[@rel = "Noreferrer"]/@href ')#links类型为etreeElementString列表        #遍历列表, and merge as a post address, call the picture processing function LoadImage         forLinkinchLinks:link= "Http://tieba.baidu.com" +Link Self. LoadImage (link)#获取图片    defLoadImage ( Self, link): Req=Urllib2. Request (Link, headers=  Self. Ua_header) HTML=Urllib2.urlopen (req). Read () Selector=Etree. HTML (HTML)#获取这个帖子里面所有图片的src路径Imagelinks=Selector.xpath ('//img[@class = ' bde_image ']/@src ')#依次取出图片路径, download save         forImageLinkinchImagelinks: Self. Writeimages (ImageLink)#保存页面内容    defWriteimages ( Self, ImageLink):"""depositing binary content in images into the username file        """        Print(ImageLink)Print the file is being stored%d..."% Self. userName#1. Opens a file that returns a file object        file = Open('./images/'+Str( Self. UserName)+ '. png ',' WB ')#获取图片里内容Images=Urllib2.urlopen (ImageLink). Read ()#调用文件对象write () method to write the contents of the page_html to a file        file. Write (Images)#最后关闭文件        file. Close ()#计数器自增1         Self. userName+= 1#模拟__main__函数:if __name__ == ' __main__ ':#首先创建爬虫对象Myspider=Spider ()#调用爬虫对象的方法, get to work.Myspider.tiebaspider ()

Python Crawler (13) _ Case: Crawler using XPath

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More