Python Crawler (13) _ Case: Crawler using XPath

Source: Internet
Author: User
Tags xpath

This is a case of using XPath, for more information, see: Python Learning Guide

Case: Crawler using XPath

Now we use XPath to make a simple crawler, we try to crawl all the posts in a bar and download the images from each floor of the post to local.

#-*-coding:utf-8-*-#tieba_xpath. PY"""role: This case uses XPath to make a simple crawler, we try to crawl to a bar of all posts"""ImportOsImportUrllib2ImportUrllib fromlxmlImportEtreeclassSpider:def __init__( Self): Self. tiebaname= Raw_input("Please enter the bar you need to visit:") Self. beginpage= int(Raw_input("Please enter the start page:")) Self. endpage= int(Raw_input("Please enter the termination page:")) Self. URL= "http://tieba.baidu.com/f"         Self. ua_header={"User-agent":"mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1 trident/5.0; "}#图片编号         Self. userName= 1    defTiebaspider ( Self): forPageinch Range( Self. Beginpage, Self. endpage+1): PN=(page-1)*  -   #page NumberWord={' PN ':p N,' kw ': Self. Tiebaname} Word=Urllib.urlencode (Word)#转换成url编码格式 (String)Myurl=  Self. URL+ "?" +Word#示例: http://tieba.baidu.com/f?kw=%E7%BE%8E%E5%A5%B3 & pn=50            #调用 page processing function load_page            #并且获取页面所有帖子链接Links=  Self. LoadPage (Myurl)#urllib2_test3. PY    #获取页面内容    defLoadPage ( Self, URL): req=Urllib2. Request (URL, headers=  Self. Ua_header) HTML=Urllib2.urlopen (req). Read ()#解析html为HTML DOM DocumentSelector=Etree. HTML (HTML)#抓取当前页面的所有帖子的url的后半部分, which is the post number        #http: "p/4884069807" in//tieba.baidu.com/p/4884069807Links=Selector.xpath ('//div[@class = "Threadlist_lz clearfix"]/div/a[@rel = "Noreferrer"]/@href ')#links类型为etreeElementString列表        #遍历列表, and merge as a post address, call the picture processing function LoadImage         forLinkinchLinks:link= "Http://tieba.baidu.com" +Link Self. LoadImage (link)#获取图片    defLoadImage ( Self, link): Req=Urllib2. Request (Link, headers=  Self. Ua_header) HTML=Urllib2.urlopen (req). Read () Selector=Etree. HTML (HTML)#获取这个帖子里面所有图片的src路径Imagelinks=Selector.xpath ('//img[@class = ' bde_image ']/@src ')#依次取出图片路径, download save         forImageLinkinchImagelinks: Self. Writeimages (ImageLink)#保存页面内容    defWriteimages ( Self, ImageLink):"""depositing binary content in images into the username file        """        Print(ImageLink)Print the file is being stored%d..."% Self. userName#1. Opens a file that returns a file object        file = Open('./images/'+Str( Self. UserName)+ '. png ',' WB ')#获取图片里内容Images=Urllib2.urlopen (ImageLink). Read ()#调用文件对象write () method to write the contents of the page_html to a file        file. Write (Images)#最后关闭文件        file. Close ()#计数器自增1         Self. userName+= 1#模拟__main__函数:if __name__ == ' __main__ ':#首先创建爬虫对象Myspider=Spider ()#调用爬虫对象的方法, get to work.Myspider.tiebaspider ()

Python Crawler (13) _ Case: Crawler using XPath

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.