#看到贴吧大佬在发图, get ready to steal.
#只是爬取一个帖子中的图片
1. Create a new Scrapy project first
Scrapy Startproject Tubaex
2. Create a new crawler
Scrapy Genspider Tubaex https://tieba.baidu.com/p/4092816277
3. Write down Items First
#保存图片的url
Img_url=scrapy. Field ()
4, began to write crawler
#-*-Coding:utf-8-*-
Import Scrapy
From Tubaex.items import Tubaexitem
Class Tubaexspider (Scrapy. Spider):
Name = "Tubaex"
#allowed_domains = ["https://tieba.baidu.com/p/4092816277"]
Baseurl= "Https://tieba.baidu.com/p/4092816277?pn="
#拼接地址用 turning pages
Offset=0
#要爬取的网页
Start_urls = [Baseurl+str (offset)]
Def parse (self, Response):
#获取最后一页的数字
End_page=response.xpath ("//div[@id = ' thread_theme_5 ']/div/ul/li[2]/span[2]/text ()"). Extract ()
#通过审查元素找到图片的类名, use XPath to get
Img_list=response.xpath ("//img[@class = ' bde_image ']/@src"). Extract ()
For IMG in img_list:
Item=tubaexitem ()
item[' Img_url ']=img
Yield item
Url=self.baseurl
#进行翻页
If self.offset < int (end_page[0]): #通过xpath返回的是list
Self.offset+=1
Yield scrapy. Request (Self.baseurl+str (Self.offset), Callback=self.parse)
5, the use of imagespipeline, this nothing to say, I do not understand
#-*-coding:utf-8-*-ImportRequests fromScrapy.pipelines.imagesImportImagespipeline fromTubaexImportSettingsclassTubaexpipeline (imagespipeline):defget_media_requests (self,item,info): Img_link= item['Img_url'] yieldScrapy. Request (Img_link)defitem_completed (self,results,item,info): Images_store="c:/users/ll/desktop/py/tubaex/images/"Img_path=item['Img_url'] returnItem
6, the configuration under Settings
Images_store ='c:/users/ll/desktop/py/tubaex/images/'#Crawl responsibly by identifying yourself (and your website) on the User-agent#user_agent = ' Tubaex (+http://www.yourdomain.com) 'User_agent="user-agent,mozilla/5.0 (Macintosh; U Intel Mac OS X 10_6_8; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50"#Obey robots.txt RulesRobotstxt_obey =False#Open PipeItem_pipelines = { 'TuBaEx.pipelines.TubaexPipeline': 300,}
7. Implementation
Scrapy Crawl Tubaex
8. Harvest Fruit
Python crawl pictures in bar paste