Python crawl pictures in bar paste

Source: Internet
Author: User
Tags xpath

#看到贴吧大佬在发图, get ready to steal.

#只是爬取一个帖子中的图片

1. Create a new Scrapy project first

Scrapy Startproject Tubaex

2. Create a new crawler

Scrapy Genspider Tubaex https://tieba.baidu.com/p/4092816277

3. Write down Items First

#保存图片的url
Img_url=scrapy. Field ()

4, began to write crawler

  

#-*-Coding:utf-8-*-
Import Scrapy
From Tubaex.items import Tubaexitem

Class Tubaexspider (Scrapy. Spider):
Name = "Tubaex"
#allowed_domains = ["https://tieba.baidu.com/p/4092816277"]
Baseurl= "Https://tieba.baidu.com/p/4092816277?pn="

#拼接地址用 turning pages
Offset=0
#要爬取的网页
Start_urls = [Baseurl+str (offset)]

Def parse (self, Response):

#获取最后一页的数字
End_page=response.xpath ("//div[@id = ' thread_theme_5 ']/div/ul/li[2]/span[2]/text ()"). Extract ()
#通过审查元素找到图片的类名, use XPath to get
Img_list=response.xpath ("//img[@class = ' bde_image ']/@src"). Extract ()

For IMG in img_list:
Item=tubaexitem ()
item[' Img_url ']=img
Yield item

Url=self.baseurl

#进行翻页
If self.offset < int (end_page[0]): #通过xpath返回的是list
Self.offset+=1
Yield scrapy. Request (Self.baseurl+str (Self.offset), Callback=self.parse)

5, the use of imagespipeline, this nothing to say, I do not understand

#-*-coding:utf-8-*-ImportRequests fromScrapy.pipelines.imagesImportImagespipeline fromTubaexImportSettingsclassTubaexpipeline (imagespipeline):defget_media_requests (self,item,info): Img_link= item['Img_url']        yieldScrapy. Request (Img_link)defitem_completed (self,results,item,info): Images_store="c:/users/ll/desktop/py/tubaex/images/"Img_path=item['Img_url']        returnItem

6, the configuration under Settings

Images_store ='c:/users/ll/desktop/py/tubaex/images/'#Crawl responsibly by identifying yourself (and your website) on the User-agent#user_agent = ' Tubaex (+http://www.yourdomain.com) 'User_agent="user-agent,mozilla/5.0 (Macintosh; U Intel Mac OS X 10_6_8; En-US) applewebkit/534.50 (khtml, like Gecko) version/5.1 safari/534.50"#Obey robots.txt RulesRobotstxt_obey =False#Open PipeItem_pipelines = {    'TuBaEx.pipelines.TubaexPipeline': 300,}

7. Implementation

Scrapy Crawl Tubaex

8. Harvest Fruit

  

Python crawl pictures in bar paste

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.