Python Show-me-the-code No. 0013 grab sister pictures using Scrapy

Source: Internet
Author: User
Tags xpath

question No. 0013: use Python to write a crawl picture of the program, crawl this link in the Japanese sister pictures:-)

    • Reference Code

Full code

Ideas:

In fact, this can not be scrapy, using regular matching +request should be able to complete the task. I want to practice the scrapy, so I use Scrapy to do this.

This only requires crawling a page of pictures, so also do not write what follow rules, is relatively simple. Through the analysis of the link in the sister pictures of the label, found that Baidu posted in the picture is with Bde_image this class, so it is good to do, directly with the XPath all IMG tags with bde_image class All put forward, is the required picture, put the necessary things into the item, Then hand it over to pipeline.

I pipeline in the first to determine whether the information is complete, and then detect whether the picture has been downloaded, if yes, skip, or download the picture, for convenience, save the picture, I also put the picture information (name, storage path) stored in MongoDB.

Steps:
生成一个叫baidutieba的scrapy项目:scrapy startproject baidutieba打开项目文件夹:cd baidutieba生成一个叫meizi的spider:scrapy genspider meizi baidu.com然后编写相关代码运行:scrapy crawl meizi
Code:

Spider
meizi.py

#-*-Coding:utf-8-*-ImportScrapy fromScrapy.contrib.spidersImportCrawlspider,rule fromScrapy.contrib.linkextractors.sgmlImportSgmllinkextractor fromBaidutieba.itemsImportBaidutiebaitem fromScrapy.selectorImportSelectorImportSysreload (SYS) sys.setdefaultencoding (' Utf-8 ') class meizispider(crawlspider):Name ="Meizi"Allowed_domains = ["Baidu.com"]Print "begin to crawl the sister figure"Start_urls = (' http://tieba.baidu.com/p/2166231880 ',    )# Defines the parse method, which is used to parse     def parse(self, Response):        # Find all the pictures of the class Bde_imageALLIMG = Selector (response). XPath ('//img[@class = "Bde_image"] ') forImginchAllimg:item = Baidutiebaitem () item[' Img_name '] = Img.xpath (' @bdwater '). Extract () [0] Item[' Img_url '] = Img.xpath (' @src '). Extract () [0]yieldItem

pipelines.py

#-*-Coding:utf-8-*-# Define Your item pipelines here## Don ' t forget to add your pipeline to the Item_pipelines setting# see:http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlImportPymongo fromScrapy.confImportSettings fromScrapy.exceptionsImportDropitem fromScrapyImportLogImportRequestsImportOs class imagedownloadandmongodbpipeline(object):     def __init__(self):        # Create a MongoDB connectionConnection = Pymongo. Mongoclient (settings[' Mongodb_server '], settings[' Mongodb_port ']) db = connection[settings[' mongodb_db ']] Self.collection = db[settings[' Mongodb_collection ']] def process_item(self, item, spider):valid =True        # Check if it's legal         forDatainchItemif  notData:valid =False                RaiseDropitem ("Missing {0}!". Format (data)ifValid# define Directory addressDir_path ='%s/%s '% (settings[' Images_store '], Spider.name)# Check if directory exists            if  notOs.path.exists (Dir_path): Log.msg ("No directory exists, create", Level=log. DEBUG, Spider=spider) os.makedirs (dir_path) Image_url = item[' Img_url ']# file nameUS = Image_url.split ('/')[3:] Image_file_name =' _ '. Join (US) File_path ='%s/%s '% (Dir_path, image_file_name)if  notOs.path.exists (File_path):# Check if you have downloaded the picture if it does not exist                 withOpen (File_path,' WB ') asHandle:response = Requests.get (Image_url, stream=True) forBlockinchResponse.iter_content (1024x768):ifBlock:handle.write (block) item[' File_path '] = File_path log.msg ("Downloaded pictures!", Level=log. DEBUG, Spider=spider)# database RecordSelf.collection.insert (Dict (item)) Log.msg ("stored in database!", Level=log. DEBUG, Spider=spider)Else: Log.msg ("The picture has been downloaded, skip", Level=log. DEBUG, Spider=spider)returnItem class imagedownloadpipeline(object):     def process_item(self, item, spider):        PrintItemreturnItem

items.py

#-*-Coding:utf-8-*-# Define Here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.html Import scrapy class baidutiebaitem(scrapy.  Item): img_name = scrapy. Field ()img_url = scrapy. Field ()file_path = scrapy. Field () Pass        

settings.py

#-*-Coding:utf-8-*-# scrapy settings for Baidutieba project## For simplicity, this file contains only the most important settings by# Default. all of the other settings is documented here:## http://doc.scrapy.org/en/latest/topics/settings.html#Bot_name =' Baidutieba 'Spider_modules = [' Baidutieba.spiders ']newspider_module =' Baidutieba.spiders 'Item_pipelines = {' Baidutieba.pipelines.ImageDownloadAndMongoDBPipeline ':1}# Store Picture pathImages_store ='/home/bill/pictures '# MongoDB ConfigurationMongodb_server ="localhost"Mongodb_port =27017mongodb_db ="Meizidb"Mongodb_collection ="Meizi"# Crawl responsibly by identifying yourself (and your website) on the User-agent#USER_AGENT = ' Baidutieba (+http://www.yourdomain.com) '

Crawl process:

Database:

Climb to the sister figure:

Python Show-me-the-code No. 0013 grab sister pictures using Scrapy

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.