Python uses the Scrapy crawler framework to crawl images and save local implementation code,

Last Update:2018-03-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

You can clone all source code on Github.

Github: https://github.com/williamzxl/Scrapy_CrawlMeiziTu

Scrapy official documentation: http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

It is basically used once according to the document process.

Step 1:

A new Scrapy project must be created before crawling. Go to the directory where you want to store the code and run the following command:

scrapy startproject CrawlMeiziTu

This command will create a tutorial directory containing the following content:

CrawlMeiziTu/ scrapy.cfg CrawlMeiziTu/  __init__.py  items.py  pipelines.py  settings.py　　　　 middlewares.py  spiders/   __init__.py   ...cd CrawlMeiziTuscrapy genspider Meizitu http://www.meizitu.com/a/list_1_1.html

This command will create a tutorial directory containing the following content:

CrawlMeiziTu/ scrapy.cfg CrawlMeiziTu/　　　　 __init__.py  items.py  pipelines.py  settings.py　　　　 middlewares.py  spiders/　　　　　　　Meizitu.py   __init__.py   ...

The main edited items are shown in the Arrow:

Main. py was added later and added two commands,

from scrapy import cmdlinecmdline.execute("scrapy crawl Meizitu".split())

It is mainly used for convenient operation.

Step 2: Edit Settings, as shown in

 BOT_NAME = 'CrawlMeiziTu'  SPIDER_MODULES = ['CrawlMeiziTu.spiders'] NEWSPIDER_MODULE = 'CrawlMeiziTu.spiders' ITEM_PIPELINES = { 'CrawlMeiziTu.pipelines.CrawlmeizituPipeline': 300, } IMAGES_STORE = 'D://pic2' DOWNLOAD_DELAY = 0.3 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' ROBOTSTXT_OBEY = True

Mainly sets the USER_AGENT, download path, and download delay time.

Step 3: Edit Items.

Items is mainly used to access information captured by the Spider Program. Because we crawl pictures of sisters, we need to capture the names, links, tags, and so on of each image.

#-*-Coding: UTF-8-*-# Define here the models for your scraped items # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass crawler lmeizituitem (scrapy. item): # define the fields for your item here like: # name = scrapy. field () # title is the folder name title = scrapy. field () url = scrapy. field () tags = scrapy. field () # image connection src = scrapy. field () # alt indicates the image name alt = scrapy. field ()

Step 4: Edit Pipelines

Pipelines mainly processes the information obtained in items. For example, create a folder or image name based on the title and download the Image Based on the image link.

#-*-Coding: UTF-8-*-import osimport requestsfrom crawler lmeizitu. settings import IMAGES_STOREclass crawler lmeizitupipeline (object): def process_item (self, item, spider): fold_name = "". join (item ['title']) header = {'user-agent': 'user-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) appleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 ', 'cooker': 'b963ef2d97e050aaf90fd5fab8e78633 ', # You need to view the cookie information of the image. Otherwise, the downloaded image cannot be viewed.} images = [] # Put all the images in one folder. dir_path = '{}'. format (IMAGES_STORE) if not OS. path. exists (dir_path) and len (item ['src'])! = 0: OS. mkdir (dir_path) if len (item ['src']) = 0: with open ('.. // check.txt ', 'a +') as fp: fp. write ("". join (item ['title']) + ":" + "". join (item ['url']) fp. write ("\ n") for jpg_url, name, num in zip (item ['src'], item ['alt'], range (0,100 )): file_name = name + str (num) file_path = '{}//{}'. format (dir_path, file_name) images. append (file_path) if OS. path. exists (file_path) or OS. path. exists (file_name): continue with open ('{}// {}.jpg '. format (dir_path, file_name), 'wb ') as f: req = requests. get (jpg_url, headers = header) f. write (req. content) return item

Step 5: edit the main program of Meizitu.

The most important main program:

#-*-Coding: UTF-8-*-import scrapyfrom CrawlMeiziTu. items import CrawlmeizituItem # from CrawlMeiziTu. items import CrawlmeizituItemPageimport timeclass MeizituSpider (scrapy. spider): name = "Meizitu" # allowed_domains = ["meizitu.com/"] start_urls = [] last_url = [] with open ('.. // url.txt ', 'R') as fp: crawl_urls = fp. readlines () for start_url in crawl_urls: last_url.append (start_url.strip ('\ n') start_urls.append ("". join (last_url [-1]) def parse (self, response): selector = scrapy. selector (response) # item = fig () next_pages = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/@ href '). extract () next_pages_text = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/text ()'). extract () all_urls = [] if 'Next page' in next_pages_text: next_url = "http://www.meizitu.com/a {}". format (next_pages [-2]) with open ('.. // url.txt ', 'a +') as fp: fp. write ('\ n') fp. write (next_url) fp. write ("\ n") request = scrapy. http. request (next_url, callback = self. parse) time. sleep (2) yield request all_info = selector. xpath ('// h3 [@ class = "tit"]/A') # Read the connection of each image clip for info in all_info: links = info. xpath ('// h3 [@ class = "tit"]/a/@ href '). extract () for link in links: request = scrapy. http. request (link, callback = self. parse_item) time. sleep (1) yield request # next_link = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/@ href '). extract () # next_link_text = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/text ()'). extract () # if 'Next page' in next_link_text: # nextPage = "http://www.meizitu.com/a {}". format (next_link [-2]) # item ['page _ url'] = nextPage # yield item # capture the information of each folder def parse_item (self, response ): item = CrawlmeizituItem () selector = scrapy. selector (response) image_title = selector. xpath ('// h2/a/text ()'). extract () image_url = selector. xpath ('// h2/a/@ href '). extract () image_tags = selector. xpath ('// div [@ class = "metaRight"]/p/text ()'). extract () if selector. xpath ('// * [@ id = "picture"]/p/img/@ src '). extract (): image_src = selector. xpath ('// * [@ id = "picture"]/p/img/@ src '). extract () else: image_src = selector. xpath ('// * [@ id = "maincontent"]/div/p/img/@ src '). extract () if selector. xpath ('// * [@ id = "picture"]/p/img/@ alt '). extract (): pic_name = selector. xpath ('// * [@ id = "picture"]/p/img/@ alt '). extract () else: pic_name = selector. xpath ('// * [@ id = "maincontent"]/div/p/img/@ alt '). extract () # // * [@ id = "maincontent"]/div/p/img/@ alt item ['title'] = image_title item ['url'] = image_url item [' tags '] = image_tags item ['src'] = image_src item ['alt'] = pic_name print (item) time. sleep (1) yield item

Summary

The above section describes how to use the Scrapy crawler framework to crawl images from the entire site and save the local implementation code. If you have any questions, please leave a message, the editor will reply to you in time!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python uses the Scrapy crawler framework to crawl images and save local implementation code,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python uses the Scrapy crawler framework to crawl images and save local implementation code,

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support