Python uses the Scrapy crawler framework to crawl images and save local implementation code,

Source: Internet
Author: User

Python uses the Scrapy crawler framework to crawl images and save local implementation code,

You can clone all source code on Github.

Github: https://github.com/williamzxl/Scrapy_CrawlMeiziTu

Scrapy official documentation: http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html

It is basically used once according to the document process.

Step 1:

A new Scrapy project must be created before crawling. Go to the directory where you want to store the code and run the following command:

scrapy startproject CrawlMeiziTu

This command will create a tutorial directory containing the following content:

CrawlMeiziTu/ scrapy.cfg CrawlMeiziTu/  __init__.py  items.py  pipelines.py  settings.py     middlewares.py  spiders/   __init__.py   ...cd CrawlMeiziTuscrapy genspider Meizitu http://www.meizitu.com/a/list_1_1.html

This command will create a tutorial directory containing the following content:

CrawlMeiziTu/ scrapy.cfg CrawlMeiziTu/     __init__.py  items.py  pipelines.py  settings.py     middlewares.py  spiders/       Meizitu.py   __init__.py   ...

The main edited items are shown in the Arrow:

Main. py was added later and added two commands,

from scrapy import cmdlinecmdline.execute("scrapy crawl Meizitu".split())

It is mainly used for convenient operation.

Step 2: Edit Settings, as shown in

 BOT_NAME = 'CrawlMeiziTu'  SPIDER_MODULES = ['CrawlMeiziTu.spiders'] NEWSPIDER_MODULE = 'CrawlMeiziTu.spiders' ITEM_PIPELINES = { 'CrawlMeiziTu.pipelines.CrawlmeizituPipeline': 300, } IMAGES_STORE = 'D://pic2' DOWNLOAD_DELAY = 0.3 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' ROBOTSTXT_OBEY = True

Mainly sets the USER_AGENT, download path, and download delay time.

Step 3: Edit Items.

Items is mainly used to access information captured by the Spider Program. Because we crawl pictures of sisters, we need to capture the names, links, tags, and so on of each image.

#-*-Coding: UTF-8-*-# Define here the models for your scraped items # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass crawler lmeizituitem (scrapy. item): # define the fields for your item here like: # name = scrapy. field () # title is the folder name title = scrapy. field () url = scrapy. field () tags = scrapy. field () # image connection src = scrapy. field () # alt indicates the image name alt = scrapy. field ()

Step 4: Edit Pipelines

Pipelines mainly processes the information obtained in items. For example, create a folder or image name based on the title and download the Image Based on the image link.

#-*-Coding: UTF-8-*-import osimport requestsfrom crawler lmeizitu. settings import IMAGES_STOREclass crawler lmeizitupipeline (object): def process_item (self, item, spider): fold_name = "". join (item ['title']) header = {'user-agent': 'user-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) appleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 ', 'cooker': 'b963ef2d97e050aaf90fd5fab8e78633 ', # You need to view the cookie information of the image. Otherwise, the downloaded image cannot be viewed.} images = [] # Put all the images in one folder. dir_path = '{}'. format (IMAGES_STORE) if not OS. path. exists (dir_path) and len (item ['src'])! = 0: OS. mkdir (dir_path) if len (item ['src']) = 0: with open ('.. // check.txt ', 'a +') as fp: fp. write ("". join (item ['title']) + ":" + "". join (item ['url']) fp. write ("\ n") for jpg_url, name, num in zip (item ['src'], item ['alt'], range (0,100 )): file_name = name + str (num) file_path = '{}//{}'. format (dir_path, file_name) images. append (file_path) if OS. path. exists (file_path) or OS. path. exists (file_name): continue with open ('{}// {}.jpg '. format (dir_path, file_name), 'wb ') as f: req = requests. get (jpg_url, headers = header) f. write (req. content) return item

Step 5: edit the main program of Meizitu.

The most important main program:

#-*-Coding: UTF-8-*-import scrapyfrom CrawlMeiziTu. items import CrawlmeizituItem # from CrawlMeiziTu. items import CrawlmeizituItemPageimport timeclass MeizituSpider (scrapy. spider): name = "Meizitu" # allowed_domains = ["meizitu.com/"] start_urls = [] last_url = [] with open ('.. // url.txt ', 'R') as fp: crawl_urls = fp. readlines () for start_url in crawl_urls: last_url.append (start_url.strip ('\ n') start_urls.append ("". join (last_url [-1]) def parse (self, response): selector = scrapy. selector (response) # item = fig () next_pages = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/@ href '). extract () next_pages_text = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/text ()'). extract () all_urls = [] if 'Next page' in next_pages_text: next_url = "http://www.meizitu.com/a {}". format (next_pages [-2]) with open ('.. // url.txt ', 'a +') as fp: fp. write ('\ n') fp. write (next_url) fp. write ("\ n") request = scrapy. http. request (next_url, callback = self. parse) time. sleep (2) yield request all_info = selector. xpath ('// h3 [@ class = "tit"]/A') # Read the connection of each image clip for info in all_info: links = info. xpath ('// h3 [@ class = "tit"]/a/@ href '). extract () for link in links: request = scrapy. http. request (link, callback = self. parse_item) time. sleep (1) yield request # next_link = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/@ href '). extract () # next_link_text = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/text ()'). extract () # if 'Next page' in next_link_text: # nextPage = "http://www.meizitu.com/a {}". format (next_link [-2]) # item ['page _ url'] = nextPage # yield item # capture the information of each folder def parse_item (self, response ): item = CrawlmeizituItem () selector = scrapy. selector (response) image_title = selector. xpath ('// h2/a/text ()'). extract () image_url = selector. xpath ('// h2/a/@ href '). extract () image_tags = selector. xpath ('// div [@ class = "metaRight"]/p/text ()'). extract () if selector. xpath ('// * [@ id = "picture"]/p/img/@ src '). extract (): image_src = selector. xpath ('// * [@ id = "picture"]/p/img/@ src '). extract () else: image_src = selector. xpath ('// * [@ id = "maincontent"]/div/p/img/@ src '). extract () if selector. xpath ('// * [@ id = "picture"]/p/img/@ alt '). extract (): pic_name = selector. xpath ('// * [@ id = "picture"]/p/img/@ alt '). extract () else: pic_name = selector. xpath ('// * [@ id = "maincontent"]/div/p/img/@ alt '). extract () # // * [@ id = "maincontent"]/div/p/img/@ alt item ['title'] = image_title item ['url'] = image_url item [' tags '] = image_tags item ['src'] = image_src item ['alt'] = pic_name print (item) time. sleep (1) yield item

Summary

The above section describes how to use the Scrapy crawler framework to crawl images from the entire site and save the local implementation code. If you have any questions, please leave a message, the editor will reply to you in time!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.