Python uses the Scrapy crawler framework to crawl images and save local implementation code,
You can clone all source code on Github.
Github: https://github.com/williamzxl/Scrapy_CrawlMeiziTu
Scrapy official documentation: http://scrapy-chs.readthedocs.io/zh_CN/latest/index.html
It is basically used once according to the document process.
Step 1:
A new Scrapy project must be created before crawling. Go to the directory where you want to store the code and run the following command:
scrapy startproject CrawlMeiziTu
This command will create a tutorial directory containing the following content:
CrawlMeiziTu/ scrapy.cfg CrawlMeiziTu/ __init__.py items.py pipelines.py settings.py middlewares.py spiders/ __init__.py ...cd CrawlMeiziTuscrapy genspider Meizitu http://www.meizitu.com/a/list_1_1.html
This command will create a tutorial directory containing the following content:
CrawlMeiziTu/ scrapy.cfg CrawlMeiziTu/ __init__.py items.py pipelines.py settings.py middlewares.py spiders/ Meizitu.py __init__.py ...
The main edited items are shown in the Arrow:
Main. py was added later and added two commands,
from scrapy import cmdlinecmdline.execute("scrapy crawl Meizitu".split())
It is mainly used for convenient operation.
Step 2: Edit Settings, as shown in
BOT_NAME = 'CrawlMeiziTu' SPIDER_MODULES = ['CrawlMeiziTu.spiders'] NEWSPIDER_MODULE = 'CrawlMeiziTu.spiders' ITEM_PIPELINES = { 'CrawlMeiziTu.pipelines.CrawlmeizituPipeline': 300, } IMAGES_STORE = 'D://pic2' DOWNLOAD_DELAY = 0.3 USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36' ROBOTSTXT_OBEY = True
Mainly sets the USER_AGENT, download path, and download delay time.
Step 3: Edit Items.
Items is mainly used to access information captured by the Spider Program. Because we crawl pictures of sisters, we need to capture the names, links, tags, and so on of each image.
#-*-Coding: UTF-8-*-# Define here the models for your scraped items # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass crawler lmeizituitem (scrapy. item): # define the fields for your item here like: # name = scrapy. field () # title is the folder name title = scrapy. field () url = scrapy. field () tags = scrapy. field () # image connection src = scrapy. field () # alt indicates the image name alt = scrapy. field ()
Step 4: Edit Pipelines
Pipelines mainly processes the information obtained in items. For example, create a folder or image name based on the title and download the Image Based on the image link.
#-*-Coding: UTF-8-*-import osimport requestsfrom crawler lmeizitu. settings import IMAGES_STOREclass crawler lmeizitupipeline (object): def process_item (self, item, spider): fold_name = "". join (item ['title']) header = {'user-agent': 'user-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) appleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 ', 'cooker': 'b963ef2d97e050aaf90fd5fab8e78633 ', # You need to view the cookie information of the image. Otherwise, the downloaded image cannot be viewed.} images = [] # Put all the images in one folder. dir_path = '{}'. format (IMAGES_STORE) if not OS. path. exists (dir_path) and len (item ['src'])! = 0: OS. mkdir (dir_path) if len (item ['src']) = 0: with open ('.. // check.txt ', 'a +') as fp: fp. write ("". join (item ['title']) + ":" + "". join (item ['url']) fp. write ("\ n") for jpg_url, name, num in zip (item ['src'], item ['alt'], range (0,100 )): file_name = name + str (num) file_path = '{}//{}'. format (dir_path, file_name) images. append (file_path) if OS. path. exists (file_path) or OS. path. exists (file_name): continue with open ('{}// {}.jpg '. format (dir_path, file_name), 'wb ') as f: req = requests. get (jpg_url, headers = header) f. write (req. content) return item
Step 5: edit the main program of Meizitu.
The most important main program:
#-*-Coding: UTF-8-*-import scrapyfrom CrawlMeiziTu. items import CrawlmeizituItem # from CrawlMeiziTu. items import CrawlmeizituItemPageimport timeclass MeizituSpider (scrapy. spider): name = "Meizitu" # allowed_domains = ["meizitu.com/"] start_urls = [] last_url = [] with open ('.. // url.txt ', 'R') as fp: crawl_urls = fp. readlines () for start_url in crawl_urls: last_url.append (start_url.strip ('\ n') start_urls.append ("". join (last_url [-1]) def parse (self, response): selector = scrapy. selector (response) # item = fig () next_pages = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/@ href '). extract () next_pages_text = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/text ()'). extract () all_urls = [] if 'Next page' in next_pages_text: next_url = "http://www.meizitu.com/a {}". format (next_pages [-2]) with open ('.. // url.txt ', 'a +') as fp: fp. write ('\ n') fp. write (next_url) fp. write ("\ n") request = scrapy. http. request (next_url, callback = self. parse) time. sleep (2) yield request all_info = selector. xpath ('// h3 [@ class = "tit"]/A') # Read the connection of each image clip for info in all_info: links = info. xpath ('// h3 [@ class = "tit"]/a/@ href '). extract () for link in links: request = scrapy. http. request (link, callback = self. parse_item) time. sleep (1) yield request # next_link = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/@ href '). extract () # next_link_text = selector. xpath ('// * [@ id = "wp_page_numbers"]/ul/li/a/text ()'). extract () # if 'Next page' in next_link_text: # nextPage = "http://www.meizitu.com/a {}". format (next_link [-2]) # item ['page _ url'] = nextPage # yield item # capture the information of each folder def parse_item (self, response ): item = CrawlmeizituItem () selector = scrapy. selector (response) image_title = selector. xpath ('// h2/a/text ()'). extract () image_url = selector. xpath ('// h2/a/@ href '). extract () image_tags = selector. xpath ('// div [@ class = "metaRight"]/p/text ()'). extract () if selector. xpath ('// * [@ id = "picture"]/p/img/@ src '). extract (): image_src = selector. xpath ('// * [@ id = "picture"]/p/img/@ src '). extract () else: image_src = selector. xpath ('// * [@ id = "maincontent"]/div/p/img/@ src '). extract () if selector. xpath ('// * [@ id = "picture"]/p/img/@ alt '). extract (): pic_name = selector. xpath ('// * [@ id = "picture"]/p/img/@ alt '). extract () else: pic_name = selector. xpath ('// * [@ id = "maincontent"]/div/p/img/@ alt '). extract () # // * [@ id = "maincontent"]/div/p/img/@ alt item ['title'] = image_title item ['url'] = image_url item [' tags '] = image_tags item ['src'] = image_src item ['alt'] = pic_name print (item) time. sleep (1) yield item
Summary
The above section describes how to use the Scrapy crawler framework to crawl images from the entire site and save the local implementation code. If you have any questions, please leave a message, the editor will reply to you in time!