Recently there is a need: Download the pictures of the https://mm.meiji2.com/website.
So it's a simple study of reptiles.
Here to organize the results, one for their own records, two to posterity some direction.
Crawl Results
The overall study cycle of 2-3 days, after reading, in addition to look at more or less will also find some of their own knowledge.
Along the look down, should have a preliminary understanding of crawler technology.
The approximate steps:
Analyze pages, write crawler rules
Download pictures, if there are pagination, then pagination
Multi-page crawl, and save the sub-directory to local, multi-level storage.
Coping with anti-crawler
The above is the time to study, to see some information.
And then post a piece of my own, crawl the time of the three-level directory, and the last level has the next page.
Import scrapyfrom znns.items import Znnsitemclass nvshenspider (scrapy. Spider): name = ' Znns ' allowed_domains = [']start_urls = [' https://mm.meiji2.com/']headers = {' Accept ': ' Text/html, application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 ', ' accept-language ': ' zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3 ' ,}# leaderboard Cycle def parse (self, response): exp = U '//div[@class = "Pagesyy"]//a[text () = "Next"]/@href ' # next page address _next = Response.xpath (exp). Extract_first () yield scrapy. Request (Response.urljoin (_next), Callback=self.parse, dont_filter=true) for P in Response.xpath ('//li[@class = ' Rankli "]//div[@class =" Rankli_imgdiv "]//a/@href '). Extract (): # Some sister profile Details page item_page =" https://mm.meiji2.com/"+ p +" album/"# Stitching all Albums page yield scrapy. Request (Item_page, Callback=self.parse_item, Dont_filter=true) # Single Introduction Details page def parse_item (self, response): item = Znnsitem () # A person's name, that is, a folder item[' name ' = Response.xpath ('//div[@id = "POST"]//div[@id = "Map"]//div[@class = "Browse"]/a[2]/@ Title '). Extract () [0].strip () exp = '//li[@class = ' Igalleryli ']//div[@class = "Igalleryli_div"]//a/@href ' for P in Response.xpath (exp). Extract (): # Traverse sister all albums Item_page = "https:// mm.meiji2.com/"+ P # stitching Picture Details page yield scrapy. Request (item_page, meta={' item ': item}, Callback=self.parse_item_details, dont_filter=true) # Picture Home page, starting to crawl def Parse_item _details (self, response): item = response.meta[' item ']item[' image_urls '] = Response.xpath ('//ul[@id = "Hgallery"]//img /@src '). Extract () # Image link item[' albumname ' = Response.xpath ('//h1[@id = ' Htilte ']/text () '). Extract () [0].strip () # Secondary folder Yield Itemnew_url = Response.xpath ('//div[@id = "pages"]//a[text () = "Next"]/@href '). Extract_first () # page New_url = " https://mm.meiji2.com/"+ New_url
Python Crawl Beauty Pictures, sub-catalog multi-level storage