Objective: To grab the mobile version of the bucket fish anchor by fiddler and crawl the nickname and Image link of all the hosts.
About the settings for crawling your phone pack with fiddler:
The phone and installed fiddler computer in the same network segment (the same wifi), the phone connected to the WiFi, click on the phone WiFi connection, the agent changed to manual, the host address is set to fiddler the computer IP, The port number is 8888 (fiddler default port number), you can grab the phone
1 Creating a crawler project DOUYUMEINV
scrapy startproject douyumeinv
2 Set the items.py file to save the field name and type to crawl data
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DouyumeinvItem(scrapy.Item): # 主播昵称 nickname = scrapy.Field() # 主播房间的图片链接 roomSrc = scrapy.Field() # 照片在本地保存的位置 imagesPath = scrapy.Field()
3 Create the crawler file douyuspider.py in the Spiders folder, the code is as follows
#-*-Coding:utf-8-*-import scrapyfrom douyumeinv.items import douyumeinvitemimport jsonclass DouyuSpider (scrapy. Spider): # crawler name, used when the terminal executes the command. such as Scrapy crawl Douyu name = ' Douyu ' # Specifies the domain range of the crawl allowed_domains = [' douyu.com '] # page number to crawl num = 1 # Number of hosts n = 0 # actually has the host's page number PageCount = 0 url = ' https://m.douyu.com/api/room/mixList?page= ' + str (num) + ' &type=qmxx ' # crawled URL list start_urls = [url] def parse (self, Response): ' Parse function ' # Converts the retrieved JSON data into a python dictionary data = Json.loads (response.text) [' Data '] # Gets the page number that actually has the host self.pagecount = Int (data[' PageCount ']) for EA Ch in data[' list ': SELF.N + = 1 item = Douyumeinvitem () # Host room picture link item[' RoomS RC '] = each[' roomsrc '].encode (' Utf-8 ') # host nickname Item[' nickname '] = each[' nickname '].encode (' utf-8 ') # print (item) # Return data to pipe yield item # Resend Request Self.num + = 1 # only for page numbers with host, send request IF self.num <= self.pageCount:self.url = ' https://m.douyu.com/api/room/mixList?page= ' + str (self.num) + ' &ty Pe=qmxx ' yield scrapy. Request (Self.url, callback=self.parse) print ' \ n has climbed page%d, total%d pages, crawling%d hosts \ n '% (SELF.NUM-1,SELF.PAGECOUNT,SELF.N)
4 Set up the pipelines.py pipeline file, use the Images.imagespipeline class to request a picture link and process the downloaded picture
#-*-Coding:utf-8-*-# Define Your item pipelines here## Don ' t forget to add your pipeline to the Item_pipelines setting # see:https://doc.scrapy.org/en/latest/topics/item-pipeline.html# class Douyumeinvpipeline (object): # def Process_ Item (self, item, spider): # return Itemimport scrapyfrom scrapy.pipelines import Imagesfrom Scrapy.utils.project im Port Get_project_settingsimport osclass Douyumeinvpipeline (object): Def process_item (self, item, spider): return Itemclass imagespipeline (images. Imagespipeline): Images_store = Get_project_settings (). Get (' Images_store ') # count is used to count the number of images that were actually downloaded and successfully renamed Count = 0 def get_media_requests (self, item, info): The function of "get_media_requests" is to deliver a request object for each image link, and the output of this method will be used as a item_c Results ' # return in ompleted input [Request (x) for x in Item.get (Self.images_urls_field, [])] Image_ur L = item[' roomsrc ' # print (' = ' *60) # print (image_url) # print (item[' nickname ']) # print (' = ' *60) yield scrapy. Request (Image_url) def item_completed (self, results, item, info): # Results is a tuple, each tuple includes (success, Imageinfo Orfailure). # if Success=true,imageinfoor_failure is a dictionary, including url/path/checksum three keys. Image_path = [x["path"] for ok,x in results if OK] # print (' * ' *60) # As a result of yield, the value of Image_path is one list per output [' full/ 0c1c1f78e7084f5e3b07fd1b0a066c6c49dd30e0.jpg '] # print (image_path[0]) # print (item[' nickname ') # pri NT (' * ' *60) # found here, no longer create the corresponding folder, directly using the following string concatenation, you can generate a folder Old_file = Self.images_store + '/' + image_path[0] New_file = Self.images_store + '/full/' + item[' nickname '] + '. jpg ' # print (' \ n ' + '-' *60 ') # print (old_ File) # print (new_file) # print (' \ n ') # print (os.path.exists (old_file)) # Determine if the file and path exist # If the picture is downloaded successfully, then count plus 1 if os.path.exists (old_file): Self.count + = 1 # print (' \ n ') # Prin T ('-' *60 + ' \ n ') os.rename (old_file,new_file) item[' imagespath '] = self.images_store + '/full/' + item[' nickname '] # Print (Os.listdir ('/home/cc/pachong/2-scrapy frame/01.scrapy frame and Spider class/douyumeinv/douyumeinv/images/full/') print (' # ' *60) print (' successfully downloaded%d pictures '% (self.count)) print (' # ' *60) return item
5 Setting the settings.py file
6 test results are as follows:
Use Scrapy to crawl a mobile version of a bucket fish anchor's room image and nickname