written in front of the words:
Java programmer One, first into the large data god pit, the reptile is the first project, the project details need not repeat, after several struggles finally decided to give up the Java crawler, using Python to
To write a reptile, a Python crawler certainly does not revolve around the scrapy genius frame.
Environment to build and install a variety of kits, I believe that every one and I like the first small partners have to experience, pain and happy, and finally give up the 2.7 version, choose 3.5 version
Ben, after all, mastering new technology can always give people a sense of achievement.
Listen to watercress Music, looking at the leisurely crawler crawling film data, no impulse, forget joy, just a kind of relaxation, a technical person's heart and soul completely relaxed.
look at the picture and see the truth:
about IP proxy pools:
I heard that the bean paste IP, so the first time in the search for Porxypool related projects, a total of two methods to try.
The first is to go to the domestic high hide agent site to crawl free agent, generate a Proxy_list.json and then copy the file to their own project root directory, each time when the request from the JSON file randomly take an IP, the idea is very good, but free agent reliable? Read the code, give up the ego. Backwards and forwards one morning, and nowhere.
Second: Similar to the first, GitHub on the small famous project Proxypool-master, still go to each big free website crawl free agent, and then stored to Redis, finally released, in the local browser access http://127.0.0.1:5000/ Random will get an agent, worth learning is that each agent in the storage time score is 10 points, the success of the asynchronous test into 100 points, the failure from 10 began to reduce, to 0 minutes from the library to remove, but still can not get rid of the bad luck of the free agent, and finally gave up.
leisurely reptile:
Project structure and catalogue
settings.py
#-*-Coding:utf-8-*-# scrapy settings for Douban Project # for simplicity, this file contains only settings conside Red Important or # commonly used. Can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # Http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/ latest/topics/spider-middleware.html Bot_name =' Douban ' Spider_modules = [' Douban.spiders '] Newspider_module =' Douban.spiders ' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = ' Douban (+http://www.yourdo main.com) ' # Obey robots.txt Rules Robotstxt_obey =True # Configure maximum concurrent requests performed by Scrapy (default:16) concurrent_requests = # Configure a delay fo R requests for the same website (default:0) # to the http://scrapy.readthedocs.org/en/latest/topics/settings.html# Download-delay # also autothrottle settings and docs download_delay = 5 # The download delay setting'll honor only O NE of: #CONCURRENT_REQUESTS_PER_DOMAIN = #CONCURRENT_REQUESTS_PER_IP = Disable Cookies (enabled by default) #COOK ies_enabled = False # Disable Telnet Console (ENABLED by default) #TELNETCONSOLE_ENABLED = False # Override The default Request Headers: #DEFAULT_REQUEST_HEADERS = {# ' Accept ': ' text/html,application/xhtml+xml,application/xml;q=0.9,*/*; q=0.8 ', # ' accept-language ': ' En ', #} # Enable or disable Spider Middlewares # Est/topics/spider-middleware.html #SPIDER_MIDDLEWARES = {# ' douban.middlewares.DoubanSpiderMiddleware ': 543, #} # Ena ble or disable downloader middlewares
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = {# ' Douban . Middlewares. Mycustomdownloadermiddleware ': 543, #} # Enable or disable extensions # Check Http://scrapy.readthedocs.org/en/latest/topi Cs/extensions.html #EXTENSIONS = {# ' scrapy.extensions.telnet.TelnetConsole ': None, #} # Configure Item Pipelines # S EE http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = {# ' Douban.pipelines.DoubanPipeline ': $, #} # Enable and configure the autothrottle extension (disabled by default) # ttp://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #
Autothrottle_start_delay = 5 # The maximum download DELAY to is set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests scrapy should is sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCUR Rency = 1.0 # Enable showing throttling stats For every response Received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # Ttp://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # httpcache_enabled = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = ' httpcache ' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HT Tpcache_storage = ' Scrapy.extensions.httpcache.FilesystemCacheStorage ' #输出Excel文件 feed_exporters = {' Excel ':' Douban.excelexport.ExcelItemExporter '} #输出文档的顺序 feed_export_fields = [' title ',' Year ',' score ',' Alias ',' Commentcount ',' director ',' writer ',' performer ',' Categories ',' website ',' Area ',' language ',' Pub ',' Time ',' IMDB ',' Start ',' Better ',' image ',' description ']
items.py
#-*-Coding:utf-8-*-
# Define Here's models for your scraped items # to documentation in
:
# http:/ /doc.scrapy.org/en/latest/topics/items.html
scrapydoubanitem (scrapy. Item):
# define the fields for your item here like:
# name = Scrapy. Field ()
#pass
#非info区域数据
title = Scrapy. Field () Year
= Scrapy. Field ()
score = scrapy. Field ()
commentcount = scrapy. Field ()
start = Scrapy. Field ()
better = Scrapy. Field ()
image = Scrapy. Field ()
description = Scrapy. Field ()
#info区域数据
Director = Scrapy. Field ()
writer = scrapy. Field ()
performer = Scrapy. Field ()
categories = Scrapy. Field ()
website = scrapy. Field () area
= scrapy. Field ()
language = Scrapy. Field ()
pub = Scrapy. Field () Time
= Scrapy. Field ()
alias = Scrapy. Field ()
IMDB = Scrapy. Field ()
excelexport.py
Baseitemexporter
xlwtexcelitemexporter (baseitemexporter):
__init__ ( Self, file, **kwargs):
self._configure (kwargs)
self.file = file
Self.workbook = xlwt. Workbook ()
Self.worksheet = Self.workbook.add_sheet (' scrapy ')
self.row = 0
Finish_exporting (self):
self.workbook.save (self.file)
export_item (Self, item):
fields = Self._get_serialized_fields (item)
fields):
Self.worksheet.write (Self.row, col, v)
Self.row + + 1
Reptile movie_hot.py
#-*-Coding:utf-8-*-ImportScrapyImportJsonImportReImportTime fromDouban.itemsImportDoubanitemclassMoviehotspider (scrapy. Spider): name ="Movie_hot" Allowed_domains = ["Https://movie.douban.com"] # stitching watercress movie URL Base_url =' https://movie.douban.com/j/search_subjects?type=movie&tag=%s&sort=recommend&page_limit=%s &page_start=%s ' Movie_tag =' latest ' Page_limit = Page_start = 0 domains = base_url% (Movie_tag, page_limit, page_start) headers = {"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","Accept-encoding":"GZIP,DEFLATE,BR","Accept-language":"zh-cn,zh;q=0.9,en;q=0.8","Connection":"Keep-alive","Host":"Movie.douban.com","Upgrade-insecure-requests":"1","User-agent":"mozilla/5.0" (Macintosh; Intel Mac OS X 10_11_3) applewebkit/537.36 (khtml, like Gecko) chrome/48.0.2564.109 safari/537.36 " # The reptile starts from here #defStart_requests (self): print (' ~ ~ ~ # Crawl list: '+ self.domains)yieldScrapy.
Request (url = self.domains, headers=self.headers, callback=self.request_movies ) # Analysis List pagedefRequest_movies (self, response): Infos = response.text # Use JSON module to parse response results infos = json.loads (infos) # Iteration Movie Information list forMovie_info ininfos[' subjects ']: Print (' ~ ~ ~ Climb the movie: '+ movie_info[' title '] +'/'+ movie_info[' Rate '] # Extracts the movie page URL, constructs the request to send requests, and passes the item through the meta parameter to the movie page parsing functionyieldScrapy. Request (url = str (movie_info[' URL '), headers = self.headers, callback = Self.request_movie, dont_filter=True # if the number of movies included in the JSON result is less than the requested number of movies, there is no movie, otherwise continue searchingifLen (infos[' subjects '] = = self. PAGE_LIMIT:self.page_start = self. Page_limit url = self. Base_url% (self. Movie_tag, self. Page_limit, Self.page_start) time.sleep (5) Print (' ~ ~ ~ # Crawl list: '+ URL)yieldScrapy. Request (url = url, headers = self.headers, callback = Self.request_movies , dont_filter=True ) # Analysis Details pagedefRequest_movie (self, Response): #组装数据 Movie_item = Doubanitem () #获取非info区域数据 movie_item[' title '] = Response.css (' Div#content>h1>span:nth-child (1):: Text '). Extract_first () movie_item[' Year '] = Response.css (' Div#content>h1>span.year::text '). Extract_first () [1:-1] movie_item[' score '] = Response.css (' Strong.rating_num::text '). Extract_first () movie_item[' Commentcount '] = Response.css (' Div.rating_sum>a.rating_people>span::text '). Extract_first () movie_item[' Start '] ='/'. Join (RESPONSE.CSS (' Span.rating_per::text '). Extract () movie_item[' Better '] ='/'. Join (RESPONSE.CSS (' Div.rating_betterthan>a::text '). Extract () movie_item[' description '] = Response.css (' #link-report>span::text '). Extract_first (). Strip () movie_item[' image '] = Response.css (' #mainpic >a>img::attr (src) '). Extract_first () # Get the entire information string info = Response.css (' Div.subject div#info '). XPath (' string (.) '). Extract_first () # Extract so the field name fields = [S.strip (). Replace (':','') forS inRESPONSE.CSS (' Div#info span.pl::text '). Extract ()] # Extract values for all fields = [Re.sub (' \s+ ','', S.strip ()) forS inRe.split (' \s* (?:%s): \s* '%'|'. Join (fields), info)][1:] # processing column names forI inRange (len (fields)):if' director '= = Fields[i]: fields[i] =' director ' if' screenwriter '= = Fields[i]: fields[i] =' writer ' if' starring '= = Fields[i]: fields[i] =' performer ' if' type '= = Fields[i]: fields[i] =' Categories ' if' official website '= = Fields[i]: fields[i] =' website ' if"Producer country/region"= = Fields[i]: fields[i] =' Area ' if' language '= = Fields[i]: fields[i] =' language ' if' Release date '= = Fields[i]: fields[i] =' Pub ' if' film length '= = Fields[i]: fields[i] =' Time ' if' Another name '= = Fields[i]: fields[i] =' Alias ' if' IMDB link '= = Fields[i]: fields[i] =' IMDB ' # Fill all information into item movie_item.update (Dict (fields, values)) # Handle missing fieldsif not' director ' inMovie_item.keys (): movie_item[' director '] ='/' if not' writer ' inMovie_item.keys (): movie_item[' writer '] ='/' if not' performer ' inMovie_item.keys (): movie_item[' performer '] ='/' if not' Categories ' inMovie_item.keys (): movie_item[' Categories '] ='/' if not' website ' inMovie_item.keys (): movie_item[' website '] ='/' if not' Area ' inMovie_item.keys (): movie_item[' area