the self-cultivation of reptiles _4I. Introduction to the SCRAPY framework
Scrapy is an application framework written with pure Python for crawling Web site data and extracting structural data, which is very versatile.
The power of the framework, users only need to customize the development of a few modules can be easily implemented a crawler, used to crawl Web content and a variety of pictures, very convenient.
Scrapy uses the Twisted [‘tw?st?d]
(its main opponent is Tornado) asynchronous network framework to handle network communication, can speed up our download speed, not to implement the asynchronous framework itself, and contains a variety of middleware interfaces, can be flexible to complete a variety of requirements.
scrapy architecture diagram (Green Line is Data flow):
Scrapy Engine(引擎)
: Responsible Spider
,,, ItemPipeline
Downloader
Scheduler
intermediary communication, signal, data transmission and so on.
Scheduler(调度器)
: It is responsible for receiving 引擎
requests sent from the request, and in a certain way to arrange the arrangement, the queue, when 引擎
necessary, returned 引擎
.
Downloader(下载器)
: Responsible for downloading Scrapy Engine(引擎)
all requests requests sent, and returning them to the responses to be Scrapy Engine(引擎)
引擎
processed by hand Spider
,
Spider(爬虫)
: It handles all responses, extracts data from it, gets the data needed for the item field, submits the URL that needs to be followed, and 引擎
enters again Scheduler(调度器)
,
Item Pipeline(管道)
: It is responsible for processing the Spider
item retrieved and for post-processing (detailed analysis, filtering, storage, etc.).
Downloader Middlewares(下载中间件)
: You can be seen as a component that can customize the extended download feature.
Spider Middlewares(Spider中间件)
: You can understand that it is a functional component that can be customized for expansion and operation 引擎
and Spider
intermediate 通信
(such as Spider
the entry of the responses; and the requests from the Spider
exit)
It started with the spiders we wrote, we sent the request to the engine (Scrapu engine), the engine sent the request to the scheduler, the scheduler queued them, and when the engine needed it, they came out on the first-out Then the engine gave them to the downloader (Downloader), downloaded the download after the response to the engine, the engine is handed to us to write the crawler, we process the response will continue to crawl inside the URL to the engine (repeat the above steps), You need to save the Send to Pipeline (Item Pipeline) processing
It takes 4 steps to make a scrapy crawler:
- New Project (Scrapy startproject XXX): Create a new crawler project (new Project method:scrapy crawl + Crawler Project name )
- Clear goals (write items.py): Identify the target you want to crawl
- Making crawlers (spiders/xxspider.py): Making crawlers start crawling Web pages
- Storage content (pipelines.py): Design Pipeline Store crawl content
second, Scrapy selectors selectorCrapy selectors built-in XPath and CSS Selector expression mechanism
Selector has four basic methods, the most common of which is XPath:
- XPath (): An XPath expression that returns the selector list of all nodes corresponding to the expression
- Extract (): Serializes the node to a Unicode string and returns a list
- CSS (): An incoming CSS expression that returns the selector list of all nodes corresponding to the expression, with the syntax of BEAUTIFULSOUP4
- Re (): Extracts the data based on the incoming regular expression, returning the Unicode string list
An example of an XPath expression and its corresponding meaning:
/html/head/title: Select <title> element/html/head/title/text () in
See the top two blogs for the rest of the story.
third, Item PipelineWhen item is collected in the spider, it is passed to item Pipeline, and the item Pipeline component processes item in the order defined.
Each item pipeline is a Python class that implements a simple method, such as deciding that the item is to be discarded and stored. Here are some typical applications for item pipeline:
- Validate crawled data (check that the item contains some fields, such as the Name field)
- Duplicate checking (and discard)
- To save the crawl results to a file or database
Write Item Pipeline
Writing item pipeline is simple, the item Pipiline component is a standalone Python class, where the Process_item () method must implement:
Class Xingepipeline (object): def __init__ (self): # Optional implementation, do parameter initialization and so on # The initial function and the End function are executed only once, in the middle of the Proces_item function, The data is executed once, so do not write Abself.file = open (' Teacher.json ', ' WB ') # opens the file def process_item (self, item, spider): # Item (item Object) – Crawled Item # spider (Spider object) – The spider that crawls the item # This method must be implemented, each item pipeline component needs to call the method, # This method must return an item to , the discarded item will not be processed by the subsequent pipeline component. content = Json.dumps (Dict (item), ensure_ascii=false) + "\ n" self.file.write (content) return item def open_spider (self, Spider): # spider (Spider object)-The enabled Spider # Optional implementation, when the spider is turned on, this method is called. def close_spider (self, Spider): # spider (Spider object) – The spider that is turned off is optional, and when the spider is closed, This method is called Self.file.close ()
To enable pipeline, you must remove the annotations in the settings file
# Configure Item pipelines# See Http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { "MySpider.pipelines.ItcastJsonPipeline": 300}
Iv. Spider class
The spider class defines how to crawl an (or some) Web site. Includes actions for crawling (for example, whether to follow a link) and how to extract structured data from the contents of a Web page (crawl item). In other words, the spider is where you define the action of crawling and analyze a page (or some Web page).
class scrapy.Spider
Is the most basic class, and all of the crawlers must inherit this class.
The main functions and the order of invocation are:
__init__()
: Initialize reptile name and Start_urls list
start_requests() 调用make_requests_from url()
: Generate requests object to scrapy download and return response
parse()
: Parses the response and returns the item or requests (the callback function needs to be specified). Item is passed to item pipline persistence, and requests is downloaded by scrapy and processed by the specified callback function (the default parse ()), looping until all data has been processed.
main properties and methods
Working rules for the parse method
1. Because of the yield used, not return. The parse function will be used as a generator. Scrapy will get the results generated in the parse method one at a time and determine what type the result is; 2. If request is added to the crawl queue, if the item type is processed using pipeline, the other type returns an error message. 3. The request for the first part of the scrapy does not immediately send the request, just put the request in the queue and then get it from the generator; 4. Take the first part of the request, and then get the second part of the item, take the item, will be placed in the corresponding pipeline processing; 5. The parse () method assigns a value to the request as a callback function (callback), specifying the parse () method to handle these requests scrapy. Request (URL, callback=self.parse) 6. The request object is dispatched, executes the response object that generates the Scrapy.http.response (), and sends it back to the parse () method until there is no request (recursive thinking) in the Scheduler 7. When the parse () is finished, the engine performs the corresponding operation according to the contents of the queue and pipelines; 8. Before obtaining items from each page, the program will process all requests in the request queue before extracting items. 7. All of this, Scrapy engine and scheduler will be responsible for the end.
Small Tips
Why do you use yield? The main function of yield is to send the function ==> generator through yield to the item return data can also be sent the next request requests. If you use return, the function will end. If you need to return a list that contains hundreds or thousands of elements, you will probably consume a lot of computer resources and time. If you use yield, you can ease the situation.
Settings file
#-*-Coding:utf-8-*-# scrapy settings for douyuscripy project## for simplicity, this file contains only settings consid Ered important or# commonly used. You can find more settings consulting the documentation:## http://doc.scrapy.org/en/latest/topics/settings.html# H ttp://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html# http://scrapy.readthedocs.org/en/ Latest/topics/spider-middleware.htmlbot_name = ' douyuscripy ' # project name spider_modules = [' Douyuscripy.spiders ']# Crawler file path newspider_module = ' douyuscripy.spiders ' # Crawl responsibly by identifying yourself (and your website) on the User-a Gent#user_agent = ' douyuscripy (+http://www.yourdomain.com) ' # Obey robots.txt Rulesrobotstxt_obey = true# is compliant with crawler rules, We write our own crawler, of course, do not abide by it, comment off it's good # Configure maximum concurrent requests performed by Scrapy (default:16) #CONCURRENT_REQUESTS = 32# Number of threads started, default is 16 # Configure a delay for requests for the same website (default:0) # See http://scrapy.readthedocs.org/en/la Test/topics/settings.html#download-delay# See also autothrottle settings and Docs#download_delay = Wait time for each request # The download delay setting would h Onor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16# The maximum number of concurrent requests that will be performed on a single domain, by default 8#concurrent_requests_per_ip = 16# The maximum number of concurrent requests that will be performed on a single IP is 0 by default, and if nonzero, the concurrency limit will be applied to each IP instead of each domain. # Disable Cookies (enabled by default) #COOKIES_ENABLED = false# Whether to save cookies, default is true# Disable Telnet Console (enabled by Def Ault) #TELNETCONSOLE_ENABLED = false# Specifies whether the Telnet console is enabled (and Windows does not matter), and the default is true# Override the ' default request Headers:d Efault_request_headers = {# request header file "User-agent": "Dyzb/1 cfnetwork/808.2.16 darwin/16.3.0" # ' Accept ': ' Text/html,appl ication/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 ', # ' accept-language ': ' en ',}# Enable or disable spider middlewares# See Http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# ' DouyuScripy.middlewares.DouyuscripySpiderMiddleware ': 543,# crawler middleware, the lower the value later, the higher the priority #}# Enable or disable downloader middlewares# See Http://scrapY.readthedocs.org/en/latest/topics/downloader-middleware.html#downloader_middlewares = {# ' DouyuScripy.middlewares.MyCustomDownloaderMiddleware ': 543,# download middleware #}# Enable or disable extensions# See/http scrapy.readthedocs.org/en/latest/topics/extensions.html#extensions = {# ' scrapy.extensions.telnet.TelnetConsole ': none,#}# Configure Item pipelines# See Http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_ Pipelines = {' DouyuScripy.pipelines.DouyuscripyPipeline ': 300,# use which pipe, multiple words, first go after the face value of the small}# Enable and configure the Autothro Ttle extension (disabled by default) # See Http://doc.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ ENABLED = true# the initial download delay#autothrottle_start_delay = AA The maximum download delay to BES set in case of H IgH Latencies#autothrottle_max_delay = 60# The average number of requests scrapy should be sending in parallel to# each re Mote server#autothrottle_target_concurrency = 1.0# Enable showing throttling stats for everY response Received: #AUTOTHROTTLE_DEBUG = false# Enable and configure HTTP caching (disabled by default) # See Http://scrap y.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#httpcache_enabled = True#httpcache_expiration_secs = 0#httpcache_dir = ' HttpCache ' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = ' Scrapy.extensions.httpcache.FilesystemCacheStorage '
Python--scrapy Frame