1. Scrapy Introduction
Scrapy is an application framework written to crawl Web site data and extract structural data. Can be applied in a series of programs including data mining, information processing, or storing historical data.
It was originally designed for page fetching (more specifically, network crawling) and could also be applied to get the data returned by the API (for example, Amazon Associates Web Services) or a generic web crawler. Scrapy is widely used for data mining, monitoring and automated testing
Scrapy uses the Twisted asynchronous network library to handle network traffic. The overall structure is broadly as follows
Scrapy mainly includes the following components:
(1) engine (scrapy): Data flow processing for the entire system, triggering transactions (framework core)
(2) Scheduler (Scheduler): Used to accept requests sent by the engine, pressed into the queue, and returned when the engine was requested again. It can be imagined as a priority queue for a URL (crawling the URL of a Web page or a link), which determines what the next URL to crawl is, and removes duplicate URLs
(3) Downloader (Downloader): Used to download the content of the Web page and return the content of the Web page to the spider (Scrapy Downloader is based on twisted, an efficient asynchronous model)
(4) Crawler (Spiders): Crawlers are primarily working to extract the information they need from a particular Web page, the so-called entity (Item). The user can also extract a link from it, allowing Scrapy to continue scratching a page
Project Pipeline (Pipeline): Responsible for dealing with the entities extracted from Web pages by crawlers, the main function is to persist entities, verify the validity of entities, and eliminate unwanted information. When the page is parsed by the crawler, it is sent to the project pipeline, and the data is processed in several specific order.
(5) Downloader middleware (Downloader middlewares): A framework between the Scrapy engine and the downloader, mainly dealing with requests and responses between the Scrapy engine and the downloader.
(6) Crawler middleware (spider middlewares): The framework between the Scrapy engine and the crawler, the main work is to deal with the spider's response input and request output.
(7) Dispatch middleware (Scheduler middewares): A middleware between the scrapy engine and scheduling, sent from the Scrapy engine to the scheduled request and response.
The scrapy running process is probably as follows:
First, the engine pulls a link (URL) from the scheduler for the next crawl
The engine encapsulates the URL as a request to the downloader, and the downloader downloads the resource and encapsulates it as a response packet (Response)
Then, the crawler parses response
If the entity (Item) is parsed, it is given to the entity pipeline for further processing.
If the parse is a link (URL), then the URL is given to scheduler waiting to crawl
2. Installing Scrapy
Use the following command:
sudo pip install virtualenv #安装虚拟环境工具virtualenv ENV #创建一个虚拟环境目录source./env/bin/active #激活虚拟环境pip Install scrapy# Verify that the PIP list is installed successfully
#输出如下cffi (0.8.6) Cryptography (0.6.1) Cssselect (0.9.1) lxml (3.4.1) pip (1.5.6) pycparser (2.10) Pyopenssl (0.14) queuelib (1.2.2) Scrapy (0.24.4) Setuptools (3.6) Six (1.8.0) Twisted (14.0.2) w3lib (1.10.0) wsgiref (0.1.2) zope.interface (4.1.1)
More virtual environment operations can view my blog post
3. Scrapy Tutorial
Before crawling, you need to create a new Scrapy project. Go to a directory where you want to save the code, and then execute:
$ scrapy Startproject Tutorial
This command creates a new directory under the current directory tutorial, which is structured as follows:
. ├──scrapy.cfg└──tutorial├──__init__.py├──items.py├──pipelines.py├──settings.py└──spiders └──__init__.py
These documents are mainly:
(1) Scrapy.cfg: project configuration file
(2) tutorial/: Project Python module, after which you will add code
(3) tutorial/items.py: Project Items file
(4) tutorial/pipelines.py: Project Pipeline File
(5) tutorial/settings.py: project configuration file
(6) Tutorial/spiders: The directory where the spider is placed
3.1. Define the item
Items is the container that will load the crawled data, which works like a dictionary inside Python, but it provides more protection, such as filling an undefined field to prevent spelling errors
By creating Scrapy. The item class, and the definition type is scrapy. Field's class attribute to declare an item.
We control the site data obtained from dmoz.org by modeling the required item, such as the name of the site, the URL, and the description of the site, and we define the domain of the three attributes. items.py file editing in the Tutorial directory
From Scrapy.item Import Item, Fieldclass Dmozitem (item): # define the fields for your item here Like:name = Field () descr iption = field () url = field ()
3.2. Writing the Spider
Spiders are user-written classes used to crawl information from a domain (or domain group), define a preliminary list of URLs for download, how to track links, and how to parse the contents of these pages for extracting items.
To build a Spider, inherit scrapy. Spider base class, and determine three main, mandatory properties:
Name: The name of the crawler, it must be unique, in different reptiles you have to define a different name.
Start_urls: Contains a list of URLs that spiders crawl at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL. We can use regular expressions to define and filter links that need to be followed up.
Parse (): is a method of the spider. When called, the Response object that is generated after each initial URL completes the download is passed to the function as a unique parameter. The method is responsible for parsing the returned data (response data), extracting it (generating item), and generating a request object that requires further processing of the URL.
This method is responsible for parsing the returned data, matching the crawled data (parsing to item), and tracking more URLs.
Create dmoz_spider.py under the/tutorial/tutorial/spiders directory
Import Scrapyclass Dmozspider (scrapy. Spider): name = "DMOZ" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/ languages/python/books/", " http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"] def parse (Self, response): filename = response.url.split ("/") [-2] with open (filename, ' WB ') as F: F.write ( Response.body)
3.3. Crawling
Current project structure
├──scrapy.cfg└──tutorial├──__init__.py├──items.py├──pipelines.py├──settings.py└──spiders ├──__init__.py
└──dmoz_spider.py
To the project root directory, and then run the command:
$ scrapy Crawl Dmoz
Operation Result:
2014-12-15 09:30:59+0800 [scrapy] info:scrapy 0.24.4 started (bot:tutorial) 2014-12-15 09:30:59+0800 [Scrapy] Info:optio NAL features Available:ssl, http112014-12-15 09:30:59+0800 [scrapy] Info:overridden settings: {' newspider_module ': ' Tut Orial.spiders ', ' spider_modules ': [' tutorial.spiders '], ' bot_name ': ' Tutorial '}2014-12-15 09:30:59+0800 [scrapy] INFO : Enabled extensions:logstats, Telnetconsole, Closespider, WebService, Corestats, spiderstate2014-12-15 09:30:59+0800 [ Scrapy] info:enabled Downloader middlewares:httpauthmiddleware, Downloadtimeoutmiddleware, Useragentmiddleware, Retrymiddleware, Defaultheadersmiddleware, Metarefreshmiddleware, Httpcompressionmiddleware, RedirectMiddleware, Cookiesmiddleware, Chunkedtransfermiddleware, downloaderstats2014-12-15 09:30:59+0800 [scrapy] info:enabled spider Middlewares:httperrormiddleware, Offsitemiddleware, Referermiddleware, Urllengthmiddleware, Depthmiddleware2014-12-15 09:30:59+0800 [scrapy] info:enabled Item pipelines:2014-12-15 09:30:59+0800 [DMOZ] info:spider opened2014-12-15 09:30:59+0800 [dmoz] info:crawled 0 pages (at 0 pages/min), scrap Ed 0 items (at 0 items/min) 2014-12-15 09:30:59+0800 [scrapy] debug:telnet console listening on 127.0.0.1:60232014-12-15 0 9:30:59+0800 [scrapy] debug:web service listening on 127.0.0.1:60802014-12-15 09:31:00+0800 [DMOZ] debug:crawled (200)
(referer:none) 2014-12-15 09:31:00+0800 [DMOZ] debug:crawled (200)
(referer:none) 2014-12-15 09:31:00+0800 [DMOZ] info:closing spider (finished) 2014-12-15 09:31:00+0800 [DMOZ] Info:dumpi Ng scrapy Stats: {' downloader/request_bytes ': 516, ' downloader/request_count ': 2, ' downloader/request_method_count/ GET ': 2, ' downloader/response_bytes ': 16338, ' Downloader/response_count ': 2, ' downloader/response_status_count/200 ': 2, ' Finish_reason ': ' Finished ', ' Finish_time ': datetime.datetime (1, 0, 666214), ' Log_count/debug ': 4, ' Log_count/info ': 7, ' Response_received_count ': 2, ' scheduler/dequeued ': 2, ' scheduler/dequeued/memory ': 2, ' Schedul Er/enqueued ': 2, ' scheduler/enqueued/memory ': 2, ' start_time ': Datetime.datetime (2014, 12, 15, 1, 30, 59, 533207)}2014-1 2-15 09:31:00+0800 [DMOZ] Info:spider closed (finished)
3.4. Extract Items
3.4.1. Introduction Selector
There are many ways to extract data from a Web page. Scrapy uses an XPath-or CSS-based expression mechanism: Scrapy selectors
Examples of XPath expressions and their corresponding meanings:
- /html/head/title: Select a tag in the HTML documentElements </li>
- /html/head/title/text (): SelectText within an element </li>
- TD: Select all the Elements
- div[@class = "Mine"]: Select all DIV elements that have class= "mine" attribute
And so much more powerful features use can view XPath tutorial
To facilitate the use of xpaths,scrapy to provide the Selector class, there are four ways:
- XPath (): Returns a list of selectors, each of which selector represents the node selected by an XPath parameter expression.
- CSS (): Returns a list of selectors, each selector representing the node selected by the CSS parameter expression
- Extract (): Returns a Unicode string that is the data returned by the XPath selector
- Re (): Returns a list of Unicode strings, with strings extracted as arguments by regular expressions
3.4.2. Extracting data
- First use the Google Browser developer tools, look at the site source code, to see that they need to take out the form of data (this method is more troublesome), the more simple way is directly interested in the right thing to review elements, you can directly view the site source code
After viewing the website source code, the site information in the second
- Core Python programming-by Wesley J. Chun; Prentice Hall PTR, 2001, ISBN 0130260363. For experienced developers to improve extant skills; Professional level examples. Starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall]
... Omit part ...
Then you can extract the data by the way.
Element: Sel.xpath ('//ul/li ') #网站描述: Sel.xpath ('//ul/li/text () '). Extract () #网站标题: Sel.xpath ('//ul/li/a/text () '). Extract ( ) #网站链接: Sel.xpath ('//ul/li/a/@href '). Extract ()
As mentioned earlier, each XPath () call returns a list of selectors, so we can combine XPath () to dig deeper nodes. We will use these features, so:
For SEL in Response.xpath ('//ul/li ') title = Sel.xpath (' A/text () '). Extract () link = sel.xpath (' a @href '). Extract () desc = Sel.xpath (' text () '). Extract () Print title, LINK, desc
To modify the code in an existing crawler file
Import Scrapyclass Dmozspider (scrapy. Spider): name = "DMOZ" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/ languages/python/books/", " http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"] def parse (Self, Response): For sel in Response.xpath ('//ul/li '): title = Sel.xpath (' A/text () '). Extract () link = Sel.xpath (' a @href '). Extract () desc = Sel.xpath (' text () '). Extract () print title, LINK, desc
3.4.3. Using the item
The item object is a custom Python dictionary that can use standard dictionary syntax to get the value of each of its fields (the field is the property that we previously assigned with field)
>>> item = Dmozitem () >>> item[' title ' = ' Example title ' >>> item[' title '] ' Example title '
In general, spiders will return the crawled data to the item object, and finally modify the crawler to use item to save the data, as follows
From Scrapy.spider import spiderfrom scrapy.selector import selectorfrom tutorial.items import Dmozitemclass DmozSpider ( Spider): name = "DMOZ" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/ languages/python/books/", " http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",] def Parse (self, Response): sel = Selector (response) sites = Sel.xpath ('//ul[@class = "Directory-url"]/li ') Items = [] for site in sites: item = Dmozitem () item[' name '] = Site.xpath (' A/text () '). Extract () item[ ' URL '] = Site.xpath (' a @href '). Extract () item[' description '] = Site.xpath (' text () '). Re ('-\s[^\n]*\\r ') Items.append (item) return items
3.5. Use the item Pipeline
When item is collected in the spider, it is passed to item Pipeline, and some components perform the processing of the item in a certain order.
Each item pipeline component (sometimes referred to as itempipeline) is a Python class that implements a simple method. They receive the item and perform some behavior through it, and also determine whether the item continues to pass through the pipeline, or is discarded and no longer processed.
Here are some typical applications for item pipeline:
- Clean up HTML data
- Validate crawled data (check item contains some fields)
- Duplicate checking (and discard)
- Save the crawl results, such as in a database, XML, JSON, and other files
Writing your own item pipeline is simple, each item pipeline component is a standalone Python class, and you must implement the following methods:
(1) Process_item (item, spider) #每个item pipeline component needs to call this method, this method must return an item (or any inherited Class) object, or throw a Dropitem exception, The discarded item will not be processed by the subsequent pipeline component.
#参数:
Item: Item object returned by the Parse method (item object)
Spider: Crawls the Crawler object (Spider object) to this Item object
(2) Open_spider (spider) #当spider被开启时, this method is called.
#参数:
Spider: (Spider Object) – the spider being opened
(3) Close_spider (spider) #当spider被关闭时, this method is called, it can be closed after the crawler to do the corresponding data processing.
#参数:
Spider: (Spider Object) – The spider that was closed
Write an items for the JSON file
From scrapy.exceptions import Dropitemclass tutorialpipeline (object): # put all words in lowercase words_to_filter = [' Pol Itics ', ' religion '] def process_item (self, item, spider): For word in Self.words_to_filter: if Word in Unicode (it em[' description '). Lower (): Raise Dropitem ("Contains Forbidden Word:%s"% word) else: return item
Set Item_pipelines to activate item pipeline in settings.py, which defaults to []
Item_pipelines = {' Tutorial.pipelines.FilterWordsPipeline ': 1}
3.6. Storing data
Use the following command to store the JSON file format
Scrapy Crawl Dmoz-o Items.json
4. Example
4.1 Simplest spider (The default spider)
Construct the Request object with the URL in instance property Start_urls
The framework is responsible for executing the request
Pass the response object returned by request to the Parse method for analysis
The simplified source code:
Class Spider (Object_ref): "" "Base Class for Scrapy spiders. All spiders must inherit from the This class. "" " name = None def __init__ (self, Name=none, **kwargs): If name was not None: self.name = name elif No T getattr (self, ' name ', None): Raise ValueError ("%s must has a name"% type (self). __name__) self.__dict__. Update (Kwargs) if not hasattr (self, ' start_urls '): self.start_urls = [] def-start_requests (self): For URL in self.start_urls: yield self.make_requests_from_url (URL) def make_requests_from_url (self, URL): return Request (URL, dont_filter=true) def parse (self, Response): raise Notimplementederror Basespider = Create_deprecated_class (' Basespider ', Spider)
A callback function returns an example of multiple request
Import scrapyfrom myproject.items import Myitemclass myspider (scrapy. Spider): name = ' example.com ' allowed_domains = [' example.com '] start_urls = [ ' http://www.example.com/1.html ', ' http://www.example.com/2.html ', ' http://www.example.com/3.html ',] def parse (self, response): sel = scrapy. Selector (response) for H3 in Response.xpath ('//h3 '). Extract (): yield myitem (title=h3) for URL in Response.xpath ('//a/@href '). Extract (): yield scrapy. Request (URL, callback=self.parse)
Only two parameters are required to construct a Request object: URL and callback function
4.2CrawlSpider
Usually we need to decide in the spider: which links on the pages need to follow up, which pages end there, no need to follow the links inside. Crawlspider provides us with useful abstract--rule to make this kind of crawl task simple. You just have to tell scrapy in the rule which ones need to follow.
Recall the spider that we crawled Mininova website.
Class Mininovaspider (crawlspider): name = ' Mininova ' allowed_domains = [' mininova.org '] start_urls = ['/HTTP/ Www.mininova.org/yesterday '] rules = [Rule (Linkextractor (allow=['/tor/\d+ '), ' parse_torrent ')] def parse_ Torrent (Self, response): torrent = Torrentitem () torrent[' url '] = response.url torrent[' name '] = Response.xpath ("//h1/text ()"). Extract () torrent[' description '] = Response.xpath ("//div[@id = ' description ']"). Extract () torrent[' size ' = Response.xpath ("//div[@id = ' specifications ']/p[2]/text () [2]"). Extract () return torrent
The meaning of the rules in the above code is: match the content returned by the/tor/\d+ URL, hand it over to parse_torrent, and no longer follow the URL on response.
An example is also available in the official documentation:
Rules = ( # extract links matching ' category.php ' (but not match ' subsection.php ') and follow up links (no callback means follow defaults to true) Rule ( Linkextractor (allow= (' category\.php ',), deny= (' subsection\.php ',)), # Extract the link matching ' item.php ' and use the spider's Parse_ The Item method is analyzed by Rule (Linkextractor (allow= (' item\.php ',)), callback= ' Parse_item '),)
In addition to spiders and crawlspider, there are Xmlfeedspider, Csvfeedspider, Sitemapspider