1. Scrapy Introduction
Scrapy is an application framework for crawling Web site data and extracting structured data. It can be applied in a series of programs including data mining, information processing or storing historical data.
It was originally designed for page crawling (or, more specifically, web crawling), or it can be applied to get the data returned by the API (such as Amazon Associates Web Services) or a generic web crawler. Scrapy is widely used for data mining, monitoring and automated testing
Scrapy uses a twisted asynchronous network library to handle network traffic. The overall structure is broadly as follows
Scrapy mainly includes the following components:
(1) engine (scrapy): Used to handle the entire system of data flow processing, triggering transactions (framework core)
(2) Scheduler (Scheduler): Used to accept requests from the engine, press into the queue, and return when the engine requests again. Can be imagined as a URL (crawl Web site or link) Priority queue, it determines the next to crawl the URL is what, while removing duplicate URLs
(3) Download (Downloader): To download the content of the Web page, and return the content of the Web page to the spider (Scrapy downloader is built on twisted this efficient asynchronous model)
(4) Reptile (Spiders): Crawler is the main work, for the specific Web page to extract the information they need, that is, the so-called entity (Item). The user can also extract the link from it, let Scrapy continue to grab a page
Project Pipeline (Pipeline): Responsible for processing the crawler extracts from the Web page of entities, the main function is to persist the entity, verify the effectiveness of the entity, clear unwanted information. When the page is parsed by the crawler, it is sent to the project pipeline and processed in a few specific order.
(5) Download middleware (Downloader middlewares): A framework between the Scrapy engine and the downloader that handles the request and response between the Scrapy engine and the downloader.
(6) Crawler middleware (Spider middlewares): The framework between the Scrapy engine and the crawler, the main task is to handle the spider's response input and request output.
(7) Scheduling middleware (Scheduler middewares): A middleware between the Scrapy engine and the scheduler, sent from the Scrapy engine to the scheduling request and response.
The scrapy running process is probably as follows:
First, the engine pulls a link (URL) from the scheduler for the next crawl
The engine encapsulates the URL as a request to the downloader, the Downloader downloads the resource and encapsulates it as an answer packet (Response)
Then, the reptile parses the response
If the entity (Item) is parsed, it is referred to the entity pipeline for further processing.
If the parse is a link (URL), then the URL to the scheduler waiting to crawl
2. Install Scrapy
use the following command:
sudo pip install virtualenv #Install virtual environment tools
virtualenv ENV #Create a virtual environment directory
source./env/bin/active #Activate the virtual environment
pip Install scrapy
#Verify that the installation was successful
pip list
#Output is as follows
Cffi (0.8.6)
Cryptography (0.6.1)
cssselect (0.9.1)
lxml (3.4.1)
pip (1.5.6)
Pycparser (2.10) Pyopenssl (0.14) queuelib (1.2.2) scrapy (0.24.4) setuptools
(3.6) Six
( 1.8.0)
Twisted (14.0.2)
w3lib (1.10.0)
wsgiref (0.1.2) zope.interface (4.1.1)
More virtual environment operations can view my blog
3. Scrapy Tutorial
Before you crawl, you need to create a new Scrapy project. Enter a directory where you want to save the code, and then execute:
$ scrapy Startproject Tutorial
This command creates a new directory tutorial in the current directory, which is structured as follows:
.
├──scrapy.cfg
└──tutorial
├──__init__.py
├──items.py
├──pipelines.py
├──settings.py └──spiders
└──__init__.py
These documents are mainly:
(1) Scrapy.cfg: project configuration file
(2) tutorial/: Project Python module, after which you will add code
(3) tutorial/items.py: Project Items Document
(4) tutorial/pipelines.py: Project Pipeline File
(5) tutorial/settings.py: project configuration file
(6) Tutorial/spiders: Place the spider directory
3.1. Define Item
items are containers of data that will be loaded to crawl, working like a dictionary in Python, but it provides more protection, such as padding on undefined fields to prevent spelling errors
By creating Scrapy. Item class, and the definition type is scrapy. The class property of the Field to declare an item.
We control the site data obtained from dmoz.org by modeling the item we need, such as the name of the site, the URL and the description of the site, and the domain where we define the three properties. items.py file editing in the Tutorial directory
From Scrapy.item Import Item, Field
class Dmozitem (item):
# define the fields for your item this is like:
name = Field ()
description = field ()
url = field ()
3.2. Preparation of spider
Spider is a user-written class that captures information from a domain (or domain group), defines a preliminary list of URLs for downloading, how to track links, and how to parse the contents of these pages to extract items.
To build a Spider, inherit scrapy. Spider the base class and determines the three main, mandatory properties:
Name: The name of the reptile, it must be unique, in different reptiles you have to define different names.
Start_urls: Contains a list of URLs that spider crawled at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL. We can use regular expressions to define and filter links that need to be followed up.
Parse (): is a method of spider. When invoked, the Response object that is generated after each initial URL completes the download is passed to the function as a unique parameter. The method is responsible for parsing the returned data (response), extracting data (generating item), and generating the Request object for URLs that need further processing.
This method is responsible for parsing the returned data, matching the crawled data (parsing to item) and tracking more URLs.
Create dmoz_spider.py in/tutorial/tutorial/spiders directory
Import Scrapy
class Dmozspider (scrapy. Spider):
name = "DMOZ"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/ programming/languages/python/books/",
" http://www.dmoz.org/Computers/Programming/Languages/Python/ Resources/"
]
def parse (self, response):
filename = response.url.split ("/") [-2] with
open (filename , ' WB ') as F:
F.write (Response.body)
3.3. Crawl
Current project Structure
├──scrapy.cfg
└──tutorial
├──__init__.py
├──items.py
├──pipelines.py
├──settings.py └──spiders
├──__init__.py
└──dmoz_spider.py
To the project root directory, and then run the command:
2014-12-15 09:30:59+0800 [scrapy] info:scrapy 0.24.4 started (bot:tutorial), 2014-12-15 09:30:59+0800 [scrapy] Info:opt Ional features Available:ssl, http11 2014-12-15 09:30:59+0800 [scrapy] Info:overridden settings: {' newspider_module ': ' t Utorial.spiders ', ' spider_modules ': [' tutorial.spiders '], ' bot_name ': ' Tutorial '} 2014-12-15 09:30:59+0800 [Scrapy] Info:enabled extensions:logstats, Telnetconsole, Closespider, WebService, Corestats, spiderstate 2014-12-15 09:30:59+ 0800 [scrapy] info:enabled downloader middlewares:httpauthmiddleware, Downloadtimeoutmiddleware, UserAgentMiddleware , Retrymiddleware, Defaultheadersmiddleware, Metarefreshmiddleware, Httpcompressionmiddleware, RedirectMiddleware, Cookiesmiddleware, Chunkedtransfermiddleware, downloaderstats 2014-12-15 09:30:59+0800 [scrapy] info:enabled spider
Middlewares:httperrormiddleware, Offsitemiddleware, Referermiddleware, Urllengthmiddleware, DepthMiddleware 2014-12-15 09:30:59+0800 [scrapy] info:enabled Item Pipelines: 2014-12-15 09:30:59+0800 [DMOZ] Info:spider opened 2014-12-15 09:30:59+0800 [DMOZ] info:crawled 0 pages (at 0 pages/mi n), scraped 0 items (at 0 items/min) 2014-12-15 09:30:59+0800 [scrapy] debug:telnet console listening on 127.0.0.1:6023 2 014-12-15 09:30:59+0800 [scrapy] debug:web service listening on 127.0.0.1:6080 2014-12-15 09:31:00+0800 [DMOZ] Debug:cra
Wled (<get) http://www.dmoz.org/computers/programming/languages/python/resources/> (Referer:none) 2014-12-15 09:31:00+0800 [DMOZ] debug:crawled <get http://www.dmoz.org/Computers/Programming/Languages/ Python/books/> (Referer:none) 2014-12-15 09:31:00+0800 [DMOZ] info:closing spider (finished) 2014-12-15 09:31:00+080 0 [DMOZ] info:dumping scrapy stats: {' downloader/request_bytes ': 516, ' downloader/request_count ': 2, ' downloader/req Uest_method_count/get ': 2, ' downloader/response_bytes ': 16338, ' Downloader/response_count ': 2, ' Downloader/response_ Status_count/200 ': 2, ' Finish_reason ': ' FinisHed ', ' finish_time ': Datetime.datetime (2014, 1, 0, 666214), ' Log_count/debug ': 4, ' Log_count/info ': 7, ' Response_received_count ': 2, ' scheduler/dequeued ': 2, ' scheduler/dequeued/memory ': 2, ' scheduler/enqueued ': 2, ' Scheduler/enqueued/memory ': 2, ' start_time ': Datetime.datetime (2014, 12, 15, 1, 30, 59, 533207)} 2014-12-15 09:31:00+080
0 [DMOZ] Info:spider closed (finished)
3.4. Extract Items
3.4.1. Introducing Selector
There are many ways to extract data from a Web page. Scrapy uses an XPath or CSS-based expression mechanism: Scrapy selectors
An example of an XPath expression and the corresponding meaning:
- /html/head/title: Select <title> elements within
- /html/head/title/text (): Select the text within the <title> element
- TD: Select all the <td> elements
- div[@class = "Mine"]: Select all div elements with class= "Mine" attribute
And so many powerful features that can be viewed with XPath tutorial
In order to facilitate the use of xpaths,scrapy to provide Selector classes, there are four ways:
- XPath (): Returns a list of selectors, each of which represents the node selected by an XPath parameter expression. Selector
- CSS (): Returns a list of selectors, each selector represents the node of a CSS parameter expression selection
- Extract (): Returns a Unicode string that is the data returned by the XPath selector
- Re (): Returns a list of Unicode strings, which are extracted as arguments by regular expressions
3.4.2. Remove Data
- The first use of Google Browser developer tools to view the site source code, to see their own need to take out the data form (this method is more trouble), the simpler way is directly interested in things right key review elements, you can directly view the site source code
After viewing the site source code, the site information in the second <ul>
<ul class= "Directory-url" style= "margin-left:0;" >
<li><a href= "http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00% 2ben-uss_01dbc.html "class=" Listinglink ">core Python programming</a>
-by Wesley J. Chun; Prentice Hall PTR, 2001, ISBN 0130260363. For experienced developers to improve extant skills; Professional level examples. Starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall]
<div class= "f Lag "><a href="/public/flag?cat=computers%2fprogramming%2flanguages%2fpython%2fbooks&url=http%3a%2f% 2fwww.pearsonhighered.com%2feducator%2facademic%2fproduct%2f0%2c%2c0130260363%2c00%252ben-uss_01dbc.html ">
> ...
Omitted part ...
</ul>
Then you can extract the data in a way
#Select each <li> element in the website with the following command:
sel.xpath ('// ul / li')
#Site Description:
sel.xpath ('// ul / li / text ()'). Extract ()
#Site title:
sel.xpath ('// ul / li / a / text ()'). Extract ()
#Site link:
sel.xpath ('//ul/li/a/@href '). Extract ()
As mentioned earlier, each XPath () call returns a selectors list, so we can combine XPath () to dig deeper nodes. We are going to use these features, so:
For SEL in Response.xpath ('//ul/li ')
title = Sel.xpath (' A/text () '). Extract ()
link = sel.xpath (' A/@href '). Extract ()
desc = Sel.xpath (' text () '). Extract ()
print title, LINK, desc
Modifying code in an existing reptile file
Import Scrapy
class Dmozspider (scrapy. Spider):
name = "DMOZ"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/ programming/languages/python/books/",
" http://www.dmoz.org/Computers/Programming/Languages/Python/ Resources/"
]
def parse (self, Response): For
sel in Response.xpath ('//ul/li '):
title = Sel.xpath (' A/ Text () "). Extract ()
link = sel.xpath (' @href '). Extract ()
desc = Sel.xpath (' text () '). Extract ()
print Title, LINK, desc
3.4.3. Use item
the Item object is a custom Python dictionary, and you can use the standard dictionary syntax to get the value of each of its fields (the field is the property that we previously assigned to the field).
>>> item = Dmozitem ()
>>> item[' title '] = ' Example title '
>>> item[' title ']
' Example title '
In general, Spider will return crawled data to the item object, finally modify the reptile class, use the item to save the data, the following code
from Scrapy.spider import Spider to scrapy.selector import selector from Tutorial.items Import Dmozitem class Dmozspider (Spider): name = "DMOZ" allowed_domains = ["dmoz.org"] start_urls = ["http://www.d moz.org/computers/programming/languages/python/books/"," http://www.dmoz.org/Computers/Programming/Languages/ Python/resources/",] def parse (self, response): SEL = Selector (response) sites = Sel.xpath ('//ul[@class =" Directory
-url "]/li ') items = [] for site in Sites:item = Dmozitem () item[' name '] = Site.xpath (' A/text () '). Extract () item[' url ' = Site.xpath (' A/@href '). Extract () item[' description '] = Site.xpath (' text () '). Re ('-\s[^\n]*\\r ') Items.ap Pend (item) return items
3.5. Use Item Pipeline
when item is collected in Spider, it is passed to item Pipeline, and some components perform the processing of the item in a certain order.
Each item pipeline component (sometimes called a itempipeline) is a Python class that implements a simple method. They receive the item and perform some behavior through it, and also decide whether the item continues to be pipeline or discarded and no longer processed.
Here are some typical applications for item pipeline:
- Clean up HTML data
- Validate crawled data (check item contains some fields)
- Duplicate checking (and discard)
- Saves crawl results, such as saving to a database, XML, JSON, etc file
Writing your own item pipeline is simple, each item pipeline component is a separate Python class, and you must implement the following methods:
(1) Process_item (item, spider) #每个item the pipeline component needs to invoke the method, which must return an item (or any inherited Class) object, or throw a Dropitem exception. The discarded item will not be processed by the subsequent pipeline component.
#parameter:
Item: Item object (item object) returned by the Parse method
Spider: Crawl to this Item object corresponding to the Reptile object (Spider object)
(2) Open_spider (spider) #当spider被开启时, this method is invoked.
#parameter:
Spider: (Spider Object) – opened Spider
(3) The Close_spider (spider) #当spider被关闭时, this method is called, can be after the crawler shutdown for the corresponding data processing.
#parameter:
Spider: (Spider object) – Closed spider
Write an items for the JSON file
From scrapy.exceptions import Dropitem
class Tutorialpipeline (object):
# put all words in lowercase
words_ To_filter = [' Politics ', ' religion ']
def process_item (self, item, spider): For
word in Self.words_to_filter:
if Word in Unicode (item[' description ']). Lower ():
Raise Dropitem ("Contains Forbidden Word:%s"% word)
Else : Return
Item
Set Item_pipelines Activate item pipeline in settings.py, its default []
Item_pipelines = {' Tutorial.pipelines.FilterWordsPipeline ': 1}
3.6. Storing data
Save as JSON file format using the following command
Scrapy Crawl Dmoz-o Items.json
4. Example
4.1 Simplest spider (Default spider)
construct the Request object with the URL in the instance property Start_urls
The framework is responsible for executing the request
Pass the response object returned by request to the Parse method for analysis
Simplified source code:
Class Spider (Object_ref): "" "
Base class for Scrapy spiders. All spiders must inherit from this
class.
"" " Name = None
def __init__ (self, Name=none, **kwargs):
If name isn't None:
self.name = name
elif not Getatt R (Self, ' name ', None):
Raise ValueError ("%s must have a name"% type (self). __name__)
self.__dict__.update ( Kwargs)
if not hasattr (self, ' start_urls '):
self.start_urls = []
def start_requests (self):
for URL in self.start_urls:
yield self.make_requests_from_url (URL)
def make_requests_from_url (self, URL): return
Request (URL, dont_filter=true)
def parse (self, Response):
raise Notimplementederror
Basespider = Create_deprecated_class (' Basespider ', Spider)
A callback function returns examples of multiple request
Import scrapyfrom myproject.items import Myitemclass myspider (scrapy. Spider):
name = ' example.com '
allowed_domains = [' example.com ']
start_urls = [
' http:// Www.example.com/1.html ',
' http://www.example.com/2.html ', '
http://www.example.com/3.html ',
]
Def parse (self, Response):
sel = scrapy. Selector (response) for
H3 in Response.xpath ('//h3 '). Extract ():
yield myitem (title=h3) to
URL in Response.xpath ('//a/@href '). Extract ():
yield scrapy. Request (URL, callback=self.parse)
Constructing a Request object takes only two parameters: URL and callback function
4.2CrawlSpider
usually we need to decide in the spider: which page links need to follow up, which pages to stop, no need to follow the links inside. Crawlspider provides us with useful abstract--rule to make this kind of crawling task simple. You just need to tell scrapy in rule what needs to be followed up.
Recall the spider of our crawling Mininova website.
Class Mininovaspider (Crawlspider):
name = ' Mininova '
allowed_domains = [' mininova.org ']
start_urls = [' Http://www.mininova.org/yesterday ']
rules = [Rule (linkextractor (allow=['/tor/\d+ ')), ' parse_torrent ']
def parse_torrent (self, response):
torrent = Torrentitem ()
torrent[' url ' = response.url torrent[
' name '] = Response.xpath ("//h1/text ()"). Extract ()
torrent[' description ' = Response.xpath ("//div[@id = ' description ']") . Extract ()
torrent[' size '] = Response.xpath ("//div[@id = ' specifications ']/p[2]/text () [2]"). Extract ()
return torrent
The meaning of the rules in the code above is to match the contents returned by the/tor/\d+ URL, to parse_torrent processing, and to no longer follow the URL on response.
There is also an example in the Official document:
Rules = (
# extract links that match ' category.php ' (but do not match ' subsection.php ') and follow up links (no callback means follow defaults to true) Rule
( Linkextractor (allow= (' category\.php ',), deny= (' subsection\.php ',)),
# extract matching ' item.php ' links and use Spider Parse_ The Item method is parsed by rule
(Linkextractor (allow= (' item\.php ',)), callback= ' Parse_item '),
In addition to Spider and Crawlspider, there are Xmlfeedspider, Csvfeedspider, Sitemapspider