Python crawler programming framework Scrapy getting started tutorial, pythonscrapy

Source: Internet
Author: User
Tags chrome developer chrome developer tools virtualenv

Python crawler programming framework Scrapy getting started tutorial, pythonscrapy

1. About Scrapy
Scrapy is an application framework written to crawl website data and extract structural data. It can be applied to a series of programs, including data mining, information processing, or storing historical data.
It was originally designed for page crawling (more specifically, Web crawling). It can also be used to  returned by APIs (such as Amazon Associates Web Services) or general Web crawlers. Scrapy is widely used for data mining, monitoring, and automated testing.
Scrapy uses the Twisted Asynchronous Network Library to process network communication. The overall architecture is roughly as follows:


Scrapy mainly includes the following components:

(1) Engine (Scrapy): used to process data streams of the entire system and trigger transactions (framework core)

(2) sched: it is used to accept requests sent by the engine, push them into the queue, and return them when the engine requests again. it can be imagined as a priority queue for a URL (the URL of the web page to be crawled or a link), which determines what the next URL is to be crawled and removes duplicate URLs.

(3) Downloader: used to download webpage content and return webpage content to SPIDER (Scrapy is based on the efficient asynchronous model of twisted)

(4) Spiders: crawlers are mainly used to extract the information they need from a specific webpage, that is, the so-called entity (Item ). You can also extract the link from it to make Scrapy continue to capture the next page.

Pipeline: Processes entities extracted by crawlers from webpages. It mainly serves to persist objects, verify object validity, and clear unwanted information. After the page is parsed by a crawler, it will be sent to the project pipeline and processed in several specific order.

(5) Downloader Middlewares: a framework between the Scrapy engine and the download server. It mainly processes requests and responses between the Scrapy engine and the download server.

(6) Spider Middlewares: a framework between the Scrapy engine and the crawler. It mainly processes the response input and request output of the Spider.

(7) scheduling middleware (schedmidmiddewares): a middleware between the Scrapy engine and the scheduling system. The middleware sends the scheduling request and response from the Scrapy engine.

The running process of Scrapy is roughly as follows:

First, the engine extracts a link (URL) from the Scheduler for subsequent capturing.
The engine encapsulates the URL into a Request and sends it to the download server. The download server downloads the resource and encapsulates it as a Response packet)
Then, the crawler parses the Response
If the object (Item) is parsed, it is handed over to the object pipeline for further processing.
If the URL is parsed, the URL is handed to Scheduler for capture.

2. Install Scrapy
Run the following command:

sudo pip install virtualenv #Install virtual environment tools
virtualenv ENV #Create a virtual environment directory
source ./ENV/bin/active #Activate the virtual environment
pip install Scrapy
#Verify that the installation was successful
pip list
#Output is as follows
cffi (0.8.6)
cryptography (0.6.1)
cssselect (0.9.1)
lxml (3.4.1)
pip (1.5.6)
pycparser (2.10)
pyOpenSSL (0.14)
queuelib (1.2.2)
Scrapy (0.24.4)
setuptools (3.6)
six (1.8.0)
Twisted (14.0.2)
w3lib (1.10.0)
wsgiref (0.1.2)
zope.interface (4.1.1)
For more virtual environment operations, see my blog post.

3. Scrapy Tutorial
Before fetching, you need to create a new Scrapy project. Go to a directory where you want to save the code, and then execute:

$ scrapy startproject tutorial
This command will create a new directory tutorial under the current directory.Its structure is as follows:

.
├── scrapy.cfg
└── tutorial
 ├── __init__.py
 ├── items.py
 ├── pipelines.py
 ├── settings.py
 └── spiders
  └── __init__.py
These files are mainly:

(1) scrapy.cfg: project configuration file
(2) tutorial /: project python module, you will add code here
(3) tutorial / items.py: project items file
(4) tutorial / pipelines.py: project pipeline file
(5) tutorial / settings.py: project configuration file
(6) tutorial / spiders: The directory where the spider is placed

3.1. Defining Item
Items is a container that will load the fetched data. It works like a dictionary in Python, but it provides more protection, such as filling undefined fields to prevent typos.

An item is declared by creating a scrapy.Item class and defining a class attribute of type scrapy.Field.
We control the site data obtained from dmoz.org by modeling the required items. For example, we want to obtain the site name, url, and website description. We define the domains of these three attributes. Edit the items.py file in the tutorial directory

from scrapy.item import Item, Field


class DmozItem (Item):
 # define the fields for your item here like:
 name = Field ()
 description = Field ()
 url = Field ()

3.2. Writing Spiders
Spider is a user-written class for grabbing information from a domain (or domain group), defining a preliminary list of URLs for download, how to follow links, and how to parse the content of these web pages for extracting items.

To build a Spider, extend the scrapy.Spider base class and determine three main, mandatory attributes:

name: the distinguished name of the crawler, it must be unique, you must define different names in different crawlers.
start_urls: Contains a list of URLs that Spider crawls at startup. Therefore, the first page retrieved will be one of them. Subsequent URLs are extracted from the data obtained from the initial URL. We can use regular expressions to define and filter links that need to be followed up.
parse (): is a method of spider. When called, the Response object generated after each initial URL finishes downloading will be passed to the function as the only parameter. This method is responsible for parsing the returned data (response data), extracting the data (generating the item), and generating a Request object for the URL that needs further processing.
This method is responsible for parsing the returned data, matching the fetched data (parsed as item), and tracking more URLs.
Create dmoz_spider.py in the / tutorial / tutorial / spiders directory

import scrapy

class DmozSpider (scrapy.Spider):
 name = "dmoz"
 allowed_domains = ["dmoz.org"]
 start_urls = [
  "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
  "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
 ]

 def parse (self, response):
  filename = response.url.split ("/") [-2]
  with open (filename, 'wb') as f:
   f.write (response.body)

3.3. Crawling
Current project structure

├── scrapy.cfg
└── tutorial
 ├── __init__.py
 ├── items.py
 ├── pipelines.py
 ├── settings.py
 └── spiders
  ├── __init__.py
  └── dmoz_spider.py
Go to the project root directory and run the command:

$ scrapy crawl dmoz

operation result:
2014-12-15 09: 30: 59 + 0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: tutorial)
2014-12-15 09: 30: 59 + 0800 [scrapy] INFO: Optional features available: ssl, http11
2014-12-15 09: 30: 59 + 0800 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial' }
2014-12-15 09: 30: 59 + 0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-15 09: 30: 59 + 0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware Download
2014-12-15 09: 30: 59 + 0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-15 09: 30: 59 + 0800 [scrapy] INFO: Enabled item pipelines:
2014-12-15 09: 30: 59 + 0800 [dmoz] INFO: Spider opened
2014-12-15 09: 30: 59 + 0800 [dmoz] INFO: Crawled 0 pages (at 0 pages / min), scraped 0 items (at 0 items / min)
2014-12-15 09: 30: 59 + 0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-15 09: 30: 59 + 0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-15 09: 31: 00 + 0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2014-12-15 09: 31: 00 + 0800 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2014-12-15 09: 31: 00 + 0800 [dmoz] INFO: Closing spider (finished)
2014-12-15 09: 31: 00 + 0800 [dmoz] INFO: Dumping Scrapy stats:
 {'downloader / request_bytes': 516,
  'downloader / request_count': 2,
  'downloader / request_method_count / GET': 2,
  'downloader / response_bytes': 16338,
  'downloader / response_count': 2,
  'downloader / response_status_count / 200': 2,
  'finish_reason': 'finished',
  'finish_time': datetime.datetime (2014, 12, 15, 1, 31, 0, 666214),
  'log_count / DEBUG': 4,
  'log_count / INFO': 7,
  'response_received_count': 2,
  'scheduler / dequeued': 2,
  'scheduler / dequeued / memory': 2,
  'scheduler / enqueued': 2,
  'scheduler / enqueued / memory': 2,
  'start_time': datetime.datetime (2014, 12, 15, 1, 30, 59, 533207)}
2014-12-15 09: 31: 00 + 0800 [dmoz] INFO: Spider closed (finished)
3.4. Extracting Items
3.4.1. Introducing Selector
There are many ways to extract data from a web page. Scrapy uses a mechanism based on XPath or CSS expressions: Scrapy Selectors

Examples of XPath expressions and their corresponding meanings:

/ html / head / title: Select the <title> element inside the <head> tag in the HTML document
/ html / head / title / text (): Select text within a <title> element
// td:Select all <td> elements
// div [@ class = "mine"]: select all div elements with class = "mine" attribute
For more powerful features, see the XPath tutorial.

To facilitate the use of XPaths, Scrapy provides a Selector class with four methods:

xpath (): returns a list of selectors, each selector representing a node selected by an xpath parameter expression.
css (): returns a list of selectors, each selector represents the node selected by the CSS parameter expression
extract (): returns a unicode string, which is the data returned by the XPath selector
re (): returns a list of unicode strings, strings are extracted by regular expressions as parameters
3.4.2. Retrieving data

First use Google Chrome developer tools, check the source code of the website, and see the data form you need to take out (this method is more troublesome). The easier way is to right-click the element of interest to review the elements, you can directly view the source code of the website
After viewing the website source code, the website information is in the second <ul>

<ul class = "directory-url" style = "margin-left: 0;">

 <li> <a href="http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html" class="listinglink"> Core Python Programming </a>
-By Wesley J. Chun; Prentice Hall PTR, 2001, ISBN 0130260363. For experienced developers to improve extant skills; professional level examples. Starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall]
<div class = "flag"> <a href = "/ public / flag? cat = Computers% 2FProgramming% 2FLanguages% 2FPython% 2FBooks & url = http% 3A% 2F% 2Fwww.pearsonhighered.com% 2Feducator% 2Facademic% 2Fproduct% 2F0% 2C% 2C0130260363% 2C00% 252Ben-USS_01DBC.html "> <img src =" / img / flag.png "alt =" [!] "Title =" report an issue with this listing "> </a> </ div >
</ li>
... omit part ...
</ ul>

Then you can extract the data in the following way

#Select each <li> element in your website with the following command:
sel.xpath ('// ul / li')

#Site Description:
sel.xpath ('// ul / li / text ()'). extract ()

#Site title:
sel.xpath ('// ul / li / a / text ()'). extract ()

#Site Link:
sel.xpath ('// ul / li / a / @ href'). extract ()

As mentioned earlier, each xpath () call returns a list of selectors, so we can combine xpath () to mine deeper nodes. We will use these features, so:

for sel in response.xpath ('// ul / li')
 title = sel.xpath ('a / text ()'). extract ()
 link = sel.xpath ('a / @ href'). extract ()
 desc = sel.xpath ('text ()'). extract ()
 print title, link, desc
Modify code in existing crawler files

import scrapy

class DmozSpider (scrapy.Spider):
 name = "dmoz"
 allowed_domains = ["dmoz.org"]
 start_urls = [
  "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
  "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
 ]

 def parse (self, response):
  for sel in response.xpath ('// ul / li'):
   title = sel.xpath ('a / text ()'). extract ()
   link = sel.xpath ('a / @ href'). extract ()
   desc = sel.xpath ('text ()'). extract ()
   print title, link, desc

3.4.3. Using items
The Item object is a custom python dictionary.You can use the standard dictionary syntax to get the value of each field.

>>> item = DmozItem ()
>>> item ['title'] = 'Example title'
>>> item ['title']
'Example title'
In general, Spider will return the crawled data as an Item object, and finally modify the crawler class to use Item to save the data. The code is as follows

from scrapy.spider import Spider
from scrapy.selector import Selector
from tutorial.items import DmozItem


class DmozSpider (Spider):
 name = "dmoz"
 allowed_domains = ["dmoz.org"]
 start_urls = [
  "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
  "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
 ]

 def parse (self, response):
  sel = Selector (response)
  sites = sel.xpath ('// ul [@ class = "directory-url"] / li')
  items = []

  for site in sites:
   item = DmozItem ()
   item ['name'] = site.xpath ('a / text ()'). extract ()
   item ['url'] = site.xpath ('a / @ href'). extract ()
   item ['description'] = site.xpath ('text ()'). re ('-\ s [^ \ n] * \\ r')
   items.append (item)
  return items

3.5. Using Item Pipeline
After the items are collected in the spider, it will be passed to the item pipeline, and some components will perform the processing of the items in a certain order.
Each item pipeline component (sometimes called an ItemPipeline) is a Python class that implements simple methods. They receive the Item and perform some actions through it, and also decide whether this Item continues to pass through the pipeline or is discarded without being processed.
The following are some typical applications of the item pipeline:

Clean up HTML data
Validate the crawled data (check that the item contains certain fields)
Check (and discard)
Save the crawl results, such as to a database, XML, JSON, etc.
Writing your own item pipeline is simple. Each item pipeline component is a separate Python class. The following methods must be implemented:

(1) process_item (item, spider) #This method needs to be called for each item pipeline component. This method must return an Item (or any inherited class) object or throw a DropItem exception. The dropped items will not be discarded. Processed by subsequent pipeline components.

#parameter:

item: the Item object (Item object) returned by the parse method

spider: Crawled the spider object (Spider object) corresponding to this Item object

(2) open_spider (spider) # This method is called when the spider is opened.

#parameter:

spider: (Spider object) – the spider that was opened

(3) close_spider (spider) #When the spider is closed, this method is called, and the corresponding data can be processed after the crawler is closed.

#parameter:

spider: (Spider object)-the spider that was closed

Write an item for the JSON file

from scrapy.exceptions import DropItem

class TutorialPipeline (object):

 # put all words in lowercase
 words_to_filter = ['politics', 'religion']

 def process_item (self, item, spider):
  for word in self.words_to_filter:
   if word in unicode (item ['description']). lower ():
    raise DropItem ("Contains forbidden word:% s"% word)
  else:
   return item

Set ITEM_PIPELINES in settings.py to activate the item pipeline, which defaults to []

ITEM_PIPELINES = {'tutorial.pipelines.FilterWordsPipeline': 1}
3.6. Stored Data
Use the following command to save as json file format

scrapy crawl dmoz -o items.json

4. Example
4.1 The simplest spider (default Spider)
Construct a Request object with the URL in the instance property start_urls
The framework is responsible for executing the request
Pass the response object returned by the request to the parse method for analysis

Simplified source code:

class Spider (object_ref):
 "" "Base class for scrapy spiders. All spiders must inherit from this
 class.
 "" "
 
 name = None
 
 def __init __ (self, name = None, ** kwargs):
  if name is not None:
   self.name = name
  elif not getattr (self, 'name', None):
   raise ValueError ("% s must have a name"% type (self) .__ name__)
  self .__ dict __. update (kwargs)
  if not hasattr (self, 'start_urls'):
   self.start_urls= []
 
 def start_requests (self):
  for url in self.start_urls:
   yield self.make_requests_from_url (url)
 
 def make_requests_from_url (self, url):
  return Request (url, dont_filter = True)
 
 def parse (self, response):
  raise NotImplementedError
 
 
BaseSpider = create_deprecated_class ('BaseSpider', Spider)

Example of a callback function returning multiple requests

import scrapyfrom myproject.items import MyItemclass MySpider (scrapy.Spider):
 name = 'example.com'
 allowed_domains = ['example.com']
 start_urls = [
  'http://www.example.com/1.html',
  'http://www.example.com/2.html',
  'http://www.example.com/3.html',
 ]
 
 def parse (self, response):
  sel = scrapy.Selector (response)
  for h3 in response.xpath ('// h3'). extract ():
   yield MyItem (title = h3)
 
  for url in response.xpath ('// a / @ href'). extract ():
   yield scrapy.Request (url, callback = self.parse)
Only two parameters are required to construct a Request object: URL and callback function

4.2 CrawlSpider
Usually we need to decide in the spider: which links on the web pages need to be followed up, and which pages stop here, there is no need to follow up the links in them. CrawlSpider provides us with a useful abstraction, Rule, to make this type of crawling task simple. You just need to tell scrapy in the rule which ones need to be followed up.
Recall our spider that crawled the mininova website.

class MininovaSpider (CrawlSpider):
 name = 'mininova'
 allowed_domains = ['mininova.org']
 start_urls = ['http://www.mininova.org/yesterday']
 rules = [Rule (LinkExtractor (allow = ['/ tor / \ d +']), 'parse_torrent')]
 
 def parse_torrent (self, response):
  torrent = TorrentItem ()
  torrent ['url'] = response.url
  torrent ['name'] = response.xpath ("// h1 / text ()"). extract ()
  torrent ['description'] = response.xpath ("// div [@ id = 'description']"). extract ()
  torrent ['size'] = response.xpath ("// div [@ id = 'specifications'] / p [2] / text () [2]"). extract ()
  return torrent
The meaning of the rules in the above code is: the content returned by the URL matching / tor / \ d + is handed over to parse_torrent, and the URL on the response is no longer followed.
There is also an example in the official documentation:

 rules = (
  # Extract the links that match 'category.php' (but not 'subsection.php') and follow up the links (no callback means follow defaults to True)
  Rule (LinkExtractor (allow = ('category \ .php',), deny = ('subsection \ .php',))),
 
  # Extract links that match 'item.php' and use spider's parse_item method for analysis
  Rule (LinkExtractor (allow = ('item \ .php',)), callback = 'parse_item'),
 )

In addition to Spider and CrawlSpider, there are XMLFeedSpider, CSVFeedSpider, SitemapSpider

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.