Python crawler programming framework Scrapy Getting Started Tutorial

Source: Internet
Author: User
Tags virtualenv
One of the major advantages of Python is that it can easily make Web crawlers, while the extremely popular Scrapy is a powerful tool for programming crawlers in Python, here, let's take a look at the Python crawler programming framework Scrapy Getting Started Tutorial: 1. about Scrapy
Scrapy is an application framework written to crawl website data and extract structural data. It can be applied to a series of programs, including data mining, information processing, or storing historical data.
It was originally designed for page crawling (more specifically, Web crawling). It can also be used to obtain data returned by APIs (such as Amazon Associates Web Services) or general Web crawlers. Scrapy is widely used for data mining, monitoring, and automated testing.
Scrapy uses the Twisted asynchronous network library to process network communication. The overall architecture is roughly as follows:

Scrapy mainly includes the following components:

(1) engine (Scrapy): used to process data streams of the entire system and trigger transactions (Framework Core)

(2) sched: it is used to accept requests sent by the engine, push them into the queue, and return them when the engine requests again. it can be imagined as a priority queue for a URL (the URL of the web page to be crawled or a link), which determines what the next URL is to be crawled and removes duplicate URLs.

(3) Downloader: used to download webpage content and return webpage content to spider (Scrapy is based on the Efficient Asynchronous model of twisted)

(4) Spiders: crawlers are mainly used to extract the information they need from a specific webpage, that is, the so-called Entity (Item ). You can also extract the link from it to make Scrapy continue to capture the next page.

Pipeline: processes entities extracted by crawlers from webpages. it mainly serves to persist objects, verify object validity, and clear unwanted information. After the page is parsed by a crawler, it will be sent to the project pipeline and processed in several specific order.

(5) Downloader Middlewares: a framework between the Scrapy engine and the download server. it mainly processes requests and responses between the Scrapy engine and the download server.

(6) Spider Middlewares: a framework between the Scrapy engine and the crawler. it mainly processes the response input and request output of the Spider.

(7) scheduling middleware (schedmidmiddewares): A middleware between the Scrapy engine and the scheduling system. The middleware sends the scheduling request and response from the Scrapy engine.

The running process of Scrapy is roughly as follows:

First, the engine extracts a link (URL) from the scheduler for subsequent capturing.
The engine encapsulates the URL into a Request and sends it to the download server. the download server downloads the resource and encapsulates it as a Response packet)
Then, the crawler parses the Response
If the object (Item) is parsed, it is handed over to the object pipeline for further processing.
If the URL is parsed, the URL is handed to Scheduler for capture.

2. install Scrapy
Run the following command:

Sudo pip install virtualenv # install the virtual environment tool virtualenv # Create a virtual environment directory source./ENV/bin/active # activate the virtual environment pip install Scrapy # verify whether the installation is successful pip list

# Output the following cffi (0.8.6) cryptography (0.6.1) cssselect (0.9.1) lxml (3.4.1) pip (1.5.6) pycparser (2.10) pyOpenSSL (0.14) queuelib (1.2.2) Scrapy (0.24.4) setuptools (3.6) six (1.8.0) Twisted (14.0.2) w3lib (1.10.0) wsgiref (0.1.2) zope. interface (4.1.1)

For more operations on the virtual environment, see my blog

3. Scrapy Tutorial
Before capturing the code, you need to create a new Scrapy project. enter a directory where you want to save the code, and then execute:

$ scrapy startproject tutorial

This command will create a new directory tutorial in the current directory. its structure is as follows:

.├── scrapy.cfg└── tutorial ├── __init__.py ├── items.py ├── pipelines.py ├── settings.py └── spiders  └── __init__.py

These files are mainly:

(1) scrapy. cfg: Project configuration file
(2) tutorial/: python module of the project. you will add the code here.
(3) tutorial/items. py: Project items file
(4) tutorial/pipelines. py: project pipeline file
(5) tutorial/settings. py: Project configuration file
(6) tutorial/spiders: directory where spider is placed

3.1. define Item
Items is the container for loading captured data. It works like a dictionary in python, but it provides more protection, such as filling undefined fields to prevent spelling errors.

Declare an Item by creating the scrapy. Item class and defining the class attribute of scrapy. Field.
We model the required items to control the site data obtained from dw..org. for example, to obtain the site name, url, and website description, we define the domains of these three attributes. Edit the items. py file in the tutorial Directory

from scrapy.item import Item, Fieldclass DmozItem(Item): # define the fields for your item here like: name = Field() description = Field() url = Field()

3.2. compile a Spider
Spider is a user-written class that captures information from a domain (or domain group) and defines the initial list of URLs used for download and how to trace links, and how to parse the content of these webpages to extract items.

Create a Spider, inherit the scrapy. Spider base class, and determine the three main and mandatory attributes:

Name: identifies a Crawler. it must be unique. you must define different names for different crawlers.
Start_urls: contains the list of URLs crawled by the Spider at startup. Therefore, the first obtained page is one of them. The subsequent URL is extracted from the data obtained from the initial URL. We can use regular expressions to define and filter links to be followed up.
Parse (): a method of spider. When called, the Response object generated after each initial URL is downloaded will be passed to the function as a unique parameter. This method is used to parse the returned data (response data), extract data (generate item), and generate the Request object of the URL to be further processed.
This method is used to parse the returned data, match the captured data (parsed as an item), and track more URLs.
Create dmoz_spider.py in the/tutorial/spiders Directory

import scrapyclass DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [  "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",  "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response):  filename = response.url.split("/")[-2]  with open(filename, 'wb') as f:   f.write(response.body)

3.3. crawling
Current project structure

├── scrapy.cfg└── tutorial ├── __init__.py ├── items.py ├── pipelines.py ├── settings.py └── spiders  ├── __init__.py  └── dmoz_spider.py

Go to the project root directory and run the following command:

$ scrapy crawl dmoz

Running result:

09:30:59 + 0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: tutorial) 09:30:59 + 0800 [scrapy] INFO: Optional features available: ssl, http112014-12-15 09:30:59 + 0800 [scrapy] INFO: Overridden settings: {'newspider _ module': 'tutorial. spiders ', 'Spider _ modules': ['tutorial. spiders '], 'bot _ name': 'utorial'} 09:30:59 + 0800 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, spiderState2014-12-15 09:30:59 + 0800 [scrapy] INFO: Enabled downloader middlewares: Enabled, Enabled, UserAgentMiddleware, RetryMiddleware, disabled, disabled, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, disabled, downloaderStats2014-12-15 09:30:59 + 0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, California, DepthMiddleware2014-12-15 09:30:59 + 0800 [scrapy] INFO: Enabled item pipelines: 09:30:59 + 0800 [dmoz] INFO: Spider opened2014-12-15 09:30:59 + 0800 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 09:30:59 + 0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1: 60232014-12-15 09:30:59 + 0800 [scrapy] DEBUG: Web service listening on 127.0.0.1: 60802014-12-15 09:31:00 + 0800 [dmoz] DEBUG: crawler (200)
 
  

3.4. extract Items
3.4.1. Introduction to Selector
There are many ways to extract data from webpages. Scrapy uses an XPath-based or CSS expression mechanism: Scrapy Selectors

Example of an XPath expression and its meaning:

  • /Html/head/title: Select HTML documentLabelElement </li>
  • /Html/head/title/text (): SelectText in the element </li>
  • // Td: Select AllElement
  • // P [@ class = "mine"]: Select All p elements with the class = "mine" attribute

For more powerful functions, see XPath tutorial.

To facilitate the use of XPaths, Scrapy provides the Selector class. There are four methods:

  • Xpath (): returns the selectors list. Each selector represents the node selected by an xpath parameter expression.
  • Css (): returns the selectors list. Each selector represents the node selected by the CSS parameter expression.
  • Extract (): returns a unicode string, which is the data returned by the XPath selector.
  • Re (): returns the unicode string list. the string is extracted from the regular expression as a parameter.

3.4.2. retrieve data

  • First, use the Google browser developer tool to view the website source code and the data format you need to retrieve (this method is troublesome). the simpler method is to directly right-click the elements you are interested in, you can directly view the website source code

After viewing the website source code, the website information is in the second

    Internal

        
       
    • Core Python Programming-By Wesley J. chun; Prentice Hall PTR, 2001, ISBN 0130260363. for experienced developers to improve extant skills; professional level examples. starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall]

    • ... Omitted part...

    Then we can extract data using the following method:

    # Use the following command to select
  • Element: sel. xpath ('// ul/li') # Website description: sel. xpath ('// ul/li/text ()'). extract () # website Title: sel. xpath ('// ul/li/a/text ()'). extract () # Website link: sel. xpath ('// ul/li/a/@ href '). extract ()
  • As described above, each xpath () call returns a list of selectors, so we can use xpath () to mine deeper nodes. We will use these features, so:

    for sel in response.xpath('//ul/li') title = sel.xpath('a/text()').extract() link = sel.xpath('a/@href').extract() desc = sel.xpath('text()').extract() print title, link, desc

    Modify the code in an existing crawler file

    import scrapyclass DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [  "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",  "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response):  for sel in response.xpath('//ul/li'):   title = sel.xpath('a/text()').extract()   link = sel.xpath('a/@href').extract()   desc = sel.xpath('text()').extract()   print title, link, desc

    3.4.3. Use item
    The Item object is a custom python dictionary. you can use the standard dictionary syntax to obtain the value of each Field (the Field is the attribute that we previously assigned a value using the Field)

    >>> item = DmozItem()>>> item['title'] = 'Example title'>>> item['title']'Example title'

    Generally, Spider will return the crawled data as an Item object, modify the crawler class, and use Item to save the data. the code is as follows:

    from scrapy.spider import Spiderfrom scrapy.selector import Selectorfrom tutorial.items import DmozItemclass DmozSpider(Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [  "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",  "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/", ] def parse(self, response):  sel = Selector(response)  sites = sel.xpath('//ul[@class="directory-url"]/li')  items = []  for site in sites:   item = DmozItem()   item['name'] = site.xpath('a/text()').extract()   item['url'] = site.xpath('a/@href').extract()   item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')   items.append(item)  return items

    3.5. use Item Pipeline
    After an Item is collected in Spider, it will be passed to Item Pipeline, and some components will process the Item in a certain order.
    Each item pipeline component (sometimes called ItemPipeline) is a Python class that implements a simple method. They receive the Item and execute some actions through it. they also decide whether the Item will continue to pass the pipeline, or whether it will be discarded and not processed.
    The following are some typical application of item pipeline:

    • Clear HTML data
    • Verify the crawled data (check that the item contains certain fields)
    • Re-query (and discard)
    • Save the crawling result, for example, to a database, XML, JSON, and other files.

    It is very easy to compile your own item pipeline. each item pipeline component is an independent Python class and must implement the following methods:

    (1) process_item (item, spider) # this method must be called for each item pipeline component. this method must return an Item (or any inheritance class) object or throw a DropItem exception, discarded items will not be processed by subsequent pipeline components.

    # Parameters:

    Item: The Item object returned by the parse method)

    Spider: Crawlers (Spider objects) corresponding to this Item object)

    (2) open_spider (spider) # this method is called when the spider is enabled.

    # Parameters:

    Spider: (Spider object)-enabled spider

    (3) close_spider (spider) # When the spider is disabled, this method is called and can be processed after the crawler is disabled.

    # Parameters:

    Spider: (Spider object)-disabled spider

    Compile an items file for the JSON file

    from scrapy.exceptions import DropItemclass TutorialPipeline(object): # put all words in lowercase words_to_filter = ['politics', 'religion'] def process_item(self, item, spider):  for word in self.words_to_filter:   if word in unicode(item['description']).lower():    raise DropItem("Contains forbidden word: %s" % word)  else:   return item

    Set ITEM_PIPELINES in settings. py to activate item pipeline. the default value is [].

    ITEM_PIPELINES = {'tutorial.pipelines.FilterWordsPipeline': 1}

    3.6. store data
    Use the following command to store the file in json format:

    Scrapy crawl dmoz-o items. json

    4. example
    4.1 simplest spider (default Spider)
    Construct a Request object using the URL in the instance property start_urls
    The framework executes the request
    Pass the response object returned by the request to the parse method for analysis.

    Simplified source code:

    class Spider(object_ref): """Base class for scrapy spiders. All spiders must inherit from this class. """  name = None  def __init__(self, name=None, **kwargs):  if name is not None:   self.name = name  elif not getattr(self, 'name', None):   raise ValueError("%s must have a name" % type(self).__name__)  self.__dict__.update(kwargs)  if not hasattr(self, 'start_urls'):   self.start_urls = []  def start_requests(self):  for url in self.start_urls:   yield self.make_requests_from_url(url)  def make_requests_from_url(self, url):  return Request(url, dont_filter=True)  def parse(self, response):  raise NotImplementedError  BaseSpider = create_deprecated_class('BaseSpider', Spider)

    Example of a callback function returning multiple requests

    import scrapyfrom myproject.items import MyItemclass MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [  'http://www.example.com/1.html',  'http://www.example.com/2.html',  'http://www.example.com/3.html', ]  def parse(self, response):  sel = scrapy.Selector(response)  for h3 in response.xpath('//h3').extract():   yield MyItem(title=h3)   for url in response.xpath('//a/@href').extract():   yield scrapy.Request(url, callback=self.parse)

    To construct a Request object, you only need two parameters: URL and callback.

    4.2 CrawlSpider
    Generally, we need to decide in the spider: which web pages need to be followed up, and which ones do not need to be followed up. CrawlSpider provides us with a useful abstraction-Rule, making this type of crawling tasks easy. You only need to tell scrapy in the rule and what needs to be followed up.
    Recall the spider we crawled on the mininova website.

    class MininovaSpider(CrawlSpider): name = 'mininova' allowed_domains = ['mininova.org'] start_urls = ['http://www.mininova.org/yesterday'] rules = [Rule(LinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]  def parse_torrent(self, response):  torrent = TorrentItem()  torrent['url'] = response.url  torrent['name'] = response.xpath("//h1/text()").extract()  torrent['description'] = response.xpath("//p[@id='description']").extract()  torrent['size'] = response.xpath("//p[@id='specifications']/p[2]/text()[2]").extract()  return torrent

    In the above code, the meaning of rules is: match the content returned by the/tor/\ d + URL, and hand it to parse_torrent for processing, and no longer follow up with the URL on response.
    There is also an example in the official document:

    Rules = (# extract and match 'Category. php' (but does not match 'subsection. php ') and follow up the link (no callback means that follow is True by default) Rule (LinkExtractor (allow = ('Category \. php ',), deny = ('subsection \. php ',), # extract and match 'item. php, and use the parse_item method of spider to analyze Rule (LinkExtractor (allow = ('item \. php ',), callback = 'parse _ item '),)

    In addition to Spider and crawler, XMLFeedSpider, CSVFeedSpider, and SitemapSpider

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.