A simple example of writing a web crawler using the Python scrapy framework

Last Update:2016-06-10 Source: Internet

Author: User

Tags xpath python scrapy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In this textbook, we assume that you have installed the scrapy. If you do not have the installation, you can refer to this installation guide.

We will use the Open Directory Project (DMOZ) As our example to crawl.

This textbook will take you through the following areas:

To create a new Scrapy project
Define the item that you will extract
Write a spider to crawl the site and extract items.
Write an item pipeline to store the items presented

Scrapy is written by Python. If you have just touched the language of Python, you may want to understand the language and how best to use it. If you are already familiar with other similar languages and want to learn python quickly, we recommend this in-depth way to learn python. If you're a novice and want to start using Python, you can try a list of non-programmer python resources.

Create a project

Before you crawl, create a new Scrapy project. Then go into your storage code directory and execute the following command.

Scrapy Startproject Tutorial

It will create the following wizard directories:

Copy the Code code as follows:

tutorial/
Scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

Here are some basic information:

SCRAPY.CFG: The configuration file for the project.
tutorial/: The project's Python module, where you will import your code later.
tutorial/items.py: Project items file.
tutorial/pipelines.py: Project pipeline file.
tutorial/settings.py: Project configuration file.
tutorial/spiders/: You are going to put your spider into this directory.

Define our item

Items are containers that load the data we crawl. They work like a simple Python dictionary, which provides more protection, such as providing padding to undefined fields to prevent errors.

They declare and define their properties as Scrapy.item.Field objects by creating Scrapy.item.Item classes, like an object-relational mapping (if you are unfamiliar with orms, you will see it as a simple task).

We will need to modularize the item to control the data obtained from the demoz.org website, such as the name, URL and description information we are going to crawl the site. We define the fields for these three attributes. We edit the items.py file, which is in the wizard directory. Our item class looks like this.

From Scrapy.item Import Item, field class Dmozitem (item): title = field () Link = field () desc = field ()

This looks complicated, but defining these item allows you to use other scrapy components to know what your item is.

The first spider of our

Spiders is a user-written class that is used to crawl a site's information (or a group of sites).
We define an initial list of URLs to download, how to track links, and how to parse the contents of these pages to extract items. To create a spider, you must be a subclass of Scrapy.spider.BaseSpider and define three main, mandatory attributes.

Name: The spider's identity. It must be unique, that is to say, you cannot set the same name in different spiders.

Start Link: The spider will crawl through the list of URLs. So the first download page will be included in these lists. Other sub-URLs will be generated from these starting URLs for inheritance.

Parse () is a method of the spider that passes in the response object returned from each URL as a parameter. Response is the only parameter to the method.

This method is responsible for parsing the response data and presenting the crawled data (as the crawled items), tracking URLs

The parse () method is responsible for processing response and returning fetch data (as the item object) and tracking more URLs (as the object of the request)

This is the code for our first spider; It is saved in the Moz/spiders folder and is named dmoz_spider.py:

From Scrapy.spider import Basespider class Dmozspider (basespider): name = "DMOZ" allowed_domains = ["dmoz.org"] Start_urls = [  "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",  "http://www.dmoz.org/Computers /programming/languages/python/resources/"]  def parse (self, response):  filename = response.url.split ("/") [- 2]  open (filename, ' WB '). Write (Response.body)

Climb

To make your spider work, go to the top directory of the project and let it run after:

Scrapy Crawl DMOZ

The Crawl DMOZ command allows the spider to crawl dmoz.org Web site information. You will get a similar message like this:

2008-08-20 03:51:13-0300 [scrapy] info:started project:dmoz2008-08-20 03:51:13-0300 [Tutorial] info:enabled extensions : ... 2008-08-20 03:51:13-0300 [Tutorial] info:enabled downloader middlewares: ... 2008-08-20 03:51:13-0300 [tutorial] info:enabled spider middlewares: ... 2008-08-20 03:51:13-0300 [Tutorial] info:enabled Item Pipelines: ... 2008-08-20 03:51:14-0300 [DMOZ] info:spider opened2008-08-20 03:51:14-0300 [dmoz] debug:crawled 
 
   
   (referer : 
 
  
   
  ) 2008-08-20 03:51:14-0300 [DMOZ] debug:crawled 
  
    
    (referer: 
  
   
    
   ) 2008-08-20 03:51:14-0300 [DMOZ] Info:spider closed (finished)

Note that those lines contain [DMOZ], which are related to our spider. You can see the URL log information initialized per line. Because these URLs are the start page, they don't cite referrers. Therefore at the end of each line of the department, one can see (referer: ).

But interestingly, with our parse method, two files were created: Books and Resources, which insured two URLs of content
What just happened?

Scrapy creates a Scrapy.http.Request object for each start_urls, and designates the crawler's parse method as a callback function.

The request is dispatched first, then executed, followed by the parse () method, the Scrapy.http.Response object is returned, and the result is fed back to the crawler.

Extract Items
Selector Introduction

There are several ways to extract data from a Web page. Scrapy uses an XPath expression, often called an XPath selectors. If you want to learn more about selectors and the mechanisms for extracting data, you can look at the following tutorial XPath selectors documentation.

Here are some examples of expressions and their related meanings:

/html/head/title: Selectelements, in HTML document <pead> elements </li> <li>/html/head/title/text (): Select the text inside the <title> element </li> <li >//td: Select all <td> elements </li> <li>//div[@class = "Mine"]: Select all DIV elements with the class attribute mine </li> There are many examples of how XPath is used, it can be said that an XPath expression is very powerful. If you want to learn more about XPath, we recommend the following tutorial this XPath tutorial. in order to better use XPaths, Scrapy provides a Xpathselector class, which has two ways, Htmlxpathselector (HTML-related data) and Xmlxpathselector (XML-related data). If you want to use them, you must instantiate a response object . you can take selectors as an object, which represents the nodes in the file structure. Therefore, the node of the 1th instance is the equivalent of the root node, or the node that is called the entire document. The selector is available in three ways (you can see the full API documentation by clicking on the method). <ul> <li> Select (): Returns a list of selectors, each select represents the node selected by an XPath expression. </li> <li> Extract (): Returns a Unicode string that is the data returned by the XPath selector. </li> <li> Re (): Returns a list of Unicode strings, which are extracted as arguments by regular expressions. &LT;/LI&GT;&LT;/UL&GT;&LT;P&GT;&LT;BR/> use selector in shell In order to use the selector more visually, we will use the scrapy shell, which also requires your system to install Ipython (an extended PYThon Console). If you use the shell, you must go to the top-level directory of the project and let the following command run: scrapy Shell http://www.dmoz.org/Computers /programming/languages/python/books/shell will display the following information <pre class= "BRUSH:PY;" >[... Scrapy log here ...] [s] Available scrapy objects:[s] 2010-08-19 21:45:59-0300 [default] Info:spider closed (finished) [s] hxs <ptmlxpathse Lector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) Xpath=none>[s] Item item () [s] Request <get Http://www.dmoz.org/computers/programming/languages/python/books/>[s] Response <200/HTTP Www.dmoz.org/computers/programming/languages/python/books/>[s] Spider <basespider ' default ' at 0x1b6c2d0>[ S] xxs <xmlxpathselector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) Xpath=none>[s] Useful Shortcuts:[s] Shelp () Print this help[s] Fetch (REQ_OR_URL) fetch a new request or URL and update shell Objects[s] View (response) View response in a Browserin [1]:</pre> when the shell is loaded, you will get a response local variable. So you enter Reponse.body, you can see the body part of response or you can enter Response.headers, you can see the reponse.headers part. shell also instantiates two selectors, one for HTML (in the HVX variable) and one for XML (in the XXS variable). So we try to use them: <pre class= "brush:py;" >in [1]: Hxs.select ('//title ') out[1]: [<ptmlxpathselector (title) Xpath=//title>]in [2]: Hxs.select ('// Title '). Extract () out[2]: [u ' <title>open directory-computers:programming:languages:python:books']in [3]: Hxs.select ('//title/text () ') out[3]: [ ]in [4]: Hxs.select ('//title/text () '). Extract () out[4]: [u ' Open directory-computers:programming:languages:python: Books ']in [5]: Hxs.select ('//title/text () '). Re (' (\w+): ') out[5]: [u ' Computers ', U ' programming ', U ' Languages ', U ' Python ']

Extract Data extracting

Now we are trying to extract real information from these pages.

You can enter response.body in the console to check if the xpaths in the source code is the same as expected. However, it is very tedious to check the original HTML code. If you want to make your job easier, you can use Firefox extensions such as Firebug to do this task. For more information about the introduction, see Using Firebug for scraping and using Firefox for scraping.

Once you have checked the page source code, you will find that the page information is placed in a

So we choose each one

Element uses the following code:

Hxs.select ('//ul/li ')

The description of the site can use the following code:

Hxs.select ('//ul/li/text () '). Extract ()

Title of the website:

Hxs.select ('//ul/li/a/text () '). Extract ()

Links to websites:

Hxs.select ('//ul/li/a/@href '). Extract ()

As mentioned earlier, each select () call returns a selectors list, so we can combine select () to dig deeper nodes. We will use these features, so:

Sites = hxs.select ('//ul/li ') for site in sites:title = Site.select (' A/text () '). Extract () link = site.select (' a @href '). E Xtract () desc = Site.select (' text () '). Extract () Print title, link, descnote

If you want to learn more about nested selectors, you can refer to nesting selectors and working with relative xpaths related selectors documentation
Add the code to our spider:

From Scrapy.spider import basespiderfrom scrapy.selector Import Htmlxpathselector class Dmozspider (basespider): name = "D Moz "allowed_domains = [" dmoz.org "] start_urls = [  " http://www.dmoz.org/Computers/Programming/Languages/Python/ books/",  " http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"]  def parse (self, Response):  HxS = htmlxpathselector (response)  sites = hxs.select ('//ul/li ') for  site in sites:   title = Site.select (' A/text () '). Extract ()   link = site.select (' a @href '). Extract ()   desc = site.select (' text () '). Extract ()   print title, LINK, desc

Now we crawl dmoz.org again and you will see the site being printed in the output, running the command:

Scrapy Crawl DMOZ

Use our item

The item object is a custom Python dictionary, and you can access their fields (that is, the properties we defined previously) using a standard dictionary-like syntax.

>>> item = Dmozitem () >>> item[' title ' = ' Example title ' >>> item[' title '] ' Example title '

Spiders wants to put the crawled data in the item object. So, in order to return the data we crawled, the final code would look like this:

From Scrapy.spider import basespiderfrom scrapy.selector import htmlxpathselector from tutorial.items import Dmozitem CLA SS Dmozspider (basespider): name = "DMOZ" allowed_domains = ["dmoz.org"] start_urls = [  "http://www.dmoz.org/ computers/programming/languages/python/books/",  " http://www.dmoz.org/Computers/Programming/Languages/ Python/resources/"]  def parse (self, Response):  HxS = htmlxpathselector (response)  sites = hxs.select ('// Ul/li ')  items = [] for  site in sites:   item = Dmozitem ()   item[' title ' = Site.select (' A/text () '). Extract ()   item[' link ' = Site.select (' A ' @href '). Extract ()   item[' desc '] = Site.select (' text () '). Extract ()   Items.append (item)  return items

Note

You will be able to find the full functionality of the spider in the Dirbot project, as well as you can access the Https://github.com/scrapy/dirbot

Now re-crawl the dmoz.org website:

[DMOZ] debug:scraped from <200 http://www.dmoz.org/computers/programming/languages/python/books/>  {' desc ': [u '- by David Mertz; Addison Wesley. Progress, full text, ASCII format. asks for feedback. [Author website, Gnosis Software, inc.\n],  ' link ': [u ' http://gnosis.cx/TPiP/'],  ' title ': [u ' Text processing in Python ']}[dmoz] debug:scraped from <200 http://www.dmoz.org/computers/programming/languages/python/books/>  {' desc ': [u '-by Sean McGrath; Prentice Hall PTR, +, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX , new Pyxie Open source XML processing library. [Prentice Hall ptr]\n '],  ' link ': [u ' http://www.informit.com/store/product.aspx?isbn=0130211192 '],  ' title ' : [u ' XML processing with Python ']}

Storage of crawled data

The simplest way to store crawled data is to use the feed exports, using the following command:

Scrapy Crawl Dmoz-o items.json-t JSON

It will produce a Items.json file that contains all of the crawled items (serialized JSON).

In some small projects (for example, in our tutorial), that's enough. However, if you want to perform more complex crawl items, you can write an Item Pipeline. Because at the time of project creation, a placeholder file dedicated to item pipelines has been built along with the project, and the directory is tutorial/pipelines.py. If you only need to access these crawled items, you don't need to implement any of the entry pipelines.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More