A simple example of writing a web crawler using the Python scrapy framework

A simple example of writing a web crawler using the Python scrapy framework _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags xpath python scrapy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In this textbook, we assume that you have installed the scrapy. If you are not installed, you can refer to this installation guide.

We will use the Open Directory Project (DMOZ) As our example to crawl.

This textbook will take you through the following areas:

Create a new Scrapy project
Define the item that you will extract
Write a spider to crawl the site and extract items.
Write an item pipeline to store the proposed items

Scrapy is written by Python. If you have just contacted the language of Python, you may want to understand the language and how best to use it. If you are already familiar with other similar languages and want to learn python quickly, we recommend this in-depth approach to Python learning. If you're a novice and want to learn from using Python, try looking at a list of non-programmer python resources.

Create a project

Before you crawl, start with a new Scrapy project. Then go into your store code directory and execute the following command.

Scrapy Startproject Tutorial

It will create the following wizard directories:

Copy Code code as follows:

tutorial/
Scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

Here are some basic information:

SCRAPY.CFG: The project's configuration file.
tutorial/: The Python module for the project, where you will import your code later.
tutorial/items.py: Project items file.
tutorial/pipelines.py: Project pipeline file.
tutorial/settings.py: Project configuration file.
tutorial/spiders/: You are going to put your spider into this directory.

define our item

Items are containers that load the data we crawl. They work like a simple Python dictionary, which provides more protection, such as providing padding for undefined fields to prevent errors.

They declare and define their properties as Scrapy.item.Field objects by creating Scrapy.item.Item classes, like an Object relational mapping (if you are unfamiliar with orms, you will see it as a simple task).

We will need to modularize the item to control the data obtained from the demoz.org Web site, such as we are going to crawl the site's name, URL and description information. We define the domains for these three properties. We edit the items.py file, which is in the wizard directory. Our item class looks like this.

From Scrapy.item Import Item, Field
 
class Dmozitem (item):
 title = field ()
 link = field ()
 desc = field ()

This looks complicated, but defining these item allows you to know what your item is when you use other scrapy components

Our first spider.

Spiders is a user-written class that is used to crawl information about a website (or a group of Web sites).
We define an initialized list of URLs to download, how to track links, and how to parse the contents of these pages to extract items. To create a spider, you must be a subclass of Scrapy.spider.BaseSpider and define three main, mandatory properties.

Name: Spider's logo. It must be unique, that is to say, you cannot set the same name in different spiders.

Start Link: Spider will crawl the list of these URLs. So the initial download page will be included in these lists. Other child URLs will inherit from these starting URLs.

Parse () is a method of spider, which passes in the response object returned from each URL as a parameter. Response is the only parameter of a method.

This method is responsible for parsing response data and fetching data (as fetching items) and tracking URLs

The parse () method handles response and returns fetching data (as item object) and tracks more URLs (as objects of the request)

This is our first spider code; it is saved in the Moz/spiders folder and is named dmoz_spider.py:

From Scrapy.spider import Basespider
 
class Dmozspider (basespider):
 name = "DMOZ"
 allowed_domains = [" Dmoz.org "]
 start_urls = [
  " http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
  " http ://www.dmoz.org/computers/programming/languages/python/resources/"
 ]
 
 def parse (self, response):
  filename = Response.url.split ("/") [-2]
  open (filename, ' WB '). Write (Response.body)

Climb

In order to make your spider work, go to the project's top directory to run after:

Scrapy Crawl DMOZ

Crawl DMOZ command allows Spider to crawl dmoz.org website information. You will receive the following similar information:

2008-08-20 03:51:13-0300 [scrapy] info:started project:dmoz 2008-08-20 03:51:13-0300
[Tutorial] info:enabled Exten Sions:
... 2008-08-20 03:51:13-0300 [Tutorial] info:enabled downloader middlewares: ...
2008-08-20 03:51:13-0300 [tutorial] info:enabled spider middlewares: ...
2008-08-20 03:51:13-0300 [Tutorial] info:enabled Item Pipelines: ...
2008-08-20 03:51:14-0300 [DMOZ] Info:spider opened 2008-08-20 03:51:14-0300
[DMOZ] debug:crawled

Note that those lines contain [DMOZ], which is related to our spider. You can see the URL log information initialized for each row. Because these URLs are the starting page, they do not cite referrers. So at the end of each line, you can see (referer: <None>).

But interestingly, two files were created in the context of our Parse method: books and resources, which protect the contents of two URLs
What just happened?

Scrapy creates a Scrapy.http.Request object for each start_urls and assigns the crawler's Parse method as a callback function.

The request is first dispatched, then executed, and then the Scrapy.http.Response object is returned through the parse () method, and the result is also fed back to the crawler.

Extract Items
Introduction to Selectors

We have a variety of ways to extract data from a Web page. Scrapy uses an XPath expression, usually called an XPath selectors. If you want to learn more about selectors and how to extract data, look at the following tutorial XPath selectors documentation.

Here are some examples of expressions and their associated meanings:

/html/head/title: Select <title> element, in the
/html/head/title/text (): Select the text inside the <title> element
TD: Select all the <td> elements
div[@class = "Mine"]: Select all DIV elements inside the class attribute is mine

Here are a number of examples of how XPath can be used, and the XPath expression is very powerful. If you want to learn more about XPath, we recommend the following tutorial this XPath tutorial.

To better use XPaths, Scrapy provides a xpathselector class that has two ways to Htmlxpathselector (HTML-related data) and Xmlxpathselector (XML-related data). If you want to use them, you must instantiate a response object.

You can use selectors as the object, which represents the nodes in the file structure. Therefore, the node in the 1th instance is equivalent to the root node, or the entire document node.

There are three options for selectors (click the method you can see the full API document).

Select (): Returns a list of selectors, each of which represents a node selected by an XPath expression.
Extract (): Returns a Unicode string that is the data returned by the string XPath selector.
Re (): Returns a list of Unicode strings, which are extracted as arguments by regular expressions.

using selectors inside the shell

To use the selector more visually, we will use the scrapy shell, which also requires your system to install Ipython (an extended Python console).

If you use a shell, you must go to the project's top-level directory, and then run the following command:

Scrapy Shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

The shell will display the following information

[ ... Scrapy log here ...]

[s] Available scrapy objects:
[s] 2010-08-19 21:45:59-0300 [default] Info:spider closed (finished)
[s] hxs  & Lt Htmlxpathselector (http://www.dmoz.org/Computers/Programming/Languages/Python/Books/) xpath=none>
[s] Item  Item ()
[s] Request <get http://www.dmoz.org/computers/programming/languages/python/books/>
[s] response <200 Http://www.dmoz.org/computers/programming/languages/python/books/>
[s] Spider  <basespider ' Default ' at 0x1b6c2d0>
[s] xxs  <xmlxpathselector (http://www.dmoz.org/Computers/Programming/ languages/python/books/) xpath=none>
[s] useful shortcuts:
[s] shelp ()   Print This help
[s] Fetch (REQ_OR_URL) fetch a new request or URL and update shell objects
[s] View (response) View response in a browser
   in [1]:

When the shell is loaded, you will get a response local variable. So you enter the Reponse.body, you can see the body part of the response or you can enter the Response.headers, and you can see the reponse.headers part.

The shell also instantiates two selectors, one is HTML (in the HVX variable) and one is XML (in the XXS variable). So we try to use them:

In [1]: Hxs.select ('//title ')
out[1]: [

Extract data extracting the

Now we're starting to try to extract the real information from these pages.

You can enter response.body in the console to check if the xpaths in the source code is the same as expected. However, checking the original HTML code is a very tedious thing to do. If you want to make your job easier, you can use Firefox extensions such as Firebug to do this task. For more information, see the using Firebug for scraping and the using Firefox for scraping.

Once you have checked the page source, you will find that the page's information is placed inside a <ul> element, in fact, the second <ul> element, exactly.

So we select each <li> element using the following code:

Hxs.select ('//ul/li ')

The description of the Web site can use the following code:

Hxs.select ('//ul/li/text () '). Extract ()

Title of the website:

Hxs.select ('//ul/li/a/text () '). Extract ()

Links to Web sites:

Hxs.select ('//ul/li/a/@href '). Extract ()

As mentioned earlier, each select () call returns a selectors list, so we can combine the Select () to dig deeper nodes. We are going to use these features, so:

Sites = hxs.select ('//ul/li ') for
site in sites:
 title = Site.select (' A/text () '). Extract ()
 link = Site.select (' @href '). Extract ()
 desc = site.select (' text () '). Extract ()
 print title, link, desc note

If you want to learn more about nesting selectors, you can refer to XPaths documents related to nesting selectors and working with relative selectors
Add code to our spider:

From Scrapy.spider import Basespider from
scrapy.selector import Htmlxpathselector
 
class Dmozspider ( Basespider):
 name = "DMOZ"
 allowed_domains = ["dmoz.org"]
 start_urls = [
  "http://www.dmoz.org/ computers/programming/languages/python/books/",
  " http://www.dmoz.org/Computers/Programming/Languages/ Python/resources/"
 ]
 
 def parse (self, Response):
  HxS = htmlxpathselector (response)
  sites = Hxs.select ('//ul/li ') for
  site in sites:
   title = Site.select (' A/text () '). Extract ()
   link = site.select ( ' A/@href '). Extract ()
   desc = site.select (' text () '). Extract ()
   print title, LINK, desc

Now that we crawl dmoz.org again, you'll see the site being printed in the output, running the command:

Scrapy Crawl DMOZ

Use our item

The item object is a custom Python dictionary, and you have access to their fields (the attributes we defined previously) using a syntax similar to the standard dictionary.

>>> item = Dmozitem ()
>>> item[' title '] = ' Example title '
>>> item[' title ']
' Example title '

Spiders wants to put the crawled data in the item object. So, to get back to the data we crawled, the final code reads as follows:

From Scrapy.spider import Basespider to
scrapy.selector import htmlxpathselector from
 
tutorial.items Import Dmozitem
 
class Dmozspider (basespider):
 name = "DMOZ"
 allowed_domains = ["dmoz.org"]
 start_urls = [
  "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
  "http://www.dmoz.org/Computers/ Programming/languages/python/resources/"
 ]
 
 def parse (self, Response):
  HxS = Htmlxpathselector ( Response)
  sites = hxs.select ('//ul/li ')
  items = [] for
  site in sites:
   item = Dmozitem ()
   item[' Title '] = Site.select (' A/text () '). Extract ()
   item[' link ' = Site.select (' @href '). Extract () item[
   ' desc '] = Site.select (' text () '). Extract ()
   items.append (item) return
  items

Note

You will be able to find the full functionality of the spider in the Dirbot project, and likewise you can visit Https://github.com/scrapy/dirbot

Now crawl the dmoz.org Web site:

[DMOZ] debug:scraped from <200 http://www.dmoz.org/computers/programming/languages/python/books/>
  {' desc ': [u '- by David Mertz; Addison Wesley. Book in progress, full text, ASCII format. asks for feedback. [Author website, Gnosis Software, inc.\n],
  ' link ': [u ' http://gnosis.cx/TPiP/'],
  ' title ': [u ' Text processing in Python ']}
[DMOZ] debug:scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/ >
  {' desc ': [u '-by Sean McGrath; Prentice Hall PTR, Watts, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX , the new Pyxie open source XML processing library. [Prentice Hall ptr]\n '],
  ' link ': [u ' http://www.informit.com/store/product.aspx?isbn=0130211192 '],
  ' title ' : [u ' XML processing with Python ']}

To store crawled data

The easiest way to store the crawled data is to use the feed exports, using the following command:

Scrapy Crawl Dmoz-o items.json-t JSON

It will produce a Items.json file that contains all of the crawled items (serialized JSON).

In some small projects (for example, in our tutorials), that's enough. However, if you want to perform more complex fetching items, you can write an Item Pipeline. Because when a project is created, a placeholder file dedicated to item pipelines has been built with the project, and the directory is tutorial/pipelines.py. If you only need to access these crawled items, you do not need to implement any of the entry pipeline.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More