Python's scrapy Getting Started tutorial

Source: Internet
Author: User
Tags xpath hosting

Look at this article, I assume you've all learned Python (Pesonton Opliton), and the following knowledge is the Python extension (framework).

In this introductory tutorial, we assume that you have installed scrapy. If you are not yet installed, please refer to the Installation Guide.

We will use the Open Directory Project (DMOZ) as an example of crawling.

This introductory tutorial will guide you through the following tasks:

    1. To create a new Scrapy project
    2. Define the extracted item
    3. Write a spider to crawl the site and extract items
    4. Write an item pipeline to store the extracted items

The Scrapy is written by Python. If you're new to Python, you might want to start by understanding Python to get the best use of scrapy. If you are familiar with other programming languages and want to learn Python quickly, it is recommended to Dive into Python. If you are new to programming and want to start learning programming from Python, take a look at the following list of Python resources for non-programmers.

New Project

Before crawling, you need to create a new Scrapy project. Go to a directory where you want to save the code, and then execute:

Microsoft Windows XP [Version 5.1.2600] (C) Copyright 1985-2001 Microsoft Corp.t:\>scrapy startproject tutorialt:\>

This command creates a new directory under the current directory tutorial, which is structured as follows:

T:\tutorial>tree/ffolder PATH listingvolume serial number is 0006EFCF c86a:7c52t:.│  scrapy.cfg│└─tutorial    │  items.py    │  pipelines.py    │  settings.py    │  __init__.py    │    └─spiders            _ _init__.py

These documents are mainly:

    • SCRAPY.CFG: Project configuration file
    • tutorial/: Project Python module, the code will be imported from here
    • tutorial/items.py: Project Items file
    • tutorial/pipelines.py: Project Pipeline File
    • tutorial/settings.py: Project configuration file
    • Tutorial/spiders: The directory where the spider is placed

definition Item

Items is the container that will load the crawled data, which works like a dictionary inside Python, but it provides more protection, such as filling an undefined field to prevent spelling errors.

It declares by creating a Scrapy.item.Item class that defines its properties as a Scrpy.item.Field object, as if it were an object-relational mapping (ORM).
We control the site data obtained from dmoz.org by modeling the required item, such as the name of the site, the URL, and the description of the site, and we define the domain of the three attributes. To do this, we edit the items.py file in the Tutorial directory, and our item class will be like this

From Scrapy.item Import Item, field class Dmozitem (item):    title = field ()    link = field () desc = field ()

It may seem confusing at first, but defining these items will allow you to use other scrapy components to know what your item is.

our first reptile. (Spider)

Spiders are user-written classes used to crawl information from a domain (or domain group).

They define a preliminary list of URLs for download, how to track links, and how to parse the contents of those pages for extracting items.

To build a spider, you must create a subclass for Scrapy.spider.BaseSpider and determine the three main, mandatory attributes:

    • Name: The name of the crawler, it must be unique, in different reptiles you have to define a different name.
    • Start_urls: A list of URLs that crawlers start crawling. Crawlers start crawling data from here, so the first downloaded data will start with these URLs. Other sub-URLs will be generated from these starting URLs for inheritance.
    • Parse (): A crawler method that invokes the response object passed back from each URL as a parameter, and response will be the only parameter of the Parse method,

This method is responsible for parsing the returned data, matching the crawled data (parsing to item), and tracking more URLs.

This is our first reptile code, named dmoz_spider.py and saved in the Tutorial\spiders directory.

From Scrapy.spider import Basespiderclass dmozspider (basespider):    name = "DMOZ"    allowed_domains = ["dmoz.org"]    start_urls = [        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",        "http.// Www.dmoz.org/Computers/Programming/Languages/Python/Resources/"    ]    def parse (self, response):        filename = Response.url.split ("/") [-2]        open (filename, ' WB '). Write (Response.body)

Crawl and crawl

In order for our crawlers to work, we return to the project home directory to execute the following command

T:\tutorial>scrapy Crawl DMOZ

Crawl DMOZ command to start the crawler from the dmoz.org domain. You will get a similar output like this

T:\tutorial>scrapy crawl dmoz2012-07-13 19:14:45+0800 [scrapy] info:scrapy 0.14.4 started (bot:tutorial) 2012-07-13 1 9:14:45+0800 [Scrapy] debug:enabled extensions:logstats, Telnetconsole, Closespider, WebService, Corestats, SPIDERSTATE2012-07-13 19:14:45+0800 [scrapy] debug:enabled Downloader middlewares:httpauthmiddleware, Downloadtimeoutmiddleware, Useragentmiddleware, Retrymiddleware, Defaultheadersmiddleware, RedirectMiddleware, Cookiesmiddleware, Httpcompressionmiddleware, Chunkedtransfermiddleware, downloaderstats2012-07-13 19:14:45+0800 [ Scrapy] debug:enabled spider middlewares:httperrormiddleware, Offsitemiddleware, Referermiddleware, Urllengthmiddleware, depthmiddleware2012-07-13 19:14:45+0800 [scrapy] debug:enabled Item PIPELINES:2012-07-13 19:14:45+0800 [DMOZ] info:spider opened2012-07-13 19:14:45+0800 [dmoz] info:crawled 0 pages (at 0 pages/min), scraped 0 Items (at 0 items/min) 2012-07-13 19:14:45+0800 [scrapy] debug:telnet console listening on 0.0.0.0:60232012-07-13 19:14:45+0800 [scrapy] debug:web service listening on 0.0.0.0:60802012-07-13 19:14:46+0800 [DMOZ] debug:crawled (<get) http://www.dmoz.org/computers/programming/languages/python/resources/> (referer:none) 2012-07-13 19:14:46+0800 [DMOZ] debug:crawled (a) <get http://www.dmoz.org/Computers/Programming/Languages/Python/Books/ > (referer:none) 2012-07-13 19:14:46+0800 [DMOZ] info:closing spider (finished) 2012-07-13 19:14:46+0800 [DMOZ] Info:d Umping Spider stats: {' downloader/request_bytes ': 486, ' Downloader/request_count ': 2, ' downloader/r Equest_method_count/get ': 2, ' downloader/response_bytes ': 13063, ' Downloader/response_count ': 2, ' Downloader/response_status_count/200 ': 2, ' Finish_reason ': ' Finished ', ' Finish_time ': Datetime.datetime (20 7, 703000, 2, ' start_time ': Datetime.datetime (2012, 7, 1), ' scheduler/memory_enqueued ': 3, 11, 14, 45, 500000)}2012-07-13 19:14:46+0800 [DMOZ] Info:spider closed (finished) 2012-07-13 19:14:46+0800 [scrapy] info:dumping global stats: {} 

Note the line that contains [DMOZ], which corresponds to our crawler. You can see that each URL defined in Start_urls has a journal line. Because these URLs are the start page, they are not referenced (referrers), so at the end of each line you will see (Referer: <None>).
Interestingly, under the action of our parse method, two files were created: Books and Resources, which have URL content in the two files.

What's going on? ?

Scrapy creates a Scrapy.http.Request object for each URL in the Start_urls property of the crawler and designates the crawler's parse method as a callback function.
The request is dispatched first, then executed, followed by the parse () method, the Scrapy.http.Response object is returned, and the result is fed back to the crawler.

Extract Item

Selector Introduction

We have many ways to extract data from the site. Scrapy uses a mechanism called XPath selectors, which is based on an XPath expression. If you want to learn more about selectors and other mechanisms you can check the information http://doc.scrapy.org/topics/selectors.html#topics-selectors
Here are some examples of XPath expressions and their meanings

    • /html/head/title: Select the <title> tag below the HTML document
    • /html/head/title/text (): Select the text content below the <title> element mentioned above
    • TD: Select all <td> elements
    • div[@class = "Mine"]: Select all div tag elements that contain the class= "Mine" attribute

This is just a few simple examples of using XPath, but actually XPath is very powerful. If you want to learn more about XPath, we recommend this XPath tutorial http://www.w3schools.com/XPath/default.asp

To facilitate the use of xpaths,scrapy to provide xpathselector classes, there are two flavors to choose from, Htmlxpathselector (HTML data Parsing) and Xmlxpathselector (XML data parsing). In order to use them you have to instantiate them with a Response object. You will find that the Selector object shows the node structure of the document. Therefore, the first instantiated selector must be related to the root node or the entire directory.
There are three methods of selectors

    • Path (): Returns the selectors list, with each select representing the node selected by an XPath parameter expression.
    • Extract (): Returns a Unicode string that is the data returned by the XPath selector
    • Re (): Returns a list of Unicode strings, with strings extracted as arguments by regular expressions

attempt at Shell used in Selectors

To demonstrate the use of selectors, we will use the built-in scrapy shell, which requires the system to have Ipython installed (an extended Python interactive environment).

Attached Ipython:http://pypi.python.org/pypi/ipython#downloads

To start the shell, first go to the project top-level directory and enter

T:\tutorial>scrapy Shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/

The output looks like this:

2012-07-16 10:58:13+0800 [scrapy] info:scrapy 0.14.4 started (bot:tutorial) 2012-07-16 10:58:13+0800 [Scrapy] Debug:enab Led Extensions:telnetconsole, Closespider, WebService, Corestats, spiderstate2012-07-16 10:58:13+0800 [scrapy] DEBUG: Enabled Downloader Middlewares:httpauthmiddleware, Downloadtimeoutmiddleware, Useragentmiddleware, Retrymiddleware, Defaultheadersmiddleware, Redirectmiddleware, Cookiesmiddleware, Httpcompressionmiddleware, Chunkedtransfermiddleware, downloaderstats2012-07-16 10:58:13+0800 [scrapy] debug:enabled spider middlewares: Httperrormiddleware, Offsitemiddleware, Referermiddleware, Urllengthmiddleware, depthmiddleware2012-07-16 10:58:13+ 0800 [scrapy] debug:enabled item pipelines:2012-07-16 10:58:13+0800 [scrapy] debug:telnet console listening on 0.0.0.0:6 0232012-07-16 10:58:13+0800 [scrapy] debug:web service listening on 0.0.0.0:60802012-07-16 10:58:13+0800 [DMOZ] Info:spi Der opened2012-07-16 10:58:18+0800 [Dmoz] debug:crawled ($) <get HTTP://WWW.dmoz.org/computers/programming/languages/python/books/> (Referer:none) [s] Available scrapy Objects:[s] HxS    

After the shell is loaded, you will get a response that is stored in the local variable response, so if you enter response.body you will see the body part of response, or enter response.headers to view its Header section.
The shell also instantiates two kinds of selectors, one is the HXS variable that parses the HTML, and the other is the XXS variable that parses the XML. Let's see what's Inside:

In [1]: Hxs.path ('//title ') out[1]: [

Extracting data

Now we're trying to extract the data from the Web page.
You can enter response.body in the console to check if the xpaths in the source code is the same as expected. However, it is tedious to check the HTML source code. To make things easier, we use the Firefox extension firebug. For more information, see using Firebug for scraping and using Firefox for scraping.
txw1958 Note: I'm actually using Google Chrome's inspect element feature, and I can extract the XPath of the element.
After checking the source code, you will find that the data we need is in a <ul> element and is the second <ul>.
We can select each of the <li> elements in the site by the following command:

Then the site description:

Hxs.path ('//ul/li/text () '). Extract ()

Website title:

Hxs.path ('//ul/li/a/text () '). Extract ()

Site Links:

Hxs.path ('//ul/li/a/@href '). Extract ()

As mentioned earlier, each path () call returns a selectors list, so we can combine path () to dig deeper nodes. We will use these features, so:

Sites = Hxs.path ('//ul/li ') for site in sites:    title = Site.path (' A/text () '). Extract ()    link = site.path (' A @href '). Extract ()    desc = Site.path (' text () '). Extract ()    print title, LINK, desc

Note
For more information on nested selectors, please read Nesting selectors and working with relative XPaths

To add code to the crawler:

txw1958 Note: The code has changed, green commented out the code for the original tutorial, you understand

From Scrapy.spider import basespiderfrom scrapy.selector import Htmlxpathselectorclass dmozspider (basespider):    Name = "Dmoz"    allowed_domains = ["dmoz.org"]    start_urls = [        "http://www.dmoz.org/Computers/Programming/ languages/python/books/",        " http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"]          def Parse (self, Response):        HxS = htmlxpathselector (response)        sites = Hxs.path ('//fieldset/ul/li ')        #sites = Hxs.path ('//ul/li ') for        site in sites:            title = Site.path (' A/text () '). Extract ()            link = site.path (' a/@ href '). Extract ()            desc = Site.path (' text () '). Extract ()            #print title, link, desc            print title, link

Now we crawl dmoz.org again and you will see the site being printed in the output, running the command

T:\tutorial>scrapy Crawl DMOZ

using Entries (Item)

The Item object is a custom Python dictionary that uses a syntax similar to the standard dictionary, and you can get the value of a field (that is, the properties of the previously defined class):

Spiders wants the data it crawls to be stored in the item object. In order to return to our fetch data, the spider's final code should be this:

From Scrapy.spider import basespiderfrom scrapy.selector import htmlxpathselectorfrom tutorial.items Import Dmozitemclass Dmozspider (basespider):   name = "DMOZ"   allowed_domains = ["dmoz.org"]   start_urls = [       " http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",       " http://www.dmoz.org/Computers/ Programming/languages/python/resources/"   ]   def parse (self, Response):       HxS = htmlxpathselector (response )       sites = Hxs.path ('//fieldset/ul/li ')       #sites = Hxs.path ('//ul/li ')       items = [] for       site in sites:           item = Dmozitem ()           item[' title '] = Site.path (' A/text () '). Extract ()           item[' link ' = Site.path (' A ' @href ') ). Extract ()           item[' desc '] = Site.path (' text () '). Extract ()           items.append (item)       return items

Now we crawl again:

2012-07-16 14:52:36+0800 [DMOZ] debug:scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/ python/books/> {' desc ': [u ' \n\t\t\t\n\t ', U ' \n\t\t\t\n\t\t\t\t\t\n-free Python Books and Tut orials.\n \ n '], ' link ': [u ' http://www.techbooksforfree.com/perlpython.shtml '], ' title ': [u ' free Python Boo KS ']}2012-07-16 14:52:36+0800 [DMOZ] debug:scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/ python/books/> {' desc ': [u ' \n\t\t\t\n\t ', U ' \n\t\t\t\n\t\t\t\t\t\n-annotated list of free on Line books on Python scripting language. Topics range from beginner to advanced.\n \ n '], ' link ': [u ' http://www.freetechbooks.com/python-f6.html '] , ' title ': [u ' freetechbooks:python Scripting Language ']}2012-07-16 14:52:36+0800 [DMOZ] debug:crawled (<) GET http://www.dmoz.org/computers/programming/languages/python/resources/> (referer:none) 2012-07-16 14:52:36+ 0800 [DMoz] debug:scraped from <200 http://www.dmoz.org/computers/programming/languages/python/resources/> {' desc ': [u ' \n\t\t\t\n\t ', U ' \n\t\t\t\n\t\t\t\t\t\n-a directory of free Python and Zope hosting providers, wit H reviews and ratings.\n \ n '], ' link ': [u ' http://www.oinko.net/freepython/'], ' title ': [u ' free Python and Zope Hosting Directory ']}2012-07-16 14:52:36+0800 [DMOZ] debug:scraped from <200 http://www.dmoz.org/Computers/  programming/languages/python/resources/> {' desc ': [u ' \n\t\t\t\n\t ', U ' \n\t\t\t\n\t\t\t\t\t\n- Features Python Books, resources, news and articles.\n \ n '], ' link ': [u ' http://oreilly.com/python/'], ' ti Tle ': [u ' O ' Reilly Python Center ']}2012-07-16 14:52:36+0800 [DMOZ] debug:scraped from <200 http://www.dmoz.org/ computers/programming/languages/python/resources/> {' desc ': [u ' \n\t\t\t\n\t ', U ' \n\t\t\t\n\t\t \t\t\t\n-resources for reporting bugs, accessing the Python source tree with CVS and taking part in the development of python.\n\n '], ' link ': [u ' http:/ /www.python.org/dev/'], ' title ': [u ' python Developer ' s Guide ']}


Save the crawled data

The simplest way to save information is through a Feed exports, with the following command:

T:\tutorial>scrapy Crawl Dmoz-o items.json-t JSON

All items that are crawled are saved in the newly generated Items.json file in JSON format

In small projects like this tutorial, these are enough. However, if you want to use the crawled items to do something more complicated, you can write an item Pipeline (entry pipeline). Because at the time of project creation, a placeholder file dedicated to the item pipeline has been built along with items, and the directory is tutorial/pipelines.py. If you only need to access these crawled items, you don't need to implement any of the entry pipelines.



Conclusion

This tutorial briefly describes the use of scrapy, but many other features are not mentioned.

For an understanding of the basic concepts, please visit the basic concepts

We recommend that you continue to learn the example of the Scrapy project Dirbot, which you will benefit from further, the project contains the DMOZ crawler mentioned in this tutorial.

Dirbot project is located in Https://github.com/scrapy/dirbot

The project contains a Readme file that describes the contents of the project in detail.

If you are familiar with Git, you can checkout its source code. Or you can download files in tarball or zip format by clicking Downloads.

In addition this has a code snippet sharing site, which includes crawler, middleware, extension applications, scripts, etc. Website name is called Scrapy snippets, have good code to remember share Oh:-)

File source attachment later upload ...

Python's scrapy Getting Started tutorial

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.