[Python] web crawler (12): Crawler frame Scrapy's first crawler example Getting Started Tutorial

Source: Internet
Author: User
We use the website of dmoz.org as the object of small grasping and grasping a skill.








First, we need to answer a question.



Q: How many steps are there to put a website into a reptile?



The answer is simple, four steps:



New Project (Project): Create a new crawler project



Clear goals (Items): Identify the target you want to crawl



Spider: Making crawlers start crawling Web pages



Storage content (Pipeline): Design Pipeline Store crawl content






OK, now that the basic process is determined, the next step is to complete it.






1. New Project (Project)



In the empty directory, hold down the SHIFT key and right-click, select "Open Command Window Here" and enter the command:


Scrapy Startproject Tutorial


Where tutorial is the project name.



You can see that a tutorial folder will be created with the following directory structure:


tutorial/  
    scrapy.cfg  
    tutorial/  
        __init__.py  
        items.py  
        pipelines.py  
        settings.py  
        spiders/  
            __init__.py  
            ...


Here's a brief look at the role of each file:


    • SCRAPY.CFG: Configuration file for Project

    • tutorial/: The project's Python module, which will reference the code from here

    • tutorial/items.py: Project Items file

    • tutorial/pipelines.py: Project's Pipelines file

    • tutorial/settings.py: Setup file for Project

    • tutorial/spiders/: Directory for crawler storage





2. Clear Target (Item)



In Scrapy, items is a container for loading crawling content, a bit like dic in Python, which is a dictionary, but provides some extra protection to reduce errors.



In general, item can be created with the Scrapy.item.Item class, and attributes are defined using Scrapy.item.Field objects (which can be understood as an ORM-like mapping relationship).



Next, we start to build the item model.



First of all, what we want is:


    • Names (name)

    • Link (URL)

    • Description (description)





Modify the items.py file in the Tutorial directory to add our own class after the original class.



Because we want to capture the content of the dmoz.org site, we can name it dmozitem:


# Define here the models for your scraped items  
#  
# See documentation in:  
# http://doc.scrapy.org/en/latest/topics/items.html  
  
from scrapy.item import Item, Field  
  
class TutorialItem(Item):  
    # define the fields for your item here like:  
    # name = Field()  
    pass  
  
class DmozItem(Item):  
    title = Field()  
    link = Field()  
    desc = Field()


At first it may seem a bit out of the point, but defining these items will allow you to use other components to know what's in your item.



The item can be simply understood as a encapsulated class object.






3. Making Crawlers (spider)



The production of reptiles, the overall two-step: first crawl and then take.



In other words, first you want to get all the content of the entire page, and then take out the parts that are useful to you.



3.1 Climb



Spiders are classes that users write themselves to crawl information from a domain (or domain group).



They define a list of URLs to download, a scheme for tracking links, and a way to parse the content of the Web page to extract items.



To build a spider, you must create a subclass with Scrapy.spider.BaseSpider and determine the three mandatory properties:



Name: The names of the crawlers must be unique, and you must define different names in different reptiles.



Start_urls: List of crawled URLs. Crawlers start crawling data from here, so the first downloaded data will start with these URLs. Other sub-URLs will be generated from these starting URLs for inheritance.



Parse (): The parsing method, when called, passes in the response object returned from each URL as the only parameter, responsible for parsing and matching the crawled data (parsing to item) and tracking more URLs.






Here you can refer to the idea mentioned in the Width crawler tutorial to help understand, tutorial delivery: [Java] The chin 5th episode: Using the HttpClient Toolkit and the width crawler.



That is, the URL is stored and as a starting point to gradually spread out, crawl all eligible Web page URLs stored up to continue crawling.



Let's write the first crawler, named dmoz_spider.py, saved in the Tutorial\spiders directory.



The dmoz_spider.py code is as follows:


from scrapy.spider import Spider  
  
class DmozSpider(Spider):  
    name = "dmoz"  
    allowed_domains = ["dmoz.org"]  
    start_urls = [  
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",  
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"  
    ]  
  
    def parse(self, response):  
        filename = response.url.split("/")[-2]  
        open(filename, 'wb').write(response.body)


Allow_domains is the domain name range of the search, that is, the crawler's constrained area, which specifies that crawlers crawl only the Web page under this domain name.



As can be seen from the parse function, the last two addresses of the link are taken out as file names for storage.



Then run a look, hold down SHIFT right click in the Tutorial directory, open the command window here, enter:


Scrapy Crawl DMOZ


Run results






The error:



Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xb0 in position 1:ordinal not in range (128)



Running the first scrapy project is a real ill-fated error.



Should be out of the coding problem, Google a bit to find a solution:






Create a new sitecustomize.py under the Python lib\site-packages folder:


Import sys    sys.setdefaultencoding (' gb2312 ')


Run again, OK, problem solved, look at the results:









The last sentence of the info:closing spider (finished) indicates that the crawler has run successfully and shuts itself down.



The line containing [DMOZ], which corresponds to the result of our crawler running.



You can see that each URL defined in Start_urls has a journal line.



Do you remember our start_urls?



Http://www.dmoz.org/Computers/Programming/Languages/Python/Books
Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources



Because these URLs are the start page, they don't have references (referrers), so you'll see them at the end of each line (Referer: <None>).



In the context of the parse method, two files were created: Books and Resources, which have URL content in the two files.






So what happened in the Thunder and lightning?



First, Scrapy creates a Scrapy.http.Request object for each URL in the crawler's Start_urls property, and designates the crawler's parse method as a callback function.



The request is then dispatched and executed, followed by the parse () method to return the Scrapy.http.Response object and feed back to the crawler.






3.2 Take



Crawl the entire page, the next step is to take the process.



It's not enough to store a whole web page in light.



In the underlying crawler, this step can be captured with regular expressions.



In Scrapy, a mechanism called XPath selectors is used, based on an XPath expression.



If you want to learn more about selectors and other mechanisms you can check out the information: Dot Me dot Me






Here are some examples of XPath expressions and their meanings



/html/head/title: Select the <title> tag below the HTML document


/html/head/title/text (): Select the text content below the <title> element mentioned above



TD: Select all <td> elements



div[@class = "Mine"]: Select all div tag elements that contain the class= "Mine" attribute



These are just a few simple examples of using XPath, but in fact XPath is very powerful.



You can refer to the book: Dot me.






To facilitate the use of xpaths,scrapy to provide xpathselector classes, there are two options, Htmlxpathselector (parsing of HTML data) and Xmlxpathselector (parsing XML data).



They must be instantiated through a Response object.



You will find that the Selector object shows the node structure of the document. Therefore, the first instantiated selector must be related to the root node or the entire directory.



In Scrapy, there are four basic methods of selectors (click to view API documentation):


    • XPath (): Returns a series of selectors, each of which represents the node selected by an XPath parameter expression

    • CSS (): Returns a list of selectors, each select node that represents a CSS parameter expression selection

    • Extract (): Returns a Unicode string for the selected data

    • Re (): Returns a string of Unicode strings that are crawled for use with regular expressions





3.3xpath Experiment



Let's try the use of selector in the shell.



Site of the experiment: http://www.dmoz.org/Computers/Programming/Languages/Python/Books/






Familiar with the experiment of the mice, the next is to use the shell crawl Web pages.



Enter the top-level directory of the project, which is the first-level tutorial folder, in cmd:


Scrapy Shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/


When you enter, you can see the following content:






After the shell is loaded, you will get a response response, stored in the local variable response.



So if you enter Response.body, you will see the body part of response, which is the content of the crawled page:






or enter Response.headers to view its header section:






Now it's like a lot of sand in your hand, and it hides the gold we want, so the next step is to use a sieve to shake the impurities out and pick out the key content.



Selector is such a sieve.



In the old version, the shell instantiates two kinds of selectors, one is the HXS variable that parses the HTML, and the other is the XXS variable that parses the XML.



And now the shell prepares us for the selector object, SEL, which automatically chooses the best parsing scheme (XML or HTML) based on the type of data returned.



Then we'll crunching! ~



To get a thorough understanding of this problem, first of all, you have to know what the captured page looks like.



For example, we want to crawl the title of the page, that is, <title> this tag:






You can enter:


Sel.xpath ('//title ')


The result is:






This allows the label to be removed, and further processing can be done with extract () and text ().



Note: A simple list of useful XPath path expressions:









An expression



Describe






NodeName selects all child nodes of this node.



/select from the root node.



Selects the nodes in the document from the current node that matches the selection, regardless of their location.



. Select the current node.



.. Selects the parent node of the current node.



@ Select Properties.









The results of the experiment are as follows, In[i] indicates the input of the first experiment, Out[i] indicates the output of the results of the first I (Refer to: World Series):


In [1]: sel.xpath('//title')  
Out[1]: [<Selector xpath='//title' data=u'<title>Open Directory - Computers: Progr'>]  
  
In [2]: sel.xpath('//title').extract()  
Out[2]: [u'<title>Open Directory - Computers: Programming: Languages: Python: Books</title>']  
  
In [3]: sel.xpath('//title/text()')  
Out[3]: [<Selector xpath='//title/text()' data=u'Open Directory - Computers: Programming:'>]  
  
In [4]: sel.xpath('//title/text()').extract()  
Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']  
  
In [5]: sel.xpath('//title/text()').re('(\w+):')  
Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']


Of course the title of this tag for us is not too much value, below we will really crawl some meaningful things.



Using Firefox's review element we can clearly see that what we need is as follows:






We can use the following code to grab this <li> tag:


Sel.xpath ('//ul/li ')


From the <li> tab, you can get a description of the site like this:


Sel.xpath ('//ul/li/text () '). Extract ()


You can get the title of the Web site this way:


Sel.xpath ('//ul/li/a/text () '). Extract ()


You can get a hyperlink to a Web site like this:


Sel.xpath ('//ul/li/a/@href '). Extract ()


Of course, the preceding examples are methods of directly acquiring properties.



We notice that the XPath returns a list of objects,



Then we can also directly invoke the properties of the objects in this list to dig deeper nodes



(Reference: Nesting selectors andworking with relative xpaths in theselectors):






Sites = Sel.xpath ('//ul/li ') for the site in sites:
title = Site.xpath (' A/text () '). Extract ()
link = site.xpath (' a @href '). Extract ()
desc = Site.xpath (' text () '). Extract ()
Print title, LINK, desc






3.4xpath Combat



We have been using the shell for such a long time, and finally we can apply the content of the previous learning to the Dmoz_spider crawler.



In the original crawler's parse function, make the following changes:


from scrapy.spider import Spider  
from scrapy.selector import Selector  
  
class DmozSpider(Spider):  
    name = "dmoz"  
    allowed_domains = ["dmoz.org"]  
    start_urls = [  
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",  
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"  
    ]  
  
    def parse(self, response):  
        sel = Selector(response)  
        sites = sel.xpath('//ul/li')  
        for site in sites:  
            title = site.xpath('a/text()').extract()  
            link = site.xpath('a/@href').extract()  
            desc = site.xpath('text()').extract()  
            print title


Note that we imported the selector class from Scrapy.selector and instantiated a new selector object. So that we can manipulate the XPath like in the shell.



Let's try to enter a command to run the crawler (in the tutorial root directory):


Scrapy Crawl DMOZ


The results of the operation are as follows:






Sure enough, the success caught all the headlines. But it seems not quite right, how top,python this navigation bar also crawled out?



We only need the content in the red circle:






It seems that our XPath statement is a bit problematic, not just grabbing the name of the project we need, but also catching some innocent but XPath-syntactically identical elements.



Review elements we find we need <ul> have class= ' directory-url ' attributes,



So just change the XPath statement to Sel.xpath ('//ul[@class = "Directory-url"]/li ')



Make the following adjustments to the XPath statement:


From Scrapy.spider import spider from  scrapy.selector import selector    class Dmozspider (spider):      name = " Dmoz "      allowed_domains = [" dmoz.org "]      start_urls = [          " http://www.dmoz.org/Computers/Programming/ languages/python/books/",          " http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"      ]        def parse (self, Response):          sel = Selector (response)          sites = Sel.xpath ('//ul[@class = "Directory-url"]/ Li ') for          site in sites:              title = Site.xpath (' A/text () '). Extract ()              link = site.xpath (' a @href '). Extract (              desc = Site.xpath (' text () '). Extract ()              print title


Successfully grabbed all the headlines, absolutely no indiscriminate killing of innocents:






3.5 Using Item



Now let's take a look at how to use item.



As we said earlier, the Item object is a custom Python dictionary that you can use to get the value of a property using standard dictionary syntax:


>>> item = Dmozitem ()  >>> item[' title ' = ' Example title '  >>> item[' title ']  ' Example title '


As a reptile, spiders wants to store the data it crawls into the item object. In order to return to our fetch data, the spider's final code should be this:


from scrapy.spider import Spider  
from scrapy.selector import Selector  
  
class DmozSpider(Spider):  
    name = "dmoz"  
    allowed_domains = ["dmoz.org"]  
    start_urls = [  
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",  
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"  
    ]  
  
    def parse(self, response):  
        sel = Selector(response)  
        sites = sel.xpath('//ul[@class="directory-url"]/li')  
        for site in sites:  
            title = site.xpath('a/text()').extract()  
            link = site.xpath('a/@href').extract()  
            desc = site.xpath('text()').extract()  
            print title


4. Storage content (Pipeline)



The simplest way to save information is through the feed exports, there are four main types: Json,json lines,csv,xml.



We export the results in the most commonly used JSON, with the following commands:


Scrapy Crawl Dmoz-o items.json-t JSON


-O is followed by the export file name, and-T followed by the export type.



Then take a look at the results of the export, open the JSON file with a text editor (for easy display, delete the attribute except the title in item):






Because this is just a small example, so simple processing is possible.



If you want to use the crawled items to do something more complicated, you can write an item Pipeline (entry pipeline).



We'll play ^_^ later.



The above is [Python] web crawler (12): Crawler frame scrapy The first crawler example of the content of the introductory tutorial, more relevant content please pay attention to topic.alibabacloud.com (www.php.cn)!


  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.