0 Base Write Python crawler using scrapy framework to write crawler

Source: Internet
Author: User
A web crawler is a program that crawls data on the web and uses it to crawl the HTML data of a particular webpage. While we use some libraries to develop a crawler, using frameworks can greatly improve efficiency and shorten development time. Scrapy is written in Python, lightweight, simple and lightweight, and very handy to use. The use of scrapy can be very convenient to complete the collection of online data, it has done a lot of work for us, without the need for their own efforts to develop.

First, we need to answer a question.
Q: How many steps are there to put a website into a reptile?
The answer is simple, four steps:
New Project (Project): Create a new crawler project
Clear goals (Items): Identify the target you want to crawl
Spider: Making crawlers start crawling Web pages
Storage content (Pipeline): Design Pipeline Store crawl content

OK, now that the basic process is determined, the next step is to complete it.

1. New Project (Project)
In the empty directory, hold down the SHIFT key and right-click, select "Open Command Window Here" and enter the command:

Copy the Code code as follows:


Scrapy Startproject Tutorial

Where tutorial is the project name.
You can see that a tutorial folder will be created with the following directory structure:

Copy the Code code as follows:


tutorial/
Scrapy.cfg
tutorial/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

Here's a brief look at the role of each file:
SCRAPY.CFG: Configuration file for Project
tutorial/: The project's Python module, which will reference the code from here
tutorial/items.py: Project Items file
tutorial/pipelines.py: Project's Pipelines file
tutorial/settings.py: Setup file for Project
tutorial/spiders/: Directory for crawler storage

2. Clear Target (Item)
In Scrapy, items is a container for loading crawling content, a bit like dic in Python, which is a dictionary, but provides some extra protection to reduce errors.
In general, item can be created with the Scrapy.item.Item class, and attributes are defined using Scrapy.item.Field objects (which can be understood as an ORM-like mapping relationship).
Next, we start to build the item model.
First of all, what we want is:
Names (name)
Link (URL)
Description (description)

Modify the items.py file in the Tutorial directory to add our own class after the original class.
Because we want to capture the content of the dmoz.org site, we can name it dmozitem:

Copy the Code code as follows:


# Define Here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

From Scrapy.item Import Item, Field

Class Tutorialitem (Item):
# define the fields for your item here is like:
# name = Field ()
Pass

Class Dmozitem (Item):
title = Field ()
link = Field ()
desc = Field ()

At first it may seem a bit out of the point, but defining these items will allow you to use other components to know what's in your item.
The item can be simply understood as a encapsulated class object.

3. Making Crawlers (spider)

The production of reptiles, the overall two-step: first crawl and then take.
In other words, first you want to get all the content of the entire page, and then take out the parts that are useful to you.
3.1 Climb
Spiders are classes that users write themselves to crawl information from a domain (or domain group).
They define a list of URLs to download, a scheme for tracking links, and a way to parse the content of the Web page to extract items.
To build a spider, you must create a subclass with Scrapy.spider.BaseSpider and determine the three mandatory properties:
Name: The names of the crawlers must be unique, and you must define different names in different reptiles.
Start_urls: List of crawled URLs. Crawlers start crawling data from here, so the first downloaded data will start with these URLs. Other sub-URLs will be generated from these starting URLs for inheritance.
Parse (): The parsing method, when called, passes in the response object returned from each URL as the only parameter, responsible for parsing and matching the crawled data (parsing to item) and tracking more URLs.

Here you can refer to the idea mentioned in the Width crawler tutorial to help understand, tutorial delivery: [Java] The chin 5th episode: Using the HttpClient Toolkit and the width crawler.
That is, the URL is stored and as a starting point to gradually spread out, crawl all eligible Web page URLs stored up to continue crawling.

Let's write the first crawler, named dmoz_spider.py, saved in the Tutorial\spiders directory.
The dmoz_spider.py code is as follows:

Copy the Code code as follows:


From Scrapy.spider import spider

Class Dmozspider (Spider):
Name = "DMOZ"
Allowed_domains = ["dmoz.org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]

Def parse (self, Response):
filename = Response.url.split ("/") [-2]
Open (filename, ' WB '). Write (Response.body)

Allow_domains is the domain name range of the search, that is, the crawler's constrained area, which specifies that crawlers crawl only the Web page under this domain name.
As can be seen from the parse function, the last two addresses of the link are taken out as file names for storage.
Then run a look, hold down SHIFT right click in the Tutorial directory, open the command window here, enter:

Copy the Code code as follows:


Scrapy Crawl DMOZ

Run results

The error:
Unicodedecodeerror: ' ASCII ' codec can ' t decode byte 0xb0 in position 1:ordinal not in range (128)
Running the first scrapy project is a real ill-fated error.
Should be out of the coding problem, Google a bit to find a solution:
Create a new sitecustomize.py under the Python lib\site-packages folder:

Copy the Code code as follows:


Import Sys
Sys.setdefaultencoding (' gb2312 ')

Run again, OK, problem solved, look at the results:

The last sentence of the info:closing spider (finished) indicates that the crawler has run successfully and shuts itself down.
The line containing [DMOZ], which corresponds to the result of our crawler running.
You can see that each URL defined in Start_urls has a journal line.
Do you remember our start_urls?
Http://www.dmoz.org/Computers/Programming/Languages/Python/Books
Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources
Because these URLs are the start page, they don't have references (referrers), so you'll see them at the end of each line (Referer: ).
In the context of the parse method, two files were created: Books and Resources, which have URL content in the two files.

So what happened in the Thunder and lightning?
First, Scrapy creates a Scrapy.http.Request object for each URL in the crawler's Start_urls property, and designates the crawler's parse method as a callback function.
The request is then dispatched and executed, followed by the parse () method to return the Scrapy.http.Response object and feed back to the crawler.

3.2 Take
Crawl the entire page, the next step is to take the process.
It's not enough to store a whole web page in light.
In the underlying crawler, this step can be captured with regular expressions.
In Scrapy, a mechanism called XPath selectors is used, based on an XPath expression.
If you want to learn more about selectors and other mechanisms you can check out the information: Dot Me dot Me

Here are some examples of XPath expressions and their meanings
/html/head/title: Select HTML DocumentElement below the Label. <br/>/html/head/title/text (): Select the text content below the <title> element mentioned above <br/>//td: Select all <td> elements <br/>// div[@class = "Mine"]: Select all div tag elements that contain class= "Mine" attribute <br/> These are just a few simple examples of using XPath, but in fact XPath is very powerful. <br/> can refer to the book: Dot me. </p><p> to facilitate the use of xpaths,scrapy to provide xpathselector classes, there are two options, Htmlxpathselector (HTML data Parsing) and Xmlxpathselector ( XML data parsing). <br/> must instantiate them with a Response object. <br/> You will find that the Selector object shows the node structure of the document. Therefore, the first instantiated selector must be related to the root node or the entire directory. <br/> In Scrapy, selectors has four basic methods (click to view API documentation): &LT;BR/>xpath (): Returns a series of selectors, Each select represents a node of an XPath parameter expression selection <br/>css (): Returns a series of selectors, each select node that represents a CSS parameter expression selected &LT;BR/>extract () : Returns a Unicode string for the selected data <br/>re (): Returns a string of Unicode strings for content crawled with regular expressions </p><p>3.3xpath experiment <br/> Let's try the use of selector in the shell. <br/> Experiment Website: http://www.dmoz.org/Computers/Programming/Languages/Python/Books/</p><p></p ><p> familiar with the experiment of the mice, the next step is to use the Shell crawl page. <br/> enters the top-level directory of the project, which is the first layer of the tutorial fileClip, enter:</p><p><span><u> in cmd to copy code </U></span> code as follows: </p><br/>scrapy Shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/<br/></p></p><p> You can see the following after the carriage return:</p><p></p><p> after the shell is loaded, you will get a response response, stored in the local variable response. <br/> So if you enter Response.body, you will see the body part of response, which is the content of the crawled page:</p><p></p><p> or enter Response.headers to see its header part:</p><p></p><p> now it's like a pile of sand in your hand, with the gold we want, so next, is to use a sieve to shake the two, the impurities out, select the key content. &LT;BR/>selector is such a sieve. <br/> In the old version, the shell instantiates two selectors, one is the HXS variable that parses the HTML, and the other is the XXS variable that parses the XML. <br/> And now the shell prepares the selector object for us, SEL, which automatically chooses the best parsing scheme (XML or HTML) based on the returned data type. <br/> And then we'll crunching! ~<br/> To get a thorough understanding of this problem, first of all to know, the catch of the page is exactly what it looks like. <br/> For example, we want to crawl the title of the page, that is, <title> this tag:</p><p></p><p> can input:</p><p> <span><U> Copy Code </U></span> code as follows: </p><br/>sel.xpath ('//title ') <br/></p></p><p> The result is:</p><p></p><p> so you can take this label out, Further processing can be done with extract () and text (). <br/> Remarks: A simple list of useful XPath path expressions: <br/> expression describes <br/>nodename selects all child nodes of this node. <br/>/is selected from the root node. &LT;BR/>//selects the nodes in the document from the current node of the matching selection, regardless of their location. <br/>. Select the current node. &LT;BR/> Selects the parent node of the current node. &LT;BR/>@ Select Properties. <br/> All the experimental results are as follows, In[i] represents the input of the first experiment, Out[i] represents the output of the results of the first I (recommended that you refer to: The World Series of tutorials):</p><p><span><u> Copy the Code </U></span> code as follows: </p><br/>in [1]: Sel.xpath ('//title ') <br/>out[1]: [<selector Xpath= '//title ' data=u ' <title>open directory-computers:progr ';] <br/> <br/>In [2]: Sel.xpath (' Title '). Extract () <br/>out[2]: [u ' <title>open directory-computers:programming:languages:python:book S']

In [3]: Sel.xpath ('//title/text () ')
OUT[3]: [ ]

In [4]: Sel.xpath ('//title/text () '). Extract ()
OUT[4]: [u ' Open directory-computers:programming:languages:python:books ']

In [5]: Sel.xpath ('//title/text () '). Re (' (\w+): ')
OUT[5]: [u ' Computers ', U ' programming ', U ' Languages ', U ' Python ']

Of course the title of this tag for us is not too much value, below we will really crawl some meaningful things.
Using Firefox's review element we can clearly see that what we need is as follows:

We can use the following code to crawl this

  • Label:

    Copy the Code code as follows:


    Sel.xpath ('//ul/li ')

    From

  • tag, you can get a description of the Web site like this:

    Copy the Code code as follows:


    Sel.xpath ('//ul/li/text () '). Extract ()

    You can get the title of the Web site this way:

    Copy the Code code as follows:


    Sel.xpath ('//ul/li/a/text () '). Extract ()

    You can get a hyperlink to a Web site like this:

    Copy the Code code as follows:


    Sel.xpath ('//ul/li/a/@href '). Extract ()

    Of course, the preceding examples are methods of directly acquiring properties.
    We notice that the XPath returns a list of objects,
    Then we can also directly invoke the properties of the objects in this list to dig deeper nodes
    (Reference: Nesting selectors andworking with relative xpaths in the selectors):
    Sites = Sel.xpath ('//ul/li ')
    For site in sites:
    title = Site.xpath (' A/text () '). Extract ()
    link = site.xpath (' a @href '). Extract ()
    desc = Site.xpath (' text () '). Extract ()
    Print title, LINK, desc

    3.4xpath Combat
    We have been using the shell for such a long time, and finally we can apply the content of the previous learning to the Dmoz_spider crawler.
    In the original crawler's parse function, make the following changes:

    Copy the Code code as follows:


    From Scrapy.spider import spider
    From Scrapy.selector import Selector

    Class Dmozspider (Spider):
    Name = "DMOZ"
    Allowed_domains = ["dmoz.org"]
    Start_urls = [
    "Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    "Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    Def parse (self, Response):
    SEL = Selector (response)
    Sites = Sel.xpath ('//ul/li ')
    For site in sites:
    title = Site.xpath (' A/text () '). Extract ()
    link = site.xpath (' a @href '). Extract ()
    desc = Site.xpath (' text () '). Extract ()
    Print title

    Note that we imported the selector class from Scrapy.selector and instantiated a new selector object. So that we can manipulate the XPath like in the shell.
    Let's try to enter a command to run the crawler (in the tutorial root directory):

    Copy the Code code as follows:


    Scrapy Crawl DMOZ

    The results of the operation are as follows:

    Sure enough, the success caught all the headlines. But it seems not quite right, how top,python this navigation bar also crawled out?
    We only need the content in the red circle:

    It seems that our XPath statement is a bit problematic, not just grabbing the name of the project we need, but also catching some innocent but XPath-syntactically identical elements.
    Review elements we found what we needed

      Has a property of class= ' Directory-url ',
      So just change the XPath statement to Sel.xpath ('//ul[@class = "Directory-url"]/li ')
      Make the following adjustments to the XPath statement:

      Copy the Code code as follows:


      From Scrapy.spider import spider
      From Scrapy.selector import Selector

      Class Dmozspider (Spider):
      Name = "DMOZ"
      Allowed_domains = ["dmoz.org"]
      Start_urls = [
      "Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
      "Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
      ]

      Def parse (self, Response):
      SEL = Selector (response)
      Sites = Sel.xpath ('//ul[@class = ' directory-url ']/li ')
      For site in sites:
      title = Site.xpath (' A/text () '). Extract ()
      link = site.xpath (' a @href '). Extract ()
      desc = Site.xpath (' text () '). Extract ()
      Print title

      Successfully grabbed all the headlines, absolutely no indiscriminate killing of innocents:

      3.5 Using Item
      Now let's take a look at how to use item.
      As we said earlier, the Item object is a custom Python dictionary that you can use to get the value of a property using standard dictionary syntax:

      Copy the Code code as follows:


      >>> item = Dmozitem ()
      >>> item[' title '] = ' Example title '
      >>> item[' title ']
      ' Example title '

      As a reptile, spiders wants to store the data it crawls into the item object. In order to return to our fetch data, the spider's final code should be this:

      Copy the Code code as follows:


      From Scrapy.spider import spider
      From Scrapy.selector import Selector

      From Tutorial.items import Dmozitem

      Class Dmozspider (Spider):
      Name = "DMOZ"
      Allowed_domains = ["dmoz.org"]
      Start_urls = [
      "Http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
      "Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
      ]

      Def parse (self, Response):
      SEL = Selector (response)
      Sites = Sel.xpath ('//ul[@class = ' directory-url ']/li ')
      Items = []
      For site in sites:
      item = Dmozitem ()
      item[' title '] = Site.xpath (' A/text () '). Extract ()
      item[' link ' = Site.xpath (' A ' @href '). Extract ()
      item[' desc '] = Site.xpath (' text () '). Extract ()
      Items.append (item)
      return items

      4. Storage content (Pipeline)
      The simplest way to save information is through the feed exports, there are four main types: Json,json lines,csv,xml.
      We export the results in the most commonly used JSON, with the following commands:

      Copy the Code code as follows:


      Scrapy Crawl Dmoz-o items.json-t JSON

      -O is followed by the export file name, and-T followed by the export type.
      Then take a look at the results of the export, open the JSON file with a text editor (for easy display, delete the attribute except the title in item):

      Because this is just a small example, so simple processing is possible.
      If you want to use the crawled items to do something more complicated, you can write an item Pipeline (entry pipeline).
      We'll play ^_^ later.

      The above is the Python crawler frame scrapy production crawler Crawl site content of the entire process, very detailed it, I hope to be able to help you, if necessary, you can contact me, progress together

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.