No basic write python crawler: use Scrapy framework to write crawlers

Source: Internet
Author: User
In the previous article, we introduced the installation and configuration of the Python crawler framework Scrapy and other basic information. in this article, we will take a look at how to use the Scrapy framework to easily and quickly capture the content of a website, a web crawler is a program that crawls data on the internet. it can be used to capture HTML data of a specific webpage. Although we use some libraries to develop a crawler, the framework can greatly improve efficiency and shorten the development time. Scrapy is a lightweight, simple, and easy to use method written in Python. Scrapy can be used to conveniently collect online data. it has done a lot of work for us, rather than making great efforts to develop it.

First, you need to answer a question.
Q: How many steps can I add a website to a crawler?
The answer is simple. Step 4:
Project: create a new crawler Project.
Clear goals: define the goals you want to capture
Crawler creation: crawlers start crawling webpages.
Storage content (Pipeline): Design pipelines to store crawled content

Okay. now that the basic process is complete, you can complete it step by step.

1. create a Project)
In the empty directory, press Shift and right-click, select "open command window here", and enter the following command:

The code is as follows:


Scrapy startproject tutorial

Here, tutorial is the project name.
You can see that a tutorial folder will be created. the directory structure is as follows:

The code is as follows:


Tutorial/
Scrapy. cfg
Tutorial/
_ Init _. py
Items. py
Pipelines. py
Settings. py
Spiders/
_ Init _. py
...

The following describes the functions of each file:
Scrapy. cfg: Project configuration file
Tutorial/: Python module of the project. the code will be referenced here.
Tutorial/items. py: the project's items file
Tutorial/pipelines. py: pipelines file of the project
Tutorial/settings. py: the setting file of the project.
Tutorial/spiders/: Directory for storing crawlers

2. define the target (Item)
In Scrapy, items is a container used to load and capture content. it is a bit like Dic in Python, that is, dictionary, but it provides some additional protection to reduce errors.
Generally, items can be created using the scrapy. item. item class, and attributes are defined using the scrapy. Item. Field object (which can be understood as an ORM-like ing relationship ).
Next, we start to build the item model ).
First, we want:
Name)
Link (url)
Description)

Modify the items. py file under the tutorial Directory and add our own class after the original class.
Because we want to capture the content of the dmoz.org website, we can name it DmozItem:

The code is as follows:


# Define here the models for your scraped items
#
# See documentation in:
# Http://doc.scrapy.org/en/latest/topics/items.html

From scrapy. item import Item, Field

Class TutorialItem (Item ):
# Define the fields for your item here like:
# Name = Field ()
Pass

Class d1_item (Item ):
Title = Field ()
Link = Field ()
Desc = Field ()

At the beginning, it may seem a little incomprehensible, but defining these items allows you to know what your items is when using other components.
You can simply understand items as encapsulated class objects.

3. make a crawler)

Make a crawler in two steps: first crawl and then fetch it.
That is to say, first you need to get all the content of the entire web page, and then retrieve the useful parts.
3.1 crawling
Spider is a self-compiled class used to capture information from a domain (or domain group.
They define a list of URLs for download, a scheme for tracking links, and a method for parsing webpage content to extract items.
To create a Spider, you must use scrapy. spider. BaseSpider to create a subclass and determine three mandatory attributes:
Name: identifies a Crawler. it must be unique. you must define different names for different crawlers.
Start_urls: List of crawled URLs. Crawlers start to capture data from here, so the data downloaded for the first time will start from these urls. Other sub-URLs are generated from these starting URLs.
Parse (): The Parsing method. when calling, the Response object returned from each URL is passed as the unique parameter, which is used to parse and match the captured data (resolved to item ), trace more URLs.

Here, you can refer to the ideas mentioned in the width crawler tutorial to help understand. The tutorial is sent to: [Java] Zhihu chin 5th set: Use the HttpClient toolkit and width crawler.
That is to say, store the Url and gradually spread it from here. capture all the qualified webpage URLs for storage and continue crawling.

Next we will write the first crawler named dmoz_spider.py and save it in the tutorial \ spiders directory.
The d1__spider.py code is as follows:

The code is as follows:


From scrapy. spider import Spider

Class DmozSpider (Spider ):
Name = "dmoz"
Allowed_domains = ["dw..org"]
Start_urls = [
"Http://www.dmoz.org/Computers/Programming/Languages/Python/Books ",
Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources"
]

Def parse (self, response ):
Filename = response. url. split ("/") [-2]
Open (filename, 'WB '). write (response. body)

Allow_domains is the search domain name range, that is, the crawler's restricted area. it requires crawlers to only crawl webpages under this domain name.
From the parse function, we can see that the last two addresses of the link are extracted and stored as file names.
Run the command to check whether shift is in the tutorial directory and right-click it. in this case, open the command window and enter:

The code is as follows:


Scrapy crawl dmoz

Running result

Error:
UnicodeDecodeError: 'ascii 'codec can't decode byte 0xb0 in position 1: ordinal not in range (128)
An error is reported when you run the first Scrapy project.
The encoding problem should have occurred. Google found the solution:
Create a sitecustomize. py in the Lib \ site-packages folder of python:

The code is as follows:


Import sys
Sys. setdefaultencoding ('gb2312 ')

Run again. OK. The problem is solved. check the result:

The last INFO: Closing spider (finished) indicates that the crawler has been successfully run and disabled by itself.
The rows containing [dmoz] correspond to the running results of our crawler.
You can see that each URL defined in start_urls has a log line.
Do you still remember our start_urls?
Http://www.dmoz.org/Computers/Programming/Languages/Python/Books
Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources
Because these URLs are the starting pages, they are not referenced (referrers), so you will see at the end of each line (referer: ).
Under the role of the parse method, two files are created: Books and Resources, which contain URL pages.

So what happened in the just-so-long thunder?
First, Scrapy creates a scrapy. http. Request object for each URL in the start_urls attribute of the crawler, and specifies the parse method of the crawler as the callback function.
Then, these requests are scheduled and executed, and then the scrapy. http. Response object is returned through the parse () method and fed back to the crawler.

3.2 fetch
After crawling the entire webpage, the next step is the process.
The whole web page of optical storage is not enough.
In basic crawlers, this step can be captured using regular expressions.
In Scrapy, a mechanism called XPath selectors is used, which is based on XPath expressions.
For more information about selectors and other mechanisms, click here.

Here are some examples of XPath expressions and their meanings.
/Html/head/title: Select HTML documentElementsLabel. <Br/>/html/head/title/text (): Select the text content below the <title> element mentioned earlier <br/> // td: select all <td> elements <br/> // p [@ class = "mine"]: select all p tag elements containing the class = "mine" attribute <br/> The above are just a few simple examples of using XPath, but in fact XPath is very powerful. <Br/> for more information, see W3C Tutorial: click me. </P> <p> to facilitate the use of XPaths, Scrapy provides the XPathSelector class. There are two options: HtmlXPathSelector (HTML data parsing) and XmlXPathSelector (XML data parsing ). <Br/> A Response object must be used to instantiate them. <Br/> The Selector object shows the node structure of the document. Therefore, the first instantiated selector must be related to the root node or the entire directory. <Br/> in Scrapy, Selectors has four basic methods (click to view the API documentation): <br/> xpath (): returns a series of selectors, each select statement represents the node selected by an xpath parameter expression <br/> css (): returns a series of selectors, and each select statement represents the node selected by a css parameter expression <br/> extract (): returns a unicode string, which is the selected data <br/> re (): returns a string of a unicode string, </p> <p> 3.3xpath experiment <br/> let's try Selector usage in Shell. <Br/> lab URL: http://www.dmoz.org/Computers/Programming/Languages/Python/Books/ </P> <p> after learning about the experiment, the mouse crawls the webpage using Shell. <Br/> enter the top-level Directory of the project, that is, under the tutorial folder at the first layer, and enter: </p> <p class = "codetitle"> <U> </U> the code is as follows: </p> <p class = "codebody" id = "code41295"> <br/> scrapy shell http://www.dmoz.org/Computers/Programming/Languages/Python/Books/ <Br/> </p> <p> press enter to view the following content: </p> <p> after the Shell is loaded, you will receive a response, which is stored in the local variable response. <Br/> so if you enter response. body, you will see the body part of the response, that is, the captured page content: </p> <p> or enter response. headers to view its header part: </p> <p> now it's like holding a pile of sand in your hand and hiding the gold we want, so the next step is to shake two times with a sieve to remove the impurities and select the key content. <Br/> selector is such a sieve. <Br/> In the old version, Shell instantiates two selectors. one is to parse the HTML hxs variable, and the other is to parse the XML xxs variable. <Br/> The current Shell is the selector object we have prepared. sel can automatically select the best resolution scheme (XML or HTML) based on the returned data type ). <Br/> let's try it out !~ <Br/> to thoroughly understand this problem, first understand what the page is like. <Br/> For example, we want to capture the webpage title, that is, <title> this tag: </p> <p> You can enter: </p> <p class = "codetitle"> <U> </U> the code is as follows: </p> <p class = "codebody" id = "code61185"> <br/> sel. xpath ('// title') <br/> </p> <p> The result is: </p> <p> In this way, the tag can be extracted and further processed using extract () and text. <Br/> Note: A brief list of available xpath path expressions: <br/> expression description <br/> nodename selects all the child nodes of the node. <Br/>/select from the root node. <Br/> // select the nodes in the document from the current node that matches the selected node, regardless of their location. <Br/>. select the current node. <Br/>. select the parent node of the current node. <Br/> @ select attributes. <Br/> all the experiment results are as follows. In [I] indicates the input of the I-th experiment, and Out [I] indicates the output of the I-th experiment result. (for details, refer: w3C tutorial): </p> <p class = "codetitle"> <U> </U> the code is as follows: </p> <p class = "codebody" id = "code28073"> <br/> In [1]: sel. xpath ('// title') <br/> Out [1]: [<Selector xpath =' // title 'data = U' <title> Open Directory-Computers: progr '>] <br/> In [2]: sel. xpath ('// title '). extract () <br/> Out [2]: [U' <title> Open Directory-Computers: Programming: ages: Python: Books']

In [3]: sel. xpath ('// title/text ()')
Out [3]: [ ]

In [4]: sel. xpath ('// title/text ()'). extract ()
Out [4]: [u'open Directory-Computers: Programming: ages: Python: Books ']

In [5]: sel. xpath ('// title/text ()'). re ('(\ w + ):')
Out [5]: [u'computers ', u'programming', u'programming Ages ', u'python']

Of course, the title tag does not have much value for us. next we will capture something meaningful.
Using Firefox's review elements, we can clearly see that what we need is as follows:

We can use the following code to capture this

  • Tags:

    The code is as follows:


    Sel. xpath ('// ul/li ')

    Slave

  • Tag to obtain the website description:

    The code is as follows:


    Sel. xpath ('// ul/li/text ()'). extract ()

    You can obtain the website title as follows:

    The code is as follows:


    Sel. xpath ('// ul/li/a/text ()'). extract ()

    You can obtain the URL of a website as follows:

    The code is as follows:


    Sel. xpath ('// ul/li/a/@ href'). extract ()

    Of course, the previous examples show how to directly obtain attributes.
    We noticed that xpath returns an object list,
    We can also directly call the attributes of objects in this list to mine deeper nodes.
    (Refer to: Nesting selectors andWorking with relative XPaths in the Selectors ):
    Sites = sel. xpath ('// ul/li ')
    For site in sites:
    Title = site. xpath ('a/text () '). extract ()
    Link = site. xpath ('a/@ href '). extract ()
    Desc = site. xpath ('text () '). extract ()
    Print title, link, desc

    3.4xpath practices
    We have been using shell for so long. Finally, we can apply the content we learned to the dmoz_spider crawler.
    Make the following changes in the original Crawler's parse function:

    The code is as follows:


    From scrapy. spider import Spider
    From scrapy. selector import Selector

    Class DmozSpider (Spider ):
    Name = "dmoz"
    Allowed_domains = ["dw..org"]
    Start_urls = [
    "Http://www.dmoz.org/Computers/Programming/Languages/Python/Books ",
    Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources"
    ]

    Def parse (self, response ):
    Sel = Selector (response)
    Sites = sel. xpath ('// ul/li ')
    For site in sites:
    Title = site. xpath ('a/text () '). extract ()
    Link = site. xpath ('a/@ href '). extract ()
    Desc = site. xpath ('text () '). extract ()
    Print title

    Note: We have imported the selector class from scrapy. Selector and instantiated a new Selector object. In this way, we can operate on xpath like in Shell.
    Let's try to enter the command to run the crawler (in the tutorial root directory ):

    The code is as follows:


    Scrapy crawl dmoz

    The running result is as follows:

    Sure enough, all titles are successfully captured. But it doesn't seem quite right. why Top? is the Python navigation bar captured?
    We only need the content in the red circle:

    It seems that our xpath statements have some problems. we didn't just capture the project names we need, but also caught some innocent elements with the same xpath syntax.
    Review elements we find what we need

      Has the class = 'directory-url' attribute,
      Change the xpath statement to sel. xpath ('// ul [@ class = "directory-url"]/li').
      Make the following adjustments to the xpath statement:

      The code is as follows:


      From scrapy. spider import Spider
      From scrapy. selector import Selector

      Class DmozSpider (Spider ):
      Name = "dmoz"
      Allowed_domains = ["dw..org"]
      Start_urls = [
      "Http://www.dmoz.org/Computers/Programming/Languages/Python/Books ",
      Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources"
      ]

      Def parse (self, response ):
      Sel = Selector (response)
      Sites = sel. xpath ('// ul [@ class = "directory-url"]/li ')
      For site in sites:
      Title = site. xpath ('a/text () '). extract ()
      Link = site. xpath ('a/@ href '). extract ()
      Desc = site. xpath ('text () '). extract ()
      Print title

      All titles have been captured successfully, and they are never killed:

      3.5 Use Item
      Next, let's take a look at how to use Item.
      As mentioned above, the Item object is a custom python dictionary. you can use the standard dictionary syntax to obtain the value of an attribute:

      The code is as follows:


      >>> Item = d1_item ()
      >>> Item ['title'] = 'example title'
      >>> Item ['title']
      'Example title'

      As a crawler, Spiders wants to store the captured data in the Item object. To return the captured data, the final code of the spider should be as follows:

      The code is as follows:


      From scrapy. spider import Spider
      From scrapy. selector import Selector

      From tutorial. items import dw.item

      Class DmozSpider (Spider ):
      Name = "dmoz"
      Allowed_domains = ["dw..org"]
      Start_urls = [
      "Http://www.dmoz.org/Computers/Programming/Languages/Python/Books ",
      Http://www.dmoz.org/Computers/Programming/Languages/Python/Resources"
      ]

      Def parse (self, response ):
      Sel = Selector (response)
      Sites = sel. xpath ('// ul [@ class = "directory-url"]/li ')
      Items = []
      For site in sites:
      Item = dnt item ()
      Item ['title'] = site. xpath ('a/text () '). extract ()
      Item ['link'] = site. xpath ('a/@ href '). extract ()
      Item ['desc'] = site. xpath ('text () '). extract ()
      Items. append (item)
      Return items

      4. Pipeline)
      The simplest way to save information is through Feed exports. There are four main types: JSON, JSON lines, CSV, and XML.
      Export the results in JSON format. the command is as follows:

      The code is as follows:


      Scrapy crawl dmoz-o items. json-t json

      -O is followed by the exported file name, and-t is followed by the export type.
      Next, let's take a look at the export result and use a text editor to open the json file (to facilitate display, the attributes except title are deleted in item ):

      This is just a small example, so you can simply process it.
      If you want to use the captured items for more complex tasks, you can write an Item Pipeline (Item Pipeline ).
      We will try again later.

      The above is the full process of using the python crawler framework Scrapy to crawl website content. it is very detailed. I hope it will be helpful to you. if you need it, please contact me and make progress together.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.