[Python] web crawler (12): Getting started with the crawler framework Scrapy

Source: Internet
Author: User
We use the dmoz.org website to show our skills. We use the dmoz.org website to show our skills.


First, you need to answer a question.

Q: How many steps can I add a website to a crawler?

The answer is simple. Step 4:

Project: create a new crawler Project.

Clear goals: define the goals you want to capture

Crawler creation: crawlers start crawling webpages.

Storage content (Pipeline): Design pipelines to store crawled content


Okay. now that the basic process is complete, you can complete it step by step.


1. create a Project)

In the empty directory, press Shift and right-click, select "open command window here", and enter the following command:

scrapy startproject tutorial

Here, tutorial is the project name.

You can see that a tutorial folder will be created. the directory structure is as follows:

tutorial/      scrapy.cfg      tutorial/          __init__.py          items.py          pipelines.py          settings.py          spiders/              __init__.py              ...

The following describes the functions of each file:

  • Scrapy. cfg: Project configuration file

  • Tutorial/: Python module of the project. the code will be referenced here.

  • Tutorial/items. py: the project's items file

  • Tutorial/pipelines. py: pipelines file of the project

  • Tutorial/settings. py: the setting file of the project.

  • Tutorial/spiders/: Directory for storing crawlers


2. define the target (Item)

In Scrapy, items is a container used to load and capture content. it is a bit like Dic in Python, that is, dictionary, but it provides some additional protection to reduce errors.

Generally, items can be created using the scrapy. item. item class, and attributes are defined using the scrapy. Item. Field object (which can be understood as an ORM-like ing relationship ).

Next, we start to build the item model ).

First, we want:

  • Name)

  • Link (url)

  • Description)


Modify the items. py file under the tutorial Directory and add our own class after the original class.

Because we want to capture the content of the dmoz.org website, we can name it DmozItem:

# Define here the models for your scraped items  #  # See documentation in:  # http://doc.scrapy.org/en/latest/topics/items.html    from scrapy.item import Item, Field    class TutorialItem(Item):      # define the fields for your item here like:      # name = Field()      pass    class DmozItem(Item):      title = Field()      link = Field()      desc = Field()

At the beginning, it may seem a little incomprehensible, but defining these items allows you to know what your items is when using other components.

You can simply understand items as encapsulated class objects.


3. make a crawler)

Make a crawler in two steps: first crawl and then fetch it.

That is to say, first you need to get all the content of the entire web page, and then retrieve the useful parts.

3.1 crawling

Spider is a self-compiled class used to capture information from a domain (or domain group.

They define a list of URLs for download, a scheme for tracking links, and a method for parsing webpage content to extract items.

To create a Spider, you must use scrapy. spider. BaseSpider to create a subclass and determine three mandatory attributes:

Name: identifies a Crawler. it must be unique. you must define different names for different crawlers.

Start_urls: List of crawled URLs. Crawlers start to capture data from here, so the data downloaded for the first time will start from these urls. Other sub-URLs are generated from these starting URLs.

Parse (): The Parsing method. when calling, the Response object returned from each URL is passed as the unique parameter, which is used to parse and match the captured data (resolved to item ), trace more URLs.

Here, you can refer to the ideas mentioned in the width crawler tutorial to help understand. The tutorial is sent to: [Java] Zhihu chin 5th set: Use the HttpClient toolkit and width crawler.

That is to say, store the Url and gradually spread it from here. capture all the qualified webpage URLs for storage and continue crawling.

Next we will write the first crawler named dmoz_spider.py and save it in the tutorial \ spiders directory.

The d1__spider.py code is as follows:

from scrapy.spider import Spider    class DmozSpider(Spider):      name = "dmoz"      allowed_domains = ["dmoz.org"]      start_urls = [          "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",          "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"      ]        def parse(self, response):          filename = response.url.split("/")[-2]          open(filename, 'wb').write(response.body)

Allow_domains is the search domain name range, that is, the crawler's restricted area. it requires crawlers to only crawl webpages under this domain name.

From the parse function, we can see that the last two addresses of the link are extracted and stored as file names.

Run the command to check whether shift is in the tutorial directory and right-click it. in this case, open the command window and enter:

scrapy crawl dmoz

Running result

Create a sitecustomize. py in the Lib \ site-packages folder of python:

import sys    sys.setdefaultencoding('gb2312')

Run again. OK. The problem is solved. check the result:


All the experiment results are as follows. In [I] indicates the input of the I-th experiment, and Out [I] indicates the output of the I-th result (we recommend that you refer to the W3C tutorial ):

In [1]: sel.xpath('//title')  Out[1]: [
 
  Open Directory - Computers: Progr'>]    In [2]: sel.xpath('//title').extract()  Out[2]: [u'
  Open Directory - Computers: Programming: Languages: Python: Books']    In [3]: sel.xpath('//title/text()')  Out[3]: [
  
   ]    In [4]: sel.xpath('//title/text()').extract()  Out[4]: [u'Open Directory - Computers: Programming: Languages: Python: Books']    In [5]: sel.xpath('//title/text()').re('(\w+):')  Out[5]: [u'Computers', u'Programming', u'Languages', u'Python']
  
 

Of course, the title tag does not have much value for us. next we will capture something meaningful.

Using Firefox's review elements, we can clearly see that what we need is as follows:

3.4xpath practices

We have been using shell for so long. Finally, we can apply the content we learned to the dmoz_spider crawler.

Make the following changes in the original Crawler's parse function:

from scrapy.spider import Spider  from scrapy.selector import Selector    class DmozSpider(Spider):      name = "dmoz"      allowed_domains = ["dmoz.org"]      start_urls = [          "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",          "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"      ]        def parse(self, response):          sel = Selector(response)          sites = sel.xpath('//ul/li')          for site in sites:              title = site.xpath('a/text()').extract()              link = site.xpath('a/@href').extract()              desc = site.xpath('text()').extract()              print title

Note: We have imported the selector class from scrapy. Selector and instantiated a new Selector object. In this way, we can operate on xpath like in Shell.

Let's try to enter the command to run the crawler (in the tutorial root directory ):

scrapy crawl dmoz

The running result is as follows:

Make the following adjustments to the xpath statement:

from scrapy.spider import Spider  from scrapy.selector import Selector    class DmozSpider(Spider):      name = "dmoz"      allowed_domains = ["dmoz.org"]      start_urls = [          "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",          "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"      ]        def parse(self, response):          sel = Selector(response)          sites = sel.xpath('//ul[@class="directory-url"]/li')          for site in sites:              title = site.xpath('a/text()').extract()              link = site.xpath('a/@href').extract()              desc = site.xpath('text()').extract()              print title

All titles have been captured successfully, and they are never killed:

Next, let's take a look at how to use Item.

As mentioned above, the Item object is a custom python dictionary. you can use the standard dictionary syntax to obtain the value of an attribute:

>>> item = DmozItem()  >>> item['title'] = 'Example title'  >>> item['title']  'Example title'

As a crawler, Spiders wants to store the captured data in the Item object. To return the captured data, the final code of the spider should be as follows:

from scrapy.spider import Spider  from scrapy.selector import Selector    from tutorial.items import DmozItem    class DmozSpider(Spider):      name = "dmoz"      allowed_domains = ["dmoz.org"]      start_urls = [          "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",          "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"      ]        def parse(self, response):          sel = Selector(response)          sites = sel.xpath('//ul[@class="directory-url"]/li')          items = []          for site in sites:              item = DmozItem()              item['title'] = site.xpath('a/text()').extract()              item['link'] = site.xpath('a/@href').extract()              item['desc'] = site.xpath('text()').extract()              items.append(item)          return items

4. Pipeline)

The simplest way to save information is through Feed exports. There are four main types: JSON, JSON lines, CSV, and XML.

Export the results in JSON format. the command is as follows:

scrapy crawl dmoz -o items.json -t json

-O is followed by the exported file name, and-t is followed by the export type.

Next, let's take a look at the export result and use a text editor to open the json file (to facilitate display, the attributes except title are deleted in item ):

This is just a small example, so you can simply process it.

If you want to use the captured items for more complex tasks, you can write an Item Pipeline (Item Pipeline ).

We will try again later.

The above is the [Python] web crawler (12): crawler framework Scrapy's first crawler sample Getting Started Tutorial. For more information, see PHP Chinese website (www.php1.cn )!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.