A brief exploration of scrapy: writing a simple reptile

Last Update:2018-07-29 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Outline:
0 Introduction
1 Project Establishment:
2 Simple Reptiles
3 Execution Crawler
4 File Downloads
5 Summary

0 Introduction
Scrapy is a reptile frame.
The framework allows us to focus on the core processing of reptiles. However, the disadvantage of the framework is that it is not flexible enough.

1 The establishment of the project
Prerequisite: Python and scrapy are already installed.
In cmd, enter the appropriate directory, using the command:

Scrapy Startproject Spider_name

Get the catalogue as follows:

spider_name/
    scrapy.cfg
    spider_name/
        __init__.py
        items.py
        pipelines.py
        spiders/
            __init__.py
            ...

items.py: Defines all the item in a reptile.
Spiders directory: The code directory to add.
SCRAPY.CFG: Crawler's configuration file.
pipelines.py: Defines the pipeline of a reptile.

Tips:item is a container for storing crawled data; his usage is similar to that of a Python dictionary. Referencing block contents

2 Simple Reptiles
This section writes a simple crawler demo.

<1> Add a py file to the Spiders directory, and locate myspider.py now. This file is the code file for the current crawler.

<2> define your own item, which inherits from the class Scrapy.item. In which you define the field you want to save, the field can be scrapy. Field () initialization. The item is loaded in items.py and is certainly not mandatory.

Class myitem (Scrapy. Item):
    url = scrapy. Field ()
    name = Scrapy. Field ()

In the code, myitem is defined, where you want to save two fields, one is the URL of the crawled page, and one is the name of the crawled page.

The main body of the <3> reptile is to define a crawlspider subclass. This class defines the action of the crawler. The main properties of the class are:
Name: Defines the names of the reptiles, in a project, the names of the reptiles do not repeat.
Allow_domains: Defines a crawler crawl page allowed by the domain name, is a list.
Start_url: A URL that defines the start of a crawler as a list.
Rules: A rule that defines the link that a reptile traverses, and a list of its classes.

In the rule class, you typically need to display several commonly used parameters:
Linkextractor object: Accepts parameters such as Allow,deny, which are regular expressions that indicate links to pages to continue crawling and rejected pages.
callback function: The name of the callback function to be passed in as a string. You cannot use parse as a callback function because this function cannot be overloaded.
Follow parameter: bool value, indicating whether to traverse. If there is no callback function, the value defaults to true.

The other two are the functions you want:
1> the callback function in the rules.
2 > Parse_start_url: This function is a function of the parent class that needs to be overwritten. Defines the action in the Start_url.

The code for the entire demo is as follows:

#-*-Coding-*-
from scrapy.spiders import crawlspider, rule from
scrapy.linkextractors import Linkextractor
  from scrapy.selector Import selector from
spider.items Import myitem

class Myspider (crawlspider):
    name = ' Myspider '
    allow_domains = [' douban.com ']
    start_urls = [' https://xxxxx.xxxxxx.com/chart/']
    rules = [rule (Linkextractor (allow=[r ' https://xxxxx.xxxxxx.com/subject/\d+/?$ '), ' parse_it ', follow=true)]

    def parse_it (self, Response):
        myitem = myitem ()
        myitem[' url ']= response.url
        ' name '] = Response.xpath ("//title/text ()"). Extract () Return
        myitem

    def parse_start_url (self, Response):
        Pass

In this demo, simply crawl the title of the page. In practice, we can do more operations, such as statistics on the site of the data, the site to download the pictures and so on. These operations can load both the Parse_start_url and PARSE_IT functions.

3 Execution Crawler
Execute the crawler by executing the following command:

Scrapy Crawl Myspider

Executing this command requires entering the directory of the Reptile, and the third parameter is the name of the reptile defined in the code. You can also save crawled data in a file:

Scrapy Crawl Myspider-o Items.json

Every time you want to cut it out of the command line, we can use Scrapy's cmdline, add a main.py to the Scrapy.cfg sibling directory, and then run the script every time you execute the script in Pycharm.

#-*-Coding:utf-8-*-from
scrapy import cmdline

cmdline.execute ("Scrapy crawl Myspider". Split ())

The following are part of the crawled JSON data:

[
[
{"url": "https://xxxxx.xxxxxx.com/subject/26844922/", "name": [\ n \u6770\u51fa\u516c\u6c11\n]},
{"url": "https://xxxxx.xxxxxx.com/subject/26279289/", "name": [\ n \u6012\n]},
{"url": "https://xxxxx.xxxxxx.com/subject/25765735/", "Name": ["\ n \u91d1\u521a\u72fc3\uff1a\u6b8a\u6b7b\u4e00\ U6218\n "]},
{"url": "https://xxxxx.xxxxxx.com/subject/6873143/", "name": [\ n \u4e00\u6761\u72d7\u7684\u4f7f\u547d\n]},
......
]

As you can see, the content of the crawl is what is defined in item.

Some sites may use the anti-reptile mechanism, where a simple workaround is to set up the proxy. Add the setting in setting.py as follows:

User_agent = ' mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) applewebkit/536.5 (khtml, like Gecko) chrome/19.0.1084.54 safari/536.5 '

4 File Downloads
Files can be downloaded using pipeline, commonly used pipeline with Filespipeline and Imagepipeline. In this section, Filespipeline is used to simply record usage.

Activating one in setting.py, the pipeline requires two steps to work:
① activates pipeine in configuration items.

Item_pipelines = {
    ' spider.pipelines.MyPipeline ': 3,
}

The following number is the order, defined within the 1~1000.

② defines a valid directory.

Files_store = '/pic '

Both of these steps are required, and one is not configured so that pipeline is not started.

The next job, the most important job, is to define the pipeline class.
Define your own pipeline class in pipelines.py, inheriting from the class Filespipeline.

In a class, there are two functions that need to be implemented primarily.
Process_item: Each item is passed to the action performed in pipeline.
Item_completed: After the content download completes the operation, we can perform the renaming and so on here the action. Of course this function is not necessary, and if there is no rename action, the downloaded file is named after a hash code.

Here's a simple demo:

Class Mypipeline (Filespipeline):
    def process_item (self, item, spider): For
        URL in item["File_urls"]:
            Yield Request (URL)

    def item_completed (self, results, item, info):
          Pass

The file_urls here need to be in the Myspider class. In order to get the URL conveniently, you can use the Scrapy selector selector. The following code gets the contents of the SRC field with. jpg in all IMG tags.

myitem[' file_urls ' = Selector (text=response.body). XPath ("//img[contains (@src, '. jpg ')]/@src"). Extract ()

Sometimes, we can not download the content, observe the output, you will find 403 errors. This is because Scrapy follows the Roberts protocol by default. The following configuration is added to the settings.py to indicate Non-compliance with the Roberts protocol.

Robotstxt_obey = False

The

6 summary
Scrapy is a handy and fast crawler framework that can be used to implement a complete reptile using only regular expressions that focus on URLs and some key action codes. But it also sacrificed a certain amount of flexibility.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More