The Python crawler---The basics of the Scrapy framework __python

Source: Internet
Author: User
Create a Scrapy item definition extract item write crawl site spider and extract item write item Pipeline to store extracted item (i.e. data)

Scrapy is written by Python. If you have just contacted and wondered about the nature of the language and the details of scrapy, we recommend Learn python the Hard Way for programmers who are already familiar with other languages and want to learn python quickly, for beginners who want to start learning from Python, A list of non-programmer python learning materials will be your choice. Create a project

Before you start crawling, you must create a new Scrapy project. Enter the directory where you intend to store the code, and run the following command:

Scrapy Startproject Tutorial

The command will create a tutorial directory that contains the following:

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py spiders/
            __init__.py
            ...

These files are: Scrapy.cfg: The project's profile tutorial/: The project's Python module. You will then join the code here. tutorial/items.py: Item file in the project. tutorial/pipelines.py: Pipelines files in the project. tutorial/settings.py: The project's settings file. tutorial/spiders/: The directory where the spider code is placed. Define Item

The Item is the container that holds the crawled data, is used like a Python dictionary, and provides additional protection against undefined field errors caused by spelling errors.

Like you do in ORM, you can create a scrapy. Item class, and the definition type is scrapy. The class property of the Field to define an item. (If you don't know Orm, don't worry, you'll find this step very simple)

First, the item is modeled according to the data that is acquired from dmoz.org. We need to get the name, URL, and description of the site from the DMOZ. For this, define the corresponding field in the item. To edit the items.py file in the Tutorial directory:

Import Scrapy

class Dmozitem (scrapy. Item):
    title = Scrapy. Field ()
    link = scrapy. Field ()
    desc = scrapy. Field ()

This may seem a bit complicated at first, but by defining the item, you can easily use other methods of scrapy. And these methods need to know the definition of your item. Write the first reptile (Spider)

Spider is a class that users write to crawl data from a single Web site (or some Web sites).

It contains an initial URL for downloading, how to follow links in a Web page, and how to analyze the contents of a page to extract the method for generating the item.

In order to create a spider, you must inherit scrapy. Spider class, and defines the following three properties: Name: Used to distinguish Spider. The name must be unique, and you may not set the same name for different spider. Start_urls: Contains a list of URLs that spider crawled at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL. Parse () is a method of spider. When invoked, the Response object that is generated after each initial URL completes the download is passed to the function as a unique parameter. The method is responsible for parsing the returned data (response), extracting data (generating item), and generating the Request object for URLs that need further processing.

The following is our first spider code, saved in the dmoz_spider.py file in the Tutorial/spiders directory:

Import Scrapy

class Dmozspider (scrapy. Spider):
    name = "DMOZ"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/ programming/languages/python/books/",
        " http://www.dmoz.org/Computers/Programming/Languages/Python/ Resources/"
    ]

    def parse (self, response):
        filename = response.url.split ("/") [-2]
        with open (filename, "WB") as F:
            F.write (Response.body)
crawling

Go to the root of the project and start the spider by executing the following command:

Scrapy Crawl DMOZ

Crawl DMOZ initiates the spider for crawling dmoz.org, you will get similar output:

2014-01-23 18:13:07-0400 [Scrapy] info:scrapy started (bot:tutorial) 2014-01-23 18:13:07-0400 [scrapy] info:optional F
Eatures Available: ... 2014-01-23 18:13:07-0400 [scrapy] Info:overridden settings: {} 2014-01-23 18:13:07-0400 [scrapy] info:enabled extensions
: ...
2014-01-23 18:13:07-0400 [scrapy] info:enabled downloader middlewares: ... 2014-01-23   

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.