Python-scrapy Creating the first project

Source: Internet
Author: User

Create a project

Before you start crawling, you must create a new Scrapy project. Go to the directory where you want to store the code, and run the following command:

scrapy startproject tutorial
    • 1

The command line will create a directory with the following contents tutorial :

tutorial/    scrapy.cfg    tutorial/        __init__.py        items.py        pipelines.py        settings.py        spiders/            __init__.py            ...
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

These files are:

    • SCRAPY.CFG: Configuration file for Project
    • Tutorial: The project's Python module. You will then join the code here.
    • tutorial/items.py: Item file in the project.
    • tutorial/pipelines.py: The pipelines file in the project.
    • tutorial/spiders/: The directory where the spider code is placed.

Define Item
Item is a container for saving crawled data: It is used in a similar way to a Python dictionary, and provides an additional protection mechanism to avoid undefined field errors caused by spelling errors.
Similar to what you do in Orm, you can scrapy.Item define an item by creating a class and defining scrapy.Field a class property of type.
The item is modeled first based on the data you need to get from dmoz.org. We need to get the name, URL, and description of the site from DMOZ. For this, the corresponding fields are defined in item. tutorialto edit a file in a directory items.py :

import scrapyclass DmozItem(scrapy.Item): title=scrapy.Field() link=scrapy.Field() desc=scrapy.Field()
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

This may seem a bit complicated at first, but by defining the item, you can easily use other methods of scrapy. And these methods need to know your item definition

Write the first crawler (spider)
Spiders are classes that users write to crawl data from a single site (or some web site).
It contains an initial URL for the download, how to follow the links in the page, and how to analyze the contents of the page to extract the method that generated the item.
In order to create a spider, you must inherit the scrapy.Spider class and define the following three properties:

    • name: Used to differentiate the spider. Renaming must be unique and you cannot set the same name for different spiders.
    • start_urls: Contains a list of URLs that spiders crawl at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL.
    • parse(): It's a spider's method. When called, the resulting Response object will be passed to the function as a unique parameter after each initial URL completes the download. The method is responsible for parsing the returned data (response data), extracting it (generating item), and generating an object that requires further processing of the URL Request .

      Here is our first spider code, which is saved in tutorial/spiders the file in the directory dmoz_spider.py :

import scrapyclass dmozsplider (Scray.spiders.Spider): Name=" DMOZ " Allowed_domain=[ "dmoz.org"] start_urls=[ "HTTP +/" www.dmoz.org/Computers/Programming/Languages/Python/Books/", " http://www.dmoz.org/ Computers/programming/languages/python/resources/"] def Span class= "Hljs-title" >parse (Self,response): Filename=response.url.split ("/") [-2] with open (file,  "WB") as f:f.write (response.body)          
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

Crawl
Enter the project's root directory and execute the following command to start the spider:

scrapy crawl dmoz
    • 1

crawl dmozStart the spider for crawling dmoz.org , and you'll get a similar output:

2014-01-2318:13:07-0400 [Scrapy] Info:scrapy started (bot:tutorial)2014-01-2318:13:07-0400 [Scrapy] info:optional features available:...2014-01-2318:13:07-0400 [Scrapy] Info:overridden settings: {}2014-01-2318:13:07-0400 [Scrapy] info:enabled extensions:...2014-01-2318:13:07-0400 [Scrapy] info:enabled Downloader middlewares:...2014-01-2318:13:07-0400 [scrapy] info:enabled spider middlewares:...2014-01-2318:13:07-0400 [Scrapy] info:enabled Item Pipelines:...2014-01-2318:13:07-0400 [DMOZ] Info:spider opened2014-01-2318:13:08-0400 [DMOZ] debug:crawled ( Span class= "Hljs-number" >200) <get http://www.dmoz.org/computers/programming/languages/python/resources/> ( Referer:none) 2014-01-23 18:13:09- 0400 [DMOZ] debug:crawled (200) <get http://www.dmoz.org/Computers/Programming/ Languages/python/books/> (referer:none) 2014-01-23 18:13:09-< Span class= "Hljs-number" >0400 [DMOZ] info:closing spider (finished)       
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11

Looking at the included dmoz output, you can see that the output log contains the start_urls initial URL that is defined and corresponds to one by one in Splider. You can see in log that it doesn't point to another page ( referer:None )

In addition, something more interesting happened. As our parse method specifies, the two files that contain the contents of the URL are created: book,resources.

What just happened?
Scrapy start_urls creates an object for each URL in the spider's Properties scrapy.Request and assigns the parse() method as a callback function (callback) to the request

The request object is dispatched, executes the generated scrapy.http.Response object, and sends it back to the spider parse() method.

Python-scrapy Creating the first project

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.