The Python crawler---The basics of the Scrapy framework _

The Python crawler---The basics of the Scrapy framework __python

Last Update:2018-07-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Create a Scrapy item definition extract item write crawl site spider and extract item write item Pipeline to store extracted item (i.e. data)

Scrapy is written by Python. If you have just contacted and wondered about the nature of the language and the details of scrapy, we recommend Learn python the Hard Way for programmers who are already familiar with other languages and want to learn python quickly, for beginners who want to start learning from Python, A list of non-programmer python learning materials will be your choice. Create a project

Before you start crawling, you must create a new Scrapy project. Enter the directory where you intend to store the code, and run the following command:

Scrapy Startproject Tutorial

The command will create a tutorial directory that contains the following:

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py spiders/
            __init__.py
            ...

These files are: Scrapy.cfg: The project's profile tutorial/: The project's Python module. You will then join the code here. tutorial/items.py: Item file in the project. tutorial/pipelines.py: Pipelines files in the project. tutorial/settings.py: The project's settings file. tutorial/spiders/: The directory where the spider code is placed. Define Item

The Item is the container that holds the crawled data, is used like a Python dictionary, and provides additional protection against undefined field errors caused by spelling errors.

Like you do in ORM, you can create a scrapy. Item class, and the definition type is scrapy. The class property of the Field to define an item. (If you don't know Orm, don't worry, you'll find this step very simple)

First, the item is modeled according to the data that is acquired from dmoz.org. We need to get the name, URL, and description of the site from the DMOZ. For this, define the corresponding field in the item. To edit the items.py file in the Tutorial directory:

Import Scrapy

class Dmozitem (scrapy. Item):
    title = Scrapy. Field ()
    link = scrapy. Field ()
    desc = scrapy. Field ()

This may seem a bit complicated at first, but by defining the item, you can easily use other methods of scrapy. And these methods need to know the definition of your item. Write the first reptile (Spider)

Spider is a class that users write to crawl data from a single Web site (or some Web sites).

It contains an initial URL for downloading, how to follow links in a Web page, and how to analyze the contents of a page to extract the method for generating the item.

In order to create a spider, you must inherit scrapy. Spider class, and defines the following three properties: Name: Used to distinguish Spider. The name must be unique, and you may not set the same name for different spider. Start_urls: Contains a list of URLs that spider crawled at startup. Therefore, the first page to be fetched will be one of them. Subsequent URLs are extracted from the data retrieved from the initial URL. Parse () is a method of spider. When invoked, the Response object that is generated after each initial URL completes the download is passed to the function as a unique parameter. The method is responsible for parsing the returned data (response), extracting data (generating item), and generating the Request object for URLs that need further processing.

The following is our first spider code, saved in the dmoz_spider.py file in the Tutorial/spiders directory:

Import Scrapy

class Dmozspider (scrapy. Spider):
    name = "DMOZ"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/ programming/languages/python/books/",
        " http://www.dmoz.org/Computers/Programming/Languages/Python/ Resources/"
    ]

    def parse (self, response):
        filename = response.url.split ("/") [-2]
        with open (filename, "WB") as F:
            F.write (Response.body)

crawling

Go to the root of the project and start the spider by executing the following command:

Scrapy Crawl DMOZ

Crawl DMOZ initiates the spider for crawling dmoz.org, you will get similar output:

2014-01-23 18:13:07-0400 [Scrapy] info:scrapy started (bot:tutorial) 2014-01-23 18:13:07-0400 [scrapy] info:optional F
Eatures Available: ... 2014-01-23 18:13:07-0400 [scrapy] Info:overridden settings: {} 2014-01-23 18:13:07-0400 [scrapy] info:enabled extensions
: ...
2014-01-23 18:13:07-0400 [scrapy] info:enabled downloader middlewares: ... 2014-01-23

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The Python crawler---The basics of the Scrapy framework __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The Python crawler---The basics of the Scrapy framework __python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support