Scrapy crawler (1)

Source: Internet
Author: User

Scrapy is based on version 0.20.2 (the latest version is 0.22.0) and python 2.7.6.

The development environment is windows 7 sp1.

 

Valuable references on the Internet

1. Scrapy getting started tutorial

The translation of documents on the official website focuses on the basic use of scrapy. Create a project and use item and spider. We will briefly discuss the content of the official tutorial.

2. Use scrapy for large-scale capturing

This article does not involve code details. It introduces some scrapy experience and uses older version 0.9. We will briefly expand the content involved in this article.

 

1. Install scrapy

Because the dependency libraries of scrapy vary greatly in different versions and the online installation tutorial version is too old, it is recommended that scrapy be directly installed, when running scrapy, You can see which library is missing and which is installed.

After scrapy is installed, there will be more python/Scripts directories. Remember to add them to the environment variables so that you can directly use the "scrapy" command on the command line.

 

2. Create a project

Run "scrapy startproject project name" on the command line to directly create the scrapy project in the current directory.

The project directory contains the following files:

Scrapy. cfg

Configuration file, which is rarely used in development.

Ajian \ item. py

Here we mainly process recursive crawling and return page data.

Ajian \ pipelines. py

The returned data is uploaded here for further processing. Store it in a database or perform analysis as you like.

Ajian \ settings. py

There are many interesting settings for setting files, such as depth or breadth crawling, the number of Crawlers for each IP address, the number of Crawlers for each domain name, the total number of crawlers, and crawler latency.

Ajian \__ init _. py

Ajian \ spider \__ init _. py

The above two files are required by the python package, which has little effect on scrapy.

 

3. Build the first spider

By default, pipelines. py is generated for projects built using commands, but the spider. py file is not generated, which is confusing. But it doesn't matter. Just write one by ourselves. The content of the new file ajian_spider.py in spider is as follows.

 scrapy.spider = = response.body

The name variable and start_urls are mandatory. The name variable is the spider name specified when scrapy is started. start_urls is the initial crawled url, a list, and multiple can be specified. Now a simple spider is ready. There is only one function that outputs 163 of the page content.

This spider is too bad. we add the most basic functions to return the page content and perform recursive crawling.

Before returning the page content and performing recursive crawling, you should first learn some background knowledge.

The parse function in the spider. py file is actually an iterator. python has a command yield, which is explained in detail in an article. Here is a brief description. When yield is called, it not only returns the value but also records the status of the current function. The next time you call a function, it starts to execute the next sentence that calls yield. It is equivalent to half of the function execution. First, a result is returned to the caller and then executed.

Parse can return two types of objects, one is item and the other is Request, which correspond to the page content and recursive Request respectively.

If a Request is returned, the returned result is placed in the list to be crawled. If an item is returned, it is passed to pipelines. py for processing.

 

To be continued...

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.