Scrapy is based on version 0.20.2 (the latest version is 0.22.0) and python 2.7.6.
The development environment is windows 7 sp1.
Valuable references on the Internet
1. Scrapy getting started tutorial
The translation of documents on the official website focuses on the basic use of scrapy. Create a project and use item and spider. We will briefly discuss the content of the official tutorial.
2. Use scrapy for large-scale capturing
This article does not involve code details. It introduces some scrapy experience and uses older version 0.9. We will briefly expand the content involved in this article.
1. Install scrapy
Because the dependency libraries of scrapy vary greatly in different versions and the online installation tutorial version is too old, it is recommended that scrapy be directly installed, when running scrapy, You can see which library is missing and which is installed.
After scrapy is installed, there will be more python/Scripts directories. Remember to add them to the environment variables so that you can directly use the "scrapy" command on the command line.
2. Create a project
Run "scrapy startproject project name" on the command line to directly create the scrapy project in the current directory.
The project directory contains the following files:
Scrapy. cfg
Configuration file, which is rarely used in development.
Ajian \ item. py
Here we mainly process recursive crawling and return page data.
Ajian \ pipelines. py
The returned data is uploaded here for further processing. Store it in a database or perform analysis as you like.
Ajian \ settings. py
There are many interesting settings for setting files, such as depth or breadth crawling, the number of Crawlers for each IP address, the number of Crawlers for each domain name, the total number of crawlers, and crawler latency.
Ajian \__ init _. py
Ajian \ spider \__ init _. py
The above two files are required by the python package, which has little effect on scrapy.
3. Build the first spider
By default, pipelines. py is generated for projects built using commands, but the spider. py file is not generated, which is confusing. But it doesn't matter. Just write one by ourselves. The content of the new file ajian_spider.py in spider is as follows.
scrapy.spider = = response.body
The name variable and start_urls are mandatory. The name variable is the spider name specified when scrapy is started. start_urls is the initial crawled url, a list, and multiple can be specified. Now a simple spider is ready. There is only one function that outputs 163 of the page content.
This spider is too bad. we add the most basic functions to return the page content and perform recursive crawling.
Before returning the page content and performing recursive crawling, you should first learn some background knowledge.
The parse function in the spider. py file is actually an iterator. python has a command yield, which is explained in detail in an article. Here is a brief description. When yield is called, it not only returns the value but also records the status of the current function. The next time you call a function, it starts to execute the next sentence that calls yield. It is equivalent to half of the function execution. First, a result is returned to the caller and then executed.
Parse can return two types of objects, one is item and the other is Request, which correspond to the page content and recursive Request respectively.
If a Request is returned, the returned result is placed in the list to be crawled. If an item is returned, it is passed to pipelines. py for processing.
To be continued...