Scrapy Operation Guide

Last Update:2018-10-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Scrapy Installation: (The following method is recommended for scrapy dependent packages)

Install Anaconda First, then run Conda install Scrapy

To create a scrapy project:

1,scrapy Startproject Project Name
2,CD to Project name
3,scrapy genspider crawler name www.baidu.com (website URL)
4, create a new run.py under the project root directory

from scrapy.cmdline import execute  execute([‘scrapy‘,‘crawl‘,‘quotes‘])

This method runs the script directly, without having to enter the command every time!

Scrapy Selector Usage

response.css(‘.text::text‘).extract()这里为提取所有带有class=’text’ 这个属性的元素里面的text返回的是一个列表response.css(‘.text::text‘).extract_first()这是取第一条，返回的是strresponse.css("div span::attr(href)").extract()这里取的是该属性值

You can also use the XPath selector:

response.xpath("//a[@class=‘tag‘]/text()").extract())

Scrapy How to save a file

scrapy crawl quotes -o quotes.json   #  保存为json形式scrapy crawl quotes -o quotes.jl    #json lines存储scrapy crawl quotes -o quotes.csvscrapy crawl quotes -o quotes.xmlscrapy crawl quotes -o quotes.pickle  #数据分析用scrapy crawl quotes -o quotes.marshal   #数据分析用scrapy crawl quotes -o ftp://user:[email protected]/path/to/quotes.csv  #远程保存

The use of the Scrapy spider

The spider is the simplest spider, and every other spider must inherit the class, and the spider does not provide any special functionality, it simply requests a given start_urls/start_requests, and calls the spider's Parse method based on the returned results.

name,
The string that defines the name of the spider must be unique, name is the most important property of the spider, and it is required
Allowed_domains
Optional, contains a list of domain names that the spider allows to crawl, and when Offsitermiddleware is enabled, the URL of the domain name not in the list will not be followed up
Start_urls
A list of URLs from which the spider starts crawling.
Start_requests
The party must return an iterative object that contains the first request that the spider uses to crawl.

def start_requests(self):    yield scrapy.Request(url=,self.menu,method=‘post‘)

The usage of Scrapyde settings

Download_delay = 2-----------Set crawl interval
Default_request_headers-----------Setting Header information
Robotstxt_obey = True-----------If enabled, Scrapy will take robots.txt policy
Autothrottle_start_delay = 5----------Start download time and delay time
Autothrottle_max_delay = maximum delay time at------------High concurrent request
Concurrent_requests =-----------Number of threads opened, default 16

Recursive call to this function crawl

next_page = response.css(‘.next::attr(href)‘).extract_first()self.page += 1  (先设置类变量)if self.page <= 3:   #控制递归深度    yield scrapy.Request(url=next_page,callback=self.parse)

Meta is a dictionary that is primarily used to pass values between analytic functions

# 上一个函数 yield scrapy.Request(title_urls,self.get_pics,meta={‘title_name‘:title_name})  # 下一个函数 s = response.meta[‘title_name‘]

Crawler pause and restart

Scrapy Crawl crawler name-S jobdir= path to save record information
such as: Scrapy crawl cnblogs-s jobdir=zant/001
The Execute command launches the specified crawler and logs the status to the specified directory

Crawler has started, we can press CTRL + C to stop the crawler, stop after we look at the record folder, will be more than 3 files, The P0 file in the Requests.queue folder is the URL record file, and the file exists to indicate that there is an unfinished URL that will be automatically deleted when all URLs are complete.

When we re-execute the command: Scrapy crawl cnblogs-s jobdir=zant/001, the crawler

Scrapy Operation Guide

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Scrapy Operation Guide

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support