Scrapy Installation: (The following method is recommended for scrapy dependent packages)
Install Anaconda First, then run Conda install Scrapy
To create a scrapy project:
1,scrapy Startproject Project Name
2,CD to Project name
3,scrapy genspider crawler name www.baidu.com (website URL)
4, create a new run.py under the project root directory
from scrapy.cmdline import execute execute([‘scrapy‘,‘crawl‘,‘quotes‘])
This method runs the script directly, without having to enter the command every time!
Scrapy Selector Usage
response.css(‘.text::text‘).extract()这里为提取所有带有class=’text’ 这个属性的元素里面的text返回的是一个列表response.css(‘.text::text‘).extract_first()这是取第一条,返回的是strresponse.css("div span::attr(href)").extract()这里取的是该属性值
You can also use the XPath selector:
response.xpath("//a[@class=‘tag‘]/text()").extract())
Scrapy How to save a file
scrapy crawl quotes -o quotes.json # 保存为json形式scrapy crawl quotes -o quotes.jl #json lines存储scrapy crawl quotes -o quotes.csvscrapy crawl quotes -o quotes.xmlscrapy crawl quotes -o quotes.pickle #数据分析用scrapy crawl quotes -o quotes.marshal #数据分析用scrapy crawl quotes -o ftp://user:[email protected]/path/to/quotes.csv #远程保存
The use of the Scrapy spider
The spider is the simplest spider, and every other spider must inherit the class, and the spider does not provide any special functionality, it simply requests a given start_urls/start_requests, and calls the spider's Parse method based on the returned results.
name,
The string that defines the name of the spider must be unique, name is the most important property of the spider, and it is required
Allowed_domains
Optional, contains a list of domain names that the spider allows to crawl, and when Offsitermiddleware is enabled, the URL of the domain name not in the list will not be followed up
Start_urls
A list of URLs from which the spider starts crawling.
Start_requests
The party must return an iterative object that contains the first request that the spider uses to crawl.
def start_requests(self): yield scrapy.Request(url=,self.menu,method=‘post‘)
The usage of Scrapyde settings
- Download_delay = 2-----------Set crawl interval
- Default_request_headers-----------Setting Header information
- Robotstxt_obey = True-----------If enabled, Scrapy will take robots.txt policy
- Autothrottle_start_delay = 5----------Start download time and delay time
- Autothrottle_max_delay = maximum delay time at------------High concurrent request
- Concurrent_requests =-----------Number of threads opened, default 16
Recursive call to this function crawl
next_page = response.css(‘.next::attr(href)‘).extract_first()self.page += 1 (先设置类变量)if self.page <= 3: #控制递归深度 yield scrapy.Request(url=next_page,callback=self.parse)
Meta is a dictionary that is primarily used to pass values between analytic functions
# 上一个函数 yield scrapy.Request(title_urls,self.get_pics,meta={‘title_name‘:title_name}) # 下一个函数 s = response.meta[‘title_name‘]
Crawler pause and restart
Scrapy Crawl crawler name-S jobdir= path to save record information
such as: Scrapy crawl cnblogs-s jobdir=zant/001
The Execute command launches the specified crawler and logs the status to the specified directory
Crawler has started, we can press CTRL + C to stop the crawler, stop after we look at the record folder, will be more than 3 files, The P0 file in the Requests.queue folder is the URL record file, and the file exists to indicate that there is an unfinished URL that will be automatically deleted when all URLs are complete.
When we re-execute the command: Scrapy crawl cnblogs-s jobdir=zant/001, the crawler
Scrapy Operation Guide