No. 353, Python distributed crawler build search engine scrapy explaining-scrapy pause and restart
Every reptile in the scrapy can record the paused state and which URLs are crawled when paused, the URLs that can be crawled from the paused state when the restart is not crawled
Implementing pause and restart record status
1, first CD into the scrapy project
2. Create a folder to save the record information in the Scrapy project
3. Execute the command:
Scrapy Crawl crawler name-S jobdir= path to save record information
such as:scrapy crawl cnblogs-s jobdir=zant/001
The Execute command launches the specified crawler and logs the status to the specified directory
The crawler has started, we can press CTRL + C to stop the crawler
Stop after we look at the record folder, there will be 3 more files
The P0 file in the Requests.queue folder is the URL record file, and the file exists to indicate that there is an unfinished URL that will be automatically deleted when all URLs are complete.
When we re-execute the command:scrapy crawl cnblogs-s jobdir=zant/001 The crawler will continue crawling based on the P0 file from where it was stopped,
No. 353, Python distributed crawler build search engine scrapy explaining-scrapy pause and restart