Python Scrapy gathers iteration URLs and supports pause recovery

Last Update:2017-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The manual tutorial only provides a starturls way to define the list of URLs to be collected, assuming I now collect a station article, each URL only ID part of the change, ID from 1 to 100w, I can not write 100w URL to the Starturls list, then how to do?
To build such a number of URLs, my first thought was to set the value of Start_urls to a generator:

Start_urls = Self.urls ()

def url (self):
I=1
While i<1000000:
Yield "http://example/articles/%d"% (i)
I+=1
Luckily, it's in effect!

After further study, I found that the way to change the start_urls directly is not good, and it might be better to replace it with the following.

Start_urls = [] #默认即可

def start_requests (self):
I=1
While i<1000000:
url = "http://example/articles/%d"% (i)
Yield Self.make_requests_from_url (URL)
I+=1
There is a start_requests method in the original parent class, the spider will get the URL from it when it starts, so we just refactor it.

Implement Pause Recovery
Now further, because the start_urls is actually no longer needed, the default pause recovery is not able to save the collection progress, which requires some manual processing.

def start_requests (self):
#用get方法可以指定默认值为0
i = self.state.get (' Urlcursor ', 0) +1
While i<1000000:
url = "http://example/articles/%d"% (i)
Yield Self.make_requests_from_url (URL)
I+=1
self.state[' urlcursor '] = i
Now, the spider holds the state value every time it pauses, and the next time it is restored, it will be calculated to continue looping from the last stop.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Scrapy gathers iteration URLs and supports pause recovery

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support