The manual tutorial only provides a starturls way to define the list of URLs to be collected, assuming I now collect a station article, each URL only ID part of the change, ID from 1 to 100w, I can not write 100w URL to the Starturls list, then how to do?
To build such a number of URLs, my first thought was to set the value of Start_urls to a generator:
Start_urls = Self.urls ()
def url (self):
I=1
While i<1000000:
Yield "http://example/articles/%d"% (i)
I+=1
Luckily, it's in effect!
After further study, I found that the way to change the start_urls directly is not good, and it might be better to replace it with the following.
Start_urls = [] #默认即可
def start_requests (self):
I=1
While i<1000000:
url = "http://example/articles/%d"% (i)
Yield Self.make_requests_from_url (URL)
I+=1
There is a start_requests method in the original parent class, the spider will get the URL from it when it starts, so we just refactor it.
Implement Pause Recovery
Now further, because the start_urls is actually no longer needed, the default pause recovery is not able to save the collection progress, which requires some manual processing.
def start_requests (self):
#用get方法可以指定默认值为0
i = self.state.get (' Urlcursor ', 0) +1
While i<1000000:
url = "http://example/articles/%d"% (i)
Yield Self.make_requests_from_url (URL)
I+=1
self.state[' urlcursor '] = i
Now, the spider holds the state value every time it pauses, and the next time it is restored, it will be calculated to continue looping from the last stop.