Python Scrapy gathers iteration URLs and supports pause recovery

Source: Internet
Author: User

The manual tutorial only provides a starturls way to define the list of URLs to be collected, assuming I now collect a station article, each URL only ID part of the change, ID from 1 to 100w, I can not write 100w URL to the Starturls list, then how to do?
To build such a number of URLs, my first thought was to set the value of Start_urls to a generator:

Start_urls = Self.urls ()

def url (self):
I=1
While i<1000000:
Yield "http://example/articles/%d"% (i)
I+=1
Luckily, it's in effect!

After further study, I found that the way to change the start_urls directly is not good, and it might be better to replace it with the following.

Start_urls = [] #默认即可

def start_requests (self):
I=1
While i<1000000:
url = "http://example/articles/%d"% (i)
Yield Self.make_requests_from_url (URL)
I+=1
There is a start_requests method in the original parent class, the spider will get the URL from it when it starts, so we just refactor it.

Implement Pause Recovery
Now further, because the start_urls is actually no longer needed, the default pause recovery is not able to save the collection progress, which requires some manual processing.

def start_requests (self):
#用get方法可以指定默认值为0
i = self.state.get (' Urlcursor ', 0) +1
While i<1000000:
url = "http://example/articles/%d"% (i)
Yield Self.make_requests_from_url (URL)
I+=1
self.state[' urlcursor '] = i
Now, the spider holds the state value every time it pauses, and the next time it is restored, it will be calculated to continue looping from the last stop.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.