Usage of the Python scheduler

Last Update:2017-06-25 Source: Internet

Author: User

Tags new set set set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Continue the content of the previous article, in the previous article, the crawler scheduler has been written, the scheduler is the entire crawler "brain", can also be called command center. And now, all we have to do is write the other components that are used in the scheduler. The first is the URL manager, which, as a manager, must distinguish between the URL to be crawled and the URL that has been crawled, or the crawl will be repeated. This tutorial is a set set, the two URLs are temporarily stored in the collection, that is, in memory, after all, compared to crawl data less, of course, can also be stored elsewhere, such as cache or relational database.

The first time is the scheduler initialization function, which creates this Urlmanager object,

The second time is to call the Add_new_url method to add the most initial URL to the collection with the crawl.

The third time is to determine whether the URL to be crawled during the crawl process,

The fourth time is the URL that will be crawled out of the collection,

The fifth time is to add a new set of URLs that parse the page back to the crawl collection

So what we're going to do next is use code to implement these features:

1 class Urlmanager (object): 2     "" "DocString for Urlmanager" "" 3     def __init__ (self): 4         self.new_urls = set () 5< C3/>self.old_urls = Set () 6     #向管理器中添加一个新的url 7     def add_new_url (self,url): 8         If URL is none:9             return10
  if URL not in Self.new_urls and URL not in self.old_urls:11             self.new_urls.add (URL)     #从爬取数据中向管理器中批量添加url13 C11/>def Add_new_urls (self,urls):         If URLs is None or len (urls) = = 0:15             return16 for         URL in urls:17             Self.add_new_url (URL)     #判断是否有新的url19     def has_new_url (self):         return (len (self.new_urls)! = 0) 21     #从管理器中取出一个新的url22     def get_new_url (self):         New_url = Self.new_urls.pop ()         Self.old_urls.add (New_url)         return New_url

OK, here it is, the URL manager is done!

Next is the URL downloader, a very simple function, the program access to the page saved.

The downloader only appears in the scheduler two times:

The first time is when the initialization is created

The second time, immediately after fetching the URL, call it to download the page

In the URL downloader, the original tutorial using the Urllib library, I feel a bit cumbersome. So I replaced it with a better library: requests. This library can help me block a lot of technical problems, directly to crawl the page we want to access, and it is very simple to use.

1 Import Requests 2  3 class Htmldownloader (object): 4     "" "DocString for Htmldownloader" ""     5     def download (Self,url): 6         If URL is none:7             return  8         response = requests.get (URL, timeout = 0.1) 9         response.encod ing = ' utf-8 '         if Response.status_code = = requests.codes.ok:11             return response.text12         else:13             Return

Briefly describe this code:

A. To import the requests library first, this is because it is a third-party library, so you need to download it yourself, enter it at the command line: Pip install requests

B. Then start writing the Downloader class, this class has only one method, that is download. This method first accepts the URL you have given and then determines whether it exists.

C. Then call requests's Get method, which accepts two parameters, one is the URL, and the other is a timeout

Timeout is my own extra, that is, the access timeout. If you do not add a timeout, the program will suspend animation, that is, there will always be waiting for the page response, and do not throw an exception.

D. Then the return of the response encoding settings, because the crawl of the Baidu Encyclopedia page is utf-8, so it is best to set up a bit, although requests will be intelligent judgment, but still manually change the appropriate.

E. Then in judging whether the page responds, here Codes.ok is actually 200, indicating that the Web page normal response, you here directly write Response.status_code = = 200 also no problem.

F. Finally, return all the contents of the page, where text is a string that contains all the code for a page (HTML,CSS,JS).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More