Usage of the Python scheduler

Source: Internet
Author: User
Tags new set set set
Continue the content of the previous article, in the previous article, the crawler scheduler has been written, the scheduler is the entire crawler "brain", can also be called command center. And now, all we have to do is write the other components that are used in the scheduler. The first is the URL manager, which, as a manager, must distinguish between the URL to be crawled and the URL that has been crawled, or the crawl will be repeated. This tutorial is a set set, the two URLs are temporarily stored in the collection, that is, in memory, after all, compared to crawl data less, of course, can also be stored elsewhere, such as cache or relational database.

The first time is the scheduler initialization function, which creates this Urlmanager object,

The second time is to call the Add_new_url method to add the most initial URL to the collection with the crawl.

The third time is to determine whether the URL to be crawled during the crawl process,

The fourth time is the URL that will be crawled out of the collection,

The fifth time is to add a new set of URLs that parse the page back to the crawl collection

So what we're going to do next is use code to implement these features:

1 class Urlmanager (object): 2     "" "DocString for Urlmanager" "" 3     def __init__ (self): 4         self.new_urls = set () 5< C3/>self.old_urls = Set () 6     #向管理器中添加一个新的url 7     def add_new_url (self,url): 8         If URL is none:9             return10
  if URL not in Self.new_urls and URL not in self.old_urls:11             self.new_urls.add (URL)     #从爬取数据中向管理器中批量添加url13 C11/>def Add_new_urls (self,urls):         If URLs is None or len (urls) = = 0:15             return16 for         URL in urls:17             Self.add_new_url (URL)     #判断是否有新的url19     def has_new_url (self):         return (len (self.new_urls)! = 0) 21     #从管理器中取出一个新的url22     def get_new_url (self):         New_url = Self.new_urls.pop ()         Self.old_urls.add (New_url)         return New_url

OK, here it is, the URL manager is done!

Next is the URL downloader, a very simple function, the program access to the page saved.

The downloader only appears in the scheduler two times:

The first time is when the initialization is created

The second time, immediately after fetching the URL, call it to download the page

In the URL downloader, the original tutorial using the Urllib library, I feel a bit cumbersome. So I replaced it with a better library: requests. This library can help me block a lot of technical problems, directly to crawl the page we want to access, and it is very simple to use.

  

1 Import Requests 2  3 class Htmldownloader (object): 4     "" "DocString for Htmldownloader" ""     5     def download (Self,url): 6         If URL is none:7             return  8         response = requests.get (URL, timeout = 0.1) 9         response.encod ing = ' utf-8 '         if Response.status_code = = requests.codes.ok:11             return response.text12         else:13             Return

Briefly describe this code:

A. To import the requests library first, this is because it is a third-party library, so you need to download it yourself, enter it at the command line: Pip install requests

B. Then start writing the Downloader class, this class has only one method, that is download. This method first accepts the URL you have given and then determines whether it exists.

C. Then call requests's Get method, which accepts two parameters, one is the URL, and the other is a timeout

Timeout is my own extra, that is, the access timeout. If you do not add a timeout, the program will suspend animation, that is, there will always be waiting for the page response, and do not throw an exception.

D. Then the return of the response encoding settings, because the crawl of the Baidu Encyclopedia page is utf-8, so it is best to set up a bit, although requests will be intelligent judgment, but still manually change the appropriate.

E. Then in judging whether the page responds, here Codes.ok is actually 200, indicating that the Web page normal response, you here directly write Response.status_code = = 200 also no problem.

F. Finally, return all the contents of the page, where text is a string that contains all the code for a page (HTML,CSS,JS).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.