Python-crawler & amp; Problem Solving & amp; thinking (3), python Crawler

Last Update:2017-06-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python-crawler & Problem Solving & Thinking (3), python Crawler

Continue with the content of the previous article. In the previous article, the crawler scheduler has been written, and the scheduler is the "brain" of the whole crawler program. It can also be called a command center. Now, we need to write the other components used in the scheduler. First, it is the url manager. Since it is used as the manager, it must distinguish the url to be crawled from the url already crawled. Otherwise, it will crawl again. Here, the tutorial uses the set to temporarily store the two URLs in the set, that is, the memory. After all, the crawler data is relatively small and can also be stored elsewhere, such as caching or relational databases.

It can be seen that there are 5 times in total:

The first time is when the scheduler initializes the function, this urlmanager object is created,

The second request is to call the add_new_url method to add the initial url to the crawler set,

The third time is to determine whether the url is to be crawled during the crawling process,

The fourth time is to extract the url to be crawled from the collection,

The fifth time is to add a new set of URLs parsed from the page to the crawler collection.

Then, we need to implement these functions using code:

1 class UrlManager (object): 2 "docstring for UrlManager" 3 def _ init _ (self): 4 self. new_urls = set () 5 self. old_urls = set () 6 # Add a new url 7 def add_new_url (self, url): 8 if url is None: 9 return10 if url not in self to the manager. new_urls and url not in self. old_urls: 11 self. new_urls.add (url) 12 # batch add url13 def add_new_urls (self, urls): 14 if urls is None or len (urls) to the Manager from crawling data) = 0: 15 return16 for url in Urls: 17 self. add_new_url (url) 18 # determine whether a new url19 def has_new_url (self): 20 return (len (self. new_urls )! = 0) 21 # retrieve a new url22 def get_new_url (self) from the Manager: 23 new_url = self. new_urls.pop () 24 self. old_urls.add (new_url) 25 return new_url

Okay. Now, the url manager is ready!

The next step is the url download tool. It is a simple function to save the page accessed by the program.

It can be seen that the download loader only appears twice in the Scheduler:

The first time it was created during initialization

The second is to call the url immediately after it is obtained to download the page.

In the url download tool, the original tutorial uses the urllib library, which I think is a little complicated. So I switched to a better library: requests. This library can help me block many technical difficulties and directly crawl the pages we want to access. It is very easy to use.

 1 import requests 2  3 class HtmlDownloader(object): 4     """docstring for HtmlDownloader"""     5     def download(self,url): 6         if url is None: 7             return  8         response = requests.get(url, timeout = 0.1) 9         response.encoding = 'utf-8'10         if response.status_code == requests.codes.ok:11             return response.text12         else:13             return

Briefly describe this Code:

A. First, import the requests library. Because it is a third-party library, you need to download it yourself and enter pip install requests on the command line.

B. Then start writing the class of the downloader. This class has only one method, that is, download. This method will first accept the given url and then determine whether it exists.

C. Call the get method of requests, which accepts two parameters: url and timeout.

Timeout is added by myself, that is, access timeout. If no timeout is added, the program will be suspended, that is, it will wait for the response from the page and will not throw an exception.

D. Then encode the returned response. Because the crawled Baidu encyclopedia page is UTF-8, it is best to set it here. Although requests can be judged intelligently, it is recommended to manually change it.

E. Determine whether the page responds. Here, codes. OK is actually 200, indicating that the webpage responds normally. You can directly write response. status_code = 200 here.

F. Finally, return all the content of the page. The text here is a string that contains all the code of the page (html, css, js ).

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python-crawler & amp; Problem Solving & amp; thinking (3), python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python-crawler & amp; Problem Solving & amp; thinking (3), python Crawler

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support