Python-crawler & Problem Solving & Thinking (3), python Crawler
Continue with the content of the previous article. In the previous article, the crawler scheduler has been written, and the scheduler is the "brain" of the whole crawler program. It can also be called a command center. Now, we need to write the other components used in the scheduler. First, it is the url manager. Since it is used as the manager, it must distinguish the url to be crawled from the url already crawled. Otherwise, it will crawl again. Here, the tutorial uses the set to temporarily store the two URLs in the set, that is, the memory. After all, the crawler data is relatively small and can also be stored elsewhere, such as caching or relational databases.
It can be seen that there are 5 times in total:
The first time is when the scheduler initializes the function, this urlmanager object is created,
The second request is to call the add_new_url method to add the initial url to the crawler set,
The third time is to determine whether the url is to be crawled during the crawling process,
The fourth time is to extract the url to be crawled from the collection,
The fifth time is to add a new set of URLs parsed from the page to the crawler collection.
Then, we need to implement these functions using code:
1 class UrlManager (object): 2 "docstring for UrlManager" 3 def _ init _ (self): 4 self. new_urls = set () 5 self. old_urls = set () 6 # Add a new url 7 def add_new_url (self, url): 8 if url is None: 9 return10 if url not in self to the manager. new_urls and url not in self. old_urls: 11 self. new_urls.add (url) 12 # batch add url13 def add_new_urls (self, urls): 14 if urls is None or len (urls) to the Manager from crawling data) = 0: 15 return16 for url in Urls: 17 self. add_new_url (url) 18 # determine whether a new url19 def has_new_url (self): 20 return (len (self. new_urls )! = 0) 21 # retrieve a new url22 def get_new_url (self) from the Manager: 23 new_url = self. new_urls.pop () 24 self. old_urls.add (new_url) 25 return new_url
Okay. Now, the url manager is ready!
The next step is the url download tool. It is a simple function to save the page accessed by the program.
It can be seen that the download loader only appears twice in the Scheduler:
The first time it was created during initialization
The second is to call the url immediately after it is obtained to download the page.
In the url download tool, the original tutorial uses the urllib library, which I think is a little complicated. So I switched to a better library: requests. This library can help me block many technical difficulties and directly crawl the pages we want to access. It is very easy to use.
1 import requests 2 3 class HtmlDownloader(object): 4 """docstring for HtmlDownloader""" 5 def download(self,url): 6 if url is None: 7 return 8 response = requests.get(url, timeout = 0.1) 9 response.encoding = 'utf-8'10 if response.status_code == requests.codes.ok:11 return response.text12 else:13 return
Briefly describe this Code:
A. First, import the requests library. Because it is a third-party library, you need to download it yourself and enter pip install requests on the command line.
B. Then start writing the class of the downloader. This class has only one method, that is, download. This method will first accept the given url and then determine whether it exists.
C. Call the get method of requests, which accepts two parameters: url and timeout.
Timeout is added by myself, that is, access timeout. If no timeout is added, the program will be suspended, that is, it will wait for the response from the page and will not throw an exception.
D. Then encode the returned response. Because the crawled Baidu encyclopedia page is UTF-8, it is best to set it here. Although requests can be judged intelligently, it is recommended to manually change it.
E. Determine whether the page responds. Here, codes. OK is actually 200, indicating that the webpage responds normally. You can directly write response. status_code = 200 here.
F. Finally, return all the content of the page. The text here is a string that contains all the code of the page (html, css, js ).