Python crawler multi-thread explanation and instance code, python Crawler
Python supports multiple threads, mainly through the thread and threading modules. The thread module is a relatively low-level module, and the threading module packages the thread for more convenient use.
Although python multithreading is limited by GIL, it is not really a multithreading, but it can significantly improve the efficiency of I/O-intensive computing, such as crawling.
The following uses an instance to verify the efficiency of multithreading. The Code only involves page retrieval and is not parsed.
# -*-coding:utf-8 -*-import urllib2, timeimport threadingclass MyThread(threading.Thread): def __init__(self, func, args): threading.Thread.__init__(self) self.args = args self.func = func def run(self): apply(self.func, self.args)def open_url(url): request = urllib2.Request(url) html = urllib2.urlopen(request).read() print len(html) return html
If _ name _ = '_ main _': # construct the url list urlList = [] for p in range (1, 10): urlList. append ('HTTP: // s.wanfangdata.com.cn/Paper.aspx? Q = % E5 % 8C % BB % E5 % AD % A6 & p = '+ str (p ))
# General method n_start = time. time () for each in urlList: open_url (each) n_end = time. time () print 'The normal way take % s' % (n_end-n_start)
# Multithreading t_start = time. time () threadList = [MyThread (open_url, (url,) for url in urlList] for t in threadList: t. setDaemon (True) t. start () for I in threadList: I. join () t_end = time. time () print 'the thread way take % s '% (t_end-t_start)
Obtain 10 slow web pages in two ways. Generally, 50 s is used and 10 s is used for multithreading.
Multi-threaded code explanation:
# Create a Thread class and inherit from the Thread class MyThread (threading. thread): def _ init _ (self, func, args): threading. thread. _ init _ (self) # Call the constructor self of the parent class. args = args self. func = func def run (self): # thread activity method apply (self. func, self. args)
ThreadList = [MyThread (open_url, (url,) for url in urlList] # Call the Thread class to create a new thread and return the thread list for t in threadList: t. setDaemon (True) # Set the daemon thread. The parent thread will wait until the sub-thread completes execution and then exit t. start () # enable thread for I in threadList: I. join () # Wait for the thread to terminate and wait until the sub-thread finishes executing the parent thread.
The above is all the content of this article, hoping to help you learn.