Simple Example of Python multi-thread crawler and python multi-thread Crawler
Python supports multiple threads, mainly through the thread and threading modules. The thread module is a relatively low-level module, and the threading module packages the thread for more convenient use.
Although python multithreading is limited by GIL, it is not really a multithreading, but it can significantly improve the efficiency of I/O-intensive computing, such as crawling.
The following uses an instance to verify the efficiency of multithreading. The Code only involves page retrieval and is not parsed.
#-*-Coding: UTF-8-*-import urllib2, timeimport threading class MyThread (threading. thread): def _ init _ (self, func, args): threading. thread. _ init _ (self) self. args = args self. func = func def run (self): apply (self. func, self. args) def open_url (url): request = urllib2.Request (url) html = urllib2.urlopen (request ). read () print len (html) return html if _ name _ = '_ main _': # construct the url list urlList = [] f Or p in range (1, 10): urlList. append ('HTTP: // s.wanfangdata.com.cn/Paper.aspx? Q = % E5 % 8C % BB % E5 % AD % A6 & p = '+ str (p) # normal n_start = time. time () for each in urlList: open_url (each) n_end = time. time () print 'The normal way take % s' % (n_end-n_start) # multithreading t_start = time. time () threadList = [MyThread (open_url, (url,) for url in urlList] for t in threadList: t. setDaemon (True) t. start () for I in threadList: I. join () t_end = time. time () print 'the thread way take % s '% (t_end-t_start)
Obtain 10 slow web pages in two ways. Generally, 50 s is used and 10 s is used for multithreading.
Multi-threaded code explanation:
# Create a Thread class and inherit from the Thread class MyThread (threading. thread): def _ init _ (self, func, args): threading. thread. _ init _ (self) # Call the constructor self of the parent class. args = args self. func = func def run (self): # thread activity method apply (self. func, self. args)
ThreadList = [MyThread (open_url, (url,) for url in urlList] # Call the Thread class to create a new thread and return the thread list for t in threadList: t. setDaemon (True) # Set the daemon thread. The parent thread will wait until the sub-thread completes execution and then exit t. start () # enable thread for I in threadList: I. join () # Wait for the thread to terminate and wait until the sub-thread finishes executing the parent thread.
The above is all the content of this article, hoping to help you learn.
Articles you may be interested in:
- Full record of crawler writing for python Crawlers
- Python-based code sharing
- Python implements simple crawler sharing for crawling links on pages
- Python3 simple Crawler
- Write crawler programs in python
- Multi-thread web crawler using python
- Python implements multi-thread crawling for counting the ratio of BBS to men and women in schools (2)
- Python multi-thread, asynchronous, and multi-process crawler implementation code