Python crawler detailed and instance code

Python crawler detailed and instance code _python

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python supports multithreading, mainly through thread and threading two modules. Thread module is the lower level of the module, threading module is the thread made some packaging, can be more convenient to use.

While Python's multithreading is limited by the Gil, it's not really multithreaded, but it can be significantly more efficient for I/O-intensive computing, such as crawlers.
Here is an example to verify the efficiency of multithreading. The code only involves page fetching and does not parse it out.

#-*-coding:utf-8-*-
import urllib2, time
import threading

Class Mythread (threading. Thread):
 def __init__ (self, Func, args):
  threading. Thread.__init__ (self)
  Self.args = args
  Self.func = func

 def run (self):
  apply (Self.func, Self.args)

def open_url (URL):
 request = Urllib2. Request (URL)
 HTML = urllib2.urlopen (Request). Read ()
 print len (HTML) return
 HTML

if __name__ = = ' __main__ ':
 # construct URL list
 urllist = [] for
 p in range (1):
  urllist.append (' http:// s.wanfangdata.com.cn/paper.aspx?q=%e5%8c%bb%e5%ad%a6&p= ' + str (p))

 # general Way
 N_start = Time.time () for each in
 urllist:
  open_url (each)
 n_end = Time.time ()
 print ' Normal way take%s s '% (N_end-n_start)

# multithreading
 T_start = Time.time ()
 threadlist = [Mythread (open_url, (URL,)) for URL in urllist] for
 T in Threadlist:
  T.setdaemon (True)
  T.start () to
 i in Threadlist:
  i.join ()
 t_end = Time.time ()
 print ' The thread way take%s s '% (T_end-t_start)

In two ways to get 10 of slow access to the Web page, the general way time-consuming 50s, multi-threaded time consuming 10s.
Multithreading Code Interpretation:

# Create thread class, Inherit thread class
Mythread (threading. Thread):
 def __init__ (self, Func, args):
  threading. Thread.__init__ (self) # invokes the constructor of the parent class
  Self.args = args
  Self.func = func

 def run (self): # thread Activity method
  apply ( Self.func, Self.args)




Threadlist = [Mythread (open_url, (URL,)) for URL in urllist] # The calling thread class creates a new thread, returns the thread list for
 T in Threadlist:
  T.setdaemon ( True) # Set up the daemon thread, and the parent thread waits for the child thread to execute before exiting
  the T.start () # thread
 to open for I-in threadlist:
  i.join () # Wait for the thread to terminate and then execute the parent thread after the child thread has finished executing

The above is the entire content of this article, I hope to help you learn.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler detailed and instance code _python

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support