Python multi-threaded crawler Simple example

Source: Internet
Author: User
Python supports multithreading, primarily through the two modules of thread and threading. The thread module is a relatively low-level module, the threading module is the thread to do some packaging, can be more convenient to use.

Although Python's multithreading is limited by the Gil, it is not really multi-threading, but for I/O intensive computing, it can be significantly more efficient, such as crawlers.

The following example is used to verify the efficiency of multithreading. The code only involves page fetching and is not parsed out.

#-*-coding:utf-8-*-import urllib2, timeimport threading  Class MyThread (threading. Thread):  def __init__ (self, Func, args):    threading. Thread.__init__ (self)    Self.args = args    Self.func = func   def run (self):    apply (Self.func, Self.args)  def open_url (URL):  request = Urllib2. Request (URL)  HTML = urllib2.urlopen (Request). Read ()  print len (HTML)  return HTML if __name__ = = ' __main__ ':  # construct URL list  urllist = []  for P in range (1, ten):    urllist.append (' http://s.wanfangdata.com.cn/ paper.aspx?q=%e5%8c%bb%e5%ad%a6&p= ' + str (p))     # general way  N_start = Time.time () for each in  urllist:    Open_url (each)  n_end = Time.time ()  print ' The normal ' to take%s s '% (N_end-n_start)     # multithreading  t_s Tart = Time.time ()  threadlist = [MyThread (open_url, (URL,)) for the URL in urllist] for  T in Threadlist:    t.setd Aemon (True)    T.start () for  i in Threadlist:    i.join ()  t_end = Time.time ()  print ' The thread ' Take%s s '% (T_end-t_start)

In two ways to obtain 10 slow access to the page, the general way time-consuming 50s, multi-threading time is 10s.

Multi-Threaded Code interpretation:

# Create a thread class that inherits the Thread class MyThread (threading. Thread):  def __init__ (self, Func, args):    threading. Thread.__init__ (self) # Call the constructor of the parent class    Self.args = args    Self.func = func   def run (self): # thread Activity method    apply ( Self.func, Self.args)  

Threadlist = [MyThread (open_url, (URL,)) for URL in urllist] # Call thread class to create a new thread, return to the list of threads for  T in Threadlist:    T.setdaemon ( True) # Sets the daemon thread, and the parent thread waits for the child thread to finish executing before exiting    the T.start () # thread open  for i in Threadlist:    i.join () # waits for the thread to terminate, and then executes the parent thread after the child thread finishes executing

The above is the whole content of this article, I hope that everyone's study has helped.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.