Python crawler detailed and instance code _python

Source: Internet
Author: User

Python supports multithreading, mainly through thread and threading two modules. Thread module is the lower level of the module, threading module is the thread made some packaging, can be more convenient to use.

While Python's multithreading is limited by the Gil, it's not really multithreaded, but it can be significantly more efficient for I/O-intensive computing, such as crawlers.
Here is an example to verify the efficiency of multithreading. The code only involves page fetching and does not parse it out.

#-*-coding:utf-8-*-
import urllib2, time
import threading

Class Mythread (threading. Thread):
 def __init__ (self, Func, args):
  threading. Thread.__init__ (self)
  Self.args = args
  Self.func = func

 def run (self):
  apply (Self.func, Self.args)

def open_url (URL):
 request = Urllib2. Request (URL)
 HTML = urllib2.urlopen (Request). Read ()
 print len (HTML) return
 HTML

if __name__ = = ' __main__ ':
 # construct URL list
 urllist = [] for
 p in range (1):
  urllist.append (' http:// s.wanfangdata.com.cn/paper.aspx?q=%e5%8c%bb%e5%ad%a6&p= ' + str (p))

 # general Way
 N_start = Time.time () for each in
 urllist:
  open_url (each)
 n_end = Time.time ()
 print ' Normal way take%s s '% (N_end-n_start)

# multithreading
 T_start = Time.time ()
 threadlist = [Mythread (open_url, (URL,)) for URL in urllist] for
 T in Threadlist:
  T.setdaemon (True)
  T.start () to
 i in Threadlist:
  i.join ()
 t_end = Time.time ()
 print ' The thread way take%s s '% (T_end-t_start)

In two ways to get 10 of slow access to the Web page, the general way time-consuming 50s, multi-threaded time consuming 10s.
Multithreading Code Interpretation:

# Create thread class, Inherit thread class
Mythread (threading. Thread):
 def __init__ (self, Func, args):
  threading. Thread.__init__ (self) # invokes the constructor of the parent class
  Self.args = args
  Self.func = func

 def run (self): # thread Activity method
  apply ( Self.func, Self.args)




Threadlist = [Mythread (open_url, (URL,)) for URL in urllist] # The calling thread class creates a new thread, returns the thread list for
 T in Threadlist:
  T.setdaemon ( True) # Set up the daemon thread, and the parent thread waits for the child thread to execute before exiting
  the T.start () # thread
 to open for I-in threadlist:
  i.join () # Wait for the thread to terminate and then execute the parent thread after the child thread has finished executing

The above is the entire content of this article, I hope to help you learn.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.