How to use a line of Python code to implement parallel tasks (with code) and a line of python

Source: Internet
Author: User

How to use a line of Python code to implement parallel tasks (with code) and a line of python

Python is somewhat notorious for program parallelization. Apart from technical issues, such as thread implementation and GIL, I think wrong teaching guidance is the main problem. Common classic Python multi-thread and multi-process tutorials seem to be "heavy ". In addition, it is often difficult to find the most useful content in daily work.

Traditional examples

A simple search of "Python multi-thread tutorial" is not difficult to find that almost all the tutorials provide examples involving classes and queues:

#Example.py'''Standard Producer/Consumer Threading Pattern'''import time import threading import Queue class Consumer(threading.Thread):   def __init__(self, queue):     threading.Thread.__init__(self)    self._queue = queue   def run(self):    while True:       # queue.get() blocks the current thread until       # an item is retrieved.       msg = self._queue.get()       # Checks if the current message is       # the "Poison Pill"      if isinstance(msg, str) and msg == 'quit':        # if so, exists the loop        break      # "Processes" (or in our case, prints) the queue item        print "I'm a thread, and I received %s!!" % msg    # Always be friendly!     print 'Bye byes!'def Producer():  # Queue is used to share items between  # the threads.  queue = Queue.Queue()  # Create an instance of the worker  worker = Consumer(queue)  # start calls the internal run() method to   # kick off the thread  worker.start()   # variable to keep track of when we started  start_time = time.time()   # While under 5 seconds..   while time.time() - start_time < 5:     # "Produce" a piece of work and stick it in     # the queue for the Consumer to process    queue.put('something at %s' % time.time())    # Sleep a bit just to avoid an absurd number of messages    time.sleep(1)  # This the "poison pill" method of killing a thread.   queue.put('quit')  # wait for the thread to close down  worker.join()if __name__ == '__main__':  Producer()

Haha, it looks a bit like Java, isn't it?

I am not saying that it is wrong to use the producer/consumer model to process multi-threaded/multi-process tasks (in fact, this model has its own application ). However, we can use a more efficient model to process routine script tasks.

The problem is...

First, you need a sample class;
Second, you need a queue to pass objects;
In addition, you also need to build methods at both ends of the channel to help them work (if you want to implement bidirectional communication or save the results, you need to introduce another queue ).

The more workers there are, the more problems there are.

According to this idea, you need a worker thread pool. The following is an example of an IBM classic tutorial-multi-thread acceleration for Web Retrieval.

#Example2.py'''A more realistic thread pool example '''import time import threading import Queue import urllib2 class Consumer(threading.Thread):   def __init__(self, queue):     threading.Thread.__init__(self)    self._queue = queue   def run(self):    while True:       content = self._queue.get()       if isinstance(content, str) and content == 'quit':        break      response = urllib2.urlopen(content)    print 'Bye byes!'def Producer():  urls = [    'http://www.python.org', 'http://www.yahoo.com'    'http://www.scala.org', 'http://www.google.com'    # etc..   ]  queue = Queue.Queue()  worker_threads = build_worker_pool(queue, 4)  start_time = time.time()  # Add the urls to process  for url in urls:     queue.put(url)   # Add the poison pillv  for worker in worker_threads:    queue.put('quit')  for worker in worker_threads:    worker.join()  print 'Done! Time taken: {}'.format(time.time() - start_time)def build_worker_pool(queue, size):  workers = []  for _ in range(size):    worker = Consumer(queue)    worker.start()     workers.append(worker)  return workersif __name__ == '__main__':  Producer()

This code can run correctly, but let's take a closer look at what we need to do: construct different methods, track a series of threads, and solve annoying deadlocks, we need to perform a series of join Operations. This is just the beginning ......

So far, we have reviewed the classic multi-thread tutorial. How many holes are there, isn't it? Model-based and error-prone. This style is obviously not suitable for daily use. Fortunately, we still have better methods.

Why not try map?

Map, a small and exquisite function, is the key to simple implementation of Python program parallelization. Map is derived from functional programming languages such as Lisp. It can map two functions through a sequence.

urls = ['http://www.yahoo.com', 'http://www.reddit.com']results = map(urllib2.urlopen, urls)

The above two lines of code pass each element in the urls sequence to the urlopen method as a parameter and save all the results to the results list. The result is roughly equivalent:

results = []for url in urls:   results.append(urllib2.urlopen(url))

The map function provides a series of operations, such as sequential operations, parameter passing, and result saving.

Why is this important? This is because map can easily implement parallel operations with the correct library.

In Python, there are two libraries that contain the map function: multiprocessing and its little-known sub-database multiprocessing. dummy.

Here are two more sentences: multiprocessing. dummy? What is the clone of the mltiprocessing library's thread version? Is this Xiami? Even in the official documentation of the multiprocessing library, there is only one description of this sub-database. This statement is basically translated into a human saying: "Well, you know it is like this." Believe me, this library is seriously underestimated!

Dummy is a complete clone of the multiprocessing module. The only difference is that multiprocessing acts on the process, while the dummy module acts on the thread (which also includes all the common multithreading restrictions of Python ).

Therefore, it is easy to replace the two databases. You can select different libraries for IO-intensive tasks and CPU-intensive tasks.

Try it

Use the following two lines of code to reference the library containing the parallel map function:

from multiprocessing import Poolfrom multiprocessing.dummy import Pool as ThreadPool

Instantiate the Pool object:

pool = ThreadPool()

This simple statement replaces the 7-line code of the buildworkerpool function in example2.py. It generates a series of worker threads, completes initialization, and stores them in variables for convenient access.

The Pool object has some parameters. Here, all I need to pay attention to is its first parameter: processes. This parameter is used to set the number of threads in the thread Pool. The default value is the number of CPU cores of the current machine.

Generally, when a CPU-intensive task is executed, the more cores it calls, the faster the call speed. However, when dealing with network-intensive tasks, it is difficult to predict things. It is wise to determine the size of the thread pool through experiments.

pool = ThreadPool(4) # Sets the pool size to 4

When too many threads exist, the time consumed by switching threads may even exceed the actual working time. It is a good idea to try to find the optimal value of the thread pool size for different jobs.

After the Pool object is created, the parallel program is ready. Let's take a look at the modified example2.py

import urllib2 from multiprocessing.dummy import Pool as ThreadPool urls = [  'http://www.python.org',   'http://www.python.org/about/',  'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',  'http://www.python.org/doc/',  'http://www.python.org/download/',  'http://www.python.org/getit/',  'http://www.python.org/community/',  'https://wiki.python.org/moin/',  'http://planet.python.org/',  'https://wiki.python.org/moin/LocalUserGroups',  'http://www.python.org/psf/',  'http://docs.python.org/devguide/',  'http://www.python.org/community/awards/'  # etc..   ]# Make the Pool of workerspool = ThreadPool(4) # Open the urls in their own threads# and return the resultsresults = pool.map(urllib2.urlopen, urls)#close the pool and wait for the work to finish pool.close() pool.join() 

There are only four lines of code that actually work, and only one line is critical. The map function can easily replace the example of more than 40 rows in the previous article. To be more interesting, I have analyzed the time consumption of different methods and different thread pools.

# results = [] # for url in urls:#  result = urllib2.urlopen(url)#  results.append(result)# # ------- VERSUS ------- # # # ------- 4 Pool ------- # # pool = ThreadPool(4) # results = pool.map(urllib2.urlopen, urls)# # ------- 8 Pool ------- # # pool = ThreadPool(8) # results = pool.map(urllib2.urlopen, urls)# # ------- 13 Pool ------- # # pool = ThreadPool(13) # results = pool.map(urllib2.urlopen, urls)

Result:

# Single thread: 14.4 Seconds
#4 Pool: 3.1 Seconds
#8 Pool: 1.4 Seconds
#13 Pool: 1.3 Seconds

Great results, isn't it? This result also shows why the thread pool size needs to be determined through experiments. When the thread pool size on my machine is greater than 9, the benefits will be very limited.

Another real example

Generate thumbnails of thousands of images

This is a CPU-intensive task and is very suitable for parallelization.

Basic single process version

import os import PIL from multiprocessing import Pool from PIL import ImageSIZE = (75,75)SAVE_DIRECTORY = 'thumbs'def get_image_paths(folder):  return (os.path.join(folder, f)       for f in os.listdir(folder)       if 'jpeg' in f)def create_thumbnail(filename):   im = Image.open(filename)  im.thumbnail(SIZE, Image.ANTIALIAS)  base, fname = os.path.split(filename)   save_path = os.path.join(base, SAVE_DIRECTORY, fname)  im.save(save_path)if __name__ == '__main__':  folder = os.path.abspath(    '11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')  os.mkdir(os.path.join(folder, SAVE_DIRECTORY))  images = get_image_paths(folder)  for image in images:    create_thumbnail(Image)

The main task of the above Code is to traverse the image files in the passed folder, generate one-to-one thumbnails, and save these thumbnails to a specific folder.

On this machine, it takes 6000 seconds to process 27.9 images using this program.

If we use the map function to replace the for Loop:

import os import PIL from multiprocessing import Pool from PIL import ImageSIZE = (75,75)SAVE_DIRECTORY = 'thumbs'def get_image_paths(folder):  return (os.path.join(folder, f)       for f in os.listdir(folder)       if 'jpeg' in f)def create_thumbnail(filename):   im = Image.open(filename)  im.thumbnail(SIZE, Image.ANTIALIAS)  base, fname = os.path.split(filename)   save_path = os.path.join(base, SAVE_DIRECTORY, fname)  im.save(save_path)if __name__ == '__main__':  folder = os.path.abspath(    '11_18_2013_R000_IQM_Big_Sur_Mon__e10d1958e7b766c3e840')  os.mkdir(os.path.join(folder, SAVE_DIRECTORY))  images = get_image_paths(folder)  pool = Pool()  pool.map(creat_thumbnail, images)  pool.close()  pool.join()

5.6 seconds!

Although only a few lines of code have been modified, we have significantly improved the execution speed of the program. In the production environment, we can select multiple processes and multi-threaded libraries for CPU-intensive tasks and IO-intensive tasks to further improve the execution speed. This is also a good solution to the deadlock problem. In addition, because the map function does not support manual thread management, the debugging work becomes abnormal and simple.

Here, we have implemented (basic) parallel execution through a line of Python.

The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.