One line of Python for Parallelization--a new idea of daily multithreading operation

Source: Internet
Author: User

Python is somewhat notorious for program parallelism. Putting aside technical problems, such as threading implementation and GIL1, I think the wrong teaching guidance is the main problem. The common classic Python multi-threaded, multi-process tutorials appear to be "heavy". And often ineffective, without delving into the most useful aspects of daily work.

The traditional example

Simple search under "Python Multi-threaded Tutorial", it is not difficult to find that almost all of the tutorials give examples involving classes and queues:

#Example. py"Standard Producer/consumer threading Pattern"Import timeImport threadingImport QueueClassConsumer(Threading. Thread):Def__init__(self, queue): Threading. Thread.__init__ (self) self._queue = queueDefRun(self):WhileTrue:# queue.get () blocks the current thread until# an item is retrieved. msg = Self._queue.get ()# Checks If the current message is# the "Poison Pill"If Isinstance (msg, str)and msg = =' Quit ':# If so, exists the loopBreak# "Processes" (or in our case, prints) the queue itemPrint"I ' m a thread, and I received%s!!"% msg# Always be friendly!Print' Bye byes! 'DefProducer():# Queue is used to share items between# The threads. Queue = Queue.queue ()# Create An instance of the worker worker = Consumer (queue)# Start calls the internal run () method to# Kick off the thread Worker.start ()# variable to keep track of if we started start_time = Time.time () # while under 5 seconds: While time.time ()-Start_time < 5: # "Produce" a piece of work and stick it in # the queue for the Con Sumer to process Queue.put (' Something @%s '% time.time ()) # Sleep a bit just to avoid an absurd number of MESSAG Es time.sleep (1) # This "poison pill" method of killing a thread. Queue.put (' quit ') # wait for the T Hread to close Worker.join ()if __name__ = = ' __main__ ': Producer ()      

Ha, looks a little like Java, doesn't it?

I'm not saying that using a producer/consumer model to handle multi-threaded/multi-process tasks is wrong (in fact, this model has its own application). It's just that we can use a more efficient model when dealing with everyday scripting tasks.

The problem is ...

First, you need a template class;
Second, you need a queue to pass the object;
Also, you need to build a method at both ends of the channel to help it work (if you want to have two-way communication or save the results, you need to introduce a queue again).

More workers, more problems

Following this idea, you now need a thread pool for a worker thread. The following is an example of a classic IBM tutorial-accelerating through multithreading when retrieving Web pages.

#Example2. py' A more realistic thread pool example 'Import timeImport threadingImport QueueImport Urllib2ClassConsumer(Threading. Thread):Def__init__(self, queue): Threading. Thread.__init__ (self) self._queue = queueDefRun(self):WhileTrue:content = Self._queue.get ()If Isinstance (content, str)and content = =' Quit ':Break response = urllib2.urlopen (content)Print' Bye byes! 'DefProducer(): URLs = [' Http://www.python.org ',' Http://www.yahoo.com '' Http://www.scala.org ',' Http://www.google.com '# etc.. ] Queue = Queue.queue () worker_threads = Build_worker_pool (Queue,4) Start_time = Time.time ()# ADD the URLs to processFor URLin urls:queue.put (URL) # Add the Poison pillv for worker in worker_threads:queue.put ( ' quit ') for worker in worker_threads:worker.join () print " done! Time taken: {} '. Format (Time.time ()-start_time) def build_worker_pool (queue, size): Workers = [] for _ in Range (size): worker = Consumer (queue) Worker.start () Workers.append (worker) return workersif __name__ = =  ' __main__ ': Producer ()          

This code works correctly, but take a closer look at what we need to do: construct different methods, track a series of threads, and, in order to solve the annoying deadlock problem, we need to do a series of join operations. This is just the beginning ...

So now we've reviewed the classic multi-threaded tutorial, how many holes are there? Boilerplate and error-prone, such a less-than-good style is obviously not so suitable for daily use, fortunately we have a better way.

Why don't you try Map

Map, a compact and sophisticated function, is the key to the parallel implementation of Python programs. Map originates from a functional programming language such as Lisp. It can implement a mapping between two functions through a sequence.

    urls = [‘http://www.yahoo.com‘, ‘http://www.reddit.com‘]    results = map(urllib2.urlopen, urls)

These two lines of code pass each element in the sequence of URLs as an argument to the Urlopen method and save all the results to the results list. The result is roughly equivalent to:

results = []for url in urls:     results.append(urllib2.urlopen(url))

The map function handedly a series of operations, such as sequence manipulation, parameter passing, and result saving.

Why is this important? This is because with the right library, map makes it easy to parallelize.

There is a two library in Python that contains the map function: multiprocessing and its little-known subpackage multiprocessing.dummy.

Here are two more words: Multiprocessing.dummy? Mltiprocessing Library for threaded version cloning? Is this a shrimp? Even in the official document of the multiprocessing library there is only one sentence related to this sub-library. And this sentence is basically said: "Well, there is such a thing, you know it." Trust me, this library is grossly undervalued!

The dummy is a complete clone of the multiprocessing module, the only difference being that the multiprocessing acts on the process, and the dummy module acts on the thread (and therefore includes all the common multithreading limitations of Python).
So replacing using these two libraries is exceptionally easy. You can select different libraries for IO-intensive tasks and CPU-intensive tasks. 2

Hands-on Try

Use the following two lines of code to refer to the library that contains the parallelized map function:

from multiprocessing import Poolfrom multiprocessing.dummy import Pool as ThreadPool

Instantiate the Pool object:

pool = ThreadPool()

This simple statement replaces the work of the Build_worker_pool function 7 lines of code in example2.py. It generates a series of worker threads and completes the initialization work, storing them in variables for easy access.

The Pool object has some parameters, and all I need to focus on here is its first parameter: processes. This parameter is used to set the number of threads in the thread pool. The default value is the number of cores for the current machine CPU.

In general, when performing CPU-intensive tasks, the more cores are called faster. But when it comes to dealing with network-intensive tasks, things can be hard to predict, and it's wise to experiment to determine the size of the thread pool.

pool = ThreadPool(4) # Sets the pool size to 4

When there are too many threads, switching threads consumes more time than the actual working time. For different jobs, it's a good idea to try to find the best value for the thread pool size.

Once the Pool object is created, the parallelized program is ready to be prepared. Let's take a look at the rewritten example2.py.

Import urllib2 from multiprocessing.dummy import PoolAs ThreadPool URLs = [' http:Www.python.org ',' http:www.python.org/about/',' http:Www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html ',' http:Www.python.org/doc/',' http:Www.python.org/download/',' http:www.python.org/getit/',' http:Www.python.org/community/', ' https://wiki.python.org/moin/', //planet.python.org/",  Https://wiki.python.org/moin/localusergroups ',  Http://www.python.org/psf/',  ' Http:// docs.python.org/devguide/',  ' http://www.python.org/ community/awards/' # etc. ]# make the Pool of Workerspool = ThreadPool (4) # Open the URLs in th Eir own threads# and return The Resultsresults = Pool.map (urllib2.urlopen, URLs) #close the poo L and wait for the work to finish Pool.close () pool.join ()     

The code that actually works is only 4 lines, and only one row is critical. The map function easily replaces more than 40 lines in the previous article. To be more interesting, I've counted the time-consuming scenarios for different methods and different thread pool sizes.

# results = [] # for url in urls:#   result = urllib2.urlopen(url)#   results.append(result)# # ------- VERSUS ------- # # # ------- 4 Pool ------- # # pool = ThreadPool(4) # results = pool.map(urllib2.urlopen, urls)# # ------- 8 Pool ------- # # pool = ThreadPool(8) # results = pool.map(urllib2.urlopen, urls)# # ------- 13 Pool ------- # # pool = ThreadPool(13) # results = pool.map(urllib2.urlopen, urls)

Results:

#        Single thread:  14.4 Seconds #               4 Pool:   3.1 Seconds#               8 Pool:   1.4 Seconds#              13 Pool:   1.3 Seconds

It's a great result, isn't it? This result also explains why you should experiment to determine the size of the thread pool. In my machine the benefits of a thread pool size greater than 9 are very limited.

Another real example.

Create thumbnails of thousands of images
This is a CPU-intensive task and is well suited for parallelization.

Basic Single Process version
ImportOsImport PILFrom multiprocessingImport PoolFrom PILImport ImageSIZE = (75,(save_directory) =' Thumbs ' def get_image_paths (folder):Return (Os.path.join (folder, F)For FInchos.listdir (folder) if  ' JPEG ' Span class= "Hljs-keyword" >in f) def create_thumbnail (filename): im = Image.open (filename) im.thumbnail (SIZE, Image.antialias) Base, fname = os.path.split (filename) Save_path = os.path.join (base, Save_directory, fname) im.save (save_path) if __name__ =  ' __main__ ': folder = os.path.abspath (os.mkdir ( os.path.join (folder, save_directory)) images = get_image_paths (folder) for image in images:create_thumbnail (image) 

The main task of the above code is to iterate through the image files in the Incoming folder, generate thumbnails, and save the thumbnails to a specific folder.

On my machine, it takes 27.9 seconds to process 6000 images with this program.

If we use the map function instead of a For loop:

Import OSImport PIL from multiprocessingimport Pool from PIL import ImageSIZE = () save_directory = ' thumbs ' def get_image_paths (folder): return (Os.path.join (folder, F) for F in Os.listdir (folder) if ' jpeg ' in f) def create_thumbnail (filename): IM = Image.open (filename) im.thumbnail (SIZE, Image.antialias) base, fname = Os.path.split (filename) Save_path = Os.path.join (base, Save_directory, fname) im.save (save_path)if __name__ = = ' __main__ ': folder = Os.path.abspath (' 11 _18_2013_r000_iqm_big_sur_mon__e10d1958e7b766c3e840 ') Os.mkdir (Os.path.join (folder, save_directory)) images = get_ Image_paths (folder) pool = Pool () pool.map (creat_thumbnail, Images) pool.close () pool.join ()     

5.6 Seconds!

Although only a few lines of code were changed, we obviously improved the execution speed of the program. In a production environment, we can select multiple processes and multi-threaded libraries for CPU-intensive tasks and IO-intensive tasks to further improve execution speed-a recipe for deadlock resolution. In addition, because the map function does not support manual thread management, the associated debug work becomes surprisingly simple.

Here, we have implemented (basically) parallelization through a single line of Python.

The original site is as follows:

1190000000414339

One line of Python for Parallelization--a new idea of daily multithreading operation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.