Anatomy of a Python process pool (ii)

Last Update:2015-06-13 Source: Internet

Author: User

Tags iterable

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous article described the pool of process pools that came with the multiprocessing module in Python, and made a simple analysis of the data structure in the process pool and the relationships between the threads, and this section looks at how the client assigns tasks to the process pool and obtains the results.

We know that when the task queue in the process pool is not empty, the worker process is triggered to work, so how to add tasks to the task queue in the process pool, the process pool class has two key sets of methods to create tasks, namely Apply/apply_async and Map/map_async, In fact, the apply and map methods of the process pool class are similar to the two identically named methods built in Python, and Apply_async and Map_async are their non-blocking versions respectively.

First look at the Apply_async method, the source code is as follows:

def apply_async (self, Func, args= (), kwds={}, callback=None):    Assert self._state = = RUN    result = callback represents a single-parameter method that, when a result is returned, the callback method is called, and the parameter is the result of the task execution.

Each time the Apply_result method is called, a task is actually added to the _taskqueue, noting that this is a non-blocking (asynchronous) invocation, where the newly created task in the Apply_async method is simply added to the task queue, not executed, and no wait is required. Return directly to the created Applyresult object, and note that when the Applyresult object is created, it is placed in the cache _cache of the process pool.

The task queue has a newly created task, then according to the processing process of the previous section analysis, the _task_handler thread of the process pool, takes the task from the Taskqueue and puts it into _inqueue, and triggers the worker process to invoke Func based on args and Kwds. At the end of the run, the result is put into _outqueue, and then by the _handle_results thread in the process pool, the run result is fetched from _outqueue and the Applyresult object in the _cache cache is found, _set its running result, Wait for the caller to get.

Since the Apply_async method is asynchronous, how does it know to end the task and get the result? Here you need to understand the two main methods in the Applyresult class:

def get (self, timeout=None): self.wait (Timeout)IfNotSelf._ready:RaiseTimeouterrorIfSelf._success:ReturnSelf._valueElse:RaiseSelf._valuedef _set (self, I, obj): self._success, Self._value =< Span style= "color: #000000;" > obj if self._callback and  Self._success:self._callback (self._value) self._cond.acquire () try: Self._ready = True self._cond.notify () finally: Self._cond.release () del self._cache[self._ Job] 
 _set method saves the run result in Applyresult._value, Wakes up a Get method that blocks on a condition variable. The client returns the result of the run by calling the Get method.

The Apply method runs the fetch process result in a blocking way, its implementation is simple, it is also called Apply_async, but does not return Applyresult, but directly returns the result of the worker process running:

def apply (self, Func, args= (), kwds={}):        assert self._state = = RUN        return Self.apply_async ( Func, args, Kwds). Get ()

The Apply/apply_async method above allows you to assign only one task to the process pool at a time, and you can use the Map/map_async method if you want to assign multiple tasks to the process pool at once. Let's start by looking at how the Map_async method is defined:

def map_async (self, func, iterable, Chunksize=none, callback=None):Assert self._state = =RUNIfNot Hasattr (iterable,‘__len__‘): Iterable =List (iterable)If chunksizeIsNone:chunksize, extra = Divmod (len (iterable), Len (self._pool) * 4 If extra:chunksize + = 1 If len (iterable) = = 0:chunksize = 0 task_batches = pool._get_tasks (Func, iterable, chunksize) result = Mapresult (self._cache, chunksize, Len (iterable), callback) Self._taskqueue.put ((((Result._job, I, Mapstar, (x,), {}) for I, x in enumerate (task_batches)), None)) return resultfu NC represents a method that performs this task iterable means that the task parameter sequence chunksize indicates that the iterable sequence is split by the size of each group of chunksize. Each segmented sequence is submitted to a task in the process pool for processing callback represents a single-parameter method, and when a result is returned, the callback method is called, and the parameter is the result of the task execution

From the source can be seen, map_async than Apply_async complex, first it will be based on chunksize to the task parameter sequence grouping, chunksize represents the number of tasks in each group, when the default Chunksize=none, Calculates the number of groupings based on the task parameter sequence and the number of processes in the process pool: chunk, extra = Divmod (len (iterable), Len (self._pool) * 4). Assuming that the number of processes in the process pool is len (self._pool) = 4, the task parameter sequence Iterable=range (123), then chunk=7, extra=11, executes down, and chunksize=8, indicating that the task parameter sequence is divided into 8 groups. Task actual grouping:

Task_batches = Pool._get_tasks (func, Iterable, chunksize) def _get_tasks (func, it, size): It = ITER (IT) while 1: X = tuple (Itertools.islice (it, size) if not x: return yield (func, x) here, use yield to compile the _get_tasks method with the genetic builder. In fact for a series such as range (123), according to Chunksize=8 is grouped, there are 16 groups of elements for each group: (Func, (0, 1, 2, 3, 4, 5, 6, 7))... (Func, (113, 117, 118, 119

After grouping, a Mapresult object is defined here: result = Mapresult (Self._cache, chunksize, Len (iterable), callback) it inherits from the Appyresult class and also provides get and The _set method interface. Put the grouped task into the task queue, and then return the result object that you just created.

Self._taskqueue.put ((((Result._job, I, Mapstar, (x,), {}) in                               Enumerate (task_batches)), None)) with task parameter sequence  =range (123), for example, actually puts a set of 16 tuple elements into the task queue, in the following tuple: (result._job, 0, Mapstar, (func, (0, 1, 2, 3, 4, 5, 6, 7)), {}, None) (Result._job, 1, Mapstar, (func, (8, 9, ten, one, three, four, four, and note I in each tuple, which represents the position of the current tuple in the entire set of task tuples, through which the _handle_resu The LT thread will be able to populate the Mapresult object with the results of the worker process running in the correct order.

Note that only one put method is called, and the 16 groups of tuples are put into the task queue as a whole sequence, then whether the task _task_handler thread will also pass the entire task sequence to _inqueue like the Map_async method. This causes only one worker process in the process pool to get to the task sequence, rather than the multi-process approach. Let's look at how the _task_handler thread is handled:

Def_handle_tasks (Taskqueue, put, outqueue, pool, cache): thread =Threading.current_thread ()For Taskseq, Set_lengthInchITER (Taskqueue.get, None): i =-1For I, TaskInchEnumerate (TASKSEQ):IfThread._state:debug (‘Task handler found Thread._state! = RUN‘)BreakTry: Put (Task)ExceptException as e:job, ind = Task[:2 try: Cache[job]._set (Ind, (False, E)) except Keyerror: pass Span style= "color: #0000ff;" >elseif set_length: Debug ( ' doing set_length ()  ' " Set_length (I+1continue break else: Debug ( ' task handler got Sentinel

Notice that the statement for I, the task in Enumerate (TASKSEQ), is not directly placed in _inqueue after the _task_handler thread gets to the task sequence through Taskqueue. Instead, the tasks in the sequence are placed in _inqueue by the previously divided group, and the task in the loop is each of the task tuples above: (result._job, 0, Mapstar, (func, (0, 1, 2, 3, 4, 5, 6, 7)),), {}, None). The worker process is then triggered. The worker process obtains each set of tasks and processes the tasks:

Job, I, Func, args, Kwds =TaskTry: result = (True, func (*args, * *KWDS))ExceptException, E:result =(False, E)try:put ((Job, I, result)) except Exception as e:wrapped = Maybeencodingerror (E, result[1] Debug ("Possible encoding error while sending result:%s"% (wrapped)) put ((Job, I, (False, wrapped))) root According to the order of correspondence, it can be seen that the tuple in Mapstar represents the callback function Func here, ((Func, (0, 1, 2, 3, 4, 5, 6, 7)), and {} represents the args and Kwds parameters respectively. Note the Func in the tuple represents the task method that is specified when the client assigns the task. Then look at how Mapstar is defined: def Mapstar (args): return Map (*args) After the task parameters are actually grouped, each group of tasks is invoked through the built-in map method. After running, call put (job, I, result) to put the result into _outqueue, the _handle_result thread pulls the result out of the _outqueue and finds the Mapresult object in the _cache cache, _set its run result

Now let's summarize how the Map_async method of the process pool works, we will use range (123) as the task sequence, pass it to the Map_async method, assuming the chunksize is not specified, and the CPU is four cores, then the method is divided into 16 groups (0~ 14 sets of 8 elements in each group, and the last set of 3 elements). After grouping the task into the task queue, altogether 16 groups, then each process needs to run 4 times to process, each time through the built-in map method, the sequence will execute 8 tasks in the group, then put the result into _outqueue, find the Mapresult object in the _cache cache, _ Set its run result, waiting for the client to get it. using the Map_async method calls multiple worker processes to process the task, each Worler process runs to the end, the result is passed to the _outqueue, and the _handle_result thread writes the result to the Mapresult object. How do you ensure that the order of the result sequence is consistent with the sequence of task arguments passed in when Map_async is called, and we look at the implementation of the Mapresult constructor and _set method.

def __init__ (self, cache, chunksize, length, callback):Applyresult.__init__ (self, cache, callback)Self._success = TrueSelf._value = [None] * lengthSelf._chunksize = chunksize If chunksize <= 0:Self._number_left = 0Self._ready = TrueDel Cache[self._job]Else:Self._number_left = length//chunksize + bool (length% chunksize) def_set (self, I, Success_result): success, result =Success_resultIfSuccess:self._value[i*self._chunksize: (i+1) *self._chunksize] =Result Self._number_left-= 1if Self._number_left = =0:IfSelf._callback:self._callback (Self._value)del Self._cache[self._job] Self._cond.acquire () try: Self._ready = True self._cond.notify () finally: Self._cond.release () else: self._success = result del Self._cache[self._job] Self._cond.acquire () try< Span style= "color: #000000;" >: Self._ready =finally: Self._cond.release ()

In the Mapresult class, _value saves the running result of the map_async, and the length of the list,list that is one element to none at the time of initialization is the same as the length of the task parameter sequence, _chunksize indicates how many tasks each group has after grouping the tasks, _number _left indicates how many groups the entire task sequence is divided into. The _handle_result thread will save the worker process's run results to _value through the _set method, so how can you fill the results of the worker process run into the correct location in _value, and remember that Map_async When the queue fills the task, I in each group, I represents the group number of the current task group, the _set method is based on the group number of the current task, the parameter I, and decrements _number_left, when _number_left decrements to 0 o'clock, Indicates that all tasks in the task parameter sequence have been processed by the Woker process, _value are all computed, wake up the conditional variable that is blocking on the Get method, and the client can obtain the running result.

The map function is a blocking version of Map_async, which, on the basis of Map_async, calls the Get method and blocks directly to the result return:

def map (self, func, iterable, chunksize=None):    Assert self._state = = RUN    return Self.map_async (Func, Iterable, chunksize). Get ()

This section focuses on two sets of interfaces for assigning tasks to process pools: Apply/apply_async and Map/map_async. The Apply method handles one task at a time, the execution method (callback function) of different tasks, the parameters can be different, and the map method can process a task sequence each time, each task executes the same way.

To be Continued ...

Anatomy of a Python process pool (ii)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More