Parallel and asynchronous processing of C ++ PPL: C ++

Last Update:2018-12-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Parallel and asynchronous processing of C ++ PPL: C ++

Written by Allen Lee

You held it all, but you were careless to let it fall. You held it all, and I was by your side powerless.
-Linkin Park, Powerless

Calculate the sine value in parallel

Suppose we have an array that contains a set of randomly generated floating point numbers. Now we need to calculate the sine value corresponding to each floating point number. If you have read my encounter C ++ Lambda, you may think of using the for_each function, as shown in Code 1. To replace the floating point number in the array with the corresponding sine value, we need to declare the Lambda parameter as a reference. If you want to retain the floating point numbers, you can create a new array to store the calculation results.

Code 1

It is worth noting that here the begin and end functions are used to obtain the starting and ending positions of arrays respectively, which is a recommended method for C ++ 11. Previously, we used the begin and end member functions of the STL container to obtain the start and end positions respectively, but this method cannot overwrite the C-style array. Now, C ++ 11 uses the begin and end functions to unify the writing of the C-style array and the start and end positions of STL containers. It is hard to imagine that the new writing method can improve the Code consistency.

The for_each function provided by STL is executed serially. If you want to take full advantage of multi-core, you can consider switching to the parallel_for_each function provided by PPL (Parallel Patterns Library). The entire transformation process takes only three steps:

# Include <ppl. h>
Using namespace concurrency;
Change for_each to parallel_for_each, as shown in Code 2.

Code 2

Note that if you use PPL on Visual C ++ 2010, you need to reference the Concurrency namespace (in upper case). The concurrency namespace referenced here (in lower case) is the namespace alias created by Visual C ++ 2012 PPL to be consistent with other common full-lowercase namespaces (such as stl.

If you do not want to affect the floating point numbers, you can create a new array and store the calculation results to the new array through the parallel_for function, as shown in code 3. Here, the parallel_for function is used to manage the correspondence between the elements of two arrays by means of indexes. If you want to rotate between multiple arrays, for example, if you want to implement (A + B)/(C-D) operations on the corresponding elements for the, B, C, and D sets, the parallel_for function is very intuitive.

Code 3

If you do not want to manage the correspondence between elements, consider the parallel_transform function, as shown in code 4. The first two parameters of the parallel_transform function specify the start and end positions of the input container, and the third parameter specifies the start position of the output container, the number of elements between the positions pointed to by the first two parameters must be smaller than or equal to the number of elements between the positions pointed to by the third parameter and the end position of the output container. Otherwise, an error occurs.

Code 4

Number of parallel lines

In meet C ++ Lambda, we use the for_each function to calculate the number of odd numbers in the randomly generated integer. Can this process be parallel? Yes. The general practice is to declare the number of variables to store. In the iteration process, the variable increments once an odd number is found. Because multithreading is involved, you can use the InterlockedIncrement function provided by the system to ensure the security of incremental operations, as shown in code 5.

Code 5

The above code can get the correct result, but there is a problem. Every time an odd number is found, the InterlockedIncrement function must be called. If the odd number in the nums array occupies the majority, therefore, the overhead of calling the InterlockedIncrement function may offset the benefits of parallelism, resulting in execution efficiency or even lower than the serial version. To avoid this impact, we can replace the combination of volatile variables and InterlockedIncrement functions with the combinable object provided by PPL, as shown in Code 6.

Code 6

How does a combinable object help parallel_for_each function improve execution efficiency? This requires a little understanding of how the parallel_for_each function works. Simply put, it will divide the data we pass to it into N blocks and hand them to N threads for parallel processing, however, the same piece of data is processed serially in the corresponding thread, which means that the code for processing the same piece of data can be directly synchronized. The combinable object uses this to reduce unnecessary synchronization, this improves the execution efficiency of the parallel_for_each function.

The combinable object provides a Thread-Local Storage for each Thread. The Local Storage of each Thread is initialized using the Lambda provided when the object is created. We can use the local member function to access the local storage of the thread of the current thread. Because the combinable object ensures that the object returned by the local member function must be of the current thread, we can operate directly with peace of mind. After the operations of each thread are completed, we can call the combine member function to summarize the local storage results of each thread. This will produce synchronization between threads, but the synchronization work is the responsibility of the combinable object. We don't have to worry about it. We just need to tell it the summary method. In our example, this logic is the plus function object provided by STL.

The combination of parallel_for_each functions and combinable objects is essentially a Reduce process. PPL provides a parallel_reduce function to deal with such requirements, as shown in code 7, it directly shows the two-step processing process hidden by the parallel_for_each function and combinable object.

Code 7

In the first stage, the parallel_reduce function divides the data that we pass to it into N blocks and submits them to N threads for parallel processing. The Code executed by each thread is specified by the fourth parameter. In our example, this parameter is a Lambda function. The parallel_reduce function uses the Lambda parameter to tell us the starting and ending positions of each piece of data and the calculated initial values, this initial value actually comes from the third parameter of the parallel_reduce function, while the Lambda function body is an uncompromising serial code. After All threads are executed, they enter the second stage to summarize the execution results of each thread. The summary method is specified by the fifth parameter.

The parallel_reduce function and the previously mentioned parallel_transform function can be combined to implement parallel MapReduce operations, while the transform and accumulate functions provided by STL can be combined to implement serial MapReduce operations.

Execute different tasks at the same time

Assume that our current task is to calculate the quotient of the sum of all odd numbers in a group of random integers and the first prime number. The general practice is to perform the following steps in order:

Generate a group of random Integers
Calculate the sum of all odd values
Find the first prime number
Calculation Result

Because steps 2 and 3 are independent of each other, they only depend on the results of the first step. We can execute these two steps at the same time to improve the overall execution efficiency of the program. So how can we execute two different codes at the same time? You can use the parallel_invoke function, as shown in code 8.

Code 8

The parallel_invoke function can accept up to 10 parameters. In other words, it can execute up to 10 different codes at the same time. What if we need to execute more than 10 codes at the same time? In this case, we can create a Lambda array and hand it to the parallel_for_each/parallel_for function for execution, as shown in code 9.

Code 9

All these codes can get the correct results, but they all have a disadvantage-blocking the current thread. Think about it. Generally, parallel programming requires a large amount of computation. If you have to wait for them to calculate well, I am afraid it will make users angry. However, if you wait for them to calculate well, the subsequent steps may not work properly. What should I do?

Async + continuation

We can use the task object to asynchronously execute the first step, and then use continuation to link the subsequent steps in the established sequence. This can avoid blocking the current thread and ensure the correct execution order.

First, move the variables to be shared in each step to the previous step, as shown in code 10. These variables will be captured and used by the corresponding step.

Code 10

Then, create a task object using the create_task function and perform the First Step asynchronously, as shown in Code 11. The create_task function is responsible for creating a task object using the Lambda object that we pass to it. This Lambda function can return values but cannot accept any parameters. Otherwise, compilation errors may occur. When we need to obtain input from external sources, we can use closures or call other functions.

Code 11

Then, call the then function on the task object returned by the create_task function to create a continuation, as shown in code 12. This continuation starts only after the previous task is completed, so as to ensure that the data required for the second and third steps are ready before execution.

Code 12

Finally, call the then function on the task object returned by the then function to create a continuation and execute Step 4, as shown in code 13. Theoretically, you can use the then function to create a continuation of any number. It is worth noting that in Metro-style applications, continuation is executed in the UI thread by default. Therefore, you can directly update the UI control in continuation without using the Dispatcher object. However, if you want to execute continuation in the background, you need to pass task_continuation_context: use_arbitrary to the _ ContinuationContext parameter of the then function.

Code 13

If you combine the code and execute it in the main function, and put a cin. get () at the end to wait for the result, everything will work normally. However, if you put them in a work function and then call the work function in the main function, you may encounter an exception, probably because we have read something that should not be read. This is because our tasks are executed asynchronously. During execution, the work function may have returned results, and the variables allocated to the stack are also destroyed, if you access those variables at this time, an error will occur. How can this problem be solved?

Previously, Lambda passed to the create_task function can return values. The returned values will be passed to the subsequent continuation through parameters. We can use this mechanism to internalize those variables into Lambda, as shown in Code 14.

Code 14

It is worth noting that we pass the calculation result of step 2 and Step 3 to Step 4 through the tuple object, and then extract the data in the tuple object into two variables through the tie function, this statement is similar to "let sum_of_odds, first_prime = operands" in F ".

In addition, if you are worried that passing a vector between tasks may cause performance problems, you can use smart pointers for separate processing, as shown in code 15. A smart pointer is an object and will be destroyed as the work function returns. Therefore, it needs to be captured by passing values.

Code 15

So far, we have no code for exception handling. How can we handle an exception thrown by one of the tasks? We can add a special continuation at the end of the task chain, as shown in code 16. Its parameter is a task object, and the exception thrown by any task in the task chain will be uploaded here, this exception can be thrown again by calling the get function, so we use a try... The catch statement wraps the call of the get member function and then handles the exceptions it throws.

Code 16

Questions you may ask

1. What are the conditions required to use PPL?

Functions such as parellel_for, parellel_for_each, and parallel_invoke can be used in Visual Studio 2010. the h header file references the Concurrency namespace, while the parellel_transform and parallel_reduce functions, and the task-related part require Visual Studio 2012. ppl must be included in use. h and ppltask. h header file.

2. Can I recommend some PPL references?

For the PPL functions and types mentioned in this article, refer to the concurrency class library of MSDN. In addition, MSDN's Parallel Patterns Library (PPL) and Parallel Programming with Microsoft Visual C ++: Design Patterns for Decomposition and Coordination on Multicore ubuntures are also good learning materials.

3. Does STL provide a substitute for tasks?

The STL of C ++ 11 provides the std: future class. Combined with the std: async function, the asynchronous Effect of tasks can be achieved, as shown in code 17, but std :: the future class currently does not support contiuation. It can only obtain results through the get member function. when calling the get member function, if the relevant code is still being executed, the current thread will be blocked.

Code 17

4. Can PPL be used on platforms other than Windows?

PPL can only be used on Windows. If you want to perform similar parallel programming on other platforms, you can consider Intel Threading Building Blocks, which supports both Windows, Mac OS X, and Linux, the provided API is similar to that of PPL. TBB is open-source, and Intel provides two license protocols for it and GPLv2.

5. Can I recommend some references for TBB?

Intel Threading Building Blocks: Outfitting C ++ for Multi-Core Processor Parallelism is a good learning material. In addition, Intel provides rich sample code.

* Statement: This article has been first published on the InfoQ Chinese site. All Rights Reserved. "parallel and asynchronous encounter PPL: C ++" must be included in this statement if you need to reprint it. Thank you.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Parallel and asynchronous processing of C ++ PPL: C ++

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Parallel and asynchronous processing of C ++ PPL: C ++

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support