Learn about Intel thread building blocks open Source libraries
Brief introduction
We have found a powerful alternative to POSIX threads and windows-based threads, the Intel thread building block, which is a C + + based framework designed for parallel programming.
0 Comments:
Arpan Sen, independent writer
February 27, 2012 content
Develop and deploy your next application on the IBM Bluemix cloud platform.
Start your trial
Parallel programming is the future development trend, but how to achieve high-performance parallel programming, so that the use of multi-core CPU effectively. Using a line such as a POSIX thread threading is certainly an option, but the C language has been taken into account when the POSIX thread framework was first introduced. This is a very low-level method, for example, if you cannot access any concurrent containers, you cannot use any concurrency algorithm. To this end, Intel has launched the Intel® thread building block (Intel TBB), a C + + language-based framework for parallel programming, which provides a number of interesting features and a higher level of abstraction than threads. Common Abbreviations POSIX: Portable operating system interfaces for UNIX
Downloading and installing Intel TBB is simple: The unpacked directory hierarchy is similar to the UNIX® system, including include, Bin, Lib, and Doc folders. For the purposes of this article, I chose to use the Tbb30_20110427oss stable version. Start using Intel TBB
Intel TBB offers a number of benefits. You can focus on the following features: Unlike threads, you can use higher levels of abstraction for tasks. Intel claims that on linux® systems, the speed at which the start and end tasks are 18 times times the same as the thread. Intel TBB comes with a task scheduler that can efficiently handle load balancing across multiple logical and physical cores. The default task scheduling policy in Intel TBB differs from the polling policy owned by most thread schedulers. Intel TBB provides a number of direct-use thread-safe containers, such as concurrent_vector and concurrent_queue. Common parallel algorithms, such as parallel_for and parallel_reduce, can be used. Non-lock (Lock-free, also known as Mutex-free) concurrent programming support is provided in the template class atomic. This support makes Intel TBB suitable for high-performance applications because Intel TBB can lock and unlock mutexes (mutexes). This is all implemented in C + +. There is no extension or use of macros, and Intel TBB uses only this language and a large number of templates.
Using Intel TBB requires many prerequisites. Before you begin, you should meet the following requirements: some C + + templates and understand the Standard Template Library (STL). Understand threads, either POSIX threads or Windows® threads.
Although not necessary, the lambda functionality in c++0x is quite useful for Intel TBB.
This article discusses Intel TBB starting with creating and studying tasks and synchronization primitives (mutexes), and then understanding how to use concurrent containers and parallel algorithms. Finally, it describes how to use atomic templates to achieve lock-free programming.
Get back to the top of Intel TBB tasks
The Intel TBB is based on the concept of tasks. You need to define your own tasks, which are derived from Tbb::task and declared using Tbb/task.h. Users need to override the pure virtual method task* Task::execute () in their own code. The following shows some properties for each Intel TBB task: When the Intel TBB Task Scheduler chooses to run some tasks, the Execute method of the task is invoked. This is the entry point. The Execute method returns a task* that tells the scheduler the next task it will run. If it returns NULL, then the scheduler is free to choose the next task. Task::~task () is virtual and must be released in this destructor (destructor), regardless of the resource consumed by the user's task. Tasks are assigned by calling Task::allocate_root (). The main task completes the task by calling Task::spawn_root_and_wait (Task).
Listing 1 below shows the first task and how it is invoked: Listing 1. Create the first Intel TBB task
#include "tbb/tbb.h"
#include <iostream>
using namespace TBB;
using namespace std;
Class First_task:public Task {public
:
task* Execute () {
cout << "Hello world!\n";
return NULL;
}
;
int main ()
{
task_scheduler_init init (task_scheduler_init::automatic);
first_task& f1 = *new (Tbb::task::allocate_root ()) First_task ();
Tbb::task::spawn_root_and_wait (F1);
}
To run the Intel TBB program, you must properly initialize the Task Scheduler. The scheduler's parameters in Listing 1 are automatic, which enables the scheduler to determine the number of threads on its own. Of course, you can override this behavior if you want to control the maximum number of derived threads. But in production code, unless you really know what you're doing, it's best for the scheduler to decide the best number of threads.
Now that you have created your first task, let's use the First_task in Listing 1 to derive more subtasks. Listing 2 introduces some new concepts: Intel TBB provides a container named Task_list that can be used as a collection of tasks. Each parent task creates a child task using the Allocate_child function call. The parent task must invoke Set_ref_count before any child tasks are derived. Failure to do so can result in undefined behavior. If you intend to derive some subtasks and then wait for them to complete, the value of count must be the number of subtasks + 1, otherwise count will equal the number of subtasks. This will be covered in more detail later. The purpose of invoking Spawn_and_wait_for_all can be inferred from its name: it can derive subtasks and wait for all subtasks to complete.
The following are related code: Listing 2. Create more than one child task
#include "tbb/tbb.h"
#include <iostream>
using namespace TBB;
using namespace std;
Class First_task:public Task {public
:
task* Execute () {
cout << "Hello world!\n";
Task_list List1;
List1.push_back (*new (Allocate_child ()) first_task ());
List1.push_back (*new (Allocate_child ()) first_task ());
Set_ref_count (3); 2 (1 per child task) + 1 (for the Wait)
Spawn_and_wait_for_all (list1);
return NULL;
}
;
int main ()
{
first_task& f1 = *new (Tbb::task::allocate_root ()) First_task ();
Tbb::task::spawn_root_and_wait (F1);
}
So why does Intel TBB require an explicit set of Set_ref_count? The documentation indicates that this is done primarily for performance reasons. Before you derive a child task, you must always set the ref count for the task. See Resources for links to more content.
You can also create task groups. The following code creates a task group that derives two tasks and waits for them to complete. The task_group Run method has the following signature:
Template<typename func> void Run (const func& f)
The Run method derives a task that computes F (), but does not block the calling task, so control returns immediately. To wait for the subtasks to complete, the calling task invokes the waiting (see Listing 3). Listing 3. Create a task_group
#include "tbb/tbb.h"
#include <iostream>
using namespace TBB;
using namespace std;
Class Say_hello () {
const char* message;
Public:
Say_hello (const char* str): Message (str) { }
void operator () () const {
cout << message & lt;< Endl;
}
;
int main ()
{
task_group tg;
Tg.run (Say_hello ("Child 1")); Spawn task and return
Tg.run (Say_hello (' Child 2 '));//Spawn another task and return
tg.wait (); Tasks to complete
}
Note that task_group's syntax is very concise-no calls to memory collection are required, so you do not need to perform any action on the ref count when working directly on a task. There are many things you can do with the Intel TBB task. Please refer to the Intel TBB documentation for more details. Next we'll explore concurrent containers.
Back to top concurrent container: vector
Now, let's focus on one of the concurrent containers of Intel TBB: concurrent_vector. The container is declared in the header file Tbb/concurrent_vector.h, and the basic interface is similar to the STL vector:
Template<typename T, class A = cache_aligned_allocator<t> >
class concurrent_vector;
Multiple thread security can be added to the vector without any explicit locking. According to the Intel TBB Manual, concurrent_vector provides the following properties: it provides random access to its elements; The index starts at position 0. You can safely increase the number of concurrent numbers, and you can add multiple threads at the same time. Adding a new element does not affect an existing index or iterator.
However, there is a price to be paid to achieve concurrency. Unlike the STL, the STL adds new elements that involve data movement, and concurrent_vector does not move the data. The container contains a series of contiguous memory fragments. Obviously, this will increase the container overhead.
For vector concurrency, there are three ways to use: push_back: Add elements to the vector end. Grow_by (n): Adds N consecutive elements of type T to concurrent_vector and returns an iterator that points to the first additional element. Each element is initialized by T (). Grow_to_at_least (N): Increases the size of the vector to n (if the current size of the vector is less than n).
Append a string to the concurrent_vector, as follows:
void Append (concurrent_vector<char>& cv, const char* str1) {
size_t count = strlen (str1) +1;
Std::copy (str1, Str1+count, Cv.grow_by (count));
}
Back to the top the out-of-the-box parallel algorithms provided with Intel TBB
One of the great advantages of Intel TBB is that it enables you to work in parallel with multiple parts of the source code without knowing how to create and maintain threads. The most common parallel algorithm is parallel_for. Consider the following example:
void Serial_only (int* array, int size) {for
(int count = 0; count < size; ++count)
apply_transformation (ARRA y [Count]);
Now, if the apply_transformation routines in the preceding code fragment do not have exceptions, such as applying some transformations to only a single array element, you will be able to successfully allocate the load to multiple CPU cores. You need to use the following two classes provided by the Intel TBB Library to do this: Blocked_range (from Tbb/blocked_range.h) and parallel_for (from Tbb/parallel_for.h).
The Blocked_range class is used to create objects that provide an iterative scope to parallel_for, so you need to create something similar to Blocked_range (0, size) and pass it as input to parallel_for. The second and last parameter required by parallel_for is a class, which is required as shown in Listing 4 (pasted from the Parallel_for.h header file). Listing 4. Requirements for the second parameter of the parallel_for
/** \page parallel_for_body_req Requirements on parallel_for body
Class \c Body Implementing the concept of Parallel_f or body must define:
-\code body::body (const body&); \endcode Copy Constructor
-\code body::~body (); \ Endcode destructor
-\code void Body::operator () (range& R) const; \endcode
Function call operator apply ing the body to range \c R.
**/
The code indicates that you need to create your own class using operator (), where Blocked_range is used as a parameter, and a sequence for loop created earlier in the operator () method definition is written. Code constructors and destructors should be public, using the default values provided by the compiler. Listing 5 shows the relevant code. Listing 5. Create a second function of the parallel_for
#include "tbb/blocked_range.h"
using namespace TBB;
Class apply_transform{
int* array;
Public:
Apply_transform (int* a): Array (a) {}
void operator () (const blocked_range& R) Const {for
(int i =r.begin (); I!=r.end (); i++) {
apply_transformation (array[i]);
}
}
;
Now that you have successfully created the second object, just call parallel_for, as shown in Listing 6. Listing 6. Using parallel_for parallel loops
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"
using namespace TBB;
void Do_parallel_the_tbb_way (int *array, int size) {
parallel_for (blocked_range (0, size), Apply_transform (array) );
}
Back to top other parallel algorithms in Intel TBB
The Intel TBB provides a number of parallel algorithms, such as Parallel_reduce (declared in tbb/parallel_reduce.h). This time does not apply transformations to each array element, but calculates the sum of all elements. Here is the sequence code:
void Serial_only (int* array, int size) {
int sum = 0;
for (int count = 0; count < size; ++count)
sum + + = array [count];
return sum;
}
In theory, running the code in parallel means that each control thread sums up some parts of the array, so you must add these totals in a location using the Join method. Listing 7 shows the Intel TBB code. Listing 7. Sequence loops for summation of array elements
#include "tbb/blocked_range.h"
#include "tbb/parallel_reduce.h"
using namespace TBB;
Float sum_with_parallel_reduce (int*array, int size) {
Summation_helper Helper (array);
Parallel_reduce (blocked_range<int> (0, size, 5), helper);
return helper.sum;
}
When dividing an array into several small arrays so that they can be used for each thread, you want to preserve some granularity (for example, each thread is responsible for summing n elements, and n should not be too large or too small). Then you need to use the third parameter blocked_range. The Intel TBB requires that the Summation_helper class meet two conditions: it must provide a method called join to add a partial aggregate value, and a constructor that contains special parameters (called the splitting constructor). Listing 8 provides the relevant code: listing 8. Use the Join method to create a Summation_helper class and divide the constructor
Class Summation_helper {
int* partial_array;
public:
int sum;
void operator () (const blocked_range<int>& R) {for
(int count=r.begin (); Count!=r.end (); ++count)
su M + = Partial_array [count];
}
Summation_helper (Summation_helper & X, Split):
Partial_array (x. partial_array), sum (0)
{
}
Summation_helper (int* Array): Partial_array (array), SUM (0)
{
}
void Join (const Summation_helper & temp) {
sum = temp.sum; Required method
}
};
Next, the Intel TBB invokes the splitting constructor (the second parameter, called Split, is the pseudo parameter that Intel TBB requires) and uses elements to populate the partial array (the number of these elements is the granularity defined in Blocked_range). When you complete a sum operation on a child array, the Join method adds the results. It sounds a little complicated. At first glance this may be the case; but remember, you only need three methods: operator () is used to add the array range, the join is used to add partial results, and the split constructor is used to start a new worker thread.
Intel TBB also provides several other useful algorithms, Parallel_sort is one of the most useful algorithms. See the Intel TBB reference manual (see Resources) for more detailed information.
Back to the top of the page using Intel TBB for lock-free programming
One problem that often arises in multithreaded programming is that locking and unlocking the mutex wastes a lot of CPU cycles. If you understand the POSIX thread, then the Intel TBB Atomic template will surprise you. It's faster than a mutex, and you no longer need to lock and unlock your code. Atomic can solve all coding problems. No, there are strict limitations to its use; Anyway, it works very well if you want to create high-performance code. The following shows how to declare an integer as a atomic type:
#include "tbb/atomic.h"
using namespace TBB;
atomic<int> count;
atomic<float* > Pointer_to_float;
Now, assume that the preceding variable count can be accessed by multiple control threads. Typically, you need to use a mutex for the count when performing a write, but you don't need to do that again after you have atomic<int>. See listing 9. Listing 9. Atomic Fetch_and_add do not need to be locked
Writing with mutex, the count is declared as int count;
{
//... code
pthread_mutex_lock (&lock);
Count + = 1000;
Pthread_mutex_unlock (&lock);
... code continues
}
//writing without mutex, Count declared as atomic<int> count;
{
//... code
count.fetch_and_add (1000);//No explicit locking/unlocking
//... code continues
}
instead of using + =, you use the Fetch_and_add method of the Atomic<t> class. Also, it does not use any mutexes in the Fetch_and_add method. When Fetch_and_add is executed, the count is immediately incremented by 1000 counts: either all threads immediately see the updated count value or all threads continue to display the old values. This is why count is declared as a atomic variable: the operation of Count is atomic and is not interrupted by a process or thread dispatch. Count does not appear differently in different threads, regardless of how the thread is scheduled. For an in-depth discussion of lock-free programming, see resources.
The Atomic<t> class provides the following 5 basic operations:
y = x; Atomic read
x = B;//Atomic write
X.fetch_and_store (y);//y = x and return the old value of x
X.fetch_and _add (y); x = = y and return the ' old ' value
of x x.compare_and_swap (Y, z);//if (x = = z) x = y; in either case, return old Val UE of X
In addition, operators = =, =, + +, and--are supported for convenience, but are implemented on top of Fetch_and_add. As shown in Tbb/atomic.h, the following shows how these operators are defined (see listing 10). Listing 10. Using Fetch_and_add-defined operators + + 、--、 + = and =
Value_type operator+= (D addend) {return
Fetch_and_add (addend) +addend;
}
Value_type operator-= (D addend) {
//additive inverse to addend computed using binary minus,
//instead of unary Minus, for sake of avoiding compiler warnings.
Return operator+= (D (0)-addend);
Value_type operator++ () {return
Fetch_and_add (1) +1;
}
Value_type operator--() {return
Fetch_and_add (__tbb_minus_one (D))-1;
}
Value_type operator++ (int) {return
fetch_and_add (1);
}
Value_type operator--(int) {return
Fetch_and_add (__tbb_minus_one (D));
}
Note that the T type in,atomic<t> can only be an integer type, an enumeration type, or a pointer type.
Conclusion
This article is limited in length and cannot be fully described in the Intel TBB library. However, Intel's Web site provides a wide range of articles that describe various aspects of Intel TBB. The purpose of this article is simply to introduce some of the interesting features provided by Intel TBB, such as tasks, concurrent containers, algorithms, and ways to implement unlocked code. I hope the introduction of this article will arouse your interest in Intel TBB and make you an avid user, just like the author of this article.