Concurrency and parallelism (I)

Source: Internet
Author: User
Document directory
  • 3. Why concurrency and parallelism?
I. Introduction

I saw a picture on GitHub the day before yesterday and asked how to explain concurrency and parallelism to my five-year-old child. Then someone answered the question in this figure:

This figure is a bit interesting. I use the coffee as a metaphor to describe concurrency and parallelism. The most direct experience is that concurrency is stateful, and a thread executes a task at the same time, when it is finished, it can proceed to the next stage, while parallel execution is stateless.

In recent years, the processing capability of computers has increased exponentially. The processing capability is getting faster and faster. Some previous workstations can now be transplanted to laptops or handheld devices. However, in recent years, as the processing speed of the processor has reached its limit, the processor has started to develop towards multi-core, one of the simplest ways to improve program performance is to make full use of the computing resources of multi-core processors. However, it is not that simple to compile a program that uses a multi-core processor. So some functions are programming languages, such as F #, Scala, and Erlang, which are becoming increasingly popular because of their immutability, recursion simplifies parallel and concurrent programming to a certain extent.

This article briefly discusses parallel programming in. Net from two aspects: Task parallelism and data parallelism. This article cannot end all the APIS, frameworks, tools, or design patterns. If you are interested, you can read special books such as concurrent programming on Windows, concurrency and parallelism in. net, these books specifically explain. net concurrency and parallel programming, this article mainly reference.. NET performance.

Ii. Problems and challenges

Another challenge for parallel processing is the heterogeneity of multi-core systems. CPU manufacturers are now developing 4-core, 8-core, or more CPU cores for user-oriented systems. Moreover, the current workstation or advanced laptop (mobile workstation) usually has a powerful graphics processor (GPU) that supports hundreds of thousands of threads for synchronization. How to make full use of GPU computing performance in some aspects, and how to allocate tasks on the CPU and GPU, CPU and GPU heterogeneity affects parallel development to a certain extent.

However, we can get some performance improvements from parallelism and Asynchronization. Applications with limited I/O can move I/O operations to another thread. By executing asynchronous I/O operations, the application can be more responsive and easily scalable. Applications with limited CPU computing can use all existing CPU cores or upgrade in parallel to improve performance by several orders of magnitude, or use the GPU kernel for computing for some computing-restricted tasks, to improve application performance. Later, we will see that a simple operation to perform array Multiplication can improve the performance by 130 times by modifying several lines of simple algorithms to run the code on the GPU.

However, it will bring about other problems-deadlocks, race conditions, hunger, and memory crashes during single-step debugging. Some of the current parallel frameworks, such. the Parallel library task parallel Library (TPL) in net 4.0 and the amp in C ++ 11 will reduce the complexity of writing parallel programs to a certain extent, and improve the performance formed by parallel writing.

3. Why concurrency and parallelism?

Through concurrency and parallelism, applications can make full use of the computing capabilities of multiple cores and GPUs to Improve the Performance of applications, for example, in the following aspects:

  • Asynchronous I/O operations can improve the responsiveness of applications. Most GUI applications use a single thread to control updates to all UI interfaces. The UI thread should not be used for too long, or the UI interface will lose the response to the user.
  • Cross-thread parallel work can make better use of system resources. Modern computers with multiple CPUs and GPUs can improve the performance of applications with limited CPU computing at an exponential level through parallel processing.
  • Performing multiple I/O operations at the same time (for example, obtaining information from multiple websites at the same time) can improve the overall throughput (throughput ), wait for the corresponding I/O operation to initiate a new operation or process the result returned by the operation.

 

Iv. Evolution of concurrency and Asynchronization in. Net 1. threads-> thread pool-> tasks

To solve the parallel problem, the concept of multithreading was first introduced. Threads are also the easiest way to process parallel and distributed asynchronous operations.

To illustrate the problem, we use the example of finding a prime number to illustrate the problem: there is a series of natural numbers, we need to find all the prime numbers from these natural numbers and store them in a combination. First, we compile a traditional method that runs on a CPU thread:

public static IEnumerable<int> PrimesInRange_Sequential(int start, int end){    List<int> primes = new List<int>();    for (int i = start; i < end; i++)    {        if (IsPrime(i)) primes.Add(i);    }    return primes;}
public static bool IsPrime(int number){    if (number == 2)        return true;    for (int divisor = 2; divisor < number; divisor += 1)    {        if (number % divisor == 0) return false;    }    return true;}

 

This is the simplest way to find prime numbers. Maybe this algorithm is very fast. Now suppose there is a large set, such as [12000,). In my I5 notebook, the above algorithm costs nearly ms, so there is a lot of room for optimization.

First, the optimization in the first step is to optimize the algorithm efficiency. For example, we can optimize the linear time complexity to the time complexity of the root number N. Here we will not deal with the difference for demonstration. However, no matter how the algorithm is optimized, it seems that parallel processing is still difficult. However, if you think about whether 4977 is a prime number and whether 3221 is a prime number, it is independent. Therefore, you can divide the above array into different threads to perform computation. Obviously, during parallelization, we need to pay attention to the problem of multi-thread synchronization. We need to prevent the same set from being modified by multiple threads at the same time. Below is a revised code. We divide the tasks equally, then run the command on different threads:

public static IEnumerable<int> PrimesInRange_Thread(int start, int end){    List<int> primes = new List<int>();    int range = end - start;    int threadNum = (int)Environment.ProcessorCount;    int chunk = range / threadNum;    Thread[] threads = new Thread[threadNum];    for (int i = 0; i < threadNum; i++)    {        int chunkStart = start + i * chunk;        int chunkEnd = chunkStart + chunk;        threads[i] = new Thread(() =>        {            for (int number = chunkStart; number < chunkEnd; ++number)            {                if (IsPrime(number))                {                    lock (primes)                    {                        primes.Add(number);                    }                }            }        });        threads[i].Start();    }    foreach (Thread thread in threads)    {        thread.Join();    }    return primes;}

Now we have allocated 10-of the data to four threads on average.

On average, the algorithm for sequential execution requires 12000 ms, while the multi-threaded method requires only about 4000 Ms. If it is on a multi-core machine, the running speed will be faster. However, the concurrency profile analysis shows the problem. My machine is an I5 processor and has four logic kernels. We can see that at the beginning, the CPU used four cores for processing, but at last only one CPU was running. We can see that the overall CPU usage is reduced from the first four cores to the last one.

We can see that the running time of the four threads is not the same. We can see that some threads end earlier than other threads, so that the CPU usage is much lower than 100%.

In fact, if there are more kernels, the program will run faster, but there are several issues to consider:

  • How many threads can be created? If the system has eight cpu cores, should we create eight threads?
  • How can we ensure that we do not monopolize system resources or impose a huge burden on computers. For example, if other threads in our process need to use the same parallel algorithm as ours to calculate the index, what should we do?
  • How the thread synchronizes the access result set. Simultaneous access to the List <int> by multiple threads is insecure, which may lead to data uncertainty. However, each time we add a result to a set, locking is costly and causes serious performance bottlenecks, there are more problems, especially when algorithms are extended to machines with more processor kernels.
  • Is it worthwhile to create so many threads for small computing problems. Or, in this case, is it more appropriate to execute these computing operations synchronously in a single thread, and the overhead of creating and destroying a thread is very high.
  • How can we ensure that all threads are allocated the same workload? Some threads may complete tasks faster than other threads, especially when processing computation with smaller result sets. Divide 100-100000 into four equal workloads, so finding the prime number in 100-25075 is twice faster than 75025-10000, because our Algorithm for Finding Prime numbers will decrease as numbers increase.
  • If an exception is thrown in a thread, what should we do? In this case, it seems that isprimer will not throw any exceptions, but in actual programming, concurrent work may produce potential exceptions. (CLR's default policy is that when an unhandled exception occurs in a thread, the entire process will be terminated. This policy is normal, but it does not allow the primersinrange_sequential method to handle this exception.

These questions are not answered in just a few words. A framework will not generate too many processes for parallel execution and ensure that the workload can be evenly distributed to all threads, in addition, it can report errors and generate reliable results. This is the work of the Parallel library task parallel library, which will be discussed later.

From the perspective of manual thread management, the most natural thing is the thread pool. The thread pool can manage multiple threads. Unlike manually creating a thread to execute specific operations, we throw a job to the thread pool, which selects the appropriate thread and then executes the given method. The thread pool solves several problems listed above. by limiting the total number of threads to be created, the thread pool determines the number of appropriate threads to be created based on the given workload, reducing the number of threads in extreme cases, if the processing capacity is small, the overhead of creating and destroying threads helps us solve the exclusive and excessive use of system resources.

In our example, we divide the above processing process into blocks and place them in the thread pool for processing.

public static IEnumerable<int> PrimesInRange_ThreadPool(int start, int end){    List<int> primes = new List<int>();    const int chunkSize = 100;    int completed = 0;    ManualResetEvent allDone = new ManualResetEvent(false);    int chunks = (end - start) / chunkSize;    for (int i = 0; i < chunks; i++)    {        int chunkStart = start + i * chunkSize;        int chunkEnd = chunkStart + chunkSize;        ThreadPool.QueueUserWorkItem(_ =>        {            for (int number = chunkStart; number < chunkEnd; number++)            {                if (IsPrime(number))                {                    lock (primes)                    {                        primes.Add(number);                    }                }            }            if (Interlocked.Increment(ref completed) == chunks)            {                allDone.Set();            }        });    }    allDone.WaitOne();    return primes;}

In this way, the code is more scalable than the previous version. Teach the previous complex version of manual thread creation, using the thread pool method, the time is about 2000 ms, and second, using concurrency profiler to view, the CPU usage is approximately 100%, in addition, the execution end time of each thread is almost the same.

 

In CLR 4.0, the CLR thread pool contains several cooperative parts. When a thread that does not belong to the thread pool, such as the main thread, assigns a task to the thread pool, the task is actually pushed into a global processing queue FIFO. Then each thread in the thread pool has a local stack lifo, and then the tasks on this stack are continuously executed. If the stack in the thread pool is empty, he will try to obtain the tasks in the local stack of other threads and execute them in queue mode (Bytes 0. When all the local queues are empty, the thread will ask the global FIFO queue, and then obtain and execute the task from where.

 

 

When a task is pressed into a global queue, no subthread has a priority to execute a specific task and only executes it in order. Therefore, FIFO is suitable for the global queue. However, when a thread in the thread pool executes a task, it usually uses the current data and commands to execute the next task, which makes full use of the CPU data and command-level cache. Furthermore, the local queue of the access thread requires less synchronization, And it is rare that other threads compete for resources when accessing the global queue. Similarly, when a local thread grabs a job from another thread, it is in the FIFO form, so the LIFO optimization considers the cache of the CPU on the local original thread.

In short, the thread pool uses some management mechanisms to free developers from complicated thread lifecycle management and scheduling. Although the CLR thread pool has some control APIs, such as threadpool. setminthreads and the setmaxthreads method control the maximum and minimum number of threads, there are no APIs that can control the priority of threads or tasks. However, it can easily expand to a stronger system and does not need to create or destroy threads for objects with short lifecycles.

The work items added to the thread pool are relatively simple. They are stateless and cannot carry exception information. They do not support asynchronous operations to continue or cancel tasks. When the task ends, the processing result cannot be returned .. In the task parallel library Parallel library of net 4.0, a task is a high-level abstraction of the work items in the thread pool, A task is a structured abstraction of the work items in the thread and thread pool.

2. Task parallelization

Task parallelism refers to splitting a large task into a series of small tasks through a series of APIS, and then executing them in parallel on multiple threads. The task parallel Library (TPL) Parallel library has a series of APIs that can manage thousands of tasks simultaneously based on the thread pool. The core of TPL is the system. Threading. Tasks class, which represents a small task. The task class provides the following functions.

  • Tasks that can be scheduled are executed independently on unspecified threads. to execute a given task on a specific thread, you must use the task scheduler, the default Task Scheduler will place the task in the CLR thread pool, but some task scheduler can send the task to a specific thread, such as the UI thread.
  • Wait until the task ends and obtain the execution result.
  • A mechanism can be provided to wait for an operation to continue immediately after the execution of a task is completed. This is usually called callback, but here we use the term "continue.
  • It can handle exceptions thrown by a single task, or even hierarchical tasks on the original thread, or any exceptions that affect the task results.
  • When the task has not yet started, you can cancel the task or submit the end task request during the task execution.

We can think of tasks as a high-level abstraction of threads. We can use tasks instead of threads to rewrite the code for finding prime numbers. In fact, the use of tasks can make the code shorter. During each execution, we do not need to count at the end of the task or reset the manualresetevent object to track the task execution progress. However, we can see later that the Data Parallel API provided by TPL may be more suitable for application scenarios where prime numbers are searched cyclically. So let's take another example.

Quick sorting is a well-known Sorting Algorithm Based on Recursive comparison. Fast sorting is easy to achieve parallelization through recursive calls, and the average time complexity is nlog (n ). The code for quick sorting is as follows:

public static void QuickSort_Sequential<T>(T[] items) where T : IComparable<T>{    QuickSort_Sequential(items, 0, items.Length);}private static void QuickSort_Sequential<T>(T[] items, int left, int right) where T : IComparable<T>{    if (left == right) return;    int pivot = Partition(items, left, right);    QuickSort_Sequential(items, left, pivot);    QuickSort_Sequential(items, pivot + 1, right);}private static int Partition<T>(T[] items, int left, int right) where T : IComparable<T>{    int pivotPosition = (right - left) / 2;    T pivotValue = items[pivotPosition];    Swap(ref items[right - 1], ref items[pivotPosition]);    int store = left;    for (int i = left; i < right - 1; ++i)    {        if (items[i].CompareTo(pivotValue) < 0)        {            Swap(ref items[i], ref items[store]);            ++store;        }    }    Swap(ref items[right - 1], ref items[store]);    return store;}private static void Swap<T>(ref T a, ref T b){    T temp = a;    a = b;    b = temp;}

Each step of a recursive quicksort call can be parallelized. Array sorting on the left and right is an independent task and does not require synchronization. This is easily represented by tasks. The following is the first step to use tasks to parallelize quicksort:

private static void QuickSort_Parallel<T>(T[] items) where T : IComparable<T>{    QuickSort_Parallel(items, 0, items.Length);}private static void QuickSort_Parallel<T>(T[] items, int left, int right) where T : IComparable<T>{    if (left == right) return;    int pivot = Partition(items, left, right);    Task leftTask = Task.Run(() => QuickSort_Parallel(items, left, pivot));    Task rightTask = Task.Run(() => QuickSort_Parallel(items, pivot + 1, right));    Task.WaitAll(leftTask, rightTask);}

The task. Run method creates a new task with the same effect as the new task, and then executes the task (similar to calling the start method ). The task. waitall static method waits until all the methods are executed and returns. Note that we have not processed how to wait for the task to complete, such as writing the logic for creating and destroying threads.

In TPL, there is also a help class for parallel. invoke, which can execute a series of tasks and then return the result value after all the tasks are completed. This method can be used to override the main code of quicksort.

Parallel.Invoke(() => QuickSort_Parallel_Threshold(items, left, pivot),                () => QuickSort_Parallel_Threshold(items, pivot + 1, right));

Whether using parallel. invoke or manually create a task. If we compare this version with the previous version executed in sequence, we can see that parallel version execution is slow, even on a machine with better configuration. In fact, it takes about 1000000 ms to execute the ordered version of 2500 random certificates, and 4100 ms to execute the parallel version.

Why is the parallel version slower than the code executed in sequence? The problem is that concurrency requires enough data. When we use recursive decomposition of the array to be sorted, when the array becomes very small, it may be better to use parallel division for further division than simply using the traditional method for sorting, when the array is relatively small, most of the cost of using parallel sorting is spent on creating task objects, putting objects in the thread pool and waiting for the task to complete, these operations waste much more time than compared elements.

3. Control parallelism in Recursive Algorithms

The following several methods can be used to optimize the preceding parallel algorithm:

  • The parallel version is used as long as the array size is greater than a threshold value. Otherwise, the version is executed sequentially.
  • If the recursive depth is smaller than a certain threshold value, the parallel version is used. Otherwise, the sequential version is used. (This is even higher than the first one, unless the Sentinel element stays in the middle of the element.
  • As long as the number of tasks to be executed is less than the specified threshold, the parallel version is used, otherwise, the version is executed in sequence (this article has no other restrictions on parallelism, such as the recursive depth and input size.

In the above example, limiting the parallel set size will produce better results. On my machine, it takes about ms to get more than 4 times faster than the sequential version. The code changes are small. You only need to set a threshold value, which must be continuously obtained through testing.

private static void QuickSort_Parallel_Threshold<T>(T[] items, int left, int right) where T : IComparable<T>{    if (left == right) return;    int pivot = Partition(items, left, right);    if (right - left > 500)    {        Parallel.Invoke(() => QuickSort_Parallel_Threshold(items, left, pivot),                        () => QuickSort_Parallel_Threshold(items, pivot + 1, right));    }    else    {        QuickSort_Sequential(items, left, pivot);        QuickSort_Sequential(items, pivot + 1, right);    }}

Using the same technique, you can parallelize many other algorithms that use recursive explanations. In fact, most Recursive Algorithms break down the input into several parts for independent operation, and then merge the result sets.

We can see that it is easy to use TPL to parallelize the recursive algorithm of the course, and the effect is good. Here we will summarize the general method as follows:

  1. For algorithms that can be recursively processed, first write a normal recursive formula, that is, the sequential processing algorithm.
  2. Perform Parallel Processing on each step of recursion.
  3. Through experiments, set the threshold values for Parallel Processing and sequential processing.

In addition to the above examples, there are also many algorithms that use recursive processing that can be easily parallelized, such as the well-known Merge Sorting and strassen matrix multiplication, there are many examples of using TPL for parallel processing. If you are interested, please take a look.

Http://code.msdn.microsoft.com/windowsdesktop/Samples-for-Parallel-b4b76364

V. Conclusion

This article briefly introduces the problems faced by parallel and concurrent programming, and demonstrates in the example of finding prime numbers. the evolution of concurrent programming in. net, and finally demonstrated how to use TPL to parallelize algorithms that can be recursive. Click here to download the code in this article. I hope this will help you understand concurrency and parallel programming.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.