The popularization of multi-core computing platforms enables Parallel or concurrent)ProgramDesign (this may be called Parallel Programming) has become a mainstream programming technology. In fact, the parallel computing software technology has existed for decades. However, it originally mainly serves applications such as high-performance computing, so parallel programming has always been shrouded in the halo of Yang chunbai. Now we are talking about multi-core programming. We are talking about the use of various software or parallel programming models. It is still difficult for beginners to follow their path.
In fact, there are rules for parallel programming. The parallel program design can be divided into the following four stages according to the development process sequence:
1. FeasibleAlgorithm(Solution) description and analysis
2. decomposition: dependency and synchronization and communication overhead Analysis
3. Select a programming (Implementation) Model
4. Performance Check and Optimization
in the initial stage of design, developers should first find a feasible solution or algorithm to solve the problem. For example, solutions to sorting problems include Bubble sorting, fast sorting, binary tree sorting, and other known feasible methods that can be used as basic parallel algorithms. Before performing a specific parallel design, a very important analysis (or evaluation) should be done, that is, the Necessity Analysis of parallelism; that is, the calculation amount of the target problem should be estimated. If the calculation amount of the problem to be solved is not very large, for example, you only need to sort 30 integers, even if the traditional serial program does not take too much time, therefore, there may be no need to design a parallel program for it, because it also has to pay other computing costs. In addition, the selection of basic algorithms is also very exquisite. Some algorithms do not have much concurrency, but they are the smallest Kruskal algorithm that generates the tree/Minimum Spanning Tree (MST) in the theory. Some algorithms have good scalability ), for example, the boruvka algorithm of MST (for examples of MST algorithm parallelization, see the ISN academic community courseware ). To determine the parallelism of algorithms, it is generally necessary to help with some theories and tools. Amdal's law is the theorem used by most designers to estimate the upper limit of parallel acceleration ratio. Gustafson's law is a powerful theoretical guide for analyzing the scalability of parallel programs. In order to be able to use these theorems to provide guidance, some software tools (profiling tools) are also needed to determine some parameters required by the theory (such as the parallel quantity P of the serial program ). Tools commonly used to provide this function include Intel Performance Analyzer vtune and perfmon in windows. For the use of theoretical and software tools, you can refer to the courseware of the academic community. The determination of basic algorithms requires developers to consider the next step and repeat several times.
After the basic algorithms of candidates are determined, developers need to perform a very important step in parallel programming-decomposition ). Decomposition is an analysis of the basic algorithm, which is divided into several relatively independent parts (or operations). You can use the next step (select the appropriate programming model) allocate these relatively independent parts to multiple execution (processing) units for execution. Generally, Work Breakdown methods can be divided into task decomposition and data decomposition ). Task Decomposition refers to splitting an algorithm into several sub-tasks that can be executed simultaneously according to the correlation (dependency) of the operation. For example, functions such as H, Q, R, or s can be parallel.
Data decomposition refers to dividing a large dataset to be processed, such as arrays into several subsets, so that the members of different parts can perform simultaneous operations or operations. In addition, there is also a commonly used method of working decomposition, which is the pipeline method. Its basic principle is to follow the operation of the Production Pipeline and break down a large task into several closely linked stages, in this way, the efficiency of each stage work unit is improved. The online courseware of the academic community has a detailed explanation of the work breakdown, so I will not go into details here. For the application of decomposition methods, developers need to carry out a lot of analysis practices, accumulate experience, and improve the correctness and efficiency of the decomposition. On the other hand, there are still some general experiences that can be used for reference. For example, an algorithm contains a cyclic body, which generally means that data decomposition may be used to divide different iterations (if there is no dependency between them) into different execution subsets, it is allocated to different threads or processes for execution. Similarly, the process and function are the effective means for programmers to package general operations for ease of reading and maintenance, this implies that the function body is also a good candidate for data decomposition. in applications such as media stream processing, the pipeline is a common means of decomposition. Interested readers may notice that the decomposition method we introduced previously is to mention the dependency or dependency multiple times ). This is an important indicator for working breakdown, because the dependency between the decomposed parts directly affects their parallelism. For example, if we use the dependence graph for analysis, we will find that each loop iteration has a read/write dependency on the variable A [I, therefore, the loop body cannot be parallel through data decomposition, so this Decomposition Scheme is not feasible (of course, some dependencies can be removed through optimization means, due to the length, for dependency analysis, refer to the academic community courseware ).
It should be pointed out that most applications (algorithms) cannot be completely independent parts, which require a certain degree of dependency and coordination to complete the work. The goal of decomposition should be to break down the basic algorithms into several parts with the smallest mutual dependence, because the coordination and synchronization (synchronizations) and communication (communications) of each part require additional overhead, this is the overhead of parallelism. Parallel Programming by developers should make the benefits of parallelism greater than the overhead (the merchant's philosophy should be used ). Dependency Analysis of Work Breakdown helps us determine the nature of resources (such as variables and data structures), such as sharing or private, so as to determine the objects for synchronization and communication, and applicable means (such as semaphores or mutex ).
After the basic algorithm is decomposed, a parallel algorithm (solution) is ready to take shape. At this time, designers should choose a (for experienced users, you can also use a variety of mixed) programming model to achieve parallel algorithms. As the introduction to this topic is full of resources, I just want to share my experience with you. Generally, for beginners and developers who want fast parallelization, they can apply implicit threading, similar to OpenMP. For experienced or hard core developers, Windows Win32 APIs can be used, p-threads API or MPI for the Distributed Platform. explicit (explicit it threading) controls relative to the underlying layer (such as thread creation and destruction have been synchronized; if you want to use a programming model that provides both high-level (implicit) parallelization and Low-layer (explicit) parallelization, Java and Intel's threading building blocks (TBB) will be a good choice.
After the first three steps, developers should have a working parallel program. At this point, I should ask myself the question: Is the parallelization effect sufficient? Can it meet your needs? If the answer is no, do you need to ask yourself if you have enough resources (Manpower, time) for further optimization? If the answer is yes, you can repeat the previous three steps to further explore parallelism, reduce synchronization and communication overhead, and thus improve performance.