MPI Programming and performance Optimization

MPI Programming and performance Optimization _mpi

Last Update:2018-08-22 Source: Internet

Author: User

Tags bitwise int size scalar

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Section 1th MPI Introduction to 1.1 MPI and its history

Like OpenMP, the messaging interface (message passing Interface, MPI) is a programming interface standard, not a specific programming language. The standard is discussed and normalized by the Messaging Interface Forum (message passing Interface Forum, mpif).

The MPI standard was drafted from 1992, and the first version of MPI-1 (MPI v1.0, which developed 1.1 and 1.2) was released in 1994, and a second version MPI-2 (MPI v2.0) was released in 1997. The MPI standard has become the standard of message-passing parallel programming in fact and the most popular parallel programming interface.

Because MPI provides a unified interface, the standard is widely supported on various parallel platforms, which makes MPI programs have good portability. Currently, MPI supports a variety of programming languages, including FORTRAN77,FORTRAN90 and C + +, while MPI supports a variety of operating systems, including most Unix-like systems and Windows systems (Windows 2000, Windows XP, and so on) MPI also supports various hardware platforms such as multi-core (multicore), symmetric multiprocessor (SMP), Clustering (Cluster). 1.2 Introduction of typical MPI implementation

1.MPICH

Mpich is the most influential and most user-MPI implementation.

Mpich is characterized by:

Open source;

Develop synchronously with MPI standard;

Support multiple program multiple data (multiple programs multiple DATA,MPMD) programming and heterogeneous cluster system;

Supports the binding of C/n + +, Fortran 77, and Fortran 90; The support of Fortran provides the header file Mpif.h and module two kinds of ways;

Supports Unix-like and Windows NT platforms;

A wide range of support environments, including multicore, SMP, clustering, and large-scale parallel computing systems;

In addition, the Mpich software package also integrates parallel programming environment components, including parallel performance visualization analysis tools and performance testing tools.

2.Intel MPI

Intel MPI is a MPI implementation of the MPI-2 standard introduced by Intel Corporation. The latest version, the 3.0 version, features a flexible, multiple-architecture support.

Intel MPI provides a middle tier called direct Access programming Library (DAPL) to support multiple architectures, compatible with a variety of network hardware and protocols, and optimize network interconnection. The Intel MPI Library and the DAPL interconnect structure can be clearly expressed with the figure 7.1.1.

Figure 7.1 Intel MPI Library and its DAPL-based interconnect architecture

As you can see from Figure 7.1, Intel MPI transparently supports TCP/IP, shared memory, and is based on DAPL to effectively support multiple high-performance interconnected systems, and Intel MPI provides better threading security, and multi-threaded MPI programs do not limit the use of MPI. Characteristics of 1.3 MPI Program

MPI program is a parallel program based on message passing. Message passing means that each process executing in parallel has its own separate stack and code snippet, which executes independently as multiple separate programs, and the information interaction between processes is completely accomplished by explicitly calling the communication function.

The parallel program based on message passing can be divided into two kinds of multiple data (single program, short SPMD) and multiple program data MPMD. SPMD uses a program to process multiple different datasets to achieve parallel purposes. The different program instances executed in parallel are in a full peer position. Accordingly, the MPMD program uses different programs to process multiple datasets and collaborate to solve the same problem.

SPMD is the most common parallel model in MPI programs. Figure 7.2 is a schematic diagram of the SPMD execution model, which represents a typical SPMD program, with the same program prog_a running on different processing cores and processing different datasets.

Figure 7.2 SPMD Execution Model

Figure 7.5 shows the execution model of three typical MPMD programs.

(a) is a manager (Master)/worker (worker) type of MPMD program

(b) For another type of MPMD procedure: a joint data analysis procedure. For most of the time, different programs perform their own tasks independently and exchange data at specific times.

(c) is a flow-type MPMD program, which is run by Prog_a, Prog_b, and Prog_c, and the execution process of these three programs is like the assembly line in the factory.

Figure 7.5 MPMD Execution Model

By studying the characteristics of the MPI program described in this section, and the various execution models for SPMD and MPMD, readers can flexibly design different MPI parallel programs in the future development process.

The advanced features proposed by PI-2 will be briefly described in section 7.6.

Section 2nd MPI Programming Basics 2.1 Simple MPI Program example

First, let's look at a simple MPI program instance. Like the first program we learn in all languages, the first program for MPI is also "Hello Word."

* Case 1 HELLOW.C * *

#include <stdio.h>

#include "mpi.h"

int main (int argc, char *argv[]) {

int rank;

int size;

Mpi_init (argc, argv);

Mpi_comm_rank (Mpi_comm_world, &rank);

Mpi_comm_size (Mpi_comm_world, &size);

printf ("Hello World from process%d of%d\n", rank, size);

Mpi_finalize ();

return 0;

}

As described in the previous section, we compile and link to this program using the following command:

Mpicc–o Hellow hellow.c

Run this example to execute MPIEXEC–NP 4/hellow in the directory of the executable file. The results of the operation are as follows:

Hello World from process 0 of 4

Hello World from Process 1 of 4

Hello World from Process 2 of 4

Hello World from process 3 of 4

This program prints each of the MPI process numbers (0~3) and the total number of processes (4), respectively, in each process that the MPI program runs.

It is worth noting that, since four processes are executed in parallel, even if the order of the output changes is normal, the program does not limit which process is in front and which process is behind. Four basic functions of 2.2 MPI Program

1.mpi_init and Mpi_finalize

Mpi_init is used to initialize the MPI execution Environment, establish a link between multiple MPI processes, and prepare for subsequent communications. And Mpi_finalize is to end the MPI execution environment.

As OpenMP defines a parallel area, these two functions are used to define the parallel area of the MPI program. That is, other MPI functions should not be invoked outside the areas defined by the two functions, in addition to detecting whether the functions are initialized.

2.mpi_comm_rank

Section 7.1 describes the SPMD program form, given the example of the process identification and the total number of data to be allocated. Mpi_comm_rank is to identify the various MPI processes, telling the process that called the function "Who am I?" ”。 Mpi_comm_rank returns an integer error value that requires two function arguments:

Mpi_comm a type of communication domain that identifies the MPI process group that participates in the calculation.

An integer pointer that returns the process number of the process in the corresponding process group. The process number is numbered starting from 0.

3.mpi_comm_size

This function is used to identify how many processes are in the corresponding process group. Point-to-Point communication of 2.3 MPI

Point-to-Point communication is the foundation of MPI programming. In this section, we will highlight two of the most important MPI functions Mpi_send and MPI_RECV.

int Mpi_send (BUF, Count, datatype, dest, tag, comm);

Input parameters include:

BUF, the starting address of the sending buffer, which can be a pointer to various arrays or structures.

Count, Integer, the number of data sent, which should be a non-negative integer.

DataType, the data type of the data sent, which is described in detail in the following section.

Dest, integral type, destination process number.

tag, integer, message flags, will be further described in the following section.

The communication domain where the COMM,MPI process group resides is further described in the following section.

The function does not have an output parameter and returns an error code.

The meaning of this function is to send data to the dest process in the communication domain Comm. The message data is stored in buf, the type is datatype, and the number is count. This message is marked with tag, which distinguishes it from other messages sent by this process to the same destination process.

int Mpi_recv (buf,count,datatype,source,tag,comm,status);

n Status,mpi_status The structure pointer and returns the status information.

The structure definition of mpi_status can be found in mpi.h.

/* The order of this elements must match that in mpif.h * *

typedef struct MPI_STATUS {

int count;

int cancelled;

int Mpi_source;

int Mpi_tag;

int mpi_error;

} mpi_status;

int Mpi_get_count (Mpi_status *status, mpi_datatype Datatype, int *count);

Count, is the number of data items that are actually received for a given data type 2.4 message management 7 elements

The most important function of MPI is message delivery. As we saw earlier, Mpi_send and MPI_RECV are responsible for sending and receiving messages between two processes. To sum up, the parameters of point-to-point message communication are mainly composed of the following 7 parameters:

(1) Sending or receiving buffer buf;

(2) Count the number of data;

(3) data type datatype;

(4) Target process or source process Destination/source;

(5) message tag tag;

(6) Communication domain comm;.

(7) The Message state status status, which appears only in the received function.

1. Message data type

Of the three variables in message buffering, the most notable is datatype, the message data type.

Why you need to define a message data type. There are two main reasons: one is to support interoperability of heterogeneous platform computing, and the other is to allow the easy use of data in noncontiguous memory areas, with different data types of content to compose messages. MPI programs have strict data type matching requirements. Type matching includes two levels of content: one is the type of the host language (c or Fortran data type) and the type specified by the communication operation, and the second is the type of the sender and receiver. MPI satisfies the above requirements with predefined basic data types and export data types.

(1) Basic data types

As mentioned earlier, we need to send and receive continuous data, and MPI provides predefined data types for programmers to use.

Table 7.3.1 The corresponding relationship between the predefined data types and C data types of MPI

MPI Predefined data types

The corresponding C data type

Mpi_char

Signed Char

Mpi_short

Signed Short int

Mpi_int

Signed int

Mpi_long

Signed Long int

Mpi_unsigned_char

unsigned char

Mpi_unsigned_short

unsigned short int

mpi_unsigned

unsigned int

Mpi_unsigned_long

unsigned long int

Mpi_float

Float

Mpi_double

Double

Mpi_long_double

Long double

Mpi_byte

No corresponding type

Mpi_packed

No corresponding type

For beginners, it is important to ensure that the data types sent and received are exactly the same as possible.

(2) Export data types

In addition to these basic data types, the MPI allows you to combine discrete, or even different types of data elements together to form new data types by exporting data types. We call this user-defined data type the exported data type.

summed up type matching rules can be summed up as:

Communication with type data, both the sender and receiver use the same data type;

Non-type data communication, the sender and receiver are mpi_byte as the data type;

Packaging data communication, both the sender and the receiver use mpi_packed.

2. Communication domain

A communication domain (comm) contains a process group (processes group) and its contexts (context). A process group is a finite ordered set of processes. The communication domain defines the scope of the message delivery process.

The MPI implementation has predefined two process groups: Mpi_comm_self, which contains only the process groups of each process itself, Mpi_comm_world, which contains the process groups for all MPI processes that were started. At the same time, MPI also provides a variety of management functions for the communication child, including:

(1) Communication domain comparison int mpi_comm_compare (COMM1, comm2, result): If comm1,comm2 is the same handle, result is mpi_ident; if only the members and serial numbers of each process group are the same, The result is mpi_congruent, and if the group members of the two are the same but the serial numbers are different, the results are mpi_similar; otherwise, the result is mpi_unequal.

(2) Communication domain copy int mpi_comm_dup (Comm, newcom): Copy Comm to get new communication domain Newcomm.

(3) communication domain split int mpi_comm_split (COMM, color, key, Newcomm): This function requires that each process in the COMM process group be executed, each process specifying a color (integer), This call first forms a new process group of processes with the same color value, and the newly generated communication domain corresponds to these process group one by one. The sequential numbering of the processes in the new communication domain depends on the size of the key (integer), the smaller the key, the smaller the sequence number of the corresponding process in the new communication domain, and if the key in the same process is the same, the new process number is determined according to the sequence number of the two processes in the original communication domain. A process may provide a color value of mpi_undefined, in which case its Newcomm returns MPI_COMM_NULL.

(4) The communication domain destroys int mpi_comm_free (COMM): Releases the given communication domain.

All of the above functions return an error code.

2.5 Statistic Time

The MPI provides two time functions Mpi_wtime and Mpi_wtick. Where the Mpi_wtime function returns a double-precision number that identifies the number of seconds elapsed from the time at a certain point in the past to the current time. The function Mpi_wtick returns the precision of the mpi_wtime result. 2.6 Error Management

MPI provides rich interface functions for error management, and here we introduce the simplest part of the interface.

N with status. Mpi_error to get the error code.

n MPI terminates the function Mpi_abort of MPI program execution.

int Mpi_abort (mpi_comm Comm, int errorcode)

It causes all processes in the Comm communication domain to exit and return ErrorCode to the invoked environment. Calling this function in any process in the communication domain comm can cause all processes in the communication domain to end running.

Section 3rd MPI Cluster communications

In addition to the point-to-point communication described earlier, MPI also provides cluster communications. The so-called cluster communication includes a one-to-many, Many-to-many, and Many-to-many process communication model. Its biggest feature is that multiple processes are involved in communication, and here we will describe several of the cluster communication functions commonly used in MPI. 3.1 Sync

This function interface is: int mpi_barrier (Mpi_comm Comm).

This function is like a roadblock. In the operation, all processes in the communication Comm are synchronized, that is, they wait each other until all processes execute their respective mpi_barrier functions and then proceed to execute subsequent code. The synchronization function is an effective means to control the execution order in the parallel program. 3.2 Broadcast

As the name suggests, it is a one-to-many transmission message. Its role is to send a message from a root process to all other processes in the group. Its interface form is:

int mpi_bcast (void *buffer, int count, mpi_datatype Datatype, int root, Mpi_comm Comm)

Figure 7.13 shows the signal for the broadcast operation.

Figure 7.13 Broadcast operation schematic diagram 3.3 aggregation

The aggregation function Mpi_gather is a many-to-many communication function. Its interface is:

int Mpi_gather (void *sendbuf, int sendcnt, Mpi_datatype sendtype,

void *recvbuf, int recvcnt, Mpi_datatype recvtype,

int root, Mpi_comm Comm)

The root process receives messages sent by each member process of the communication group (including Root itself). The connections of these n messages are placed in the receive buffer of the root process by the process number arrangement. Each send buffer is identified by the ternary group (SENDBUF, sendcnt, Sendtype). All non-root processes ignore the receive buffer, and the root process is sent a buffer that is identified by the ternary group (RECVBUF, recvcnt, Recvtype). Figure 7.14 gives a schematic of the aggregation operation.

Fig. 7.14 Cluster operation schematic 3.4 seeding

int Mpi_scatter (void *sendbuf, int sendcnt, Mpi_datatype sendtype,

void *recvbuf, int recvcnt, mpi_datatype recvtype, int root,

Mpi_comm Comm)

Seeding function Mpi_scatter is a one-to-many delivery message. But unlike the broadcast, the root process delivers different messages to each process. Scatter actually performs the opposite of the gather operation. 3.5 Extended aggregation and seeding operations

The role of Mpi_allgather is that each process collects messages from all other processes, which is equivalent to each process executing the mpi_gather execution mpi_gather, all processes receive the same contents of the receiving buffer, That is, each process sends an identical message to all processes, so it is named Allgather. The interface of this function is:

int Mpi_allgather (void *sendbuf, int sendcount, Mpi_datatype sendtype,

void *recvbuf, int recvcount, Mpi_datatype recvtype,

Mpi_comm Comm)

Fig. 7.15 shows the extension of the aggregation and seeding operation of the signal.

Fig. 7.15 expanded aggregation and seeding Operation Diagram 3.6 Global Exchange

Mpi_allgather each process sends an identical message to all processes, and Mpi_alltoall is different from the messages emitted to different processes. Therefore, its send buffer is also an array. Each of the Mpi_alltoall processes can send a different number of data to each receiver, and the block J data sent by the first process will be received by the J process and stored in block I of its receive message buffer Recvbuf, The Sendcount and Sendtype types of each process must be the same as the Recvcount and recvtype of all other processes, which means that the amount of data sent between each process and root process must be equal to the amount of data received. The function interface is:

int Mpi_alltoall (void *sendbuf, int sendcount, Mpi_datatype sendtype,

void *recvbuf, int recvcount, Mpi_datatype recvtype,

Mpi_comm Comm)

The operation diagram of the global Exchange is figure 7.4.4.

Figure 7.4.4 Global Exchange operation schematic 3.7 protocol and scanning

MPI provides two types of aggregation operations: reduction (reduction) and scanning (scan).

1. Reduction

int mpi_reduce (void *sendbuf, void *recvbuf, int count, Mpi_datatype Datatype,

Mpi_op Op, int root, Mpi_comm Comm)

The pending data for each process is stored in sendbuf, which can be either scalar or vector. All processes calculate these values through the input Operator's OP as the final result and deposit it in the recvbuf of the root process. The data type of the data item is defined in the DataType field. Specific reduction operations include:

Mpi_max for maximum value

Mpi_min to find the minimum value

Mpi_sum sum

Mpi_prod quadrature

Mpi_land Logic and

Mpi_band by Bit and

Mpi_lor logic or

Mpi_bor bitwise OR

Mpi_lxor Logical XOR or

Mpi_bxor per-Bitwise XOR OR

Mpi_maxloc maximum value and corresponding position

Mpi_minloc minimum value and corresponding position

The data type combination of the specification operation is shown in table 7.4.1.

Table 7.4.1 The corresponding relation between the protocol operation and the corresponding type

Operation

Allowed data types

Mpi_max,mpi_min

c Integer, Fortran integer, floating-point number

Mpi_sum,mpi_prod

c Integer, Fortran integer, floating-point number, plural

Mpi_land,mpi_lor,mpi_xlor

c Integers, logical type

Mpi_band,mpi_bor,mpi_bxor

c Integer, Fortran integer, byte type

In MPI, all of the MPI predefined operations are both binding and exchangeable for protocol operations. At the same time, users can specify custom function actions that can be combined, but may not be interchangeable.

2. Scan

int Mpi_scan (void *sendbuf, void *recvbuf, int count, Mpi_datatype Datatype,

Mpi_op Op, Mpi_comm Comm)

Mpi_scan is often used to make a predecessor-reduction operation on data distributed in a group. This operation converts the result of a process send buffer with a serial number of 0,,i (including i) to a process receiving message buffer with the sequence number I. This operation supports the same data types, operations, and restrictions and protocols for sending and receiving buffers. The scan scan operation eliminates the root domain as compared to the specification, because the scan combines the partial values into N final values and is stored in the recvbuf of n processes. The specific scan operation is defined by the OP field.

The normalization and scanning operations of MPI allow each process to contribute vector values, not just scalar values. The length of a vector is defined by count. MPI also supports user-defined reduction operations.

4th section MPI performance Analysis and Optimization example 4.1 Select calculated granularity

When communication becomes the bottleneck of parallel program performance, it is generally possible to choose a high computational granularity to reduce the communication overhead between processes. For example, A, B, c 3 unrelated tasks are done with 7 processes, if B is twice times the amount of a, and C is 4 times times the amount of a.

One strategy for parallel execution is to use parallelism within the task, as shown in Figure 7.16 (a), where each task is executed in parallel on 7 processes, so each task requires a data allocation, and a data collection. However, the parallel mode of task, that is, a larger granularity of parallel distribution, only need one data allocation and one data collection, save two times set communication (as shown in Figure 7.16 (b)).

(a) Parallel tasks in parallel (b) Tasks

Fig. 7.16 schematic diagram of different granularity parallel patterns

4.2 Aggregation messages

One way to reduce the number of communications is to aggregate small messages one at a time, the optimization being called message aggregation. If you have a lot of fragmented messages, you can get a great performance boost with message aggregation. 4.3 Solving load balancing problem

In parallel computing, if the work on each processor (kernel) requires different completion times, the first completed processor waits for the unfinished processor (kernel) and wastes the computational resources. If this is a more serious scenario, you should adopt a strategy to balance the processor load as much as possible. There are generally two kinds of strategies, one for static load balancing and one for dynamic load balancing. The former can accurately know the total load before the calculation, and these loads are easily divided evenly between the processes. For those who do not know beforehand the total load, or the total load is not easy to divide evenly, you may need to use dynamic load partition to solve.

Dense matrix and vector multiplication is an example of static load balancing, assuming that the matrix is Nxm order, and that there is a P-same processor can be used for calculation, decomposition of each processor by row or row, and several broken down by column per processor, as shown in Figure 7.19. Of course, you can also press the rectangular block decomposition matrix, at this point according to the size of the specific rectangular block.

Dynamic load balancing we use triangular matrices and vector multiplication as an example. There is a management node that sends the unfinished row of the matrix to the work node, and when the work node completes the task, it actively requests the task to the management node and sends a termination signal to all processes when there are no unfinished tasks on the management node, as shown in Figure 7.20. This is an example of dynamic load balancing through master-slave mode, which effectively maintains the task pool.

Fig. 7.19 matrix vector multiplication static load balancing diagram

Fig. 7.20 schematic diagram of dynamic load balancing

from:http://jpck.zju.edu.cn/eln/200805131515180671/page.jsp?cosid=1423&jspfile=page&listfile=list& chapfile=listchapter&path=200805131515180671&rootid=6380&nodeid=6403&docid=8717

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More