MPI parallel programming Series 3: parallel regular sample sorting psrs

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Quick sortingAlgorithmThe efficiency of the parallel algorithm is relatively high, and the time complexity of the parallel algorithm can reach O (n) under ideal conditions. However, the parallel fast sorting algorithm has a serious problem: it may cause serious load imbalance, in the worst case, the complexity of the algorithm can reach O (N ^ 2 ). In this article, we will introduce a parallel sorting algorithm based on evenly divided Load Balancing ------ parallel sorting by regular sampling ).

I. Basic Idea of Algorithms

Assume n elements to be sorted and P processors.

First, the n elements are evenly divided into P parts, each of which contains N/P elements. Each processor is responsible for part of it and performs partial sorting on it. To determine the position of a local ordered sequence in the whole sequence, each processor selects several representative elements from their respective local ordered sequence, sorts these representative elements, and selects P-1 principal component. Each processor divides its local ordered sequence into P segments based on P-1 principal component. Then, the P-segment ordered sequence is distributed to the corresponding processor by means of global switching, so that the I-segment processor has the I-segment ordered sequence of each processor. Each processor sorts the p-segment ordered sequence. Finally, the sequential segments of each processor are merged in sequence, that is, the global ordered sequence.

Ii. Algorithm Description

According to the basic idea of the algorithm, we describe the algorithm as follows:

Input: N sequence to be sorted

Output: distributed on each processor to obtain a globally ordered data sequence.

1) division and local sorting of unordered Sequences

Based on the fast data division method (see Series 1), the unordered sequence is divided into P parts, and each processor performs sequential and fast sorting of some of them, in this way, each processor will have a local ordered sequence.

2) select representative elements

Each processor selects P-1 elements from the local ordered sequence W, 2 w,..., (p-1) W. W = N/P ^ 2.

3) determine the principal component

Each processor sends the selected representative element to the processor P0. P0 sorts the p-segment ordered sequence in multiple ways, and then selects the p-1, 2 (PM ),..., (p-1 P-1) a total of 1 elements are used as the principal element.

4) distribution Principal Component

P0 distributes the P-1 principal component to each processor.

5) Division of local ordered sequences

After receiving the principal component, each processor divides its local ordered sequence into P segments based on the principal component.

6) Distribution of p-segment ordered sequences

Each processor sends its I-segment to the I-th processor, which is the I-segment where I owns all the processors.

7) multi-path sorting

Each processor merges the p-segment sequence obtained in the previous step.

After these seven steps, the data of each processor is retrieved at a time, which is ordered.

Iii. Algorithm Analysis

1) load balancing analysis:

Because this algorithm is a load balancing algorithm, which can be seen in step 1), but it is not perfect, because in step 1st) step division may cause load imbalance.

2) Time Complexity Analysis

The psrs algorithm is suitable for processing large volumes of data (why parallel processing ). When N> P ^ 3, the time complexity of the algorithm can reach N/P * logn. The time complexity analysis of each step is not described here, because the sorting of each step is a common serial sorting algorithm.

Iv. Algorithm Implementation

Because the algorithm is complex,CodeLong, this article only lists the main code, the code is as follows:

 1:VoidPsrs_mpi (Int* Argc,Char* ** Argv ){

2:

 3:IntProcess_id;

 4:IntProcess_size;

5:

6:Int* Init_array;// Initial Array

 7:IntInit_array_length;// Initial array Length

8:

 9:Int* Local_sample;// Array of elements selected by each process

 10:IntLocal_sample_length;// Represents the length of the element array

11:

 12:Int* Sample;// Represents the Element Set (used by process 0)

13:Int* Sorted_sample;// Sorted Element Set

 14:IntSample_length;// Represents the length of the budget

15:

 16:Int* Primary_sample;// Principal Component

17:

 18:Int* Resp_array;// Offset array. You can specify the length of each segment of a process array.

19:

 20:Int* Section_resp_array;// Offset array, used to specify the length of the array obtained by the process from other processes

21:

 22:Int* Section_array;// Obtain the set of segmented arrays from each process

 23:Int* Sorted_section_array;

 24:IntSection_array_length;// Total length

25:

 26:IntSection_index;

27:

 28:IntI, J;// Cyclic variable

29:

30:Mpi_request handle;

 31:Mpi_status status;

32:

 33:Mpi_start (argc, argv, & process_size, & process_id, mpi_comm_world );

 34:Resp_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Process_size );

35:

 36:// Construct an array for each process

 37:// Sequential and fast sorting of improved Arrays

 38:Init_array_length = array_length;

39:Init_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Init_array_length );

 40:Array_builder_seed (init_array, init_array_length, process_id );

41:

 42:Quick_sort (init_array, 0, init_array_length-1 );

43:

 44:// Each processor selects process_size-1 elements from the sequence of order numbers

 45:// And sent to process 0

 46:Local_sample_length = process_size-1;

47:Local_sample = array_sample (init_array, local_sample_length, init_array_length/process_size, process_id );

48:

 49:If(Process_id)

 50:Mpi_send (local_sample, local_sample_length, mpi_int, 0, sample_data, mpi_comm_world );

51:

52:

 53:// Process 0 receives the representative elements sent by each processor and sorts these elements in multiple ways.

 54:If(! Process_id ){

55:Sample = (Int*) My_mpi_malloc (0,Sizeof(Int) * Process_size * local_sample_length );

 56:Sorted_sample = (Int*) My_mpi_malloc (0,Sizeof(Int) * Process_size * local_sample_length );

 57:Array_copy (sample, local_sample, local_sample_length );

58:

 59:For(I = 1; I <process_size; I ++)

 60:Mpi_irecv (sample + local_sample_length * I, local_sample_length, mpi_int, I, sample_data,

61:Mpi_comm_world, & handle );

62:

 63:Mpi_wait (& handle, & status );

64:

 65:For(I = 0; I <process_size; I ++)

 66:Resp_array [I] = local_sample_length;

67:

 68:Mul_merger (sample, sorted_sample, resp_array, process_size );

69:

 70:// Select the process_size-1 principals from the sorted representative elements and

71:Primary_sample = array_sample (sorted_sample, process_size-1, process_size-1, process_id );

 72:}

 73:If(Process_id)

 74:Primary_sample = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Process_size-1 );

75:

 76:Mpi_bcast (primary_sample, process_size-1, mpi_int, 0, mpi_comm_world );

77:

 78:// Divides the data on the processor into process_size based on the principal component.

79:Get_array_sepator_resp (init_array, primary_sample, resp_array, init_array_length, process_size );

 80:If(Process_id = ID ){

 81:Printf ("Process % d resp array is :", Process_id );

 82:Array_int_print (process_size, resp_array );

 83:}

84:

 85:// Each processor sends its segment I to the processor

 86:Section_resp_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Process_size );

87:Section_resp_array [process_id] = resp_array [process_id];

88:

 89:// The number of data to be sent by each process is sent to the processor.

 90:For(I = 0; I <process_size; I ++ ){

 91:If(I = process_id ){

 92:For(J = 0; j <process_size; j ++)

 93:If(I! = J)

 94:Mpi_send (& (resp_array [J]), 1, mpi_int, J, section_index,

95:Mpi_comm_world );

 96:}

 97:Else

 98:Mpi_recv (& (section_resp_array [I]), 1, mpi_int, I, section_index,

 99:Mpi_comm_world, & status );

 100:}

 101:

 102:Mpi_barrier (mpi_comm_world );

 103:

 104:Section_array_length = get_array_element_total (section_resp_array, 0, process_size-1 );

105:Section_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Section_array_length );

 106:Sorted_section_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Section_array_length );

 107:Section_index = 0;

 108:

 109:For(I = 0; I <process_size; I ++ ){

 110:If(I = process_id ){

 111:For(J = 0; j <process_size; j ++ ){

112:If(J)

 113:Section_index = get_array_element_total (resp_array, 0, J-1 );

 114:If(I = J)

 115:Array_int_copy (section_array, init_array, section_index, section_index + resp_array [J]);

 116:If(I! = J ){

 117:If(J)

 118:Section_index = get_array_element_total (resp_array, 0, J-1 );

119:Mpi_send (& (init_array [section_index]), resp_array [J], mpi_int,

 120:J, section_data, mpi_comm_world );

 121:}

 122:}

 123:}

 124:Else{

 125:If(I)

 126:Section_index = get_array_element_total (section_resp_array, 0, I-1 );

 127:Mpi_recv (& (section_array [section_index]), section_resp_array [I], mpi_int,

128:I, section_data, mpi_comm_world, & status );

 129:}

 130:}

 131:Mpi_barrier (mpi_comm_world );

 132:

 133:// Merge multiple rows for sorting

 134:Mul_merger (section_array, sorted_section_array, section_resp_array, process_size );

 135:

 136:Array_int_print (section_array_length, sorted_section_array );

 137:

 138:// Release the memory

139:Free (resp_array );

 140:Free (init_array );

 141:Free (local_sample );

 142:Free (primary_sample );

 143:Free (section_array );

 144:Free (sorted_section_array );

 145:Free (section_resp_array );

 146:

 147:If(! Process_id ){

 148:Free (sample );

 149:Free (sorted_sample );

 150:}

151:

 152:Mpi_finalize ();

 153:}

 5. MPI Function Analysis

 In the above algorithm, the non-blocking communication function of MPI is used: mpi_irecv, which corresponds to mpi_isend. These functions are used for non-blocking communication between processes so that communication and operations can be performed simultaneously. There are two

 For non-blocking communication functions, you cannot mention the mpi_wait function, which is used to block process execution until you want to execute all the corresponding process operations. Generally, these three functions are used together.

 Next, we will introduce the KMP string matching algorithm and its parallelization.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MPI parallel programming Series 3: parallel regular sample sorting psrs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

MPI parallel programming Series 3: parallel regular sample sorting psrs

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support