MPI parallel programming Series 3: parallel regular sample sorting psrs

Source: Internet
Author: User

Quick sortingAlgorithmThe efficiency of the parallel algorithm is relatively high, and the time complexity of the parallel algorithm can reach O (n) under ideal conditions. However, the parallel fast sorting algorithm has a serious problem: it may cause serious load imbalance, in the worst case, the complexity of the algorithm can reach O (N ^ 2 ). In this article, we will introduce a parallel sorting algorithm based on evenly divided Load Balancing ------ parallel sorting by regular sampling ).

I. Basic Idea of Algorithms

Assume n elements to be sorted and P processors.

First, the n elements are evenly divided into P parts, each of which contains N/P elements. Each processor is responsible for part of it and performs partial sorting on it. To determine the position of a local ordered sequence in the whole sequence, each processor selects several representative elements from their respective local ordered sequence, sorts these representative elements, and selects P-1 principal component. Each processor divides its local ordered sequence into P segments based on P-1 principal component. Then, the P-segment ordered sequence is distributed to the corresponding processor by means of global switching, so that the I-segment processor has the I-segment ordered sequence of each processor. Each processor sorts the p-segment ordered sequence. Finally, the sequential segments of each processor are merged in sequence, that is, the global ordered sequence.

Ii. Algorithm Description

According to the basic idea of the algorithm, we describe the algorithm as follows:

Input: N sequence to be sorted

Output: distributed on each processor to obtain a globally ordered data sequence.

1) division and local sorting of unordered Sequences

Based on the fast data division method (see Series 1), the unordered sequence is divided into P parts, and each processor performs sequential and fast sorting of some of them, in this way, each processor will have a local ordered sequence.

2) select representative elements

Each processor selects P-1 elements from the local ordered sequence W, 2 w,..., (p-1) W. W = N/P ^ 2.

3) determine the principal component

Each processor sends the selected representative element to the processor P0. P0 sorts the p-segment ordered sequence in multiple ways, and then selects the p-1, 2 (PM ),..., (p-1 P-1) a total of 1 elements are used as the principal element.

4) distribution Principal Component

P0 distributes the P-1 principal component to each processor.

5) Division of local ordered sequences

After receiving the principal component, each processor divides its local ordered sequence into P segments based on the principal component.

6) Distribution of p-segment ordered sequences

Each processor sends its I-segment to the I-th processor, which is the I-segment where I owns all the processors.

7) multi-path sorting

Each processor merges the p-segment sequence obtained in the previous step.

After these seven steps, the data of each processor is retrieved at a time, which is ordered.

Iii. Algorithm Analysis

1) load balancing analysis:

Because this algorithm is a load balancing algorithm, which can be seen in step 1), but it is not perfect, because in step 1st) step division may cause load imbalance.

2) Time Complexity Analysis

The psrs algorithm is suitable for processing large volumes of data (why parallel processing ). When N> P ^ 3, the time complexity of the algorithm can reach N/P * logn. The time complexity analysis of each step is not described here, because the sorting of each step is a common serial sorting algorithm.

Iv. Algorithm Implementation

Because the algorithm is complex,CodeLong, this article only lists the main code, the code is as follows:

 
1:VoidPsrs_mpi (Int* Argc,Char* ** Argv ){
 
2: 
 
3:IntProcess_id;
 
4:IntProcess_size;
 
5: 
6:Int* Init_array;// Initial Array
 
7:IntInit_array_length;// Initial array Length
 
8: 
 
9:Int* Local_sample;// Array of elements selected by each process
 
10:IntLocal_sample_length;// Represents the length of the element array
 
11: 
 
12:Int* Sample;// Represents the Element Set (used by process 0)
13:Int* Sorted_sample;// Sorted Element Set
 
14:IntSample_length;// Represents the length of the budget
 
15: 
 
16:Int* Primary_sample;// Principal Component
 
17: 
 
18:Int* Resp_array;// Offset array. You can specify the length of each segment of a process array.
 
19: 
 
20:Int* Section_resp_array;// Offset array, used to specify the length of the array obtained by the process from other processes
21: 
 
22:Int* Section_array;// Obtain the set of segmented arrays from each process
 
23:Int* Sorted_section_array;
 
24:IntSection_array_length;// Total length
 
25: 
 
26:IntSection_index;
 
27: 
 
28:IntI, J;// Cyclic variable
 
29: 
30:Mpi_request handle;
 
31:Mpi_status status;
 
32: 
 
33:Mpi_start (argc, argv, & process_size, & process_id, mpi_comm_world );
 
34:Resp_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Process_size );
 
35: 
 
36:// Construct an array for each process
 
37:// Sequential and fast sorting of improved Arrays
 
38:Init_array_length = array_length;
39:Init_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Init_array_length );
 
40:Array_builder_seed (init_array, init_array_length, process_id );
 
41: 
 
42:Quick_sort (init_array, 0, init_array_length-1 );
 
43: 
 
44:// Each processor selects process_size-1 elements from the sequence of order numbers
 
45:// And sent to process 0
 
46:Local_sample_length = process_size-1;
47:Local_sample = array_sample (init_array, local_sample_length, init_array_length/process_size, process_id );
 
48: 
 
49:If(Process_id)
 
50:Mpi_send (local_sample, local_sample_length, mpi_int, 0, sample_data, mpi_comm_world );
 
51: 
 
52:
 
53:// Process 0 receives the representative elements sent by each processor and sorts these elements in multiple ways.
 
54:If(! Process_id ){
55:Sample = (Int*) My_mpi_malloc (0,Sizeof(Int) * Process_size * local_sample_length );
 
56:Sorted_sample = (Int*) My_mpi_malloc (0,Sizeof(Int) * Process_size * local_sample_length );
 
57:Array_copy (sample, local_sample, local_sample_length );
 
58:
 
59:For(I = 1; I <process_size; I ++)
 
60:Mpi_irecv (sample + local_sample_length * I, local_sample_length, mpi_int, I, sample_data,
61:Mpi_comm_world, & handle );
 
62: 
 
63:Mpi_wait (& handle, & status );
 
64: 
 
65:For(I = 0; I <process_size; I ++)
 
66:Resp_array [I] = local_sample_length;
 
67: 
 
68:Mul_merger (sample, sorted_sample, resp_array, process_size );
 
69: 
 
70:// Select the process_size-1 principals from the sorted representative elements and
71:Primary_sample = array_sample (sorted_sample, process_size-1, process_size-1, process_id );
 
72:}
 
73:If(Process_id)
 
74:Primary_sample = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Process_size-1 );
 
75: 
 
76:Mpi_bcast (primary_sample, process_size-1, mpi_int, 0, mpi_comm_world );
 
77: 
 
78:// Divides the data on the processor into process_size based on the principal component.
79:Get_array_sepator_resp (init_array, primary_sample, resp_array, init_array_length, process_size );
 
80:If(Process_id = ID ){
 
81:Printf ("Process % d resp array is :", Process_id );
 
82:Array_int_print (process_size, resp_array );
 
83:}
 
84:
 
85:// Each processor sends its segment I to the processor
 
86:Section_resp_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Process_size );
87:Section_resp_array [process_id] = resp_array [process_id];
 
88: 
 
89:// The number of data to be sent by each process is sent to the processor.
 
90:For(I = 0; I <process_size; I ++ ){
 
91:If(I = process_id ){
 
92:For(J = 0; j <process_size; j ++)
 
93:If(I! = J)
 
94:Mpi_send (& (resp_array [J]), 1, mpi_int, J, section_index,
95:Mpi_comm_world );
 
96:}
 
97:Else
 
98:Mpi_recv (& (section_resp_array [I]), 1, mpi_int, I, section_index,
 
99:Mpi_comm_world, & status );
 
100:}
 
101: 
 
102:Mpi_barrier (mpi_comm_world );
 
103: 
 
104:Section_array_length = get_array_element_total (section_resp_array, 0, process_size-1 );
105:Section_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Section_array_length );
 
106:Sorted_section_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Section_array_length );
 
107:Section_index = 0;
 
108: 
 
109:For(I = 0; I <process_size; I ++ ){
 
110:If(I = process_id ){
 
111:For(J = 0; j <process_size; j ++ ){
112:If(J)
 
113:Section_index = get_array_element_total (resp_array, 0, J-1 );
 
114:If(I = J)
 
115:Array_int_copy (section_array, init_array, section_index, section_index + resp_array [J]);
 
116:If(I! = J ){
 
117:If(J)
 
118:Section_index = get_array_element_total (resp_array, 0, J-1 );
119:Mpi_send (& (init_array [section_index]), resp_array [J], mpi_int,
 
120:J, section_data, mpi_comm_world );
 
121:}
 
122:}
 
123:}
 
124:Else{
 
125:If(I)
 
126:Section_index = get_array_element_total (section_resp_array, 0, I-1 );
 
127:Mpi_recv (& (section_array [section_index]), section_resp_array [I], mpi_int,
128:I, section_data, mpi_comm_world, & status );
 
129:}
 
130:}
 
131:Mpi_barrier (mpi_comm_world );
 
132: 
 
133:// Merge multiple rows for sorting
 
134:Mul_merger (section_array, sorted_section_array, section_resp_array, process_size );
 
135: 
 
136:Array_int_print (section_array_length, sorted_section_array );
 
137: 
 
138:// Release the memory
139:Free (resp_array );
 
140:Free (init_array );
 
141:Free (local_sample );
 
142:Free (primary_sample );
 
143:Free (section_array );
 
144:Free (sorted_section_array );
 
145:Free (section_resp_array );
 
146: 
 
147:If(! Process_id ){
 
148:Free (sample );
 
149:Free (sorted_sample );
 
150:}
151: 
 
152:Mpi_finalize ();
 
153:}
 
 
 
 
 
5. MPI Function Analysis
 
 
 
In the above algorithm, the non-blocking communication function of MPI is used: mpi_irecv, which corresponds to mpi_isend. These functions are used for non-blocking communication between processes so that communication and operations can be performed simultaneously. There are two
 
For non-blocking communication functions, you cannot mention the mpi_wait function, which is used to block process execution until you want to execute all the corresponding process operations. Generally, these three functions are used together.
 
 
 
Next, we will introduce the KMP string matching algorithm and its parallelization.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.