Quick sortingAlgorithmThe efficiency of the parallel algorithm is relatively high, and the time complexity of the parallel algorithm can reach O (n) under ideal conditions. However, the parallel fast sorting algorithm has a serious problem: it may cause serious load imbalance, in the worst case, the complexity of the algorithm can reach O (N ^ 2 ). In this article, we will introduce a parallel sorting algorithm based on evenly divided Load Balancing ------ parallel sorting by regular sampling ).
I. Basic Idea of Algorithms
Assume n elements to be sorted and P processors.
First, the n elements are evenly divided into P parts, each of which contains N/P elements. Each processor is responsible for part of it and performs partial sorting on it. To determine the position of a local ordered sequence in the whole sequence, each processor selects several representative elements from their respective local ordered sequence, sorts these representative elements, and selects P-1 principal component. Each processor divides its local ordered sequence into P segments based on P-1 principal component. Then, the P-segment ordered sequence is distributed to the corresponding processor by means of global switching, so that the I-segment processor has the I-segment ordered sequence of each processor. Each processor sorts the p-segment ordered sequence. Finally, the sequential segments of each processor are merged in sequence, that is, the global ordered sequence.
Ii. Algorithm Description
According to the basic idea of the algorithm, we describe the algorithm as follows:
Input: N sequence to be sorted
Output: distributed on each processor to obtain a globally ordered data sequence.
1) division and local sorting of unordered Sequences
Based on the fast data division method (see Series 1), the unordered sequence is divided into P parts, and each processor performs sequential and fast sorting of some of them, in this way, each processor will have a local ordered sequence.
2) select representative elements
Each processor selects P-1 elements from the local ordered sequence W, 2 w,..., (p-1) W. W = N/P ^ 2.
3) determine the principal component
Each processor sends the selected representative element to the processor P0. P0 sorts the p-segment ordered sequence in multiple ways, and then selects the p-1, 2 (PM ),..., (p-1 P-1) a total of 1 elements are used as the principal element.
4) distribution Principal Component
P0 distributes the P-1 principal component to each processor.
5) Division of local ordered sequences
After receiving the principal component, each processor divides its local ordered sequence into P segments based on the principal component.
6) Distribution of p-segment ordered sequences
Each processor sends its I-segment to the I-th processor, which is the I-segment where I owns all the processors.
7) multi-path sorting
Each processor merges the p-segment sequence obtained in the previous step.
After these seven steps, the data of each processor is retrieved at a time, which is ordered.
Iii. Algorithm Analysis
1) load balancing analysis:
Because this algorithm is a load balancing algorithm, which can be seen in step 1), but it is not perfect, because in step 1st) step division may cause load imbalance.
2) Time Complexity Analysis
The psrs algorithm is suitable for processing large volumes of data (why parallel processing ). When N> P ^ 3, the time complexity of the algorithm can reach N/P * logn. The time complexity analysis of each step is not described here, because the sorting of each step is a common serial sorting algorithm.
Iv. Algorithm Implementation
Because the algorithm is complex,CodeLong, this article only lists the main code, the code is as follows:
1:VoidPsrs_mpi (Int* Argc,Char* ** Argv ){
2:
3:IntProcess_id;
4:IntProcess_size;
5:
6:Int* Init_array;// Initial Array
7:IntInit_array_length;// Initial array Length
8:
9:Int* Local_sample;// Array of elements selected by each process
10:IntLocal_sample_length;// Represents the length of the element array
11:
12:Int* Sample;// Represents the Element Set (used by process 0)
13:Int* Sorted_sample;// Sorted Element Set
14:IntSample_length;// Represents the length of the budget
15:
16:Int* Primary_sample;// Principal Component
17:
18:Int* Resp_array;// Offset array. You can specify the length of each segment of a process array.
19:
20:Int* Section_resp_array;// Offset array, used to specify the length of the array obtained by the process from other processes
21:
22:Int* Section_array;// Obtain the set of segmented arrays from each process
23:Int* Sorted_section_array;
24:IntSection_array_length;// Total length
25:
26:IntSection_index;
27:
28:IntI, J;// Cyclic variable
29:
30:Mpi_request handle;
31:Mpi_status status;
32:
33:Mpi_start (argc, argv, & process_size, & process_id, mpi_comm_world );
34:Resp_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Process_size );
35:
36:// Construct an array for each process
37:// Sequential and fast sorting of improved Arrays
38:Init_array_length = array_length;
39:Init_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Init_array_length );
40:Array_builder_seed (init_array, init_array_length, process_id );
41:
42:Quick_sort (init_array, 0, init_array_length-1 );
43:
44:// Each processor selects process_size-1 elements from the sequence of order numbers
45:// And sent to process 0
46:Local_sample_length = process_size-1;
47:Local_sample = array_sample (init_array, local_sample_length, init_array_length/process_size, process_id );
48:
49:If(Process_id)
50:Mpi_send (local_sample, local_sample_length, mpi_int, 0, sample_data, mpi_comm_world );
51:
52:
53:// Process 0 receives the representative elements sent by each processor and sorts these elements in multiple ways.
54:If(! Process_id ){
55:Sample = (Int*) My_mpi_malloc (0,Sizeof(Int) * Process_size * local_sample_length );
56:Sorted_sample = (Int*) My_mpi_malloc (0,Sizeof(Int) * Process_size * local_sample_length );
57:Array_copy (sample, local_sample, local_sample_length );
58:
59:For(I = 1; I <process_size; I ++)
60:Mpi_irecv (sample + local_sample_length * I, local_sample_length, mpi_int, I, sample_data,
61:Mpi_comm_world, & handle );
62:
63:Mpi_wait (& handle, & status );
64:
65:For(I = 0; I <process_size; I ++)
66:Resp_array [I] = local_sample_length;
67:
68:Mul_merger (sample, sorted_sample, resp_array, process_size );
69:
70:// Select the process_size-1 principals from the sorted representative elements and
71:Primary_sample = array_sample (sorted_sample, process_size-1, process_size-1, process_id );
72:}
73:If(Process_id)
74:Primary_sample = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Process_size-1 );
75:
76:Mpi_bcast (primary_sample, process_size-1, mpi_int, 0, mpi_comm_world );
77:
78:// Divides the data on the processor into process_size based on the principal component.
79:Get_array_sepator_resp (init_array, primary_sample, resp_array, init_array_length, process_size );
80:If(Process_id = ID ){
81:Printf ("Process % d resp array is :", Process_id );
82:Array_int_print (process_size, resp_array );
83:}
84:
85:// Each processor sends its segment I to the processor
86:Section_resp_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Process_size );
87:Section_resp_array [process_id] = resp_array [process_id];
88:
89:// The number of data to be sent by each process is sent to the processor.
90:For(I = 0; I <process_size; I ++ ){
91:If(I = process_id ){
92:For(J = 0; j <process_size; j ++)
93:If(I! = J)
94:Mpi_send (& (resp_array [J]), 1, mpi_int, J, section_index,
95:Mpi_comm_world );
96:}
97:Else
98:Mpi_recv (& (section_resp_array [I]), 1, mpi_int, I, section_index,
99:Mpi_comm_world, & status );
100:}
101:
102:Mpi_barrier (mpi_comm_world );
103:
104:Section_array_length = get_array_element_total (section_resp_array, 0, process_size-1 );
105:Section_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Section_array_length );
106:Sorted_section_array = (Int*) My_mpi_malloc (process_id,Sizeof(Int) * Section_array_length );
107:Section_index = 0;
108:
109:For(I = 0; I <process_size; I ++ ){
110:If(I = process_id ){
111:For(J = 0; j <process_size; j ++ ){
112:If(J)
113:Section_index = get_array_element_total (resp_array, 0, J-1 );
114:If(I = J)
115:Array_int_copy (section_array, init_array, section_index, section_index + resp_array [J]);
116:If(I! = J ){
117:If(J)
118:Section_index = get_array_element_total (resp_array, 0, J-1 );
119:Mpi_send (& (init_array [section_index]), resp_array [J], mpi_int,
120:J, section_data, mpi_comm_world );
121:}
122:}
123:}
124:Else{
125:If(I)
126:Section_index = get_array_element_total (section_resp_array, 0, I-1 );
127:Mpi_recv (& (section_array [section_index]), section_resp_array [I], mpi_int,
128:I, section_data, mpi_comm_world, & status );
129:}
130:}
131:Mpi_barrier (mpi_comm_world );
132:
133:// Merge multiple rows for sorting
134:Mul_merger (section_array, sorted_section_array, section_resp_array, process_size );
135:
136:Array_int_print (section_array_length, sorted_section_array );
137:
138:// Release the memory
139:Free (resp_array );
140:Free (init_array );
141:Free (local_sample );
142:Free (primary_sample );
143:Free (section_array );
144:Free (sorted_section_array );
145:Free (section_resp_array );
146:
147:If(! Process_id ){
148:Free (sample );
149:Free (sorted_sample );
150:}
151:
152:Mpi_finalize ();
153:}
5. MPI Function Analysis
In the above algorithm, the non-blocking communication function of MPI is used: mpi_irecv, which corresponds to mpi_isend. These functions are used for non-blocking communication between processes so that communication and operations can be performed simultaneously. There are two
For non-blocking communication functions, you cannot mention the mpi_wait function, which is used to block process execution until you want to execute all the corresponding process operations. Generally, these three functions are used together.
Next, we will introduce the KMP string matching algorithm and its parallelization.