CUDA (vi). Understanding parallel thinking from the parallel sort method--the GPU implementation of bubbling, merging and double-tuning sort

Source: Internet
Author: User

In the fifth lecture, we studied the GPU three important basic parallel algorithms: Reduce, Scan and histogram, and analyzed its function and serial parallel implementation method. In the sixth lecture, this paper takes the Bubble sort, merge sort, and sort in the sorting network, and Bitonic sort as an example, explains how to convert the serial parallel sorting method from the data structure class to the parallel sort, and attach the GPU implementation code.

In the parallel method, we will consider the characteristics of the parallel method needs attention to design, I hope that after reading this article on the GPU on the design of parallel algorithms have some superficial understanding. The following features are required for attention:
1. Get the most out of your hardware (try not to have an idle and always waiting status of SM)
2. Restricting branch divergence (see CUDA Series Study (ii))
3. Try to ensure that the memory is centrally accessible (that is, prevent misses)

(The sort algorithm we learned in the data structure class often doesn't pay attention to these points.) )




Cuda Series Learning Catalog:

Cuda Series Learning (i) an Introduction to GPU and Cuda

Cuda Series Learning (ii) CUDA memory & variables-different memory and variable types

Cuda Series Learning (iii) GPU design and Structure QA & coding Exercises

Cuda Series Learning (iv) Parallel task type and Memory Allocation

Cuda Series Learning (v) GPU base algorithm: Reduce, Scan, histogram




I. Bubble Sort

Bubble sort, I'm sure you'll be familiar with it. The classic bubble sort is ordered by the N-round order bubbling (n is the array length to be sorted), its time complexity O (1), and the spatial Complexity O (n^2).

So how to change the bubble sorting algorithm to parallel algorithm? There are some dependencies that need to be lifted, such as whether the serial dependency between the N-round bubbles can be lifted & whether the serial dependencies within each round of bubbling can be lifted, making the same n^2-bubble operation possible by parallel, reducing the step complexity.
In 1996, J Kornerup proposed the Odd-even sort algorithm for these problems, and proved its correctness in the paper.


I.1 from bubble sort to odd-even sort

Let's take a look at the sorting method for Odd-even sort:



Figure 1.1

The basic method for Odd-even sort.
In odd steps, the odd item in array Array[i] is compared to the item on the right (Array[i + 1]);
In even steps, the odd item in array Array[i] is compared to the item on the left (Array[i-1]);

In this way, each adjacent comparison in the same step can be parallelized.

PS: The same is true for arrays with an even number of elements:



Figure 1.2



I.2 Odd-even Sort Complexity

In the Odd-even sort algorithm, the total comparison number of the original O (n^2) is constant, but due to parallelism, the time complexity drops to O (n), i.e.

Step complexity = O (n)
Work complexity = O (n^2)

Code See < Bubble sort and its variants >





II. Merge Sort

After looking at Odd-even sort, let's see how to parallelize the merge sort. Data structure class We've all learned classic merge sort: Based on divide & Conquer thought, each time an array is split into two parts, sorted and then combined with two ordered sequences. Can be obtained by means of T (n) =2t (N/2) +n, whose complexity = O (NLOGN). Similar to I.1, let's look at which steps in the merge sort can be parallel.

This allows you to sort large-scale data based on the merge sort into three parts. After the divide step, the data is distributed:



Figure 2.1

The lower end is a combination of large-short sequences;
The middle piece is a combination of medium-and medium-length sequences;
The upper end is a combination of small-long sequences;

We divide these three parts to parallelize. Then everyone will understand why this is so.



II.1 Step 1:huge Number of small tasks

In this section, the cost of merging each part of the sequence is minimal, and there are many such tasks. So we take a thread for each merge to execute, and the thread internally uses the serial merge method.


II.2 Step 2:mid Number of Mid task

At this stage, there is a moderate number of tasks, and the workload for each task also increases. So we take a single SM to each merge to execute the multiple thread parallel merge method that runs on each block. The main difference between the Step1 and the block is that the internal merge is changed from serial to parallel. So how to do it?

As shown, if you now have two arrays that are 4 elements long and want to sort them, the merge sort result, index 0-7, is written to the block below the data.



Figure 2.2

Practice:
For each number in the array, look at two positions:
1. Position of your own sequence: see how many elements are in front of it
2. Position of another sequence: how many elements in the other sequence are smaller than it (using a binary search)

To do so, the first step O (1) can be obtained, the second step to find O (LOGN) can be obtained; the entire merge process is stored with the shared memory result.



II.3 Step 3:small number of huge task

In the third link, the top of the merge (the last part), there are many elements for each merge task, but the number of merge tasks is very small. In this case, if the Step2 method is used, the worst case scenario is that only a large task is running, at which time only 1 SM is busy and other idle SM is not available, so here we try to divide a task into multiple SM execution.

Method: 2.3 shows that each sequence is segmented in 256 elements, resulting in two pending merge sequences In1 and In2. Then, sort these end nodes, such as EABFCGDH. As in Step2, we calculate the position of each segment node in the other short sequence (length 256) and then merge only the indeterminate parts in the middle, assigning one SM to each merge.

As part of the E~a,
1. Calculate the position of E in In1 posE1, the position of a in In2 posA2
2. Elements of E~POSA2 in Pose1~a and In2 in merge In1



Figure 2.3



II.4 Merge Sort in GPU

For example, in the above step 1 merge sort, the kernel function in its GPU code is as follows:

Where temp is a sequential sequence, each sorting two blocks of size sortedsize, assigning a value of 2 * sortedsize elements to temp.

So in fact, Sortedsize is the size of a sorted block.

__global__voidMergeblocks (int*a,int*temp,intSortedsize) {int ID= Blockidx. x;intIndex1 =ID*2* SORTEDSIZE;intEndIndex1 = index1 + sortedsize;intIndex2 = EndIndex1;intEndIndex2 = Index2 + sortedsize;intTargetindex =ID*2* SORTEDSIZE;intDone =0; while(!done) {if((index1 = = endIndex1) && (Index2 < ENDINDEX2)) temp[targetindex++] = a[index2++];Else if((Index2 = = endIndex2) && (Index1 < ENDINDEX1)) temp[targetindex++] = a[index1++];Else if(A[index1] < A[INDEX2]) temp[targetindex++] = a[index1++];Elsetemp[targetindex++] = a[index2++];if((index1 = = endIndex1) && (index2 = = endIndex2)) Done =1; }}

In the main function, define the block size and invoke the kernel function:

        int blocks = BLOCKS/2;        int sortedsize = THREADS;        while0)        {          mergeBlocks<<<blocks,1>>>(dev_a, dev_temp, sortedsize);          cudaMemcpy(dev_a, dev_temp, N*sizeof(int), cudaMemcpyDeviceToDevice);          2;          2;        }        cudaMemcpy(a, dev_a, N*sizeof(int), cudaMemcpyDeviceToHost);





Iii. Bitonic Sort



III.1 Bitonic Sequence two-tone sequence

Unlike the above two sorting methods, now we want to contact the two-tone sort is a sort of network method. Remember that year in the University of Zhejiang University interview a tutor's lab is to let the realization of the double-tune sequencing, and constantly optimize, but then a lump of soil, have not heard the algorithm ... Finally write a multi-threaded end, and later did not tidy up. Now let's take a look at bitonic sort what a ghost.

Two-tone sorting is one of the quickest ways to sort a network. The so-called sorting network is a sort of data-independent, that is, the network comparison sequence and data-independent sorting method, so it is particularly suitable for hardware parallelization.

Before we get to the double-tuning algorithm, let's take a look at what is a two-tone sequence. A double-tone sequence is a sequence of monotonically decreasing or monotonically decreasing after a monotonically increasing first.



III.2 two-tone sorting algorithm

If we get a double-tone sequence now, how do we sort it from small to large? Image point of view, we will be a double-tone sequence into two halves, the monotonic unity of each segment, and then 3.1, the two-segment stack up, 22 comparison, so you can be in the left and right two paragraphs to get a double-tone sequence (think why the resulting is two double-tone sequence), And the elements in the left double-tone sequence are all smaller than all elements of the double-tone sequence that is obtained on the right. Iterative this process, each time the sequence two can be divided into two sub-double sequence, until the length of the sub-double sequence is 2, it becomes a monotone sub-sequence, the process of sequencing the original long double sequence becomes ordered. The entire procedure is shown in 3.2.


Figure 3.1



Figure 3.2



III.3 sequence of arbitrary sequence generation

Well, the III.2 how to sort the two-tone sequence, and the question is, how to generate a double-tone sequence from any sequence? Here you can see the last reference in this article 3, written in very detailed. This process is called bitonic merge, is actually divide and conquer idea. In contrast to the idea in III.2, we can think of two contiguous monotone sequences of monotone opposites as a two-tone sequence, each of which generates a new double-tone sequence for each of the two contiguous, monotone-inverse monotone sequence merges, and then sorts (with III.2). This way, as long as the monotone of two consecutive n-length sequences is reversed, a double-tone sequence with a length of 2n can be obtained by connecting. N starts at 1 and doubles each time until it equals the length of the array, and it only needs to be sorted once in one Direction (monotonic).



Figure 3.3

Take an array of 16 elements for example,
1. The adjacent two elements merge to form 8 monotone sequences of monotone opposite,
2.22 Sequence merging to form 4 double-tone sequences, sorted by opposite monotone respectively
3.4 4-length, opposite monotone monotone sequence, adjacent to two merges, generates two double-tone sequences of 8 length, sorted separately
4.2 8-length inverse monotonic monotone sequence, two merges adjacent, generates 1 double-tone sequences of 16 length, sorted


Finally finished, so swollen to achieve it? What do we need to control in this process? As shown, we can divide the len=16 array into 4 parts, and each part ends up with a number of monotone sequences of length i . In each section, a J is used to denote the interval of the comparison, as shown in the I and J at each moment.



Figure 3.4



Parallel implementation of III.4 two-tone sequencing

The spirit of "Talk is cheap, show me the Code" of the fine style, take the coarse to double the sort of GPU implementation code share as follows:

/* * Author:rachel * <[email protected]> * * File:bitonic_sort.cu * Create date:2015-08-05 17:10:44 * */#include <iostream>#include <stdio.h>#include <stdlib.h>#include "gputimer.h"#include <time.h>#define NTHREADS 8#define NBLOCKS 4#define Num Nthreads*nblocksusing namespacegadgetron;__device__voidSwapint&a,int&AMP;B) {intt = A;    A = b; b = t;} __global__voidBitonic_sort (int* arr) {extern__shared__intShared_arr[];Const unsigned intTID = blockidx.x * blockdim.x + threadidx.x;//const unsigned int tid = threadidx.x;Shared_arr[tid] = Arr[tid]; __syncthreads ();//for (int i=2; i<=blocidim.x; i<<=1) {     for(unsigned intI=2; i<=num; i<<=1){ for(unsigned intJ=i>>1; J>0; j>>=1){unsigned intTid_comp = tid ^ j;if(Tid_comp > Tid) {if(Tid & i) = =0){//ascending                    if(Shared_arr[tid]>shared_arr[tid_comp])                    {swap (Shared_arr[tid],shared_arr[tid_comp]); }                }Else{//desending                    if(Shared_arr[tid]<shared_arr[tid_comp])                    {swap (Shared_arr[tid],shared_arr[tid_comp]);        }}} __syncthreads (); }} Arr[tid] = Shared_arr[tid];}intMainintargcChar* argv[]) {Gputimer timer;int* Arr= (int*)malloc(num*sizeof(int));//init Array Valuetime_t T; Srand ((unsigned) time (&t)); for(intI=0; i<num;i++) {Arr[i] = rand ()% +; }//init Device Variable    int* PTR; Cudamalloc ((void* *) &ptr,num*sizeof(int)); cudamemcpy (ptr,arr,num*sizeof(int), Cudamemcpyhosttodevice); for(intI=0; i<num;i++) {printf("%d\t", Arr[i]); }printf("\ n"); DIM3 Blocks (Nblocks,1); DIM3 Threads (Nthreads,1);    Timer.start (); bitonic_sort<<<blocks,threads,num*sizeof(int) >>> (PTR);//bitonic_sort<<<1,num,num*sizeof (int) >>> (PTR);Timer.stop (); cudamemcpy (arr,ptr,num*sizeof(int), cudamemcpydevicetohost); for(intI=0; i<num;i++) {printf("%d\t", Arr[i]); }printf("\ n"); Cudafree (PTR);return 0;}



Code,
Tid^j is used for single direction judgment to prevent the same element from being compared two times;
Tid & i = = 0 is used to determine whether this part should be monocytogenes or single minus, because the direction is consistent in each monotone sequence of I, so I use I to determine the monotone direction.


Reference documents:
1. Bubble sort and its variants
2. NVIDIA's MergeSort implementation
3. The easy-to-understand bitonic sort document I used



Welcome everyone to Exchange

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

CUDA (vi). Understanding parallel thinking from the parallel sort method--the GPU implementation of bubbling, merging and double-tuning sort

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.