Intro to Parallel programming
How does you dig a hole faster?
GPU Concept
Many, many simple computational units;
Parallel computing model for cleaning;
Focus on throughput rather than latency;
Cpu:host
Gpu:device
A Typical GPU Program
1,cpuallocates (allocation) storage on GPU cuda Malloc
2,cpucopies input data from Cpu-gpu cuda Memcpy
3,cpulaunches Kernel (s) on GPU to process the data kernel launch
4,cpucopies RESULTS back-to-CPU from GPU cuda Memcpy
It's best to copy the GPU data back to the last step
Defining the GPU computation
BIG idea
Kernels look like SERIAL PROGRAMS
WRITE YOUR Program as IF IT would RUN on one THREAD
The GPU would RUN, the on many THREADS
Make Sureyou understand this
This isimportant
What is the GPU good at?
1,efficiently launching LOTS of THREADS
2,running LOTS of THREADS in PARALLEL
Simple EXAMPLE:
In:floatarray [0 1 2 ... 36]
Out:floatarray [0 1x1 2x2 ... 63X63]
[0 1 4 9 ...]
Kernel:square
CPU Code:square each ELEMENT of Anarray
for (i=0; i<64;i++) {
Out[i]=in[i]*in[i];
}
1, Onlyone THREAD of execution
("Thread" =one independentpath of execution through the Code ")
2,noexplicit Parallelsim
GPU Code:a High-level VIEW
Cpu
Allocatememory
CopyData To/from GPU
Launchkernel
Specifies degree ofparallelism
Gpu
Expressout = in. Inch
SAMS Nothing
About the degree ofparallelism
CPU code:square kernel <<< >>> (outarrayinarray)
But how DOES is IT work IF I LAUNCH 64INSTANCES of the same program?
CPU launches THREADS
#include <stdio.h>
__global__ void Cube (float * d_out, float *d_in) {
Todo:fill in this function
}
int main (int argc, char * * argv) {
Constint array_size = 96;
Constint array_bytes = array_size * sizeof (float);
Generate the input array on the host
Floath_in[array_size];
for (int i = 0; i < array_size; i++) {
h_in[i]= float (i);
}
Floath_out[array_size];
Declare GPU memory pointers
float* d_in;
float* d_out;
Allocate GPU memory
Cudamalloc ((void**) &d_in, array_bytes);
Cudamalloc ((void**) &d_out, array_bytes);
Transfer the array to the GPU
cudamemcpy (d_in,h_in, Array_bytes, Cudamemcpyhosttodevice);
Launch the kernel
Cube<<<1,array_size>>> (D_out, d_in);
Copy back the result array to the CPU
cudamemcpy (H_out,d_out, Array_bytes, cudamemcpydevicetohost);
Print out the resulting array
for (int i =0; i < array_size; i++) {
printf ("%f", H_out[i]);
printf (((i% 4)! = 3)? "\ T": "\ n");
}
Cudafree (d_in);
Cudafree (d_out);
Return0;
}
Configuring the Kernel Launch
Square<<<1,64>>> (d_ou,d_in)
Number of <<<block, threads per block >>>
1, you can run more than one block at a time
2, Maximum value per threads/block (512 old version)
(1024 new version)
128 County square<<<1,128>>> (...)
1280 County square<<<10,128>>> (...)
Square<<<5,256>>> (...)
Kernal<<<grid of Blocks,block ofthreads>>> (...)
Kernal<<<1,2or3d, 1,2or3d >>> (...)
DIM3 (x, Y, z)
DIM3 (w,1,1) ==dim3 (w) ==w
SQUARE<<<1,64>>>==SQUARE<<<DIM3 (1,1,1), dim3 (64,1,1) >>>
Kernel<<<grid of Blocks,block ofthreads>>> (...)
SQUARE<<<DIM3 (BX,BY,BZ), dim3 (Tx,ty,tz),shmem>>> (...)
Square<<<gridof blocks Bx.by.bz,block of Threads Tx.ty.tz, shared memory per block in bytes>>>
Thread Idx:thread within block
Thread Idx.xthread idx.y
Block Dim:size OFA block
Block Idx:blockwithin Grid
Griddim:size Ofgrid
MAP
Setof elements to process [floats]
Functionto run on each element ["Square"]
Map (elements,function)
Gpuare Good at map
--GPU have many parallel processors
--gpu Optimize for throughput
Struct uchar4{
Unsigned char x;
Unsigned Char y;
Unsigned Char Z;
Unsigned Char W;
}
Converting color to black and white
I = (r+g+b)/3
I =. 299f*r +. 587f*g +. 114f*b
Intro to Parallel Programming Course Note 001