Intro to Parallel Programming Course Note 001

Source: Internet
Author: User

Intro to Parallel programming

How does you dig a hole faster?

GPU Concept

Many, many simple computational units;

Parallel computing model for cleaning;

Focus on throughput rather than latency;

Cpu:host

Gpu:device

A Typical GPU Program

1,cpuallocates (allocation) storage on GPU cuda Malloc

2,cpucopies input data from Cpu-gpu cuda Memcpy

3,cpulaunches Kernel (s) on GPU to process the data kernel launch

4,cpucopies RESULTS back-to-CPU from GPU cuda Memcpy

It's best to copy the GPU data back to the last step

Defining the GPU computation

BIG idea

Kernels look like SERIAL PROGRAMS

WRITE YOUR Program as IF IT would RUN on one THREAD

The GPU would RUN, the on many THREADS

Make Sureyou understand this

This isimportant

What is the GPU good at?

1,efficiently launching LOTS of THREADS

2,running LOTS of THREADS in PARALLEL

Simple EXAMPLE:

In:floatarray [0 1 2 ... 36]

Out:floatarray [0 1x1 2x2 ... 63X63]

[0 1 4 9 ...]

Kernel:square

CPU Code:square each ELEMENT of Anarray

for (i=0; i<64;i++) {

Out[i]=in[i]*in[i];

}

1, Onlyone THREAD of execution

("Thread" =one independentpath of execution through the Code ")

2,noexplicit Parallelsim

GPU Code:a High-level VIEW

Cpu

Allocatememory

CopyData To/from GPU

Launchkernel

Specifies degree ofparallelism

Gpu

Expressout = in. Inch

SAMS Nothing

About the degree ofparallelism

CPU code:square kernel <<< >>> (outarrayinarray)

But how DOES is IT work IF I LAUNCH 64INSTANCES of the same program?

CPU launches THREADS

#include <stdio.h>

__global__ void Cube (float * d_out, float *d_in) {

Todo:fill in this function

}

int main (int argc, char * * argv) {

Constint array_size = 96;

Constint array_bytes = array_size * sizeof (float);

Generate the input array on the host

Floath_in[array_size];

for (int i = 0; i < array_size; i++) {

h_in[i]= float (i);

}

Floath_out[array_size];

Declare GPU memory pointers

float* d_in;

float* d_out;

Allocate GPU memory

Cudamalloc ((void**) &d_in, array_bytes);

Cudamalloc ((void**) &d_out, array_bytes);

Transfer the array to the GPU

cudamemcpy (d_in,h_in, Array_bytes, Cudamemcpyhosttodevice);

Launch the kernel

Cube<<<1,array_size>>> (D_out, d_in);

Copy back the result array to the CPU

cudamemcpy (H_out,d_out, Array_bytes, cudamemcpydevicetohost);

Print out the resulting array

for (int i =0; i < array_size; i++) {

printf ("%f", H_out[i]);

printf (((i% 4)! = 3)? "\ T": "\ n");

}

Cudafree (d_in);

Cudafree (d_out);

Return0;

}

Configuring the Kernel Launch

Square<<<1,64>>> (d_ou,d_in)

Number of <<<block, threads per block >>>

1, you can run more than one block at a time

2, Maximum value per threads/block (512 old version)

(1024 new version)

128 County square<<<1,128>>> (...)

1280 County square<<<10,128>>> (...)

Square<<<5,256>>> (...)

Kernal<<<grid of Blocks,block ofthreads>>> (...)

Kernal<<<1,2or3d, 1,2or3d >>> (...)

DIM3 (x, Y, z)

DIM3 (w,1,1) ==dim3 (w) ==w

SQUARE<<<1,64>>>==SQUARE<<<DIM3 (1,1,1), dim3 (64,1,1) >>>

Kernel<<<grid of Blocks,block ofthreads>>> (...)

SQUARE<<<DIM3 (BX,BY,BZ), dim3 (Tx,ty,tz),shmem>>> (...)

Square<<<gridof blocks Bx.by.bz,block of Threads Tx.ty.tz, shared memory per block in bytes>>>

Thread Idx:thread within block

Thread Idx.xthread idx.y

Block Dim:size OFA block

Block Idx:blockwithin Grid

Griddim:size Ofgrid

MAP

Setof elements to process [floats]

Functionto run on each element ["Square"]

Map (elements,function)

Gpuare Good at map

--GPU have many parallel processors

--gpu Optimize for throughput

Struct uchar4{

Unsigned char x;

Unsigned Char y;

Unsigned Char Z;

Unsigned Char W;

}

Converting color to black and white

I = (r+g+b)/3

I =. 299f*r +. 587f*g +. 114f*b

Intro to Parallel Programming Course Note 001

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.