Intro to Parallel Programming Course Note 001

Last Update:2015-01-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Intro to Parallel programming

How does you dig a hole faster?

GPU Concept

Many, many simple computational units;

Parallel computing model for cleaning;

Focus on throughput rather than latency;

Cpu:host

Gpu:device

A Typical GPU Program

1,cpuallocates (allocation) storage on GPU cuda Malloc

2,cpucopies input data from Cpu-gpu cuda Memcpy

3,cpulaunches Kernel (s) on GPU to process the data kernel launch

4,cpucopies RESULTS back-to-CPU from GPU cuda Memcpy

It's best to copy the GPU data back to the last step

Defining the GPU computation

BIG idea

Kernels look like SERIAL PROGRAMS

WRITE YOUR Program as IF IT would RUN on one THREAD

The GPU would RUN, the on many THREADS

Make Sureyou understand this

This isimportant

What is the GPU good at?

1,efficiently launching LOTS of THREADS

2,running LOTS of THREADS in PARALLEL

Simple EXAMPLE:

In:floatarray [0 1 2 ... 36]

Out:floatarray [0 1x1 2x2 ... 63X63]

[0 1 4 9 ...]

Kernel:square

CPU Code:square each ELEMENT of Anarray

for (i=0; i<64;i++) {

Out[i]=in[i]*in[i];

}

1, Onlyone THREAD of execution

("Thread" =one independentpath of execution through the Code ")

2,noexplicit Parallelsim

GPU Code:a High-level VIEW

Cpu

Allocatememory

CopyData To/from GPU

Launchkernel

Specifies degree ofparallelism

Gpu

Expressout = in. Inch

SAMS Nothing

About the degree ofparallelism

CPU code:square kernel <<< >>> (outarrayinarray)

But how DOES is IT work IF I LAUNCH 64INSTANCES of the same program?

CPU launches THREADS

#include <stdio.h>

__global__ void Cube (float * d_out, float *d_in) {

Todo:fill in this function

}

int main (int argc, char * * argv) {

Constint array_size = 96;

Constint array_bytes = array_size * sizeof (float);

Generate the input array on the host

Floath_in[array_size];

for (int i = 0; i < array_size; i++) {

h_in[i]= float (i);

}

Floath_out[array_size];

Declare GPU memory pointers

float* d_in;

float* d_out;

Allocate GPU memory

Cudamalloc ((void**) &d_in, array_bytes);

Cudamalloc ((void**) &d_out, array_bytes);

Transfer the array to the GPU

cudamemcpy (d_in,h_in, Array_bytes, Cudamemcpyhosttodevice);

Launch the kernel

Cube<<<1,array_size>>> (D_out, d_in);

Copy back the result array to the CPU

cudamemcpy (H_out,d_out, Array_bytes, cudamemcpydevicetohost);

Print out the resulting array

for (int i =0; i < array_size; i++) {

printf ("%f", H_out[i]);

printf (((i% 4)! = 3)? "\ T": "\ n");

}

Cudafree (d_in);

Cudafree (d_out);

Return0;

}

Configuring the Kernel Launch

Square<<<1,64>>> (d_ou,d_in)

Number of <<<block, threads per block >>>

1, you can run more than one block at a time

2, Maximum value per threads/block (512 old version)

(1024 new version)

128 County square<<<1,128>>> (...)

1280 County square<<<10,128>>> (...)

Square<<<5,256>>> (...)

Kernal<<<grid of Blocks,block ofthreads>>> (...)

Kernal<<<1,2or3d, 1,2or3d >>> (...)

DIM3 (x, Y, z)

DIM3 (w,1,1) ==dim3 (w) ==w

SQUARE<<<1,64>>>==SQUARE<<<DIM3 (1,1,1), dim3 (64,1,1) >>>

Kernel<<<grid of Blocks,block ofthreads>>> (...)

SQUARE<<<DIM3 (BX,BY,BZ), dim3 (Tx,ty,tz),shmem>>> (...)

Square<<<gridof blocks Bx.by.bz,block of Threads Tx.ty.tz, shared memory per block in bytes>>>

Thread Idx:thread within block

Thread Idx.xthread idx.y

Block Dim:size OFA block

Block Idx:blockwithin Grid

Griddim:size Ofgrid

MAP

Setof elements to process [floats]

Functionto run on each element ["Square"]

Map (elements,function)

Gpuare Good at map

--GPU have many parallel processors

--gpu Optimize for throughput

Struct uchar4{

Unsigned char x;

Unsigned Char y;

Unsigned Char Z;

Unsigned Char W;

}

Converting color to black and white

I = (r+g+b)/3

I =. 299f*r +. 587f*g +. 114f*b

Intro to Parallel Programming Course Note 001

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Intro to Parallel Programming Course Note 001

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Intro to Parallel Programming Course Note 001

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support