Cuda Parallel Computing Framework (ii) Case correlation

Last Update:2017-02-27 Source: Internet

Author: User

Tags constant execution thread

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

From this part began to combine the bug's demo program to analyze the Cuda performance and feasibility.

One. The implementation process is outlined first.

Cuda is executed by letting one of the host's kernel perform on the graphics hardware (GPU) according to the concept of thread grid (GRID). Each thread grid can also contain multiple line Cheng (block), each of which can contain multiple threads (thread).

Each kernel is given to each grid to complete. When performing these tasks, each grid divides the task into parts of Block,block and threads to complete. The tasks in each grid are certain. The index relationship of the two-dimensional thread block is as follows:
unsigned int xindex = blockdim.x * blockidx.x + threadidx.x;

unsigned int yindex = BLOCKDIM.Y * blockidx.y + threadidx.y;

Each thread in the block has its own registers and all threads in the local memory,block share a shared memory, and a grid shares a global memory.

Within each clock cycle, warp (thread running together in a block) contains a limited number of thread numbers, now set at 32. A block contains 16 warp. So a block contains a maximum of 512 threads

Each device (that is, the video card) processes only one grid.

The following is a description of the hardware execution model.

If for some reason, the company's office was requisitioned for activities. Leave only a small room for the development team. Each clock cycle in accordance with wrap (as the runtime, a block inside the run of thread, such as block inside there are 512 thread, but only 32 thread at a time running, then the 32 thread is a running warp group Line Cheng). Each warp contains a limited number of thread numbers and is now set at 32. The future does not know will change, this only cuda developers know. Each device (that is, the video card) processes only one grid (this limitation may be lifted in future hardware that supports directX11). If we have x individuals in one department, there are n tables in the office and 32 people can sit at each table. Then take turns to develop .... The table here can be understood as multiprocessor (multiprocessor). Each SM contains 8 scalar stream processors (SPS). The concept of the so-called multicore kernel in the GPU is the number of SPS. The kernel function in Cuda is essentially performed in block. Data must be shared in the same block, so they have to be fired in the same SM, and each thread in the blocks is sent to the SP for execution. Doubt: Since there is such a thread-cluster limit, why do you want to set a thread higher than the number of warp threads.

Two. Demo

Installation of deployment driver, Toolkit, SDK sequence installed. Cuda's Project supports 4 debugging methods release, Debug, Emurelease, Emudubug. The first 2 are needed to really support the GPU Cuda the latter is the CPU emulation GPU. As for your computer to support Cuda can run the next DeviceQuery.exe program

Let's look at a few things in the picture, first of all, a device that supports CUDA. Computing power 1, the size of the local memory, the number of cores, the number of processors, the size of the constant memory, the size of the shared memory per block, the number of wrap threads, and so on.
Want to see Cuda in the Graphics field of application can run this SmokeParticles.exe program OH.

In my demo, CPP files are mainly processing some CPU-side processing, the CU file is usually some content with GPU kernel function and Cuda API. The My_kernel encapsulates a specific kernel function implementation method. The Cudatool project is the Cuda application, Cudaproviders is the driver of my connection between C # and Cuda, and Cudaweb is our usual Web project. Cudawinapp This is some small feature demo.
The following Cuda function type qualifier is described below.
__DEVICE__ is performed on the device and can only be invoked on the device.
__GLOBAL__ is used to declare kernel functions, and execution on devices can only be invoked from the host side.
__host__ is performed on the host side and can only be invoked from the host side, default.
__device__ and __global__ do not support recursion, the function body cannot declare static variables, the number of parameters can not be changed, can not be pointer to device. __global__ and __host__ can not be used. __global__ can only return empty, the __GLOBAL__ function must declare its execution configuration, the call to the __GLOBAL__ function is asynchronous, the value of the __global__ parameter is currently passed through shared memory, and the total size cannot exceed 256byte.
Variable type qualifiers are divided into __device__ (variables exist on the device side), __constant__ (there is constant memory space), __SHARE__ (block shared memory), volatile keyword when the data between threads may affect the transformation.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More