Cuda Series Learning (iii) GPU design and Structure QA & coding Exercises

Source: Internet
Author: User

What? You learn the Cuda series (a), (b) It's all over. Still don't know why to use GPU to speed up? Oh, yes.. Feedback on Weibo I silently feel that the small number of partners to raise such a problem, but more small partners should be seen (a) feel away from their own too far so hurriedly remove powder ran away ... I didn't write Cuda series study (0) ... Well, this chapter on this piece, through a bunch of q&a to explain, and auxiliary coding practice, I hope you feel close to Cuda is so easy ~ ~


Please note that there is a sequential relationship between the various q&a, read it in turn, or it will not be easy to understand!



Q: What is the usual way to speed up the hardware plane now?
A:
-More processors
-Speed up clock frequency
-More memory

Figure 1. Period-transistors (transistor) Size
Way: Make transistors faster, smaller, and consume less power, so you can put more transistor on each chip
On the macro level, the more data that can be processed simultaneously on a single processor.






Q: Refer to Figure 2, in order to accelerate one of the 3 common methods, improve clock frequency, visible to the rear clock speed can not be done up. What is this for?

Figure 2. Period-clock Frequency

A: Is the transistor not able to do faster and smaller? Wrong!

-key issues in cooling! Even smaller transistors make it difficult to dissipate heat. So now we should focus on build more small, low-power (power-efficient) processor to speed up.





What is the difference between Q:CPU and GPU design?
A:
Cpu:complex CONTROL HARDWARE
-Flexible in performance:)
-Expensive in terms of power:(

Gpu:simple CONTROL HARDWARE
-More hardware for computation:)
-More power efficient:)
-More restrictive programming model:(
So just now, the processor that wants to design more power-efficient is the central idea of the GPU, the CPU ismore focused on optimization (minimize) latency and the GPU is more focused on optimization (maximize) Throughput.



Q: What is latency and what is throughput?
A:
For example, to drive 5000km from a to B.
Method 1:
By taxi,200km/h, can carry 2 people: latency = 25h,throughput = 2/25 person/h
Method 2:
By bus, 50km/h, can carry 10 people: Latency=100h,throughput = 10/100 person/h
So the CPU prefers taxi, because at one time it comes faster and the GPU prefers bus because of the throughput.



What's Q:cuda? CUDA programming Software-level structure?

A:






Q:cuda Programming Note what?

A: Notice what the GPU is good at!

-Efficiency launching lots of threads

-Running lots of threads in parallel



Is there a limit to the parameters when declaring Q:kernel?

A:

We studied in the CUDA series (i) an Introduction to GPU and CUDA spoke of kernel declaration: Kernel<<<grid of blocks, block of Threads> ;>>, the limit is: Maximum number of threads per block:1024 for new, and. For example, I want 2048 threads to be parallel, not to apply kernel<<<1,2048>>> directly, but to separate <<<a,b>>> so a*b=2048 and a <=1024 && b<=1024.



Q: Specifically, what is the general declaration of kernel, the parameter meaning and format?

A:general said: Kernel<<<grid of blocks, block of threads, shmem>>>
Shmem:shared memory per block in bytes

Use G to indicate grid of blocks
The block of threads is represented by B
In fact, each dimension in dim (x, Y, z) is 1 by default, so: Dim3 (w,1,1) ==dim3 (w) ==int W, i.e. these three representations are equivalent





What are the common built-in variables that can be called in a Q:cuda program?
A:threadidx, Blockdim, Blockidx, Griddim




Q:cuda program template to one?

A: Say is a template, just a simple example of the approximate routines, we can also refer to the Cuda Series Learning (ii) CUDA memory & variablesdifferent Memory and variable types, as follows:

1. Declare the host variable, request allocation space, and initialize

2. Declaring device variables, requesting allocation space

3. cudamemcpy assigns the assigned host variable to the device variable

4. Call Cuda kernel, parallel run thread

5. Device return results to host

6. Release device variable memory





========================================



Exercise1:

Input:float Array [0, 1, 2, ... 63]

Output:float Array [0^2, 1^2, 2^2, ... 63^2]


Exercise2:

Enter a color map and turn to grayscale.

Tips:
In CUDA, each pixel are represented in struct UCHAR4:
unsigned char R,b,g,w;//w:alpha channel, represent the transparent information
Grayscale conversion formula: I =. 299*r+.587*g+.114b (setting different weight for RBG is caused by different sensitivity of the human eye to different channels)



For specific implementation please refer to Cuda Series Learning (ii) CUDA memory & variablesdifferent Memory and variable types, welcome to the small partners exercise Code and run time post to reply, preferably compared with CPU version ~



Resources:

udacity-cs344




Cuda Series Learning (iii) GPU design and Structure QA & coding Exercises

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.