Analysis of Cuda 4.0 real Technology

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprinted please indicate the source for the klayge game engine, the address of this Article for http://www.klayge.org /? P = 961

Last week's post mentioned that NVIDIA announced Cuda 4, and yesterday it received an NV email saying that Cuda 4.0 RC can be downloaded. Developer registered users can find them at http://developer.nvidia.com/object/cuda_4_0_rc_downloads.html.

I didn't want to talk about anything. I happened to see a so-called "New Feature Analysis" on a website. A typical small editor that doesn't understand the technology was able to write a soft article. So I have to make a mistake here to prevent Chinese readers from being misled.

Updates to Cuda 4.0 are mainly concentrated in three aspects:

Simplify the transplantation of parallel programs
Accelerate multi-GPU Programming
Better tool chain support

Simplify concurrent program porting

Before Cuda (in fact, AMD's Stream) came out, parallel programs can only be transplanted to the GPU directly using shader, with many restrictions. The code is not flexible, and the algorithm is basically rewritten, rather than transplanted. With cuda, the situation has improved. In Cuda 4.0, the Migration task becomes simpler. NV supports the following new features:

Share GPU among multiple CPU threads
A single CPU thread can also access all GPUs
The system memory address can be mapped directly to the GPU without copying.
New Cuda C/C ++ features
Thrust template primitive Library
NPP image/video processing database
Multi-Layer Texture

Here, and are the same thing. In previous cuda, to use multiple GPUs, you have to open multiple threads. Each thread is responsible for dealing with one GPU (or different GPUs have to be in different context ). This is actually a silly limitation. Now Cuda 4 has removed this restriction. Any thread can deal with any GPU, simplifying context management. It can only be said that it is not so stupid.

3rd I think it is the most important update in Cuda 4. Memory Address ing is partially supported on previous cuda, which maps the system memory address from a cudahostmalloc to the GPU so that the GPU can directly access data in the system memory. In this way, we can overcome the disadvantages of memory limitations (but at the cost of some performance ). The advantage of Cuda 4 is that any malloc/new memory space can be registered to the GPU by calling cudahostregister, and cudahostunregister can also be called to cancel the registration without cudamemcpy. In fact, PCIe itself supports ing a host address to the device, and d3d or OpenGL uses this method to process big data such as buffer and texture, which is nothing new. In addition to performance, this feature also brings about the ability to change a large CPU program to a GPU program at 1.1 points and verify the results step by step. In the past, many Code such as cudamalloc and cudamemcpy were inserted before and after this incident, which is tedious and error-prone.

4th is just a compiler update. It adds C ++-style new/delete, virtual functions, embedded PTX assembly, and so on to the Cuda compiler. D3d11 GPUs support function pointers, So virtual functions are not difficult. As for the new/delete and embedded PTX assembly, it cannot be implemented by changing a few lines of compiler code, but it should be asked why it was not supported in the past.

Thrust at is a third-party library. It imitates the C ++ STL method to encapsulate some cuda data structures and primitive algorithms (such as scan, reduce, and sort, this allows people without the foundation of Cuda to use GPU acceleration in C ++ programs. This library has been released for a long time and can work well with previous Cuda. This is just to integrate it, and it is not a new thing.

At, it was also a Cuda library, which was developed by NV and integrated.

The layered texture at is actually the texture array of d3d10 +. I didn't know why it was never exposed in Cuda before. Now it is listed as "new feature ".

Accelerate multi-GPU Programming

The new technology is called gpudirect 2.0. In the past, there was an unnamed gpudirect 1.0, which was first upgraded to enable point-to-point graphics memory access, data transmission, and synchronization between GPUs. It may need a specific chipset for support. This is indeed a good news for multiple GPUs. In the past, if multiple GPUs were to do the same thing, they had to copy the same data to each GPU in sequence and then start to work. So that the copy time masks the computing time (this copy is exclusive and cannot allow multiple GPUs to copy in parallel). The more GPUs, the slower the speed. Even worse, to access the computing results of another GPU, you have to cudamemcpy to host memory and then cudamemcpy to another GPU. Now you only need to use cudamemcpy once.

In fact, it also requires that host memory and multiple device memory should be uniformly configured (for example, ultraviolet A), so that cudamemcpy can recognize where and where it is. A picture referencing NV is:

With ultraviolet A, you do not need to specify cudamemcpyhosttohost, cudamemcpyhosttodevice, cudamemcpydevicetohost, and cudamemcpydevicetodevice when using cudamemcpydefault. Why is this supported by 4.0? You have to ask NV.

Better tool chain support

From the profiler of the text interface to visual profiler, Cuda tool chain has been improving. In addition to the previous display of kernel execution time, usage, and overhead of various commands, profiler in Cuda 4 can also provide overall statistics and optimization tips, which are more practical. Cuda-GDB is also updated to support C ++ debugging.

Cuda x86

Finally, NV also mentions the latest progress of PGI Cuda x86. In May 4, 1.0 will be released (very elementary, or even not supporting multi-core), and in May 1.1 (supporting multi-core, SSE/avx ).

Summary

I think most updates of Cuda 4 come from the improvement of the upper layer and the periphery. The highlights of the Core update are the system memory address ing and unified addressing. Others are not worth mentioning. Maybe NV has also learned the Google version big method, and a new big version will come out if it's okay. Although Cuda 4 is worth looking forward to, it is not good to show what we should have as a new feature.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of Cuda 4.0 real Technology

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis of Cuda 4.0 real Technology

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support