Cuda Memory Access (i) improve the------step-by-step------GPU Revolution

Source: Internet
Author: User
Tags rounds thread

Talking about memory access, in fact, is also a few API function calls, feel nothing good to say, know Cudamalloc, know cudamemcpy, and cudafree, you should be able to allocate device memory, then you can use memory on the device. But just like when we watch the sports meeting, just look at the 100 meters of less than 10 seconds, perhaps you will only say that he runs really fast. How many people can understand the temper of the process? To really understand the device of memory scheduling access, so that our program to achieve faster speed, we have to do a more in-depth memory access to understand the situation, understand the process.

I remember when the university started programming language courses, will talk about the allocation of memory and memory release, but generally in the end of the book will speak of memory alignment, memory space layout. Just like C + +, you may have studied for years, but you know the memory layout of class? Know how a virtual table is a memory access procedure? Just like doing the network, as to a certain point of time, will find data from a passage to another paragraph when the data changed, the content is wrong, the definition of the structure passed to the other side when the dislocation. Is there a problem with memory alignment when using SSE to speed up data processing? Said so much, perhaps some people are annoyed ~ ~ to the topic, these parts may only be real in the actual use of the problems encountered, will be considered, but still hope that we can have problems before they can master the ability to solve problems. It is not when you meet a beautiful woman to start shaving, the usual habit will be raised well.

The preceding passage may be just some experience of life, in fact, you can skip, when you come back to see: We are still step-by-step to explain the memory of device access bar. After so many years of development, the price of memory is getting cheaper, but who knows when you do woven memory, many people's eyes have been blinded, for their contribution to the computer bless.

Still remember when I was a child often see the gun, see small horse brother with guns keep shooting, which handsome AH ~ ~ but later, a little larger, psychological side has been a doubt, the bullet used not to finish? How many bullets can be loaded at a time ~ so small clip. 8 rounds of revolver in the hands of handsome brother can handle more than 10 people ~ not loaded-embarrassed! General automatic pistols are usually 8 hair, 14 hair, the most Bokeqiang (muskets) can be loaded with 20 rounds. You have to say that people are "the first drop of blood 4" inside the Stallone can open the tank above the m2hb12.7mm heavy machine guns, bullets do not use a cartridge, can be loaded with thousands of rounds; Yes, they are DMA Direct memory access, not through the cartridge access, do not need the processor to transit data. G80 supported memory access is a visit 4bytes,8byets or 16bytes,g80 has three kinds of magazines, one can load 4 rounds, one is 8 rounds, and 16 hair.

Global Memory No cache during the visit, just like the previous muskets, after a shot, charge, and then a shot. Each access time is 400-600 clock (core run clock) delay. So in Cuda programming, one of the bottlenecks is memory access. The bandwidthtest provided by the SDK can be used to test transfer performance from host to Device,device to Host,device to device. Although PCIe has a 3.2g/s theoretical value, it does not actually reach so much. The transmission of Device to Device can reach 89g/s (GTX260), and the theoretical value is 90g/s (GTX260) is about the same. This place is not the same for everyone, the motherboard is not the same, setting the environment is different, not necessarily the same.

An active warp on device has 32 threads, but the actual 16 thread is running at the same time, which is half warp. When the half warp 16 threads access memory, it is best to let 16 thread to the corresponding memory address in turn, so that you can guarantee coalesced access. The following figure http://www.isi.edu/~ddavis/GPU/Course/Slides/GPU+CUDA.pdf:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.