7. Cuda memory access (I) Improvement-step by step-GPU revolution

Last Update:2018-12-03 Source: Internet

Author: User

Tags rounds

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Preface: from the previous article "Cuda Programming Interface (ii) ------ 18 weapons" to the present, it has been almost three months, and I don't know how everyone is doing in the "Summer vacation, what have you experienced? I spent two weeks before I went to bed. After reading the fifth book of "those things of the Ming Dynasty", I looked at the weapons of the Ming Dynasty, and thought about the Major of aircraft design I learned. The weapons of the Ming dynasty were already the most advanced in the world at that time, but now, when I came out of the aircraft design profession, I saw the development of foreign aircraft, but I felt like my face was hot and very powerful, after centuries of trauma, when can we restore the flourishing world. How many people really understand the significance of sports when people are still paying attention to the number of gold medals in the Olympic Games in China? Sports Spirit is not just a gold medal. Let's look at China's gold medals and the total number of medals, and then the total number of medals in the United States. Maybe we can only say that reform and opening up make some people first rich, and sports are no exception. Watching the Olympic program hosted by Cui Yongyuan, more attention is paid to athletes who have not won the gold medal. It is worth reflecting on ...... From a junior high school football team member to a high school field team member, the University participated in many sports meetings; from each primary school sports meeting to the last one, to the high school records, to the university, how many people understand how many years of exercise is in the school's records? How many people understand the meaning of persistence? Maybe only athletes can understand it. Participation, feelings, and persistence: How many refining operations are behind each start? How many people will pay attention to each ending point? Understanding with your heart, feeling with your heart, and treating others with a more inclusive mind will also give others a more tolerant mind. Maybe I have experienced some new experiences in the past three months, such as feelings, money, and career ...... Every process, every mentality, persistence, and cross-going events are a small case ~ Go on!

Let's get down to the truth: when talking about memory access, it is actually called by several API functions. I feel like I have nothing to talk about. I know cudamalloc, cudamemcpy, and cudafree, you can allocate the memory on the device and then use the memory on the device. However, just like watching sports games, we only watch less than 10 seconds of 100 meters. Maybe you just want to say that he runs really fast. How many people are aware of the tempering in the process? To really understand the memory scheduling access on the device and make our program faster, we need to have a deeper understanding of the memory access situation and understand the process. I remember that when I first started my programming language course in college, I talked about memory allocation and memory release. However, the alignment of memory is usually at the end of the book, memory space layout. Just like C ++, maybe you have studied it for several years and used it for many years. But do you know the memory layout of the class? How is a virtual table accessed in memory? Just like a network, when it reaches a certain level, it will find that the data has changed from one segment to another, and the content is incorrect, the defined struct is misplaced when it is passed to the other side. Is memory alignment a headache when SSE is used to accelerate data processing? So many people may be bored ~~~ Let's get to the point. These parts may be considered only when you encounter problems in practical use. However, we still hope that before you encounter problems, you can master the problem-solving capabilities. It is not because you start to shave when you meet beautiful women. The previous paragraph may be just some of my life's experiences. You can skip it and come back later. :) Let's explain the memory access on the device step by step. After so many years of development, the memory price is getting cheaper and cheaper, but who knows that when you do the knitting of memory, many people's eyes are blind, the bless contribution they made to computers in the past. I still remember when I was a child, I often watched the gun and saw my brother scanning with a gun. Which one is handsome ~~ However, when I got a little bigger, I had a psychological question: Can't I use bullets? How many bullets can be loaded at a time ~ It's such a small bullet clip. The eight-bullet revolver can be used by more than a dozen people in the hands of Shuai's brother ~ It's not loaded yet.-Thanks! Generally, 8 or 14 automatic pistols are used, and the largest shell gun (Mao's gun) can also hold 20 rounds of bullets. You should say that Shi tailong In the first drop of blood 4 can drive the m2hb12. 7mm heavy machine gun on the chariot. The bullet can be loaded with thousands of rounds at a time without a bullet clip, users access the memory directly through DMA instead of using a bullet clip, and do not need a processor to transfer data. The memory access capability supported by g80 is the ability to access 4 bytes, 8 byets or 16 bytes at a time. g80 has three bullet clips, one with four bullets at a time and the other with eight bullets, there are 16 more. Global memory does not have a cache during access. Just like a previous firegun, it takes a shot to charge a gun before it can be shot. Each access time is-clock (core clock) latency. In Cuda programming, memory access is one of the bottlenecks. The bandwidthtest provided by the SDK can be used to test the transmission performance from the host to the device, from the device to the host, and from the device to the device. Although PCIe has a theoretical value of 3.2 Gbit/s, it does not actually reach that much. Device to device transmission can reach about 89 g/s (gtx260), and the theoretical value is 90 g/s (gtx260. In this place, each person has different video cards, different motherboard, and different set environments. A active warp on the device has 32 threads, but 16 threads are running at the same time, that is, half warp. When half Warp's 16 threads are used to access the memory, it is recommended that 16 threads be directed to the corresponding memory address in sequence so that coalesced access can be ensured. Like http://www.isi.edu /~ Ddavis/GPU/Course/slides/gpu?cuda= ~. If there is a crossover, the following situation will occur: uncoalesced access. When I was a child, I used to play like a revolver. The automatic rotation will soon finish the bullet, and I didn't see who has completed the 1st, then we hit 3rd, and then we came back with 2nd bullets ~...... I will not explain it here ~ The following is an example of the connection above: the process of optimizing the code of an uncoalesced float3 code to a coalesced code: float3 is 12 bytes, each thread reads three float entries. I still remember that the previous change of a cartridge is or 16, but float3 is not equal to this. Do you still remember how active thread works? If you do not remember, it is best to refer to the previous chapter on thread work. When warp is working, 16 threads work together, so there are 16 threads to access the memory at the same time. In addition, 16*3*4 (16 threds, float3 is 3 float, each float has 4 bytes) and 192 bytes. Therefore, uncoalesced access is created. Next, let's take a look at the figure to solve this problem. Then, let's look at the code: first, let's take a look at how the access process works. The following explanation: we set the thread of the block to 256, and then when everyone executes the first storage command s_data [threadidx. x] = g_in [Index]. Do you still remember the thread execution model in the block? Simp: one command. At the same time, 256 threads must be executed before the next command is executed. Shared Memory is used as a transit mechanism to avoid uncoalesced for global memory access. If the struct is not size (, 16) and the memory is forcibly aligned to make the Global Access coalesced. For example, use _ align (X) to force alignment of memory, but this will waste some space. If float3, the struct, use _ align (16), align, there will be a float space for alignment ~. To sum up, we will talk about the global memory access alignment. If the access cannot be consecutive, we will adopt two methods to align it ~ In fact, after reading the previous article, I understand that this sentence is useful at the end ~ The next chapter will introduce the bank conflict accessed in shared memory, which is actually very simple. It will be solved by drawing a picture ~~ PS: Draw multiple images. If you don't understand them, draw them out :)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More