Since the launch of NVIDIA's Cuda (compute United device architecture), it has been sought after by countless NVIDIA fans, and many technical staff in the non-graphic image field have started to play with Cuda. I am a bit lazy. Technically, apart from the theoretical and architectural aspects, other things, such as language details, are learned only when the actual work is needed, it took me some time to search for Cuda tutorial on the Internet)
But as shown in the question, I wrote this blog not to introduce Cuda's "powerful", but to spray it :-)
Although there will be errors, I still follow my understanding and summarize its limitations as follows. You are welcome to shoot bricks or shoot.
First, at the syntax and API level:
1. dynamic memory cannot be allocated when executing code on the GPU. That is to say, you can either use shared memory in the computation kernel to store temporary data, or allocate it on the CPU;
2. Cuda is based on the C language. Let alone the c99 standard, because Cuda's support for exception handling is definitely not a problem (this will be mentioned later), that is, function pointers in ansi c cannot be used in Cuda. As far as I know, the C language is used, for example, embedded devices and some network devices, almost all use function pointers to implement virtual functions similar to those in C ++;
3. The block/thread must be allocated by the CPU (the host of the Cuda program. In the current multi-threaded program, it is often a subordinate worker thread that also has a lower-level worker thread, and they can be transparent to the upper-layer thread;
4. The communication primitive between threads is only synchronize. While we are writing multi-threaded programs, and occasionally pass data between threads in some way (such as global variables and shared memory;
These limitations are at the language level. When we understand things, these are the categories of appearances. So what is the essence? What I think is more essential:
1. The method for using the function pointer in the computation kernel -- the GPU storage program is quite different from that of the x86 CPU. The x86 CPU unified the storage space of executable code and the data storage space. This is the case in the first 8086, through a set of registers (e.g. ECS, EIP) to indicate the location where the current program is executed. As far as I have been dealing with the GPU of the hardware spec, the instruction code is the instruction slot units in the GPU chip, which are not uniformly located with the Display memory. Therefore, for jump commands, the CPU can have JMP, JNE, and so on, and help you automatically handle the call for Stack push/pop operations. These commands can be followed by an immediate number, registers or some addressing method, both of which can be used to implement function pointers. GPU jump commands can only be counted immediately due to addressing (for the reason you think about it), so...
2. About memory allocation-this explanation is also very simple. Why can I allocate memory for the CPU? It depends on the support of OS and runtime. What does the underlying operating system do? The physical memory is mapped to the virtual address space of the current process by adding a PTE (page table entry) to the gdt (Global Descriptor Table) of a process. Why can't GPU allocate memory? The reason is that without runtime support, GPU programs are running naked at.) In the future, some library functions can be implemented in the BIOS of the video card to achieve dynamic memory allocation, but the software standards should be first proposed;
3. No GPU stack! This makes almost all of the current multi-threaded architecture applications (typical applications such as the Renderer) unable to easily transplant to the Cuda architecture, such as the ray tracing algorithm that the Renderer will certainly support, it requires a lot of skills to implement a system without a stack. I have seen a lot of paper that does Ray Tracing on the GPU before:
A. either use space for time (for example, uniform grid indicates the scenario)
B. either at the cost of performance, ray tracing can run on the GPU (for example, some algorithms in Stanford's "Interactive K-D tree GPU raytracing)
C. there are also some more clever designs that use hardware mechanisms (such as mipmap) on the GPU, such as "Fast GPU ray tracing of dynamic meshes Using Geometry images ", however, they have considerable limitations.
Can ray tracing be performed on the GPU at a certain speed? The fact is that these paper conclusions are only "comparable" between the speed and high-end CPU in the same scenario. Therefore, GPU does not have a stack, which indeed limits the application scope of Cuda. Perhaps NV and AMD will launch dx11 hardware next year as the last-generation stackless GPU.
4. GPU performance nightmare-random memory access. According to my understanding of cuda, we do not advocate random access in Cuda programs, in addition, the Cuda tutorial also took a lot of time to introduce what kind of memory access is cache friendly. The purpose is to reduce the cache missing rate. But isn't the current memory frequency very high? For example, gddr5 of 2G Hz. In this case, you should add the knowledge of memory latency. I am not arrogant here. Why does the CPU adapt to various random memory access and dynamic redirection? This is because it draws a considerable number of circuits to complete the "caching mechanism much more intelligent than GPU, pre-Branch judgment and command execution in disorder". Compared with CPU, GPUs are too lagging behind in this aspect, and NVIDIA does not seem to focus on investment in this aspect, but thinks it can be optimized through its software (driver, compiler, I personally think it is outdated.
Write so much first. In a word, my personal opinion is: Cuda is quite different from traditional programming thinking and cannot be compatible with many algorithms, which leads to great limitations in its application. Currently, Cuda is only applicable to static image processing, and video encode and decode can only be used in some stages.