The emergence of GPU in recent years, the development of high-performance computing can be described as playing an immeasurable role in promoting. GPU computing on the order of magnitude increase in performance so that it has more success than traditional solutions success stories.
A large number of technology workers love, sought after or use GPU technology, but at the same time there are more people for a variety of reasons to sit idly by. This article addresses the latter and summarizes some of the most common problems, concerns, or subjective assumptions in the field of GPU computing.
The rest of this article will attempt to address these challenges by rethinking these issues based on the progress of GPU computing and our forecast of future technological developments. Of course, GPGPU is not the ultimate solution for all HPC applications, but many will find the technology's price / performance advantage and its success in many areas such as seismic imaging, electromagnetism, molecular dynamics, finance Pricing model, medical imagery, and more.
I do not want to rewrite my code, or re-learn the language
If you are using a GPU, rewriting the code is yes. In fact, you write the current serial program as a parallel execution of the CPU program also need to rewrite the code. The key question is what is your target platform. If the target is multicore CPUs, then its parallel processing is based on three-level model processing at the process, thread, and register level. You need MPI, OpenMP / pthreads, and SSE / AVX expansion and so on. In fact, CUDA programming for the GPU is not more difficult than the above, but the advantage is that it can have significant performance gains both for compute-bound and memory-bound, and we'll talk about it To
If you already have parallel code, what are the benefits of using GPU for your implementation? Chip-level comparison of the code, the calculation will generally have 5 to 40 times the increase. This is evidenced by the many publications available on the GPU-based solution. In the past few years, these comparison results are based on Intel and NVIDIA products.
CUDA is an extension of C program, which is very easy for experienced programmers. The existing parallel programming model is not realistic for exascale operations, but I believe the final solution should look more like a CUDA parallel model than CPU parallelism. As I said before, CUDA forces programmers to think about how to match their irreducible parallelism to threads. This is a good parallel programming model that allows problems to be solved on a single multipoint GPU with multiple nodes Good scalability.
Academics have had some very good results in this direction, such as the Global Memory for Accelerators (GMAC), the highly scalable HUESPACE API in the business world (provided by HUE, a company in Oslo, Norway), and their brother company , Head of the company that specializes in GPU applications for oil and gas development, Headwave.
I do not know what kind of performance can be achieved with GPU computing
HPC code does not have compute-bound or memory-bound. For compute-bound code, we can compare it with the NVIDIA Fermi M2090 and Intel Westmere. Fermi has 512 cores at 1.3GHz, while Westmere has 6 cores at 3.4GHz. In the core Hertz comparison, the former is 32 times the latter. If your CPU code can effectively use the SSE instruction, perhaps on the CPU side to further enhance the 4 times, then the former is the latter eight times (nearly GFLOPS peak ratio)
For memory-bound code, the GPU's memory bandwidth is 177GB / sec, CPU is 32GB / sec, the former is 5.5 times the latter. The assumption here is that your code reaches compute-bound, with GPU performance 5x the CPU (code highly optimized for the SSE directive) up to 20 times (for some specific code). If your code is memory-bound, GPU will probably have 5 times the performance increase (chip-to-chip comparison).
When discussing parallel solutions, it is helpful to consider marginal costs.
If the code is memory-bound, you should consider using the cheapest option to increase the bandwidth, or add a GPU card, which costs about $ 15 per GB / sec; or add a node, the nominal cost is about GB / sec 80 dollars, the latter method also increased the computational power and operating system burden.
If the code is compute-bound, the calculation is similar and you get the marginal cost of Gigaflops.
If the code is both compute-bound and memory-bound, most code, GPU, performs better in hiding the memory-bound delay and improving compute-bound.
3. PCIe bandwidth can seriously affect my performance
Someone is questioning the computational efficiency of the GPU for PCIe bandwidth limitations, indeed because of a computationally intensive (large amount of computational) problem. Computational intensity There are many definitions, one of which is to use FLOPS, which is how many floating-point operations per second. Each data is transmitted to the GPU board for GPU operation, where there is a threshold on the transmission.
For example, PCIe v2.0 x16 bandwidth is about 6GB per second. It takes about a second to fill 6GB of data on the M2090 board and compute a peak of 665GFlops. M2090 is a floating-point arithmetic monster, can handle such a large amount of data per second. If, in this example, you want the PCIe data transfer time to be less than one-tenth of the computation time (that is, the data transfer does not affect the computation), before the data is ready, the M2090 MUST do tens of thousands of times Floating-point operations, so the data must be transmitted as fast as possible.
In addition, CUDA allows asynchronous overlap (PCIe) for PCIe data transmission. Flexible Use this method to hide some or all of the PCIe data transfer time during the calculation. Success stories include the Finite Difference Time Domain (FDTD) algorithm in physics and the interaction of N2 particles and particles in molecular dynamics, all of which make significant data reuse and computationally intensive.
Some algorithms on the GPU are not very effective, such as a simple vector product, a small amount of computation. If the problem needs to be coordinated across multiple GPUs, minimize the data transfer time.
4. If you explain the implications of Amdahl's law?
Amdahl's law quantifies the fact that if you plan to speed up a portion of large serial code, no matter what method you use, there will not be much improvement unless you speed it up to its maximum. In short, speedup can only be doubled if 50% of a process needs to be serialized in one application (regardless of how many threads are actually available); up to 10% of programs need to be serialized Nearly 10 times. Amdahl's law also quantifies the efficiency of serialization overhead. In systems with 10 processors, programs can be accelerated up to 5.3 times (53% utilization) if 10% of them are serialized, up to 9.2 in systems with 100 processors (9% usage). This makes invalid CPU utilization ten times impossible (see the link: http://sesame.iteye.com/blog/428011).
According to the above assertion, the most effective counter-argument to GPU parallel performance improvement is based on the observation that modern computer architectures want to improve performance, massively parallelize all the code as much as possible, and reduce serial code as much as possible, Whether in the CPU platform or GPU platform. The question is, are you trying to parallelize your code on the CPU or on the GPU?
5. What should I do if NVIDIA goes bankrupt?
HPC historically has many supercomputing companies dedicated to moving parallel computing to new heights, such as Thinking Machines, Maspar, KSR, and Mitrion. Their hard work and the creative thinking of the company's behind-the-scenes leaders give us a deep understanding of what works and what does not. They deserve our thanks and respect.
But NVIDIA is not a company that studies supercomputing. Most of the nearly $ 5 billion earnings for the company are derived from graphics cards and embedded processors sold to the PC gaming market. This relative independence is an advantage for HPC, and if all of the HPCs that use the GPU disappear, NVIDIA can still survive well. As long as there is a bunch of serious game enthusiasts around in NVIDIA on the line. In fact, NVIDIA is more competitive on the market than those HPC evergreens and has a higher insurance factor.
And NVIDIA released his vision and a six-year roadmap for technology, showing ambitions to move the GPU from the traditional graphics acceleration role to the center of the computer architecture. ). Along the way, they plan to release more powerful computing engines.
(BBF Note: In fact, OpenCL can also be used here, after all, it is an alliance, and similar to CUDA, but it is still quite immature.)
6. The GPU board can not provide enough memory for my problem
For the M2090 and M2070 GPUs, on-board memory has a 6GB transfer limit per second. This is a problem for algorithms that require some amount of data to exceed the memory limit, and can be handled in parallel with a few cards on a single node. (The author uses the Dell C410 PCIe machine can hold 16 NVIDIA GPU cards for example, not elaborate here)
The most current problem is that the algorithm requires essentially random access to a large array, such as a large hash table or other random array lookups. The current GPU board for these issues have not yet an effective solution. However, more and more cheap memory, storage density is also higher and higher, I believe the future of the GPU board will load a better cost-effective memory.
7. I can wait for more nuclear CPU, or Knights Corner program
Multi-core helps solve compute-bound applications, but it should be realized that the same thing happens to the GPU as more cores are added to the CPU. Compare the CPU and GPU development blueprint, you can see that they are in the calculation and bandwidth gap. This situation will continue. For the bandwidth-bound problem, the situation may be even worse, because it is easier to add bandwidth than to add bandwidth.
Intel announced the Knights Corner, announced more than a year, they also realized that GPU is a competitor to x86 parallel data processing. Details about the Knights Corner are still unknown, and we estimate that there are 50 cores, 1.2GHz, each core has 512 vector processing units, supporting 4 threads in parallel, HPC is a strong rival. But the development model, price, announcement date, and many other key pieces of information for the program have so far been unknown.
Knock Corner controversy may be successful, because in the x86 architecture dominates HPC computing. The reclusive HPC world scientists need to find a broader market to expand high-performance computing, graphics may be an option, NVIDIA and AMD in this regard has done a good job.
8. I do not like proprietary languages
Proprietary Language This refers to a language supported by an organization. It may develop in an unknown or undesirable direction, or lose organizational support. CUDA can be classified as such a language. However, the benefits of using CUDA are obvious: 1. It takes advantage of certain optimization features unique to NVIDIA hardware; 2. No single board makes simple decisions about blueprint development; 3. It can support new NVIDIA hardware features faster .
However, if proprietary languages can not be adopted by your organization, perhaps OpenCL is being developed as a non-proprietary language and is a great choice. OpenCL, by Apple, NVIDIA, AMD, Intel and many other well-known manufacturers to provide easy-to-use features across hardware platforms. Here I emphasize the function easy to use, and this corresponds to the price of performance. Compared to the CUDA kernel, the OpenCL kernel is fairly simple and there are more differences between the host-side setup and the startup code.
9. I am waiting for this magic tool like CPU and GPU transcoder to appear
There is good news and bad news for this. The good news is that this CPU-to-GPU converter is already there, and the bad news is that the code it produces and the code written by an expert can not be compared in performance. Test with The Portland Group (PGI) PGI Workstation and / or CAPS HMPP workbench.
I have N code to optimize but IT budget is limited
To put it plainly, this is the embarrassment of "either not doing it or completely doing it". Adding a GPU-enabled node to a fixed-budget organization infrastructure requires choosing between two options, either a more powerful heterogeneous GPU node or a less powerful traditional CPU node. For future system upgrades, from an economic point of view, some organizations either select GPU nodes 100% or simply do not choose. This is especially true for clusters in commercial establishments that are open all year round and have market competition. Analysis of this IT architecture complex scheduling system, in the worst case, everything requires CPU and GPU versions: cluster management scripts, scheduling, compilers, test verification, application code and so on.
Large-scale commercial organizations need to consider ROI ROI. The argument that "do it or not," suggests some far-sighted and thought-provoking organizations face the dilemma of facing the unknown costs of technology transformation with measurable, known costs. This last point, like at 9 o'clock, is in some respects either putative (code development, people skills, new hardware, retraining costs) or reward-related (performance, scalability, power-hungry).
Each company must have its own ROI formula to deal with these issues. Using traditional financial analysis, capital investment must be beneficial to shareholders and other aspects of the company's investment (BBF Note: here is relatively simple translation, in fact, is to fully consider all aspects of investment.
In short, GPU computing continues to invest in the HPC market, with significant benefits over the last four years. The above ten questions come from individuals and organizations and they all want to solve the above problems. GPGPU is not a solution to all HPC problems, but you should not miss out on performance-enhancing techniques for the wrong reasons.
Finally, organizations should take a step toward GPU computing because not only is this year's solution but a deliberate strategy. This strategy not only solves the current cost issues but also represents the best solution for future architectures, programming models and billions of billion operations.
Translator: Chen Xiao Wei (reproduced please indicate the source http://blog.csdn.net/babyfacer/article/details/6902985)