Top 10 questions about GPU computing-Rethinking on GPU computing

Source: Internet
Author: User

Http://blog.csdn.net/babyfacer/article/details/6902985

Link: http://www.hpcwire.com/hpcwire/2011-06-09/top_10_objections_to_gpu_computing_reconsidered.html
By dr. Vincent natoli, stone ridge Technology (http://www.stoneridgetechnology.com /)
Translator: Chen Xiaowei (reprinted please indicate the source of http://blog.csdn.net/babyfacer/article/details/6902985)

Note: The title of the original article (top 10 Objections to GPU computing reconsidered) is briefly translated.
Some parts of this article are not literal translations, but focus on smooth expression and understanding of substantive content. I have very good translation skills. If you have any shortcomings or errors, please kindly advise me.

In recent years, the emergence of GPU has played an immeasurable role in promoting the development of the high-performance computing field. The increasing order of magnitude in GPU computing makes it more successful than traditional solutions.

A large number of tech workers love, pursue, or use GPU Technology, but there are still more people who are concerned about it for various reasons. This article summarizes some of the most common problems, concerns, and subjective disconnections In the GPU computing field.

The following content will try to solve these challenges and rethink about these issues through the progress of GPU computing and our predictions on future technological developments. Of course, gpgpu is not the ultimate solution for all HPC applications, but many people will find its advantages in cost effectiveness and its successful application in many fields, for example, seismic imaging, electromagnetism, molecular dynamics, financial pricing and valuation models, and medical imaging.

1. I don't want to rewrite my code or learn the language again.

If you want to use a GPU, you must rewrite the code. In fact, you need to rewrite the code when writing the current serial program into a CPU program for parallel execution. The key question is what your target platform is. If the target is a multi-core CPU, its parallel processing is based on three-tier model processing at the process, thread, and register levels, you need to use MPI, OpenMP/pthreads, and SSE/avx extensions. In fact, using Cuda for GPU programming is not more difficult than the above processing, and its advantages are also reflected in its compute-bound and memory-bound) we will talk about it later.

If you already have parallel code, what are the benefits of using GPU for implementation? For the chip-level comparison of code, the computing generally increases by 5 to 40 times. This is evidenced by many publications on the GPU-based solution. In the past few years, these products have been compared based on Intel and NVIDIA products.

Cuda is an extension of C Programs, which is easy for experienced programmers to use. The existing parallel programming model is not realistic to implement exascale operations, but I believe the final solution should look more like the Cuda parallel model than the CPU Parallel Processing Method. As I said before, Cuda forces programmers to think about how to map their unavoidable parallel processing problems to threads. This is a good parallel programming model, this allows the problem to be well scalable on a single node with multiple GPUs and multiple nodes.

In this direction, the academic community has already achieved some outstanding achievements, such as global memory for accelerators (GMAC), and the business community also has well-scalable huespace APIs (provided by hue, a company in Oslo, Norway ), and their brother company, headwave, which focuses on developing GPU applications for oil and gas.


2. I don't know what performance can be achieved with GPU computing.

HPC Code does not have compute-bound or memory-bound ). For compute-bound code, we can use NVIDIA Fermi m2090 to compare it with Intel westmere. Fermi has 512 cores, 1.3 GHz, and westmere has 6 cores, 3.4 GHz. Compared with the core Hz, the former is 32 times the latter. If your CPU code can effectively use the SSE command, it may be increased by 4 times on the CPU side, then the former or the latter eight times (close to the ratio of gflops peak)

For memory-bound code, the GPU memory bandwidth is 177 GB/second, And the CPU is 32 GB/second. The former is 5.5 times the latter. The premise here is that your code reaches compute-bound, and the GPU performance will be 5 times higher than the CPU (highly optimized code for SSE commands), up to 20 times (for some specific code ). If your code is memory-bound, the GPU performance will be improved by about five times (compared with chips ).

When discussing parallel solutions, it is helpful to consider marginal costs.

  • If the code is memory-bound, you should consider using the cheapest option to increase the bandwidth, or adding a GPU card, about $15 per GB/second; or adding a node, the nominal cost is about $80 per GB/second. The latter method also increases the computing capability and the burden on the operating system.
  • If the code is compute-bound, the calculation method is similar, and the marginal cost of gigaflops can be obtained.
  • If both compute-bound and memory-bound exist in the code, most codes use GPU to hide memory-bound latency and improve compute-bound performance.

3. PCIe bandwidth will seriously affect my performance

Some people question the GPU computing efficiency against the PCIe bandwidth limit, which is indeed due to a computing intensive (high computing workload) problem. Computational intensity has many definitions, one of which is the flops, that is, the number of floating point operations per second. Transmit each data to the GPU board for GPU operations, with a transfer threshold.

For example, the PCIe 665x16 bandwidth is about 6 GB per second. It takes about one second to fill 6 GB Data on the m2090 board, and the computing peak value can reach gflops. M2090 is a floating point computation monster that can process such a large amount of data per second. In this example, if you want to make the PCIe data transmission time no more than one tenth of the computing time (that is, data transmission does not affect computing), before the data is ready, m2090 must perform thousands of floating point operations, so data must be transmitted as quickly as possible.

In addition, Cuda allows PCIe data transmission to adopt asynchronous overlap. You can use this method to hide some or all PCIe data transmission times during computing. Successful cases include the Finite Difference Time Domain (FEM) algorithm in physics and the interaction between N2 particles and particles in molecular dynamics, which can significantly achieve data reuse and high computing density.

Some algorithms are not very effective on GPUs, such as a simple vector product with a small amount of computing. If the problem requires collaborative computing on multiple GPUs, try to reduce the data transmission time as much as possible.

4. What are the inspirations of Amdahl's law?

Amdahl's law quantitatively reveals the fact that if you want to accelerate a part of a large serial code, no matter what method you use, unless you want to accelerate the biggest part, otherwise, it will not be much improved. Simply put, if 50% of processing in a program needs to be serialized, speedup can only be increased by 2 times (regardless of how many threads are actually available); if 10% of the program needs to be serialized, speedup can be increased by up to 10 times. Amdahl's law also quantifies the efficiency overhead of serialization. In a system with 10 processors, if 10% of programs are serialized, they can speed up to 5.3 times (53% of usage). In a system with 100 processors, this number can reach 9.2 (9% of usage ). This makes it impossible for invalid CPU utilization to reach 10 times (see the link: http://sesame.iteye.com/blog/428011 ).

In view of the above arguments, the most effective counterargument to the improvement of GPU parallel performance is that, according to observation, to improve the performance of the modern computer system architecture, all code must be parallelized as much as possible on a large scale, and reduce serial code as much as possible, whether on the CPU platform or GPU platform. The problem is, do you want to parallelize your code on the CPU or on the GPU?

5. What if NVIDIA fails?

In the history of HPC, many super computing companies are striving to push parallel computing to another new level, such as thinking machines, maspar, ksr, and mitrion. Their hard work and creative thinking behind the scenes enable us to have a deep understanding of what is feasible and what is not feasible. They deserve our thanks and respect.

However, NVIDIA is not a super computing company. The company, which collects revenue of nearly $5 billion, generates most of its revenue from graphics cards and embedded processors sold to the pcgame market. This relative independence is an advantage for HPC, and if all GPU-used HPC disappears, NVIDIA can still survive well. As long as there are a bunch of serious game enthusiasts around NVIDIA. In fact, NVIDIA is more competitive in the market than those HPC evergreen companies, with a higher insurance factor.

Furthermore, NVIDIA has released its vision and six years of technological development blueprint, the lines show their ambition to move GPUs from a traditional Graphics Accelerator role to a computer architecture Center (BBF note, that is, a CPU ). Along this path, they plan to release more powerful computing engines.

(BBF Note: opencl can also be used here. After all, it is a consortium and similar to Cuda, but it is still quite immature .)

6. The GPU board cannot provide enough memory for my problem.

For m2090 and m2070 GPU boards, the on-board memory has a limit of 6 GB of transmission per second. This may cause problems for algorithms that require a certain amount of data to exceed the memory limit. You can use several cards for Parallel Processing on a single node to solve the problem. (The author uses the Dell c410 PCIe machine to install 16 nvidia gpu cards. I will not elaborate on them here)

Currently, the most common problem is that algorithms require essentially random access to large arrays, such as large hash tables or other queries that require random arrays. The current GPU Board has no effective solution for these problems. However, the memory is getting cheaper and the storage density is getting higher and higher. I believe that GPU boards will load better cost-effective memory in the future.

7. I can wait for more CPU cores or the knights Corner plan

Multi-core helps solve the compute-bound application, but we should realize that when more cores are added to the CPU, the same thing also happens to the GPU. Compare the Development Blueprint of CPU and GPU to see their gap in computing and bandwidth. This situation will continue. The bandwidth-bound problem may be worse, because adding a core is easier than adding a bandwidth.

Intel plans to release knights corner, announcing more than a year, and realizing that GPU is a competitor of X86 parallel data processing. Details about knights corner are still unknown. We estimate that there are 50 cores, 1.2 GHz, and each core has 512 bit vector processing units. It supports four parallel threads and is a strong opponent of HPC. However, the development model, price, publication date, and many other key information of this plan are unknown so far.

It is widely argued that the knights corner may succeed because the X86 architecture prevails in the HPC computing field. Scientists living in the HPC world need to find a broader market to expand high-performance computing. Graphics and images may be a choice. nvidia and AMD have already done well in this regard.

8. I do not like the proprietary languages)

A proprietary language is a language supported by a certain organization. It may develop in an unknown or undesirable direction, or lose the technical support of the organization. Cuda can be classified as this type of language. However, the advantages of using Cuda are also obvious: 1. it can take advantage of some unique optimization features of NVIDIA hardware; 2. no committee makes a simple decision on blueprint development; 3. it supports new NVIDIA hardware features faster.

However, if the private language cannot be adopted in your organization, opencl may be an excellent choice for development as a non-private language. Opencl is supported by Apple, NVIDIA, AMD, Intel, and many other well-known vendors. It provides easy-to-use features across hardware platforms. I emphasize the ease of use of functions, which corresponds to the performance cost. Compared with the Cuda kernel, The opencl kernel is quite simple, and there are more differences between the setting and startup code on the host side.

9. I'm waiting for a magic tool like CPU and GPU Code Converter to appear.

For this, there is good news and bad news. The good news is that the CPU-to-GPU converter already exists. The bad news is that the Code it generates is incomparable to the Code Compiled by experts. You can use the Portland group (PGI)'s PGI workstation and/or caps hmpp Workbench to test it.

10. I have n codes to be optimized, but the IT budget is limited.

To put it bluntly, this is the embarrassment of "Do not do it, or do it completely. To add GPU-supported nodes to the infrastructure of a fixed budget organization, you need to make a choice between the two options, or be a more powerful heterogeneous GPU node, or traditional CPU nodes that are not powerful enough. For future system upgrades, from an economic point of view, some organizations either choose GPU nodes for 100% or simply do not. This is especially true for clusters in commercial organizations that have no time off for the whole year and have market competition. To analyze the complex scheduling system of this IT architecture, in the worst case, everything requires two versions of CPU and GPU: cluster management script, scheduling, compiler, test and verification, and application code.

ROI needs to be considered when large commercial organizations adopt technologies. The debate "either don't do it or do it in place" shows the dilemmas of some visionary and well-thought-out organizations facing the unknown cost of technological transformation with quantifiable known costs. Finally, this is like the above 9.1. In some aspects, it is either related to investment (code development, personnel skills, new hardware, retraining costs) or return (performance, scalability, energy consumption ).

Each company must have its own ROI formula to deal with the above problems. Using traditional financial analysis, capital investment must be beneficial to shareholders and be compared with other aspects of the company's investment (BBF note: the translation here is relatively simple, in fact, it is necessary to fully consider all aspects of investment ).

In short, the continuous investment in GPU computing in the HPC market has seen significant gains in the last four years. The top 10 questions above come from individuals and institutions who want to solve the above problems. Gpgpu is not a solution to all HPC problems, but should not miss out on technologies that can significantly improve performance for incorrect judgment reasons.

Finally, organizations should take a step toward GPU computing, because it is not just a solution this year, but a well-thought-out strategy. This strategy not only solves the current cost problem, but also provides the best solution for future architecture, programming models, and 10 billion computing operations.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.