Heterogeneous computing:
Heterogeneous computing uses different types of processors to handle different types of computing tasks. Common computing units include CPUs, GPGPU, GPDSP, Asics, FPGAs, and other types of core processors.
There are many accelerator cards or coprocessors that are used to increase system performance, which are common:
GPGPU is the most common accelerator card, connected by PCI-E. The GPU was first used for graphics processing cards, the graphics card, and then slowly developed into an accelerator card. 2010, Tianhe one uses CPU+GPU heterogeneous structure to get TOP500 first. At that time, the Tianhe was using the GPU is AMD. Tianhe one uses Nvidia's GPU card.
Xeon Phi is a coprocessor produced by Intel, connected via PCI-E. The goal is to compete with the GPU because Intel graphics is not an advantage. The second Tianhe uses the Xeon E5 + Xeon Phi.
The FPGA accelerator card was also presented in 2014. At SC14, xilinux the Adm-pcie-7v3 FPGA Accelerator board produced by Alpha Data, which is connected by PCIE and host CPUs to load the Virtex 7 series. The first application of FPGA is to validate the logic design, that is, as a development board, that is used to validate the logic design, then the design of the flow sheet, the ASIC chip generation. Now, the FPGA has been used as an accelerator card (Plug and play)!!
GPDSP because the United States Intel banned Xeon Phi, so the defense Hkust proposed GPDSP as a coprocessor, is still brewing.
The following is reproduced:
****************************************************************************************************
Heterogeneous computing is a distributed computing task that can be done with a single, standalone computer that supports both SIMD and MIMD methods, or a set of independent computers interconnected by high-speed networks. A heterogeneous computing architecture uses at least 2 types of processors, in which the general CPU in the heterogeneous computing architecture is responsible for the logic of complex scheduling and serial tasks, and the accelerator is responsible for high-parallelism tasks to achieve computational acceleration. In particular, the use of heterogeneous computing architecture in the calculation of the computation of both processors, and the use of the GPU or the core chip accelerator. With the US Titan and China Tianhe 2nd as an example, Titan has 18,688 compute nodes, each of which consists of a 16-core amd-opteron-6274 processor and a nvidia-tesla-k20 accelerator, totaling 299,008 compute cores Tianhe 2nd has 16,000 compute nodes, each consisting of 2 intel-e5-2692 and 3 pieces of Xeon-phi, using 32000 intel-e5-2692 and 48000 slices of xeon-phi. In addition to Titan and Tianhe 2nd, the dawning 6000 and Tianhe 1th are also using a heterogeneous computing architecture.
Large-scale scientific computing generally can be highly parallelized, can be divided into a huge amount of computing tasks to the small core to parallel execution. As a result, there are three options in terms of accelerator selection:
one is to use GPGPU to do accelerator. because the GPU is a large-width parallel structure (the GPU has a massive SIMD compute unit), the high-end GPU resource integration is very high, can easily achieve a very high theoretical double-precision floating-point computing power. With Nvidia's latest accelerator card K80, the accelerator card consumes 300W, with double-precision floating point up to 2.9TFlops.
second, use the core chip to do accelerator. on the one hand, add floating point, vector instructions (such as Intel's AVX, FMA, Godson's LOONGSIMD) to improve floating-point performance. On the other hand, the number of cores, such as Intel's first generation of Xeon-phi has 60 cores, dual-precision floating-point performance of 1 T, power consumption of 300W; Godson has also had 16 core of the Godson 3 C program, but in the completion of the work progress was forced to give up.
the third is to use GPDSP to do accelerator. The national Defense Hkust has developed a Matrix 2000 to replace Intel's Xeon-phi, Matrix 2000 Double precision floating point up to 2.4T, power consumption 200W, although the performance of the second-generation Xeon Phi double-precision floating point 3T, but performance and performance-power ratio is enough to be proud of the Intel first-generation Xeon Phi, Tianhe 2nd, is an ideal alternative to the strong Phi compute card in the Tianhe 2A upgrade program.
****************************************************************************************************
Advantages and disadvantages of GPGPU and GPDSP
The large-width parallel structure of the GPU can achieve very high theoretical double-precision floating-point computing capabilities (NVIDIA's accelerator card K80 double-precision floating-point up to 2.9TFlops). But because the CPU and GPU programming model is inconsistent, resulting in GPGPU programming is very inconvenient, can only run OpenCL, OPENACC, Cuda code, can not run OpenMP parallel processing code. Plus GPGPU as the accelerator card and CPU is not shared memory, requires the programmer to explicitly copy, which leads to slower data access. As a result, GPGPU is relatively cumbersome to program, relatively inefficient, and less versatile, but with high performance-to-power ratio.
GPDSP is the first national defense hkust to deal with the U.S. ban Xeon-phi technology reserves and secret weapons. At the latest, the National Defense Hkust started GPDSP Research and development work in 2013. This year's release of the matrix 2000 using 40nm process technology, with 16 cores, the frequency of 1G, double precision floating point 2.4T, power consumption is 200W. Therefore, the Matrix 2000, although in performance due to the domestic manufacturing process and design level, in performance than GPGPU, but in performance-power consumption has been slightly better than GPGPU (2.4t/200w contrast 2.91t/300w), and has been significantly superior to the first-generation Xeon Phi computing card currently used in Tianhe 2nd (2.4t/200w contrast
1t/300w).
GPDSP is closer to the CPU than GPGPU, can run the OS (Linux or other real-time kernels) independently, and is a bit easier to program than GPGPU in terms of programming (in fact, it's a lot harder to program than the CPU). Matrix 2000 is also a core processor with branching capabilities, and xeon-
Phi is a similar type of computing card, and in theory, extending some GPDSP compiler instructions can also run OpenMP code. Of course, GPDSP can also run OpenCL, OPENACC parallel processing code (heterogeneous code).
As a result, although GPDSP is less performance than GPGPU, it has been slightly better than GPGPU in performance-power consumption (2.4t/200w contrast
2.91t/300w), which outperforms gpgpu in terms of efficiency and versatility.
While the GPU is doing parallel computing, the TMU, ROP and other features in the traditional rendering architecture are useless and occupy the transistor resources. DSP is a purely vector machine, not like the GPU as a raster rendering of the transistor, affecting the pipeline structure.
Although the GPU is a large-width parallel structure, high-end GPU resource integration is very strong, can easily achieve a very high theoretical double-precision floating-point computing power, but in the same process and integration capability of each other, discard the graphics part of the DSP transistor efficiency is higher, And the efficiency of the visit is higher than that of the GPU traditional graphics rendering pipeline, the Mealy method of memory.
As a result, GPDSP has a congenital advantage over GPGPU in terms of efficiency, most likely to draw on the management and execution structure of the GPU's shader execution, but without the negative impact of so many legacy frameworks from the GPU. Defense Hkust in the development of Tianhe 1 and Tianhe 2, in the selection of accelerators to the core processor and the GPU have tried again, I believe that the choice of GPDSP route is the defense hkust thoughtful results-China in the manufacturing process and ultra-integrated chip design capabilities than foreign Nvidia, IBM, When Intel and other foreign giants, take the GPDSP route is to shorten and foreign products in absolute performance gaps in the effective way.
According to the information released by the national defense Hkust, because the main I/O structure of the Tianhe 2nd is retained, the compute node processor is still using E5-2692V2, the compute nodes are increased to 18,000, and according to a compute Node 2 E5 and 3 accelerators are required to calculate, The Tianhe 2A requires 36000 pieces of E5 and 54000 pieces of Matrix 2000, only 54000 Matrix 2000 theoretical floating point peaks can reach 129.6Pflops.
To make the blueprint a reality, the technical difficulty is no longer the design and manufacture of chips, but the software stack, which includes GPDPS drivers, operating systems, compilers, base libraries, etc., which is a huge amount of work.
????? ? ?????
‘ /\_,,,,_/\ ?????
┃? ? ┃
' ┃ξξ?ξξ┃
' ╰┳━┳╯
' ╭┫┣╮
' ┺┻┻┻┹
Various accelerator card heterogeneous computing