Research on the performance of HADOOP+GPU

Source: Internet
Author: User
Keywords Can very if data transmission

Hadoop parallel processing can multiply performance, GPU is increasingly becoming an important burden of computing tasks, Altoros BAE Research and development team has been dedicated to explore the possibility of HADOOP+GPU, and in the actual large-scale system implementation, this article is part of their research results.

Hadoop parallel processing can improve performance exponentially. The question now is what happens if some of the computing work is migrated from the CPU to the GPU? Faster theoretically, these processes, if optimized for parallel computing, can be executed on the GPU 50-100 times faster than the CPU. As a catalyst for large data specialists and PAAs, the Altoros BAE Research and development team has been working to explore the possibilities of HADOOP+GPU, as well as the implementation of real-world large-scale systems, and this article is part of their findings. Author Vladimir Starostenkov, a senior research and development engineer at Altoros Bae, has five years of experience in implementing complex software architectures, including data-intensive systems and Hadoop-driven applications, and is also interested in AI and machine learning algorithms.

Technical Status:

Over the years, there have been a lot of research projects that apply Hadoop or MapReduce to the GPU. Mars may be the first successful GPU MapReduce framework. With Mars technology, the performance of analyzing Web data (search and log) and processing Web documents increased by 1.5-1.6 times times. According to Mars ' rationale, many research institutes have developed similar tools to improve the performance of their data-intensive systems. Related cases include molecular dynamics, mathematical modelling (e.g. Monte Carlo), matrix multiplication based on block, financial analysis, image processing, etc.

There is also a boing system for grid computing, which is a fast developing and volunteer-driven middleware system. Although the use of Hadoop,boinc has become the foundation for many research projects to accelerate. For example, Gpugrid is a project based on BOINC GPU and distributed computing, which helps us understand the different roles of proteins in health and disease situations by performing molecular simulations. Most BOINC projects on medicine, physics, mathematics, biology, etc. can also use HADOOP+GPU technology.

Therefore, the need to accelerate parallel computing systems using the GPU exists. These organizations invest in GPU supercomputers or develop their own solutions. Hardware vendors, such as Cray, have released machines that configure the GPU and have Hadoop installed. Amazon has also launched the EMR (Amazon elastic MapReduce), where users can use Hadoop on a server that has a GPU configured.

The supercomputing machine has a high performance, but costs up to millions of dollars; Amazon EMR is also available only for projects that last several months. For some larger research projects (years), it is more cost-effective to invest in your own hardware. Even if using a GPU in a Hadoop cluster can improve computing speed, data transmission can also cause a certain performance bottleneck. The following is a detailed description of the issues involved.

Working principle

In the process of data processing, the HDD, DRAM, CPU and GPU are bound to be exchanged. The following figure shows the transfer of data when the CPU and the GPU perform the calculations together.

Figure: Data exchange between components during data processing

arrow A: Data from HDD to DRAM (initial step of CPU+GPU calculation) arrow B:CPU process data (data flow: DRAM->CHIPSET->CPU) arrows C:GPU process data (data flow:dram-> CHIPSET->CPU->CHIPSET->GPU->GDRAM->GPU)

The total amount of time required to complete any task includes:

The time required to

a CPU or GPU to compute the time needed to transfer between components

According to Tom's Hardware 2012 CPU charts, the average CPU performance is between 15 and 130GFLOPS, while the NVIDIA GPU has a performance range of 100 to 3000+ Gflops. These are statistical values and depend largely on the type and algorithm of the task. In any case, in some cases, a GPU can speed up the node 5 to 25 times times faster. Some developers claim that if your cluster includes multiple nodes, performance can be increased by 50 to 200 times times. For example, the Mithra project achieves a performance improvement of 254 times times.

Performance bottlenecks:

So what is the effect of GPU on data transfer? Different types of hardware transmit data at different rates, supercomputers have been optimized on the GPU, and a normal computer or server may be much slower in data transfer. The data transfer rate between a CPU and a chip set is usually between 10 and 20GBps (the y point in the figure), and the exchange rate between the GPU and DRAM is between 1 and 10GBps (x points in the figure). Although some system rates are up to 10GBps (PCI v3), most standard-configured Gdram and DRAM data flow rates are 1GBps. (It is recommended that you measure the actual value in a real hardware environment, because CPU memory bandwidth [x and Y] and the corresponding data transfer rate [C and B] may be approximately 10 times times the difference).

While the GPU provides faster computing power, data transfer (x points) between GPU memory and CPU memory presents performance bottlenecks. Therefore, for each particular project, the actual measurement of the data transfer time (arrow c) consumed on the GPU and the timing of the GPU acceleration saves. Therefore, the best approach is to estimate the operation of a larger system based on the actual performance of a small cluster.

Because the data transfer rate can be quite slow, ideally, the amount of per GPU input/output data is smaller than the number of execution calculations. Remember: First, the task type needs to match the capabilities of the GPU, and the second task can be segmented by Hadoop into a parallel, stand-alone sub-process. Complex mathematical formula calculations (such as matrix multiplication), the generation of a large number of random values, similar scientific modeling tasks or other general-purpose GPU applications belong to this task.

Available Technologies

1. The Jcuda:jcuda project provides Java bindings and associated libraries for Nvidia Cuda, such as Jcublas, Jcusparse (a working library of matrices), jcufft (common signal processing Java bindings), Jcurand (GPU produces random number of libraries) and so on. But it only applies to the Nvidia GPU.

2. Java Aparapi. Aparapi converts Java bytecode to OpenCL at run time and executes on the GPU. Of all the HADOOP+GPU computing systems, the prospects for APARAPI and OpenCL are most promising. Aparapi, developed by Amdjava Labs, opened the source code in 2011 and can see some of APARAPI's practical applications on the official website of the AMD Fusion developer Summit. OpenCL is an open source, Cross-platform standard that is supported by a large number of hardware vendors and can write the same code base for CPUs and GPU. If no GPU,OPENCL on a machine will support the CPU.

3. Create local code to access the GPU. Accessing GPU native code for complex mathematical calculations is much higher than using bindings and connectors, but if you need to provide a solution in as short a time as possible, use a similar APARAPI framework. Then, if you are not satisfied with its performance, you can rewrite some or all of the code as local code. You can use the C language API (using NVIDIA CUDA or OPENCL) to create local code that allows Hadoop to use the GPU through JNA (if it is a Java application) or Hadoop streaming (if it is a C language application).

Gpu-hadoop Framework

You can also try customizing the Gpu-hadoop framework, which starts after Mars, including Grex, Panda, C-MR, GPMR, Shredder, STEAMMR, and so on. But Gpu-hadoop is used for specific research projects and no longer provides support, and it is even difficult to apply the Monte Carlo simulation framework to a bioinformatics project based on other algorithms.

Processor technology is also evolving. In Sony PlayStation 4, there is a revolutionary new framework, Adapteva multi-core microprocessor, arm Mali GPU and so on. Both the Adapteva and Mali GPU will be compatible with OpenCL.

Intel also launched the Xeon Phi collaboration processor using OPENCL, a 60-core collaboration processor, similar to the X86, which supports PCI standards. The performance can be up to 1TFLOPS in double precision calculation, and the energy consumption is only 300Watt. Today's fastest supercomputer, 2, uses the collaboration processor.

It is difficult to say which of these frameworks will become mainstream in the field of high-performance and distributed computing. As they continue to improve, our understanding of large data processing may change.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.