In the face of large-scale computing-intensive algorithms, the performance of the MapReduce paradigm is not always ideal. To solve the bottleneck, a small entrepreneurial team built a product named ParallelX, which will leverage the GPU's computing capabilities to significantly improve Hadoop tasks.
Tony Diepenbrock, co-founder of ParallelX, said that this is a "GPU compiler that converts code written in Java into OpenCL and runs on the Amazon aws gpu cloud ". Its final product is a service similar to Amazon Elastic MapReduce, except that it will use the EC2 GPU instance type.
Without a doubt, Amazon is not the only cloud service provider that provides GPU servers. Other companies such as IBM/Softlayer or Nimbix also provide servers that use NVidia GPUs. However, when asked whether ParallelX will support different cloud service providers other than Amazon, Tony replied, "No, but we will have an SDK, it is used by customers who use internal Hadoop clusters. Most GPU cloud service providers provide GPUs In the HPC cloud, but we hope to use the GPU in the cloud service at a relatively low price. After all, this is exactly the original design of Hadoop-cheap commercial hardware ."
Before better understanding what the ParallelX compiler can do, we need to know that there are different types of GPUs that are equipped with different parallel computing platforms, such as CUDA or OpenCL. Tony mentioned that ParallelX is applicable to the scenario where "the compiler will convert the JVM bytecode to the OpenCL 1.2 code, so that it can compile into a Shader Assembly through the OpenCL compiler to run on the GPU. There are also some FPGA hardware that can run OpenCL Code. However, to support generalized parallel hardware, it may take some time to come ." Although ParallelX does not support reflection or native calls in Java source code, its goal is to ensure that developers only need to make necessary adjustments to the code of their MapReduce tasks-the fewer the better.
As the ParallelX team began to study the increase in I/O-Bound task throughput, Tony found that their product "also supports real-time processing, queries expressed in Pig and Hive code, and a large dataset stream for I/O Bound tasks. In our tests, using our pipeline framework, I/O throughput can almost reach the level of GPU computing throughput ."
Although the ParallelX team is currently focusing on the efforts of Hadoop version branches of Amazon, they are planning to develop other popular Hadoop version branches (such as Cloudera's CDH, there is no doubt that in the ParallelX environment, using these commercial branches to improve Hive and Pig will be very beneficial.
ParallelX has a unique evolution story. Tony introduced the history of this 2.5-Year-Old epic project in an article: first, a social network developed for a community, followed by the Widget plug-in for Facebook, followed by a tool to identify code plagiarism. These projects share some commonalities: graphic analysis and GPU-based algorithms-almost, and the concept of ParallelX naturally emerges.
ParallelX is suitable for many different workloads, but it focuses mainly on the heavy Analysis of High-performance computing and graphics processing such as machine learning. The ParallelX team gave an example to illustrate its capabilities: It can cluster a large community network on a single GPU within one second-in the past, it needed to use six computers in parallel, it takes one hour to complete. In practice, there is no limit. Any program written for MapReduce can use ParallelX to compile code that can run on the GPU.
The ParallelX team is planning to release its data and White Paper in the future to demonstrate the performance of this "From Hadoop to GPU" compiler in the face of real-world workloads. There are some slight differences in response to this topic. Some people are waiting to read this White Paper before deciding whether to transform to ParallelX. After the News was published on Hacker News, we can find a similar comment in the comment: "extraordinary statements require extraordinary evidence ."
Now, developers can use Aparapi to experience the feeling of using GPU capabilities on Hadoop. Aparapi is a set of Java APIs. By converting Java bytecode into OpenCL, developers can run specific code segments on the GPU, and these code segments can be embedded into any MapReduce task written in Java.
ParallelX may become a far-reaching step in promoting Hadoop in the context of increasingly demanding research groups on complex algorithms. For example, by using the overall synchronous parallel computing model promoted by Apache Hama, the graphic analysis algorithm can achieve excellent performance, if ParallelX can be combined with projects such as Apache Giraph, which can run graphic analysis algorithms as MapReduce tasks, it will add a valuable tool to the graphic analysis toolbox of any data scientist.
You can now use an email address to register the Beta version of ParallelX online. ParallelX plans to support a free value-added plan (freemium plan) that allows access to powerful GPUs and uses limited storage space.
Hadoop Jobs on GPU with ParallelX