How can I run a Hadoop task on a GPU? ParallelX may...

Last Update:2014-05-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the face of large-scale computing-intensive algorithms, the performance of the MapReduce paradigm is not always ideal. To solve the bottleneck, a small entrepreneurial team built a product named ParallelX, which will leverage the GPU's computing capabilities to significantly improve Hadoop tasks.

Tony Diepenbrock, co-founder of ParallelX, said that this is a "GPU compiler that converts code written in Java into OpenCL and runs on the Amazon aws gpu cloud ". Its final product is a service similar to Amazon Elastic MapReduce, except that it will use the EC2 GPU instance type.

Without a doubt, Amazon is not the only cloud service provider that provides GPU servers. Other companies such as IBM/Softlayer or Nimbix also provide servers that use NVidia GPUs. However, when asked whether ParallelX will support different cloud service providers other than Amazon, Tony replied, "No, but we will have an SDK, it is used by customers who use internal Hadoop clusters. Most GPU cloud service providers provide GPUs In the HPC cloud, but we hope to use the GPU in the cloud service at a relatively low price. After all, this is exactly the original design of Hadoop-cheap commercial hardware ."

Before better understanding what the ParallelX compiler can do, we need to know that there are different types of GPUs that are equipped with different parallel computing platforms, such as CUDA or OpenCL. Tony mentioned that ParallelX is applicable to the scenario where "the compiler will convert the JVM bytecode to the OpenCL 1.2 code, so that it can compile into a Shader Assembly through the OpenCL compiler to run on the GPU. There are also some FPGA hardware that can run OpenCL Code. However, to support generalized parallel hardware, it may take some time to come ." Although ParallelX does not support reflection or native calls in Java source code, its goal is to ensure that developers only need to make necessary adjustments to the code of their MapReduce tasks-the fewer the better.

As the ParallelX team began to study the increase in I/O-Bound task throughput, Tony found that their product "also supports real-time processing, queries expressed in Pig and Hive code, and a large dataset stream for I/O Bound tasks. In our tests, using our pipeline framework, I/O throughput can almost reach the level of GPU computing throughput ."

Although the ParallelX team is currently focusing on the efforts of Hadoop version branches of Amazon, they are planning to develop other popular Hadoop version branches (such as Cloudera's CDH, there is no doubt that in the ParallelX environment, using these commercial branches to improve Hive and Pig will be very beneficial.

ParallelX has a unique evolution story. Tony introduced the history of this 2.5-Year-Old epic project in an article: first, a social network developed for a community, followed by the Widget plug-in for Facebook, followed by a tool to identify code plagiarism. These projects share some commonalities: graphic analysis and GPU-based algorithms-almost, and the concept of ParallelX naturally emerges.

ParallelX is suitable for many different workloads, but it focuses mainly on the heavy Analysis of High-performance computing and graphics processing such as machine learning. The ParallelX team gave an example to illustrate its capabilities: It can cluster a large community network on a single GPU within one second-in the past, it needed to use six computers in parallel, it takes one hour to complete. In practice, there is no limit. Any program written for MapReduce can use ParallelX to compile code that can run on the GPU.

The ParallelX team is planning to release its data and White Paper in the future to demonstrate the performance of this "From Hadoop to GPU" compiler in the face of real-world workloads. There are some slight differences in response to this topic. Some people are waiting to read this White Paper before deciding whether to transform to ParallelX. After the News was published on Hacker News, we can find a similar comment in the comment: "extraordinary statements require extraordinary evidence ."

Now, developers can use Aparapi to experience the feeling of using GPU capabilities on Hadoop. Aparapi is a set of Java APIs. By converting Java bytecode into OpenCL, developers can run specific code segments on the GPU, and these code segments can be embedded into any MapReduce task written in Java.

ParallelX may become a far-reaching step in promoting Hadoop in the context of increasingly demanding research groups on complex algorithms. For example, by using the overall synchronous parallel computing model promoted by Apache Hama, the graphic analysis algorithm can achieve excellent performance, if ParallelX can be combined with projects such as Apache Giraph, which can run graphic analysis algorithms as MapReduce tasks, it will add a valuable tool to the graphic analysis toolbox of any data scientist.

You can now use an email address to register the Beta version of ParallelX online. ParallelX plans to support a free value-added plan (freemium plan) that allows access to powerful GPUs and uses limited storage space.

Hadoop Jobs on GPU with ParallelX

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How can I run a Hadoop task on a GPU? ParallelX may...

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How can I run a Hadoop task on a GPU? ParallelX may...

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support