Gridgain recently released the hadoop in-memory acceleration technology at the spark summit in 2014, which can bring about the benefits of In-memory computing for hadoop applications.
This technology includes two units: memory-in-chip file systems compatible with hadoop HDFS, and mapreduce implementation optimized for In-memory processing. These two units expand disk-based HDFS and traditional mapreduce to provide better performance for big data processing.
The in-memory acceleration technology eliminates system overhead related to job tracker and task tracker in the traditional hadoop architecture model, it can work with existing mapreduce applications without modifying any code of the original mapreduce, HDFS, and yarn environment.
Below is infoq's interview with the CTO Nikita Ivanov of gridgain about the hadoop memory in-chip acceleration technology and architecture details.
Infoq: the key feature of hadoop memory-in-chip acceleration technology is gridgain's memory-in-chip file system and memory-chip mapreduce. Can you describe how these two components work together?
NIKITA: gridgain's hadoop in-chip acceleration technology is a free, open-source, and plug-and-play solution that improves the speed of traditional mapreduce jobs, you only need to download and install it in 10 minutes, and you will get a performance improvement of dozens of times without any changes to the Code. This product is the first dual-mode, high-performance in-chip File System in the industry, and mapreduce implementation solution optimized for In-chip memory processing. This file system is compatible with hadoop's HDFS. In-memory HDFS and In-memory mapreduce expand the disk-based HDFS and traditional mapreduce in an easy-to-use way to bring significant performance improvements.
To put it simply, gridgain's memory-in-chip File System ggfs provides a high-performance, distributed, and HDFS-Compatible Memory-in-chip computing platform where data is stored, in this way, we can use ggfs to optimize data storage based on Yarn mapreduce. Both components are required so that the performance can be improved by dozens of times (in some boundary situations, it can be higher ).
Infoq: how can we compare these two combinations? One is the combination of HDFS in the memory chip and mapreduce in the memory chip, what is the combination of disk-based HDFS and traditional mapreduce?
NIKITA: The biggest difference between the memory chip solution of gridgain and the traditional HDFS/mapreduce solution is:
- In the memory-in-chip computing platform of gridgain, data is stored in the memory in a distributed manner.
- Gridgain's mapreduce implementation is optimized from the underlying level to make full use of the advantage of data storage in memory, while improving some defects in hadoop's previous architecture. In the mapreduce Implementation of gridgain, the execution path is to direct the job submitter of the client application to the data node, and then complete the in-process data processing, data processing is based on the data partition in the memory of the data node. This bypasses the traditional implementation of job tracker, task tracker, and Name node (Name nodes) these units also avoid latency.
In contrast, in traditional mapreduce implementation, data is stored on low-speed disks, and mapreduce implementation is also optimized based on this.
Infoq: Can you describe how the dual-mode, high-performance in-chip file system working behind the hadoop in-chip acceleration technology? What is the difference between it and traditional file systems?
NIKITA: the memory-in-chip File System ggfs of gridgain supports two modes: one is the master file system of an independent hadoop cluster, and the other is connected to HDFS, ggfs serves as the smart cache layer of HDFS.
As a cache layer, ggfs provides direct read and direct write logic, which is highly adjustable and allows you to freely select which files and directories are to be cached and how to cache them. In both cases, ggfs can be used as an embedded alternative to traditional HDFS, or an extension, which will immediately improve the performance.
Infoq: how to compare the memory-in-chip mapreduce solution of gridgain with some other real-time stream solutions, such as storm or Apache spark?
NIKITA: the most essential difference is that gridgain's in-memory acceleration technology supports plug-and-play. Unlike storm or spark (by the way, both are great projects), they need to completely reinvent your original hadoop mapreduce code, while gridgain does not need to modify a line of code, you can get the same or even higher performance advantages.
Infoq: under what circumstances do we need to use the hadoop in-chip acceleration technology?
NIKITA: actually, when you hear the word "real-time analysis", you hear the new use case of hadoop in-memory acceleration technology. As you know, there is no real-time thing in traditional hadoop. We have seen such use cases in the emerging HTAP (hybrid transactional and analytical processing), such as fraud protection, Game Analysis, algorithm trading, portfolio analysis and optimization.
Infoq: Can you talk about gridgain's visor and graphic interface-based file system analysis tools, and how they help monitor and manage hadoop jobs?
NIKITA: gridgain's hadoop memory-in-chip acceleration is combined with gridgain's visor, which is a solution for managing and monitoring gridgain products. The visor provides direct support for the hadoop in-memory acceleration technology. It provides a fine-grained File Manager and HDFS analysis tool for HDFS-compatible file systems, you can view and analyze various real-time performance information related to HDFS.
Infoq: What about the product roadmap?
NIKITA: we will continue to invest (together with our open-source community) to provide performance improvement solutions for hadoop-related products and technologies, including hive, pig, and hbase.
Taneja group also has related reports (memory is the hidden secret to success with big data, registration is required before downloading all reports ), this article discusses how gridgain integrates hadoop in-chip acceleration technology with existing hadoop clusters, traditional disk-based defective database systems, and mapreduce Technology for batch processing.
About the visitor
Nikita Ivanov is the initiator and CTO of gridgain systems. gridgain was founded in 2007 and investors include RTP ventures and Almaz capital. Nikita leader gridgain developed the leading distributed in-chip data processing technology-the leading Java in-chip computing platform. Today, it starts to run every 10 seconds around the world. Nikita has over 20 years of software application development experience, created an HPC and middleware platform, and contributed to some startups and well-known enterprises, including adaptec, visa, and BEA Systems. Nikita is also a pioneer in server-side development and application using Java technology. In 1996, he did some integration work for large European systems.
View references: Nikita Ivanov on gridgain's in-memory accelerator for hadoop
Nikita Ivanov on gridgain's hadoop in-chip acceleration technology