13 Open source tools based on large data analysis system Hadoop

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Code hosting can provide

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop is a large data distributed system infrastructure developed by the Apache Foundation, the earliest version of which was the 2003 original Yahoo! Dougcutting based on Google's published academic paper. Users can easily develop and run applications that process massive amounts of data in Hadoop without knowing the underlying details of the distribution. The features of low cost, high reliability, high scalability, high efficiency, and high fault tolerance make Hadoop the most popular large data analysis system, yet the HDFs and MapReduce components on which it survives are once in trouble--batch processing works so that it only works on off-line data processing, There is no need for a scene that requires real-time sex. As a result, a variety of tools based on Hadoop were created to share the 13 most commonly used open source tools in the Hadoop ecosystem, including resource scheduling, stream computing, and various business-oriented scenarios. First, we look at resource management related.

Unified resource Management/scheduling system

In companies and organizations, servers tend to be split into clusters for business logic, and data-intensive processing frameworks are emerging, such as mapreduce that support off-line processing, storm and Impala that support online processing, spark for iterative computing, and flow-processing frameworks S4, They are born in different laboratories and have their own strengths. To reduce management costs and increase resource utilization, a common idea arises-to have these frameworks run on the same cluster; therefore, there are now many resources unified management/scheduling systems, such as Google's Borg, Apache yarn, Twitter Mesos ( has contributed to the Apache Foundation, Tencent Search Torca, Facebookcorona (open source), this time to focus on Apachemesos and yarn:

1.ApacheMesos

Code managed Address: APACHESVN

Mesos provides efficient, resource isolation and sharing across distributed applications and frameworks, and supports Hadoop, MPI, hypertable, Spark, and so on.

Mesos is an open source project in the Apache incubator that uses zookeeper to implement fault-tolerant replication, isolate tasks using linuxcontainers, and support multiple resource planning allocations (memory and CPU). Provides Java, Python, and C++apis to develop new parallel applications, providing a web-based user interface couched viewing cluster status.

2.HadoopYARN

Code managed Address: APACHESVN

Yarn also known as MapReduce2.0, for reference to Mesos,yarn proposed resource isolation solution container, but is not yet mature, only to provide Java Virtual machine memory isolation.

The Mapreduce1.x,yarn architecture does not change much on the client side, and it maintains most of the compatibility on the calling API and interface, whereas in yarn, developers use ResourceManager, Applicationmaster and NodeManager Replace the core jobtracker and Tasktracker in the original frame. ResourceManager is a central service, responsible for scheduling, starting each job belongs to the Applicationmaster, in addition to monitoring the existence of applicationmaster; NodeManager is responsible for maintaining the state of the container and maintaining a heartbeat to RM. Applicationmaster is responsible for all work within a job lifecycle, similar to Jobtracker in the old frame.

Real-time solutions on Hadoop

We have said that in the Internet companies based on business logic requirements, enterprises tend to use a variety of computing frameworks, such as the company engaged in search business: Web indexing to build with MapReduce, natural language processing with spark, etc. This section is shared by the storm, Impala, spark three frameworks:

3.ClouderaImpala

Code managed Address: GitHub

Impala is developed by Cloudera, an open source massivelyparallelprocessing (MPP) query engine. With the same metadata, SQL syntax, ODBC driver, and user interface (Huebeeswax) as hive, you can provide fast, interactive SQL queries directly on HDFs or HBase. Impala was developed under Dremel's inspiration, and the first version was released at the end of 2012.

Impala no longer uses slow hive+mapreduce batches, but rather a distributed query engine (consisting of Queryplanner, Querycoordinator, and Queryexecengine) similar to the commercial parallel relational database. You can query data directly from HDFs or hbase using SELECT, join, and statistical functions, which greatly reduces latency.

4.Spark

Code managed Address: Apache

Spark is an open source data Analysis Cluster Computing framework, originally developed by the University of California, Berkeley Amplab, based on HDFs. Spark, like Hadoop, is used to build large-scale, low-latency data analysis applications. Spark is implemented in Scala, using Scala as an application framework.

Spark uses a distributed data set based on memory to optimize iterative workloads and interactive queries. Unlike Hadoop, Spark and Scala are tightly integrated, and Scala manages distributed datasets like local collective objects. Spark supports iterative tasks on distributed datasets and can actually be run with Hadoop on Hadoop file systems (via yarn, Mesos, etc.).

5.Storm

Code managed Address: GitHub

Storm is a distributed, fault-tolerant real-time computing system developed by Backtype and captured by Twitter. Storm is a stream processing platform that is used for real-time computing and updating databases. Storm can also be used for "continuous computing" (continuouscomputation), continuous query on the data stream, and output to the user in the form of a flow in the calculation. It can also be used for "distributed RPC" to run expensive operations in parallel.

Other solutions on Hadoop

As the previous article said, based on the real-time needs of the business, various laboratories have developed storm, Impala, Spark, Samza, and other streaming real-time processing tools. In this section, we will share open source solutions based on performance, compatibility, and data type research, including Shark, Phoenix, Apacheaccumulo, Apachedrill, Apachegiraph, Apachehama, Apachetez, Apacheambari.

6.Shark

Code managed Address: GitHub

Shark, representing the "Hiveonspark", a spark built for the large-scale data warehouse system, compatible with Apachehive. You can perform hiveql at 100 times times faster without modifying existing data or queries.

Shark support for Hive query languages, Meta storage, serialization formats, and custom functions, seamless integration with existing hive deployments, is a faster, more powerful alternative.

7.Phoenix

Code managed Address: GitHub

Phoenix is a SQL tier built on top of apachehbase that is written entirely in Java and provides a JDBC driver that can be embedded in the client. The Phoenix query engine converts the SQL query into one or more hbasescan and executes to generate a standard JDBC result set. Directly using HBASEAPI, collaboration processor and custom filter, for simple queries, its performance level is milliseconds, for millions other line number, its performance level is seconds. Phoenix completely hosted on GitHub.

Phoenix noteworthy features include: 1, embedded JDBC driver, the implementation of most of the java.sql interface, including metadata api;2, can be through a number of row keys or key/value units to model the column, 3,DDL support, 4, version of the schema Warehouse; 5,DML support; 5, through the client's batch processing to achieve limited transaction support, 6, closely follow the ANSI standard.

8.ApacheAccumulo

Code managed Address: APACHESVN

Apacheaccumulo is a reliable, scalable, high-performance, sorted, distributed key-value storage solution, based on unit access control and customizable server-side processing. Use googlebigtable design ideas, based on Apachehadoop, zookeeper and thrift construction. Accumulo was first developed by the NSA and donated to the Apache Foundation.

Contrast Googlebigtable,accumulo is mainly promoted in cell-based access and server-side programming mechanisms, and the latter modification allows Accumulo to modify the key-value pairs at any point in the process of data processing.

9.ApacheDrill

Code managed Address: GitHub

In essence, Apachedrill is a googledremel open source implementation, essentially a distributed MPP query layer that supports SQL and some languages for NoSQL and Hadoop data storage systems, which will help Hadoop users achieve the goal of faster querying of mass datasets. The current drill can only be counted as a frame, containing only the initial functionality of the drill vision.

The purpose of drill is to support a wider range of data sources, data formats, and query languages, through a quick scan of petabytes of data (in about a few seconds), and a distributed system designed for interactive analysis of large datasets.

10.ApacheGiraph

Code managed Address: GitHub

Apachegiraph is a scalable, distributed, iterative graph processing system inspired by BSP (Bulksynchronousparallel) and Google's Pregel, as distinct from open source, Hadoop based architectures.

Giraph processing platform is suitable for running large-scale logic calculation, such as page ranking, sharing links, personalized ranking and so on. Giraph focuses on social graph computing, and Facebook is the core of its opengraph tool, dealing with connections between trillions of users and their behavior within minutes.

11.ApacheHama

Code managed Address: GitHub

Apachehama is a computing framework based on BSP (Bulksynchronousparallel) built on Hadoop, mimicking Google's Pregel. Used to deal with large scale scientific calculations, especially matrix and graph calculations. The system architecture in a clustered environment consists of bspmaster/groomserver (Computationengine), Zookeeper (distributedlocking), Hdfs/hbase (StorageSystems) These 3 chunks are composed.

12.ApacheTez

Code managed Address: GitHub

Apachetez is a computational framework based on Hadoopyarn Dag (with a direction-free graph, directedacyclicgraph). It splits the map/reduce process into several sub processes, while combining multiple map/reduce tasks into a larger DAG task, reducing file storage between map/reduce. At the same time, it is reasonable to combine the sub process and reduce the running time of the task. Developed by Hortonworks and provides major support.

13.ApacheAmbari

Code managed Address: APACHESVN

Apacheambari is an open source framework for provisioning, managing, and monitoring apachehadoop clusters, providing an intuitive operational tool and a robust hadoopapi that can hide complex hadoop operations and make cluster operations much simpler. The first version was released in June 2012.

Apacheambari is now one of Apache's top projects, as early as August 2011, Hortonworks introduced Ambari as a apacheincubator project, developing a vision for the ultimate simple management of the Hadoop cluster. In more than two years the development community has grown significantly from a small team to a hortonworks of various organizations. The Ambari user base has been growing steadily, with many institutions relying on Ambari to deploy and manage the Hadoop cluster in their large data centers on a large scale.

Current Apacheambari-supported Hadoop components include: HDFS, MapReduce, Hive, Hcatalog, HBase, zookeeper, Oozie, pig, and Sqoop.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More