Inventory the Hadoop Biosphere: 13 Open source tools for elephants to fly

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Code hosting can provide

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop is a large data distributed system infrastructure developed by the Apache Foundation, the earliest version of which was the 2003 original Yahoo! Doug cutting is based on Google's published academic paper. Users can easily develop and run applications that process massive amounts of data in Hadoop without knowing the underlying details of the distribution. The features of low cost, high reliability, high scalability, high efficiency, and high fault tolerance make Hadoop the most popular large data analysis system, yet the HDFs and MapReduce components on which it survives are once in trouble--batch processing works so that it only works on off-line data processing, There is no need for a scene that requires real-time sex. As a result, a variety of tools based on Hadoop were created to share the 13 most commonly used open source tools in the Hadoop ecosystem, including resource scheduling, stream computing, and various business-oriented scenarios. First, we look at resource management related.

CSDN recommended: Welcome to free Subscribe to "Hadoop and Big Data Weekly" to get more Hadoop technical literature, large data technology analysis, business experience, ecological circle development trend.

Unified resource Management/scheduling system

In companies and organizations, servers tend to be split into clusters for business logic, and data-intensive processing frameworks are emerging, such as mapreduce that support off-line processing, storm and Impala that support online processing, spark for iterative computing, and flow-processing frameworks S4, They are born in different laboratories and have their own strengths. To reduce management costs and increase resource utilization, a common idea arises-to have these frameworks run on the same cluster; therefore, there are many resources unified management/scheduling system, such as Google's Borg, Apache yarn, Twitter's Mesos (has contributed to the Apache Foundation), Tencent Search Torca, Facebook Corona (open source), this time to highlight the Apache Mesos and yarn:

1. Apache Mesos

Code managed Address: Apache SVN

Mesos provides efficient, resource isolation and sharing across distributed applications and frameworks, and supports Hadoop, MPI, hypertable, Spark, and so on.

Mesos is an open source project in the Apache incubator that uses zookeeper to implement fault-tolerant replication, isolate tasks using Linux containers, and support multiple resource planning allocations (memory and CPU). Provides Java, Python, and C + + APIs for developing new parallel applications, providing a web-based user interface couched viewing cluster status.

2. Hadoop YARN

Code managed Address: Apache SVN

Yarn, also known as MapReduce 2.0, draws on Mesos,yarn's resource isolation solution container, but is not yet mature, providing only the isolation of Java virtual machine memory.

Contrast MapReduce 1.x,yarn architecture does not change much on the client side, maintaining a majority of compatibility on the calling API and interfaces, whereas in yarn developers use ResourceManager, Applicationmaster, and NodeManager replaces the core jobtracker and Tasktracker in the original frame. ResourceManager is a central service, responsible for scheduling, starting each Job belongs to the Applicationmaster, in addition to monitoring the existence of Applicationmaster; NodeManager responsible for Container State maintenance and maintain a heartbeat to RM. Applicationmaster is responsible for all work within a job lifecycle, similar to Jobtracker in the old frame.

Real-time solutions on Hadoop

We have said that in the Internet companies based on business logic requirements, enterprises tend to use a variety of computing frameworks, such as the company engaged in search business: Web indexing to build with MapReduce, natural language processing with spark, etc. This section is shared by the storm, Impala, spark three frameworks:

3. Cloudera Impala

Code managed Address: GitHub

Impala is developed by Cloudera, an open-source massively Parallel 處理 (MPP) query engine. With the same metadata, SQL syntax, ODBC driver, and user interface (Hue beeswax) as hive, you can provide fast, interactive SQL queries directly on HDFs or HBase. Impala was developed under Dremel's inspiration, and the first version was released at the end of 2012.

Instead of using a slow hive+mapreduce batch, Impala uses a distributed query engine similar to the commercial parallel relational database (composed of Query planner, query Coordinator, and query Exec engine), You can query data directly from HDFs or hbase using SELECT, join, and statistical functions, which greatly reduces latency.

4. Spark

Code managed Address: Apache

Spark is an open source data Analysis Cluster Computing framework, originally developed by the University of California, Berkeley Amplab, based on HDFs. Spark, like Hadoop, is used to build large-scale, low-latency data analysis applications. Spark is implemented in Scala, using Scala as an application framework.

Spark uses a distributed data set based on memory to optimize iterative workloads and interactive queries. Unlike Hadoop, Spark and Scala are tightly integrated, and Scala manages distributed datasets like local collective objects. Spark supports iterative tasks on distributed datasets and can actually be run with Hadoop on Hadoop file systems (via yarn, Mesos, etc.).

5. Storm

Code managed Address: GitHub

Storm is a distributed, fault-tolerant real-time computing system developed by Backtype and captured by Twitter. Storm is a stream processing platform that is used for real-time computing and updating databases. Storm can also be used for "continuous computing" (continuous computation), a continuous query of the data stream, which outputs the results to the user in the form of a stream. It can also be used for "distributed RPC" to run expensive operations in parallel.

Other solutions on Hadoop

As the previous article said, based on the real-time needs of the business, various laboratories have developed storm, Impala, Spark, Samza, and other streaming real-time processing tools. In this section, we will share the lab's open source solutions based on performance, compatibility, data type research, including Shark, Phoenix, Apache Accumulo, Apache Drill, Apache Giraph, Apache Hama, Apache Tez, Apache Ambari.

6. Shark

Code managed Address: GitHub

Shark, representing the "Hive on Spark", a large-scale data warehousing system designed for Spark, compatible with Apache Hive. You can perform hive QL 100 times times faster without modifying existing data or queries.

Shark support for Hive query languages, Meta storage, serialization formats, and custom functions, seamless integration with existing hive deployments, is a faster, more powerful alternative.

7. Phoenix

Code managed Address: GitHub

Phoenix is a SQL tier built on the Apache HBase that is written entirely in Java and provides a JDBC driver that can be embedded in the client. The Phoenix query engine translates the SQL query into one or more hbase scan and is choreographed to produce a standard JDBC result set. Using the HBase API directly, the collaboration processor and the custom filter, the performance level of the simple query is milliseconds, and the performance level is seconds for the millions other rows. Phoenix completely hosted on GitHub.

Phoenix noteworthy features include: 1, embedded JDBC driver, the implementation of most of the java.sql interface, including metadata api;2, can be through multiple row keys or key/value units to model the column, 3,DDL support, 4, version of the schema Warehouse; 5,DML support; 5, Limited transaction support via client batch processing, 6, followed by ANSI SQL standard.

8. Apache Accumulo

Code managed Address: Apache SVN

The Apache Accumulo is a reliable, scalable, high-performance, sorted, distributed key-value storage solution, based on unit access control and customizable server-side processing. Use Google bigtable design ideas, built on Apache Hadoop, zookeeper, and thrift. Accumulo was first developed by the NSA and donated to the Apache Foundation.

In contrast to Google Bigtable,accumulo, which mainly promotes cell-based access and server-side programming mechanisms, the latter modification allows Accumulo to modify key-value pairs at any point in the data processing process.

9. Apache Drill

Code managed Address: GitHub

Essentially, the Apache drill is an open-source implementation of Google Dremel, essentially a distributed MPP query layer that supports SQL and some languages for NoSQL and Hadoop data storage systems, which will help Hadoop users achieve the goal of faster querying of mass datasets. The current drill can only be counted as a frame, containing only the initial functionality of the drill vision.

The purpose of drill is to support a wider range of data sources, data formats, and query languages, through a quick scan of petabytes of data (in about a few seconds), and a distributed system designed for interactive analysis of large datasets.

Apache Giraph

Code managed Address: GitHub

The Apache Giraph is a scalable, distributed, iterative graph processing system inspired by the BSP (bulk synchronous parallel) and Google's Pregel, as distinct from open source, Hadoop based architectures.

Giraph processing platform is suitable for running large-scale logic calculation, such as page ranking, sharing links, personalized ranking and so on. Giraph, who focuses on social graph computing, is the core of its open graph tool, which handles trillions of of connections between users and their behavior within minutes.

Apache Hama

Code managed Address: GitHub

The Apache Hama is a computing framework built on Hadoop based on BSP (Bulk Synchronous Parallel), mimicking Google's Pregel. Used to deal with large scale scientific calculations, especially matrix and graph calculations. The system architecture in a clustered environment consists of bspmaster/groomserver (computation Engine), zookeeper (distributed locking), Hdfs/hbase (Storage BAE) These 3 chunks are composed.

Apache Tez

Code managed Address: GitHub

The Apache Tez is a framework for computing Dag (yarn, directed acyclic graph) on top of Hadoop. It splits the map/reduce process into several sub processes, while combining multiple map/reduce tasks into a larger DAG task, reducing file storage between map/reduce. At the same time, it is reasonable to combine the sub process and reduce the running time of the task. Developed by Hortonworks and provides major support.

Apache Ambari

Code managed Address: Apache SVN

The Apache Ambari is an open source framework for provisioning, managing, and monitoring the Apache Hadoop cluster, providing an intuitive operational tool and a robust hadoop API to hide complex Hadoop operations and simplify cluster operations. The first version was released in June 2012.

Apache Ambari is now an Apache top project, as early as August 2011, Hortonworks introduced Ambari as the Apache Incubator Project and developed a vision for the ultimate simple management of the Hadoop cluster. In more than two years the development community has grown significantly from a small team to a hortonworks of various organizations. The Ambari user base has been growing steadily, with many institutions relying on Ambari to deploy and manage the Hadoop cluster in their large data centers on a large scale.

Current Apache Ambari supported Hadoop components include: HDFS, MapReduce, Hive, Hcatalog, HBase, zookeeper, Oozie, pig, and Sqoop.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More