13 open source tools for big data analytics system Hadoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop is a big data distributed system infrastructure developed by the Apache Foundation. The earliest version was the original Yahoo! DougCutting in 2003 based on research published by Google.

Users can easily develop and run applications that process massive amounts of data on Hadoop without knowing the underlying details of the distribution. Low cost, high reliability, high scalability, high efficiency, and high fault tolerance make Hadoop the most popular big data analytics system. However, the HDFS and MapReduce components that it relies on make it once in trouble - the way batch works Let it be only suitable for offline data processing, and it is useless in scenarios that require real-time performance.

Therefore, various Hadoop-based tools have emerged. This time, we share the 13 most commonly used open source tools in the Hadoop ecosystem, including resource scheduling, stream computing, and various business-oriented scenarios. First, we look at resource management.

Resource unified management/scheduling system

In companies and organizations, servers are often split into multiple clusters because of business logic. Data-intensive processing frameworks are also emerging, such as MapReduce for offline processing, Storm and Impala for online processing, and iterative computing. Spark and stream processing framework S4, they were born in different laboratories and have their own strengths.

In order to reduce management costs and improve resource utilization, a common idea arises - let these frameworks run on the same cluster; therefore, there are a lot of resources unified management / scheduling system, this time we will focus on ApacheMesos And YARN:

1, ApacheMesos

Code hosting address: ApacheSVN

Mesos provides efficient isolation and sharing of resources across distributed applications and frameworks, supporting Hadoop, MPI, Hypertable, Spark, and more.

Mesos is an open source project in the Apache incubator that uses ZooKeeper for fault-tolerant replication, LinuxContainers to isolate tasks, and supports multiple resource plan allocations (memory and CPU). Java, Python, and C++ APIs are provided to develop new parallel applications, providing a web-based user interface to view cluster status.

2, HadoopYARN

Code hosting address: ApacheSVN

YARN is also known as MapReduce2.0. Referring to Mesos, YARN proposed the resource isolation solution Container, but it is not yet mature, and only provides isolation of Java virtual machine memory.

Compared with MapReduce1.x, the YARN architecture has not changed much on the client side, and it still maintains most compatibility in calling APIs and interfaces. However, in YARN, developers use ResourceManager, ApplicationMaster and NodeManager instead of the original framework. Core JobTracker and TaskTracker. The ResourceManager is a central service that is responsible for scheduling and starting the ApplicationMaster to which each job belongs. It also monitors the existence of the ApplicationMaster. The NodeManager is responsible for the maintenance of the Container state and maintains a heartbeat to the RM. ApplicationMaster is responsible for all the work in a Job lifecycle, similar to the old framework JobTracker.

Real-time solution on Hadoop

As we said before, in the Internet companies based on business logic requirements, companies often use a variety of computing frameworks, such as companies engaged in search services: MapReduce for web indexing, Spark for natural language processing.

3, ClouderaImpala

Code hosting address: GitHub

Impala is an open source MassivelyParallelProcessing (MPP) query engine developed by Cloudera. The same metadata, SQL syntax, ODBC driver, and user interface (HueBeeswax) as Hive provide fast, interactive SQL queries directly on HDFS or HBase. Impala was inspired by Dremel and the first version was released in late 2012.

Impala no longer uses slow Hive+MapReduce batch processing, but through a similar distributed query engine (composed of QueryPlanner, QueryCoordinator and QueryExecEngine) in a commercial parallel relational database, you can use SELECT, JOIN directly from HDFS or HBase. And the statistical function queries the data, which greatly reduces the delay.

4, Spark

Code hosting address: Apache

Spark is an open source data analytics cluster computing framework originally developed by the University of California, Berkeley, AMPLab, built on top of HDFS. Like Hadoop, Spark is used to build large-scale, low-latency data analysis applications. Spark is implemented in Scala and uses Scala as the application framework.

Spark uses a memory-based distributed data set to optimize iterative workloads and interactive queries. Unlike Hadoop, Spark is tightly integrated with Scala, which manages distributed datasets like managing local collective objects. Spark supports iterative tasks on distributed datasets and can actually be run with Hadoop on Hadoop filesystems (via YARN, Mesos, etc.).

5, Storm

Code hosting address: GitHub

Storm is a distributed, fault-tolerant real-time computing system developed by BackType and then captured by Twitter. Storm is a stream processing platform that is used to calculate and update databases in real time. Storm can also be used for "continuous computing" to continuously query the data stream and output the results to the user as a stream. It can also be used for "distributed RPC" to run expensive operations in parallel.

Other solutions on Hadoop

As mentioned above, based on the real-time needs of the business, each laboratory invented real-time processing tools such as Storm, Impala, Spark, and Samza. In this section we will share the lab's open source solutions based on performance, compatibility, data type research, including Shark, Phoenix, ApacheAccumulo, ApacheDrill, ApacheGiraph, ApacheHama, ApacheTez, ApacheAmbari.

6, Shark

Code hosting address: GitHub

Shark, representing "HiveonSpark," a large-scale data warehousing system built for Spark, is compatible with ApacheHive. HiveQL can be executed 100 times faster without modifying existing data or queries.

Supporting Hive query languages, meta-storage, serialization formats, and custom functions, Shark seamlessly integrates with existing Hive deployments, making it a faster and more powerful alternative.

7, Phoenix

Code hosting address: GitHub

Phoenix is a SQL middle tier built on top of Apache HBase, written entirely in Java, and provides a client-embedded JDBC driver. The Phoenix query engine converts SQL queries into one or more HBasescans and orchestrate them to generate a standard JDBC result set. Direct use of HBaseAPI, coprocessor and custom filters, for simple queries, the energy level is milliseconds, for millions of rows, the energy level is seconds. Phoenix is fully hosted on GitHub.

Phoenix's features worthy of attention include: 1. The embedded JDBC driver implements most of the java.sql interfaces, including the metadata API; 2. It can model columns by multiple row keys or key/value units; 3, DDL support; 4, versioned schema repository; 5, DML support; 5, limited transaction support through client-side batch processing; 6, followed by the ANSI SQL standard.

8, ApacheAccumulo

Code hosting address: ApacheSVN

ApacheAccumulo is a reliable, scalable, high-performance, sort-distributed key-value storage solution based on cell access control and customizable server-side processing. Use GoogleBigTable design ideas, built on ApacheHadoop, Zookeeper and Thrift. Accumulo was first developed by the NSA and later donated to the Apache Foundation.

Compared to GoogleBigTable, Accumulo mainly promotes cell-based access and server-side programming mechanisms. The latter modification allows Accumulo to modify key-value pairs at any point during data processing.

9, ApacheDrill

Code hosting address: GitHub

Essentially, ApacheDrill is an open source implementation of GoogleDremel, essentially a distributed mpp query layer that supports SQL and some languages for NoSQL and Hadoop data storage systems that will help Hadoop users achieve faster querying of massive data sets. purpose. At the moment, Drill can only count as a framework, including only the initial features in the Drill vision.

Drill's purpose is to support a wider range of data sources, data formats and query languages. It can be quickly analyzed (within a few seconds) for PB byte data, and will be a distributed analysis for large data sets. system.

10, ApacheGiraph

Code hosting address: GitHub

ApacheGiraph is a scalable distributed iterative graph processing system inspired by BSP (bulksynchronousparallel) and Google's Pregel. It is different from open source, Hadoop-based architecture.

The Giraph processing platform is ideal for running large-scale logical calculations such as page rankings, shared links, and personalized rankings. Giraph focuses on social graph computing, and is the core of its OpenGraph tool, which handles the connection between trillions of users and their behavior in minutes.

11, ApacheHama

Code hosting address: GitHub

ApacheHama is a BSP (BulkSynchronousParallel)-based computing framework built on Hadoop that mimics Google's Pregel. Used to handle large-scale scientific calculations, especially matrix and graph calculations. The system architecture in a cluster environment consists of three major blocks: BSPMaster/GroomServer (ComputationEngine), Zookeeper (Distributed Locking), and HDFS/HBase (StorageSystems).

12, ApacheTez

Code hosting address: GitHub

ApacheTez is a DAG (Directed AcyclicGraph) computing framework based on Hadoop Yarn. It splits the Map/Reduce process into several sub-processes, and can combine multiple Map/Reduce tasks into one large DAG task, which reduces the file storage between Map/Reduce. At the same time, the sub-processes are reasonably combined to reduce the running time of the tasks. Developed and provided with primary support by Hortonworks.

13, ApacheAmbari

Code hosting address: ApacheSVN

ApacheAmbari is an open source framework for provisioning, managing, and monitoring Apache Hadoop clusters. It provides an intuitive operating tool and a robust Hadoop API that hides complex Hadoop operations and greatly simplifies cluster operations. The first release was released in June 2012.

Apache Ambari is now a top-level Apache project. As early as August 2011, Hortonworks introduced Ambari as the Apache Incubator project, setting the vision for the ultimate simple management of Hadoop clusters. In the development community of more than two years, the community has grown significantly, from a small team to a contributor to various organizations of Hortonworks. The Ambari user base has been growing steadily, with many organizations relying on Ambari to deploy and manage Hadoop clusters on a large scale in their large data centers.

The Hadoop components currently supported by Apache Ambari include: HDFS, MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More