Hadoop Learning-Ecosystem (ecosystem) overview

Source: Internet
Author: User
Tags zookeeper hadoop ecosystem sqoop

0. Big background

Google (Google), the world's leading search engine, is faced with the problem of daily massive search engine data, and after a long period of accumulated practice,

Google formed its own big data framework, but did not open source, but published a paper, elaborated its own ideas, in the paper

Mentioned the method of MapReduce. This paper, which was followed by Doug Cutting, the father of Hadoop later, aroused his great interest.

Because, at this time, he was working on a project that required multitasking to process large amounts of data in parallel, and he and his partner worked many times and the results were not ideal.

So Doug and his team decided to re-develop a framework based on Google's mapreduce ideas.

After a period of effort, in the fall of 2005, as part of Lucene's sub-project, Nutch formally introduced the Hadoop project as a project of the Apache Foundation.

The name Hadoop is not an abbreviation, but a fictitious name. Doug Cutting, the creator of the project, explains the name of Hadoop: "This is my child named after a brownish-yellow elephant toy."

Learn about Hadoop's recommended reference books: The authoritative guide to Hadoop, the current Chinese version to the 3rd edition, the English version to edition 4, the book's author Tom White is a core member of the Hadoop founding team and a member of the Hadoop Commission.

The characters of the Bull class!!

2. Eco-System Overview

After a long period of development, Hadoop has formed its own ecological system.

Some frameworks are developed by the Facebook team, such as some big companies like Yahoo!, let's take a look at its ecological map:

As you can see, Apache Hadoop contains the following major components:

* HDFS and MapReduce: This is the core framework of Hadoop (which Doug cutting and his team have developed)

* HBase, Hive, Pig: These 3 frameworks are primarily responsible for data storage and querying , which are developed by different companies, which we'll cover later

* Flume, Sqoop: Responsible for data import/Export

* Mahout: Machine learning and analysis

* Zookeeper: Distributed coordination

* Ambari: Cluster Management

* Avro: Storage and serialization of data

* Hcatalog: Meta data management

3. The components are described separately

1) Apache HBase

Because HDFs is a file system that can append data only, it does not allow data modification .

So Apache HBase was born.

HBase is a distributed, random-access, column-oriented database system.

HBase runs on the top level of HDFs, which allows application developers to read and write HDFs data directly .

The only drawback, however, is thathbase does not support SQL statements .

So, it's also a NoSQL database.

However, it provides a command-line-based interface and rich API functions to update the data.

It should be mentioned that the data in HBase is stored in the HDFs file system as a key-value pair.

2) Apache Pig

Apache Pig was developed by Yahoo, which provides an abstraction layer on top of mapreduce.

It provides a language called Pig Latin that is used to create a mapreduce program.

Pig Latin is used by programmers to write programs, analyze data, and use it to create tasks that are executed in parallel.

This makes it possible to utilize Hadoop's distributed clusters more efficiently.

Pig has a number of successful large-company projects, such as EBay, LinkedIn, and Twitter.

3) Apache Hive

Hive is used as a data warehouse for big data, and it also uses the HDFs file system to store data.

In Hive we don't write a mapreduce program because hive provides a class of SQL language called HIVEQL,

This allows developers to quickly write peer-to-peer (AD-HOC) queries that resemble relational data SQL queries.

4) Apache ZooKeeper

Hadoop provides inter-communication through nodes (nodes).

Zookeeper is used to manage these nodes, and it is used to coordinate each node.

In addition to the management node, it maintains some configuration information and groups the services of the distributed system.

Zookeeper can be run independently of Hadoop, unlike other components in the ecosystem.

Because zookeeper is based on memory to manage information, its performance is relatively high.

5) Apache Mahout

Mahout is an open-source machine learning library that enables Hadoop users to efficiently perform some column operations such as data analysis, data mining, and clustering.

Mahout is particularly efficient for large datasets, and the algorithms it provides are performance-optimized to run the MapReduce framework efficiently on the HDFs file system.

6) Apache Hcatalog

Hcatalog provides metadata management services at the top level of Hadoop.

All software running on top of Hadoop can use Hcatalog to store their plans (schema) in the HDFs file system.

Hcatalog enables third-party software to create, edit, and expose the definition of a table and the generated metadata in the form of a rest API.

Therefore, we do not need to know the physical location of the data through Hcatalog.

Hcatalog provides data definition statements (DDL), through which work tasks such as MapReduce, Pig, Hive, and so on are queued to be executed, if necessary

They can also monitor their progress.

7) Apache Ambari

Ambari is used to monitor Hadoop clusters.

It provides some column features, such as: Installation Wizard, System warning, cluster management, task performance, and so on.

Ambari also provides restful APIs to integrate with other software.

8) Apache Avro

How to use other programming languages to effectively organize big data for Hadoop is Avro for this purpose.

Avro provides the compression and storage of data on each node.

Avro-based data storage can easily be read by many scripting languages such as Python, or non-scripting languages such as Java.

In addition, Avro can also be used to serialize data in the MapReduce framework.

9) Apache Sqoop

Sqoop is used to efficiently load large datasets in Hadoop, such as it allows developers to easily get from some data sources, such as:

relational databases, enterprise-class data warehouses, and even application import/export data data.

Apache Flume

Flume is often used for the aggregation of logs, which is used as an ETL (extract-transform-load)-de-Add (extract-transform-load) tool.

Well, the Hadoop ecosystem and its main components are covered here!

Hadoop Learning-Ecosystem (ecosystem) overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.