What is the Hadoop ecosystem?

Last Update:2015-08-03 Source: Internet

Author: User

Tags hadoop mapreduce hadoop ecosystem

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is the Hadoop ecosystem?

Https://www.facebook.com/Hadoopers

In some articles and examples of Teiid, there will be information about the use of Hadoop as a Data source through Hive. When you use a Hadoop environment to create Data Virtualization examples, such as Hortonworks Data Platform and Cloudera Quickstart, there will be a large number of open-source projects. This article mainly gives a preliminary understanding of the Hadoop ecosystem. For details about the following open-source projects, see hadoop ecosystem table.

Map Reduce-MapReduce is a programmable model that uses cluster parallelism and distributed algorithms to process large datasets. Apache MapReduce is derived from Google MapReduce: It simplifies data processing in large clusters. The current Apache MapReduce version is built based on the Apache YARN framework. YARN = "Yet-Another-Resource-Negotiator ". YARN can run applications with non-MapReduce models. YARN is an attempt by Apache Hadoop to surpass MapReduce's data processing capabilities.

Google open-source C/C ++ MapReduce framework

Sort the Shuffle process in MapReduce

HDFS-The Hadoop Distributed File System (HDFS) provides a solution for storing large files across multiple machines. Hadoop and HDFS are derived from Google File System (GFS. Before Hadoop 2.0.0, NameNode is a SPOF for HDFS clusters ). The high availability feature of Zookeeper and HDFS solves this problem and provides options to run two duplicate NameNodes. In the same cluster, the same Active/Passive configuration is used.

How does Hadoop modify the size of HDFS file storage blocks?

Copy local files to HDFS

Download files from HDFS to local

Upload local files to HDFS

Common commands for HDFS basic files

Introduction to HDFS and MapReduce nodes in Hadoop

HBase-inspired by Google BigTable. HBase is an open-source implementation of Google Bigtable, similar to Google Bigtable's use of GFS as its file storage system, HBase uses Hadoop HDFS as its file storage system, and Google runs MapReduce to process massive data in Bigtable, HBase also uses Hadoop MapReduce to process massive data in HBase. Google Bigtable uses Chubby as the collaborative service, and HBase uses Zookeeper as the corresponding service.

Hadoop + HBase cloud storage creation summary PDF

Regionserver startup failed due to inconsistent time between HBase nodes

Hadoop + ZooKeeper + HBase cluster configuration

Hadoop cluster Installation & HBase lab environment setup

HBase cluster configuration based on Hadoop cluster'

Hadoop installation and deployment notes-HBase full distribution mode installation

Detailed tutorial on creating HBase environment for standalone Edition

HBase details: click here
HBase: click here

Hive-data warehouse infrastructure developed by Facebook. Collect, query, and analyze data. Hive provides a language similar to SQL (not compatible with SQL92): HiveQL.

Pig-Pig provides an engine to concurrently execute data streams in Hadoop. Pig contains a language: Pig Latin, which is used to express these data streams. Pig Latin includes a large number of traditional data operations (join, sort, filter, etc.), and allows users to develop their own functions for viewing, processing, and writing data. Pig runs on hadoop and is used in Hadoop distributed file systems, HDFS, Hadoop processing systems, and MapReduce. Pig uses MapReduce to execute all data processing and compile Pig Latin scripts. You can write one or more MapReduce jobs in a series and then execute them. Pig Latin looks different from most programming languages, with no if State or for loop.

Hive programming guide PDF (Chinese Version)

Hadoop cluster-based Hive Installation

Differences between Hive internal tables and external tables

Hadoop + Hive + Map + reduce cluster installation and deployment

Install in Hive local standalone Mode

WordCount word statistics for Hive Learning

Hive operating architecture and configuration and deployment

Hive details: click here
Hive: click here

Zookeeper-ZooKeeper is a formal subproject of Hadoop. It is a reliable coordination system for large-scale distributed systems. It provides functions such as configuration maintenance, Name Service, distributed synchronization, and group service. The goal of ZooKeeper is to encapsulate key services that are complex and error-prone, and provide users with easy-to-use interfaces and systems with high performance and stable functions. Zookeeper is an open-source implementation of Google's Chubby and a highly effective and reliable collaborative work system. Zookeeper can be used for leader election and configuration information maintenance. In a distributed environment, we need a Master instance or some configuration information to ensure file write consistency.

Ubuntu 14.04 installs distributed storage Sheepdog + ZooKeeper

CentOS 6 installs sheepdog VM distributed storage

ZooKeeper cluster configuration

Use ZooKeeper to implement distributed shared locks

Distributed service framework ZooKeeper-manage data in a distributed environment

Build a ZooKeeper Cluster Environment

Test Environment configuration of ZooKeeper server cluster

ZooKeeper cluster Installation

Mahout-MapReduce-based Machine Learning Library and mathematical library.

How Mahout controls Hadoop

Steps for installing Mahout in Ubuntu 10.04

Mahout installation configuration and use

Hadoop2.2 + Mahout0.9 practice

At the same time, you can access the Big Data Insights Page to learn

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More