What is the Hadoop ecosystem?
|
Https://www.facebook.com/Hadoopers |
In some articles and examples of Teiid, there will be information about the use of Hadoop as a Data source through Hive. When you use a Hadoop environment to create Data Virtualization examples, such as Hortonworks Data Platform and Cloudera Quickstart, there will be a large number of open-source projects. This article mainly gives a preliminary understanding of the Hadoop ecosystem. For details about the following open-source projects, see hadoop ecosystem table.
Map Reduce-MapReduce is a programmable model that uses cluster parallelism and distributed algorithms to process large datasets. Apache MapReduce is derived from Google MapReduce: It simplifies data processing in large clusters. The current Apache MapReduce version is built based on the Apache YARN framework. YARN = "Yet-Another-Resource-Negotiator ". YARN can run applications with non-MapReduce models. YARN is an attempt by Apache Hadoop to surpass MapReduce's data processing capabilities.
Google open-source C/C ++ MapReduce framework
Sort the Shuffle process in MapReduce
HDFS-The Hadoop Distributed File System (HDFS) provides a solution for storing large files across multiple machines. Hadoop and HDFS are derived from Google File System (GFS. Before Hadoop 2.0.0, NameNode is a SPOF for HDFS clusters ). The high availability feature of Zookeeper and HDFS solves this problem and provides options to run two duplicate NameNodes. In the same cluster, the same Active/Passive configuration is used.
How does Hadoop modify the size of HDFS file storage blocks?
Copy local files to HDFS
Download files from HDFS to local
Upload local files to HDFS
Common commands for HDFS basic files
Introduction to HDFS and MapReduce nodes in Hadoop
HBase-inspired by Google BigTable. HBase is an open-source implementation of Google Bigtable, similar to Google Bigtable's use of GFS as its file storage system, HBase uses Hadoop HDFS as its file storage system, and Google runs MapReduce to process massive data in Bigtable, HBase also uses Hadoop MapReduce to process massive data in HBase. Google Bigtable uses Chubby as the collaborative service, and HBase uses Zookeeper as the corresponding service.
Hadoop + HBase cloud storage creation summary PDF
Regionserver startup failed due to inconsistent time between HBase nodes
Hadoop + ZooKeeper + HBase cluster configuration
Hadoop cluster Installation & HBase lab environment setup
HBase cluster configuration based on Hadoop cluster'
Hadoop installation and deployment notes-HBase full distribution mode installation
Detailed tutorial on creating HBase environment for standalone Edition
HBase details: click here
HBase: click here
Hive-data warehouse infrastructure developed by Facebook. Collect, query, and analyze data. Hive provides a language similar to SQL (not compatible with SQL92): HiveQL.
Pig-Pig provides an engine to concurrently execute data streams in Hadoop. Pig contains a language: Pig Latin, which is used to express these data streams. Pig Latin includes a large number of traditional data operations (join, sort, filter, etc.), and allows users to develop their own functions for viewing, processing, and writing data. Pig runs on hadoop and is used in Hadoop distributed file systems, HDFS, Hadoop processing systems, and MapReduce. Pig uses MapReduce to execute all data processing and compile Pig Latin scripts. You can write one or more MapReduce jobs in a series and then execute them. Pig Latin looks different from most programming languages, with no if State or for loop.
Hive programming guide PDF (Chinese Version)
Hadoop cluster-based Hive Installation
Differences between Hive internal tables and external tables
Hadoop + Hive + Map + reduce cluster installation and deployment
Install in Hive local standalone Mode
WordCount word statistics for Hive Learning
Hive operating architecture and configuration and deployment
Hive details: click here
Hive: click here
Zookeeper-ZooKeeper is a formal subproject of Hadoop. It is a reliable coordination system for large-scale distributed systems. It provides functions such as configuration maintenance, Name Service, distributed synchronization, and group service. The goal of ZooKeeper is to encapsulate key services that are complex and error-prone, and provide users with easy-to-use interfaces and systems with high performance and stable functions. Zookeeper is an open-source implementation of Google's Chubby and a highly effective and reliable collaborative work system. Zookeeper can be used for leader election and configuration information maintenance. In a distributed environment, we need a Master instance or some configuration information to ensure file write consistency.
Ubuntu 14.04 installs distributed storage Sheepdog + ZooKeeper
CentOS 6 installs sheepdog VM distributed storage
ZooKeeper cluster configuration
Use ZooKeeper to implement distributed shared locks
Distributed service framework ZooKeeper-manage data in a distributed environment
Build a ZooKeeper Cluster Environment
Test Environment configuration of ZooKeeper server cluster
ZooKeeper cluster Installation
Mahout-MapReduce-based Machine Learning Library and mathematical library.
How Mahout controls Hadoop
Steps for installing Mahout in Ubuntu 10.04
Mahout installation configuration and use
Hadoop2.2 + Mahout0.9 practice
At the same time, you can access the Big Data Insights Page to learn