1. Hadoop Ecosystem 2, HDFS (Hadoop Distributed File System)
A GFS paper from Google, published in October 2003, is a GFS clone version of HDFs.
is the foundation of data storage management in the Hadoop system. It is a highly fault-tolerant system capable of detecting and responding to hardware failures and for running on low-cost generic hardware. HDFs simplifies the file consistency model and provides high-throughput application data access through streaming data access for applications with large data sets.
Client: Slicing files, accessing HDFs, interacting with Namenode, obtaining file location information, interacting with Datanode, reading and writing data.
NameNode: Master node, in Hadoop1. There is only one in X, which manages the namespace and block mapping information for HDFS, configures the replica policy, and processes client requests.
DataNode: Slave node, stores the actual data, reports the storage information to Namenode.
secondary NameNode: Auxiliary NameNode, share its workload, periodically merge fsimage and Fsedits, push to NameNode; In case of emergency, can assist to restore NameNode, but secondary Namenode is not a hot preparation for namenode.
3. Mapreduce (distributed computing framework)
The MapReduce paper from Google, published in December 2004, is the Google MapReduce clone version of Hadoop MapReduce.
A mapreduce paper from Google
MapReduce is a computational model for the calculation of large data volumes. Where map specifies the operation of a separate element on the dataset, generating a key-value pair as an intermediate result. Reduce all "values" of the same "key" in the intermediate result is regulated to obtain the final result. The functionality of MapReduce is well suited for data processing in a distributed, parallel environment that is composed of a large number of computers.
jobtracker: Master node, only one, management of all jobs, job/task monitoring, error handling, etc. the task is decomposed into a series of tasks and assigned to Tasktracker.
tasktracker: Slave node, run Map task and reduce task, and interact with Jobtracker to report task status.
map Task: Parses each data record, passes it to the user-written Map (), executes it, writes the output to the local disk (if it is a map-only job, writes to HDFs directly).
Reducer Task: From the execution result of the map task, read the input data remotely, sort the data, and pass the data to the user-written reduce function by grouping.
The MapReduce process, taking WordCount as an example:
4. Hive (Hadoop-based Data Warehouse)
Open source from Facebook, originally used to solve the massive structural log data statistics problem.
Hive defines a SQL-like query Language (HQL) that translates SQL into a mapreduce task executed on Hadoop.
Typically used for offline analysis.
5. Hbase (Distributed Columnstore database)
BigTable paper from Google, published in November 2006, HBase is Google bigtable clone version
HBase is a scalable, high-reliability, high-performance, distributed, and column-oriented dynamic schema database for structured data. Unlike traditional relational databases, HBase employs the bigtable data model: An enhanced sparse sort mapping table (Key/value), where keys are composed of row keywords, column keywords, and timestamps. HBase provides random, real-time read-and-write access to large-scale data, while data stored in HBase can be processed using MapReduce, which combines data storage and parallel computing in a perfect way.
Data Model: Schema-->table-->column Family-->column-->rowkey-->timestamp-->value
6. Zookeeper (distributed Collaboration service)
Chubby paper from Google, published in November 2006, zookeeper is chubby clone version
Solve the problem of data management in distributed environment: Unified naming, State synchronization, cluster management, configuration synchronization and so on.
7. Sqoop (Data Sync Tool)
SQOOP is the acronym for SQL-TO-HADOOP, which is used primarily for traditional databases and for transferring data before Hadoop.
The import and export of data is essentially a mapreduce program, which takes advantage of the parallelism and fault tolerance of Mr.
8. Pig (Data stream system based on Hadoop)
Open Source by Yahoo!, the design motive is to provide a mapreduce-based Ad-hoc (calculation occurs at query time) data analysis tool
Defines a data flow language-pig Latin, which translates a script into a mapreduce task executed on Hadoop.
Typically used for offline analysis.
9. Mahout (Data mining algorithm library)
Mahout originated in 2008, was originally a sub-project of Apache Lucent, it has achieved considerable development in a very short period of time, and is now the top project of Apache.
The main goal of Mahout is to create a number of extensible machine learning domain classic algorithms that are designed to help developers create smart applications more quickly and easily. Mahout now includes a wide range of data mining methods such as clustering, classification, recommendation engine (collaborative filtering), and frequent set mining. In addition to the algorithms, Mahout also includes input/output tools for data, data mining support architectures such as integration with other storage systems such as databases, MongoDB, or Cassandra.
10. Flume (log Collection Tool)
Cloudera Open Source Log collection system with distributed, high reliability, high fault tolerance, easy to customize and extend the features. It abstracts data from the process of generating, transmitting, processing, and eventually writing to the target's path into a data stream, in which the data source supports the custom data sender in flume to support the collection of different protocol data. At the same time, flume data stream provides the ability of simple processing of log data, such as filtering, format conversion and so on. In addition, Flume has the ability to write logs to various data targets (customizable). In general, Flume is a scalable, large-scale log collection system for complex environments.
Hadoop Core Components