Apache Hadoop and the Hadoop Ecosystem
Hadoop is a distributed system infrastructure developed by the Apache Foundation .
The user is able to understand the distributed underlying details. Develop distributed programs. Take advantage of the power of the cluster for fast operations and storage.
Hadoop implements a distributed filesystem (Hadoop distributedFile system), referred to as HDFS.
HDFS Features high fault tolerance and is designed to be deployed on inexpensive (low-cost) hardware. And it provides high throughput (Hithroughput) to Access application data for applications with very large datasets (largedata set).
HDFS relaxed The ( Relax ) POSIX the requirements. The ability toaccessdata in the file system in the form of a stream (streaming.
The core design of the Hadoop framework is:HDFS and MapReduce.
HDFS provides storage for massive amounts of data. Then MapReduce provides calculations for massive amounts of data.
Although Hadoop is known for MapReduce and its distributed file system HDFS , Hadoop The name is also used collectively for a group of related projects, which use the underlying platform for distributed computing and massive data processing.
Hadoop Common:
A set of distributed file systems and general-purpose I/O Components and Interfaces (serialization,Java RPC , and persisted data structures)
Hdfs:hadoop Distributed File Systems (Distributed File System) - HDFS (Hadoop Distributed file). Implemented in large commercial machine clusters
Mapreduce:
Distributed data processing model and execution environment, implemented in large commercial machine cluster
HBase:
A distributed, column-based storage database. HBase Use HDFS as the underlying storage, support at the same time MapReduce batch-based calculations and point queries (random reads).
Hive: Data Warehouse tool. Contributed by Facebook . a distributed, column-stored Data Warehouse.
Hive manages The data stored in HDFS. and provides a SQL -based query language (with an execution-time engine translated into a MapReduce Job) to query the data.
Zookeeper: A distributed lock facility that provides Google Chubby -like features that are contributed by Facebook .
A distributed, high-availability coordination service. Basic services such as distributed locks are provided to build distributed applications.
Avro: A serialization system that supports efficient, cross-language RPC and permanent storage of data. the new data serialization format and Transfer tool will gradually replace the original IPC mechanism of Hadoop .
Pig:
Big Data analytics platform. Provides a variety of interfaces for users.
A data flow language and execution environment to retrieve a large set of data. Pig executes on a cluster of MapReduce and HDFS .
Ambari:
Hadoop management tools. Ability to quickly monitor, deploy, and manage clusters.
Sqoop:
A tool for efficient data transfer between the database and HDFS.
References:
http://baike.baidu.com/link?url=5TXA32tcYO3i-xO4cIMNT4b6EJv9rNo-2hO7L5FpZsEzeSHMh_BXS8d9yX4T80El7rGMUMMCgVRVfx-8a-Dl2q
http://hadoop.apache.org
TheHadoop authoritative guide
Apache Hadoop and the Hadoop ecosystem