http://hadoop.apache.org/
1
The Apache™hadoop®project develops Open-source software for reliable, scalable,distributed computing.
The Apache Hadoop Software Library is a framework this allows for the distributedprocessing of large data sets across Clus Ters of computers using simple programming models.
It is designed-to-thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the
Library itself is designed to detect and handle failures on the application layer, so delivering
a highly-available service on top of a cluster of computers, each of the which may is prone to
failures.
The project includes these modules:
Hadoop common:the Common Utilities, the other Hadoop modules.
Hadoop Distributed File System (hdfs™): A Distributed File system that provides
High-throughput access to application data.
Hadoop yarn:a framework for job scheduling and cluster resource management.
Hadoop mapreduce:a yarn-based system for parallel processing of large data sets.
Other hadoop-related projects at Apache include:
ambari™: A web-based tool for provisioning, managing, and monitoring Apache
Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive,
Hcatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard
For viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive
Applications visually alongwith features to diagnose their performance characteristics in a
user-friendly manner.
avro™: A data serialization system.
cassandra™: A Scalable multi-master database with no single points of failure.
chukwa™: A data collection system for managing large distributed systems.
hbase™: A Scalable, distributed database, supports structured data storage for large
Tables.
hive™: A Data Warehouse infrastructure that provides data summarization and ad hoc
Querying.
mahout™: A Scalable machine learning and data Mining library.
pig™: A high-level Data-flow language and execution framework for parallel
Computation.
spark™: A Fast and general compute engine for Hadoop data. Spark provides a simple
and expressive programming model that supports a wide range of applications, including
ETL, machine learning, stream processing, and graph computation.
Welcome to apache™hadoop®!
Page 3 copyright©2014 the Apache software Foundation. All rights reserved.
tez™: A Generalized Data-flow programming framework, built on Hadoop YARN,
which provides a powerful and flexible engine to execute an arbitrary DAG of the tasks to
Process data for both batch and interactive use-cases. Tez is being adopted by hive™,
Pig™and other frameworks in the Hadoop ecosystem, and also by other commercial
Software (e.g. ETL tools), to replace Hadoop™mapreduce as the underlying execution
Engine.
zookeeper™: A high-performance Coordination Service for distributed applications.
1
1
Xxxxxxxxxxxxx
Xxxxxxxxxxxxxxxxxxxxxx
Xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
What is Apache Hadoop?