The use of Hadoop has been going on for some time, from the beginning of confusion, to various attempts, to the current combination of .... Slowly involved in data processing things, has been inseparable from Hadoop. The success of Hadoop in large data fields has led to its own accelerated development. Now the Hadoop family product, has already reached 20 many.
It is necessary to do a collation of their knowledge, the product and technology are strung together. Not only can deepen the impression, but also to the future technology direction, technical selection to do the groundwork.
Product Introduction:
Apache Hadoop: A distributed computing open source framework for the Apache Open source organization, which provides a distributed file system subproject (HDFS) and a software architecture that supports mapreduce distributed computing.
Apache Hive: It is a data Warehouse tool based on Hadoop, it can map the structured data file into a database table, quickly realize simple mapreduce statistics through class SQL statement, and it is very suitable for statistic analysis of data Warehouse without developing special MapReduce application.
Apache Pig: is a large-scale data analysis tool based on Hadoop, it provides the Sql-like language is called Pig correlation, the language compiler will convert the class SQL data analysis request to a series of optimized processing MapReduce operation.
Apache HBase: A highly reliable, high-performance, column-oriented, scalable, distributed storage system that leverages HBase technology to build large structured storage clusters on inexpensive PC servers.
Apache sqoop: A tool used to transfer data from Hadoop and relational databases to the HDFs of Hadoop, using data from a relational database (MySQL, Oracle, Postgres, etc.) HDFs data can also be directed into a relational database.
Apache Zookeeper: is a distributed, open source coordination service designed to distribute applications, it is mainly used to solve some of the data management problems often encountered in distributed applications, simplifying the difficulty of distributed application coordination and management, providing high performance distributed service
Apache Mahout: A distributed framework for machine learning and data mining based on Hadoop. Mahout uses MapReduce to realize partial data mining algorithm, which solves the problem of parallel mining.
Apache Cassandra: is a set of open source distributed NoSQL database system. It was originally developed by Facebook to store simple format data, with a Google BigTable data model and a fully distributed architecture of Amazon Dynamo.
Apache Avro: is a data serialization system designed to support data-intensive, High-volume data interchange applications. Avro is a new data serialization format and transmission tool that will gradually replace the existing IPC mechanism of Hadoop.
Apache Ambari: A web-based tool that supports the provisioning, management, and monitoring of Hadoop clusters.
Apache Chukwa: An Open-source data collection system for monitoring large distributed systems, which collects a variety of types of data into files that are suitable for hadoop processing and is stored in HDFS for various MapReduce operations in Hadoop.
Apache Hama: is a HDFs based BSP (Bulk Synchronous Parallel) Parallel computing framework, Hama can be used to include graphs, matrices and network algorithms, including large-scale, large data calculation.
Apache Flume: is a distributed, reliable, highly available mass log aggregation system, can be used for log data collection, log processing, log transmission.
Apache Giraph: is a scalable distributed iterative graph processing system, based on the Hadoop platform, inspired by the BSP (bulk synchronous parallel) and Google's Pregel.
Apache Oozie: A workflow engine server that manages and coordinates the tasks that run on the Hadoop platform (HDFS, pig, and MapReduce).
Apache Crunch: A Java library based on Google's Flumejava library for creating MapReduce programs. Similar to Hive,pig, Crunch provides schema libraries for common tasks such as connecting data, performing aggregations, and sorting records
Apache whirr: A set of class libraries (including Hadoop) running on cloud services to provide a high degree of complementarity. WHIRR supports Amazon EC2 and Rackspace services.
Apache bigtop: A tool for packaging, distributing, and testing Hadoop and its surrounding ecology.
Apache hcatalog: Data tables and storage management based on Hadoop, implementing central metadata and schema management, spanning Hadoop and RDBMS, and providing relational views with pig and hive.
Cloudera Hue: is a web-based monitoring and management system to achieve Hdfs,mapreduce/yarn, HBase, Hive, pig web operations and management.
Reprint please indicate the source:
http://blog.fens.me/hadoop-family-roadmap/