Hadoop Family Learning Roadmap-Reprint

Last Update:2015-04-30 Source: Internet

Author: User

Tags cassandra sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original address: http://blog.fens.me/hadoop-family-roadmap/

Sep 6,Tags:hadoophadoop familyroadmapcomments:CommentsHadoop Family Learning Roadmap

The Hadoop family of articles, mainly about the Hadoop family of products, commonly used projects include Hadoop, Hive, Pig, HBase, Sqoop, Mahout, Zookeeper, Avro, Ambari, Chukwa, and new additions to the project including, YARN, Hcatalog, Oozie, Cassandra, Hama, Whirr, Flume, Bigtop, Crunch, hue, etc.

Since 2011, China has entered the era of big data surging, and the family software, represented by Hadoop, occupies a vast expanse of data processing. Open source industry and vendors, all data software, no one to Hadoop closer. Hadoop has also become the standard for big data development in the areas of high-fidelity from a small audience. On top of Hadoop's legacy technology, the Hadoop family of products has evolved through the concept of "big data".

As a developer of it, we have to keep up with the rhythm, seize the opportunity, and follow Hadoop together!

About

Zhang Dan (Conan), programmer Java,r,php,javascript
Weibo: @Conan_Z
Blog:http://blog.fens.me
Email: [Email protected]

Reprint please specify the source:
http://blog.fens.me/hadoop-family-roadmap/

Objective

Hadoop has been used for some time, from the beginning of confusion, to a variety of attempts, to the present combination of applications .... Things that are slowly involved in data processing are already inseparable from Hadoop. The success of Hadoop in the field of big data has led to its own accelerated growth. Now the Hadoop family of products has reached more than 20.

It is necessary to do a collation of their own knowledge, the product and technology are strung together. Not only can deepen the impression, but also for the future technical direction, technology selection to prepare the foundation.

This article begins with the "Hadoop family," a Hadoop family learning roadmap

Directory

Hadoop Family Products
Hadoop Family Learning Roadmap

1. Hadoop Family Products

By 2013, according to Cloudera statistics, the Hadoop family of products has reached 20!
http://blog.cloudera.com/blog/2013/01/apache-hadoop-in-2013-the-state-of-the-platform/

Next, I divided the 20 products into 2 categories.

The first kind is what I've mastered.
The second category is Todo ready to continue learning

A Word Product Description:

Apache Hadoop: A distributed computing open source framework for the Apache Open source organization that provides a distributed File system subproject (HDFS) and a software architecture that supports mapreduce distributed computing.
Apache Hive: A Hadoop-based data warehousing tool that can map structured data files into a database table, quickly implement simple mapreduce statistics with class-SQL statements, and do not have to develop specialized mapreduce applications, which is well suited for statistical analysis of data warehouses.
Apache Pig: A large-scale Hadoop-based data analysis tool that provides the Sql-like language called Pig Latin, which translates SQL-like data analysis requests into a series of optimized mapreduce operations.
Apache HBase: is a highly reliable, high-performance, column-oriented, scalable distributed storage system that leverages HBase technology to build large, structured storage clusters on inexpensive PC servers.
Apache Sqoop: A tool used to transfer data from Hadoop and relational databases to and from a relational database (MySQL, Oracle, Postgres, etc.) into the HDFs of Hadoop, HDFs data can also be directed into a relational database.
Apache Zookeeper: Is a distributed, open source Coordination service designed for distribution applications, which is mainly used to solve some data management problems frequently encountered in distributed applications, simplify the coordination and management of distributed applications, and provide high-performance distributed services.
Apache Mahout: A distributed framework for machine learning and data mining based on Hadoop. Mahout implements some data mining algorithms with MapReduce, and solves the problem of parallel mining.
Apache Cassandra: is a set of open source distributed NoSQL database system. It was originally developed by Facebook to store simple format data, a data model for Google BigTable and a fully distributed architecture of Amazon Dynamo
Apache Avro: is a data serialization system designed to support data-intensive, large-volume data exchange applications. Avro is a new data serialization format and transfer tool that will gradually replace the original IPC mechanism of Hadoop
Apache Ambari: is a web-based tool that supports the provisioning, management, and monitoring of Hadoop clusters.
Apache Chukwa: is an open source data collection system for monitoring large distributed systems that can collect all kinds of data into Hadoop-ready files to be stored in HDFS for various MapReduce operations in Hadoop.
Apache Hama: Is an HDFs-based BSP (Bulk synchronous Parallel) Parallel computing framework Hama can be used for large-scale, big data calculations including graphs, matrices, and network algorithms.
Apache Flume: is a distributed, reliable, high-availability system of large-volume log aggregation, which can be used for log data collection, log processing, and log transfer.
Apache Giraph: is a scalable distributed iterative processing system based on the Hadoop platform, inspired by the BSP (bulk synchronous parallel) and Google Pregel.
Apache Oozie: is a workflow engine server that manages and coordinates the tasks that run on the Hadoop platform (HDFS, pig, and MapReduce).
Apache Crunch: Is a Java library written based on Google's Flumejava library for creating MapReduce programs. Similar to Hive,pig, Crunch provides a library of patterns for common tasks such as connecting data, performing aggregations, and sorting records
Apache whirr: A class library that runs on cloud services, including Hadoop, to provide a high degree of complementarity. WHIRR supports the services of Amazon EC2 and Rackspace.
Apache bigtop: A tool for packaging, distributing, and testing Hadoop and its surrounding ecosystems.
Apache Hcatalog: Hadoop-based data table and storage management for central metadata and schema management, spanning Hadoop and RDBMS, and providing relational views with pig and hive.
Cloudera Hue: A web-based monitoring and management system that implements Web operations and management of Hdfs,mapreduce/yarn, HBase, Hive, Pig.

2. Hadoop Family Learning Roadmap

Below I will introduce the installation and use of each product separately, summarize my learning route with my experience.

Hadoop

Hadoop Learning Roadmap
Yarn Learning Roadmap
Build Hadoop projects with Maven
Hadoop Historical Version Installation
Hadoop Programming Calls HDFs
Massive Web log analysis using Hadoop to extract KPI statistic index
Build a movie recommendation system with Hadoop
Create a Hadoop parent virtual machine
Cloning virtual machines adds Hadoop nodes
R Language for Hadoop injection statistics blood
One of the Rhadoop Practice series Hadoop Environment Setup
Using MapReduce to achieve matrix multiplication
Parallel implementation of PageRank algorithm
Peoplerank discovering individual value from social networks

Hive

Hive Learning Roadmap
Hive Installation and Usage tips
Test for hive import of 10G data
The Hive of the R Sword NoSQL series article
Extracting inverse repurchase information from historical data using rhive

Pig

Pig Learning Road Map

Zookeeper

Zookeeper Learning Road Map
Installation and use of zookeeper pseudo-step cluster
Zookeeper implementing distributed queue queues
Zookeeper implementing a distributed FIFO queue
A case study of zookeeper-based split-step queue system integration

HBase

HBase Learning Roadmap
Installing HBase in Ubuntu
Rhadoop Practice series of four rhbase installation and use

Mahout

Mahout Learning Road Map
Using R to parse Mahout user recommended collaborative filtering algorithm (USERCF)
Rhadoop Practice series of three R implementation of MapReduce collaborative filtering algorithm
Build Mahout projects with Maven
Mahout Recommended Algorithm API
Profiling Mahout recommendation engine from source code
Mahout development of collaborative filtering ITEMCF based on item-by-step program
Mahout-Step program development of clustering Kmeans
Building a job recommendation engine with Mahout
Mahout building the book recommendation system

Sqoop

Sqoop Learning Road Map

Cassandra

Cassandra Learning Road Map
Cassandra single cluster Experiment 2 nodes
The Cassandra of the R Sword NoSQL series article

Keep up with the pace of innovation, and constantly adhere to: (Todo list, not updated regularly)

Avro, Ambari, Chukwa, Hama, Flume, Giraph, Oozie, Crunch, Whirr, Bigtop, Hcatalog, Hue

Welcome to leave a message, put forward valuable suggestions!

Hadoop Family Learning Roadmap-Reprint

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More