In the previous article (a beginner'sguide to Hadoop Learning), we've introduced beginners ' Considerations for learning about Hadoop. This article is about the core knowledge of Hadoop learning.
Hadoop Core Knowledge Learning:
Hadoop is divided into hadoop1. X and hadoop2.x, and there is also the Hadoop ecosystem. We can only introduce it slowly here. You can't eat a fat bite.
Then let's take hadoop2.x as an example to introduce in detail:
The core of Hadoop is MapReduce and HDFs.
mapreduce:MapReduce is a lot of people need to go past the threshold, it is more difficult to understand, we sometimes even write a Mapreduce program, but still can not touch the brain. We all know that MapReduce is a programming model, so what it does and what it does to me. What it does, why we write the map function, thereduce function can be run on multiple machines, and these problems may be confusing for beginners.
Then we need to understand:
What is MapReduce?
How does MapReduce work?
What is the work flow of mapreduce?
What is the programming model for MapReduce?
What is shuffle?
What is partition?
What is Combiner?
What is the relationship between the three of them?
Who decides the number of maps and how to calculate them?
Who will decide the number of reduce and how to calculate it?
MapReduce is familiar, and there are some problems plaguing beginners, although with the Java Foundation, but we need to build a development environment, how to build the development environment?
Then we need to learn how to use Eclipse to remotely connect to Hadoop and program development on Windows
Because the operation of the mapredcue process is accompanied by the operation of HDFs, like our traditional development, programming is inseparable from the database. HDFs can be understood as a traditional programming database, but in fact he is not, the real database is Hadoop data base, which is hbase. Okay, here we go. How to learn about HDFs:
HDFS: We should at least learn the following
What is the HDFs and HDFs architecture design?
HDFs Architecture Introduction and pros and cons?
How does HDFs store data?
How does HDFs read data?
How does HDFs write files?
A copy of HDFs storage policy?
How do I access HDFs?
How does HDFs data replicate?
Hot spare for Namenode?
Hadoop ecosystem content is still more, but the most commonly used is hive,hbase.
Hive is the best choice for beginners to enter the Big Data (Hadoop) industry because it provides simple class SQL statements that make it easy for learners who don't know how to write a MapReduce program to enter the big data industry. So it's recommended that you (especially the 0-based learners) learn about Hadoop in a way that focuses on the learning of hive, especially the proficiency of hive statements. Of course, it's easier for students with database basics to learn hive.
HBase is a NoSQL database, only when the amount of data is very large, such as TB, petabytes, HBase can play a very good effect, so for the students who are committed to join the large company, can learn more hbase, especially hbase table design, Rowkey's design, hbase performance tuning, HBase and hive, Impala combination and so on.
yarn is a distributed cluster resource management framework, but also hadoop2.x and hadoop1.x obviously different places, so we still need to understand the principle of yarn, the framework, the components of a detailed understanding.
For other components of Hadoop: such as the massive Log Collection tool flume, the Data Import Export tool Sqoop, the application Coordination Service zookeeper, students can combine the actual combat project to learn its principles, how to use it.
For those who want to engage in data mining, you can learn more about mahout, machine learning, algorithms and other related knowledge, according to the trainees own choice of occupation and hobbies, suggest that the 0 basis of the students should preferably start from hive.
Storm is a flow-based computing framework, and Spark is a memory-based computing framework, which is different from the MapReduce computing framework, but the role of data processing and analysis, it is recommended that beginners in learning good mapreduce under the premise of You can learn more about storm and spark, and remember can chew. Pass without refinement.
If you want to learn more deeply, enrich your knowledge, you can choose to learn the shell, Python scripting language, Redis, MongoDB and other NoSQL databases, if you want to do Hadoop operations, you can also learn ganglia and nagios and other monitoring tools.
Finally, we suggest that we must be in the learning process, from simple to complex, theoretical and practical combination, due to the Hadoop ecosystem tools more, each tool has a different focus, so remind everyone remember too much, remember impetuous, only the basis of solid, follow-up learning will be more relaxed, faster, more efficient.
Reprint: Http://www.dajiangtai.com/community/17981.do
Hadoop's Core Knowledge Learning guide for Beginners