Hadoop Learning Summary (2)--hadoop Introduction

Source: Internet
Author: User
Tags chop

1. Introduction to Hadoop


Hadoop is an open-source distributed computing platform under the Apache Software Foundation, which provides users with a transparent distributed architecture of the underlying details of the system, and through Hadoop, it is possible to organize a large number of inexpensive machine computing resources to solve the problem of massive data processing that cannot be solved by a single machine. Hadoop has several advantages:

High reliability: The ability to provide bitwise processing of storage and computing is trusted by the user.

High scalability: You can easily scale from small clusters to thousands of nodes.

Efficiency: Provides a concurrent distributed computing framework that is very fast to handle.

High fault tolerance: tasks can be completed automatically even when a small number of nodes are down.

The main core content of Hadoop is the MapReduce computing framework and the HDFs file system, which is briefly introduced in this article.

2. Description of the MapReduce framework "Hadoop Map/reduce is an easy-to-use software framework based on the applications it writes out to run on a large cluster of thousands of commercial machines, and in parallel to the T-level datasets in a reliable, fault-tolerant way. [1]hadoop the MapReduce highly into two stages: the map phase and the reduce phase, each of which takes key/value as the input and output of the process, and can be selected by the programmer for their type. Data flow throughout the map and reduce phases:

So why does the Hadoop distributed framework use the MapReduce computing framework? I mean, why not just a map or reduce? Unfortunately, a lot of articles are just talking about what the steps of Mapreduce,mapreduce are, and there are few questions that can explain why the MapReduce framework is used. Occasionally, see an interesting introduction to the MapReduce framework and post it for sharing:Here's a talk about how a programmer is a wife explaining what is MapReduce? The article is very long please look patiently (if you do not want to see can directly see my summary below).
I asked my wife,"Are you really trying to figure out what a mapreduce is?" She replied firmly, "yes." So I asked:
Me: How did you prepare the onion chili sauce? (The following is not an accurate recipe, do not try at home)
Wife: I'll take an onion, chop it up, mix it with salt and water, and grind it into a mixing grinder. So you can get the onion chili sauce.
Wife: But what does this have to do with MapReduce?
me: You wait a minute. Let me make a complete plot so that you can understand mapreduce within 15 minutes.
Wife: Okay, all right.
me: Now, suppose you want a bottle of mixed chili sauce with mint, onion, tomato, chili, garlic. What would you do?
Wife: I'll take a pinch of mint leaves, one onion, one tomato, one chili, one garlic, chopped Salt and water, and then grind it into a hybrid grinder so you can get a bottle of mixed chili sauce.
Me: Yes, let's apply the concept of mapreduce to recipes. Map and reduce are actually two kinds of operations, I'll give you a detailed explanation. Map : Chopping onions, tomatoes, peppers and garlic is a map operation that acts on each of these objects. So if you give map an onion, the map will chop the onion. Similarly, you can get chili, garlic and tomato one by one to map, and you will have a variety of pieces. So, when you're cutting vegetables like onions, you're doing a map operation. The map operation is suitable for each vegetable, which produces one or more fragments accordingly, and in our case the vegetable blocks are produced. In the map operation there may be a situation where an onion is broken and you just lose the bad onion. So, if a bad onion is present, the map operation will filter out the bad onion without producing any bad onion blocks.
Reduce (reduction): At this stage, you can get a bottle of chili sauce by putting all the vegetables into the grinder for grinding. This means that to make a bottle of chili sauce, you have to grind all the ingredients. As a result, the grinder usually aggregates the vegetable pieces of the map operation.
Wife: So , this is MapReduce?
Me: You can say yes, or you can say no. In fact, this is only part of the MapReduce, and the power of MapReduce lies in distributed computing.
Wife: distributed computing? What is that? Please explain it to me.
me: no problem.
Me: Suppose you took a chili sauce contest and your recipe won the best Chili Sauce award. After the award, the chili sauce recipes are popular, so you want to start selling homemade chili sauce. Suppose you need to produce 10000 bottles of chili sauce every day, what would you do?
Wife: I will find a supplier who can provide me with a lot of raw materials.
me: yes. That's the way it is. Can you finish the production on your own? That is to say, the raw materials are chopped up alone? can only one grinder meet the needs? And now, we also need to supply different kinds of chili sauce, such as onion chili sauce, green pepper chili sauce, tomato chili sauce and so on.
Wife: Of course not, I will hire more workers to cut vegetables. I also need more grinders so that I can produce chili sauce more quickly.
Me: Yes, so now you have to assign a job, you will need several people to cut vegetables together. Everyone has to deal with a bag full of vegetables, and each person is equivalent to performing a simple map operation. Each person will continue to take out the vegetables from the bag and dispose of only one vegetable at a time, that is, to chop them up until the bag is empty. In this way, when all the workers are finished, the work station (where everyone works) has onion blocks, tomato blocks, and garlic, and so on.
Wife: But how can I make different kinds of ketchup?
me: Now you'll see the phase of the MapReduce omission-the stirring phase. MapReduce mixes all the vegetables that have been exported, all of which are produced under the key-based map operation. The stirring will be done automatically, and you can assume that key is the name of a raw material, just like an onion. So the whole onion keys will be stirred together and transferred to the grinder that grinds the onion. In this way, you can get onion chili sauce. In the same way, all tomatoes are transferred to the grinder labeled with the tomato, and the tomato chili sauce is made.

That is, when we are running a large program, we are bound to encounter some steps that are not interdependent (such as the steps to cut onions and cut tomatoes in the example), which wastes time if we take a sequential approach. Because these can be executed asynchronously in parallel (that is, to find several people preparing onions and tomatoes at the same time), this is greatly shortened, and this step is called the map stage in MapReduce, because of the existence of the map, parallel computing is possible. But our program cannot be always parallel (so that we can completely treat them as multiple programs), and after certain steps are completed, we need to integrate these calculations (the prepared onion and tomato in the example), although this may not be parallel but necessary, So that's why there has to be a reduce phase. When using a single reduce task, the entire mapreduce process is roughly the following:
3, HDFs introduction HDFs, both Hadoop Distributed File System (distributed filesystem), is a multi-computer storage file system, because the system resumes in the network only, so the need to consider the file storage, but also take into account the complexity of the network. However, the benefits of HDFs are self-evident, because HDFs has the ability to store across multiple computers, so the storage capacity of the system has been expanded infinitely, more importantly, in the Hadoop platform, these distributed storage files can be executed in parallel, greatly reducing the program run time. And HDFs can not make too many demands on the reliability of the computer, can be designed on any ordinary hardware, and provide fault tolerance. The advantages of HDFS are: fault tolerance, extensibility, support for large file storage, and so on. However, due to the unreliable network and high latency, HDFS also has some inappropriate scenarios: low-latency data access is not appropriate: if you have high real-time requirements, such as access in dozens of milliseconds, it is not appropriate to use HDFS. A large number of small file storage is inappropriate: HDFs is optimized for large file storage applications, and the file metadata is stored in memory, so it is not appropriate for a large number of small files to cause metadata maintenance node (NameNode) memory overruns. So why is Hadoop using the HDFS system? HDFs on the Hadoop platform is mainly due to the concept of the block, because of distributed storage and file chunking so that different computers can store different files, so in the map phase when the concurrent processing provides a great convenience, We can use different or identical map tasks for different blocks to achieve parallel processing, and then combine the results with reduce processing. In view of the principle of data localization, each block that corresponds to map task one by one and the node that runs map normally is the node that stores the corresponding block of the map task. HDFs uses a master-slave structure model, a HDFS cluster is composed of a namenode and a number of datanode, which namenode the primary server, the management of the file System namespace and client access to the file, the Datanode in the cluster manages the stored data. Namenode performs namespace operations on the file system, such as opening, closing, renaming files or directories, and it is responsible for mapping data blocks to specific datanode. Datnode is responsible for handling the structure of file HDFs execution:
4. Reference Documents[1]Hadoop map/reduce Tutorial:Https://hadoop.apache.org/docs/r1.0.4/cn/mapred_tutorial.html[2]hadoop Definitive Guide 4th Edition

Hadoop Learning Summary (2)--hadoop Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.