What is Hadoop And How It Works

Last Update:2020-10-27 Source: Internet

Author: User

Keywords hadoop hadoop architecture hadoop big data

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. The main task deployment of Hadoop is divided into 3 parts, namely: Client machine, master node and slave node. The master node is mainly responsible for the supervision of the two key functional modules of Hadoop, HDFS and Map Reduce. When Job Tracker uses Map Reduce for parallel processing of monitoring and scheduling data, the name node is responsible for HDFS monitoring and scheduling. The slave node is responsible for most of the machine operation, and is responsible for all data storage and instruction calculation. Each slave node plays the role of a data node and also acts as a daemon for communicating with their master node. The daemon belongs to Job Tracker, and the data node belongs to the name node.

2. Hadoop core and features

The core of Hadoop is HDFS and MapReduce, and the two are just theoretical foundations, not specific advanced applications that can be used.

The design features of HDFS are:

1. Big data files, very suitable for storing T-level large files or a bunch of large data files, if the file is only a few G or even smaller, it is meaningless.

2. File storage in blocks, HDFS will store a complete large file on different computers in average blocks. Its meaning is that files of different blocks can be fetched from multiple hosts at the same time when reading files, and read by multiple hosts The read efficiency is much higher than that of a single host.

3. Streaming data access, one write and multiple reads and writes. This mode is different from traditional files. It does not support dynamic change of file content. Instead, it requires that the file be written once and not change. Changes can only be made in Add content at the end of the file.

4. Inexpensive hardware, HDFS can be applied to ordinary PCs. This mechanism allows some companies to use dozens of cheap computers to support a big data cluster.

5. Hardware failure. HDFS believes that all computers may have problems. In order to prevent a host from failing to read the block file of the host, it will allocate the same file block copy to several other hosts. If one of them If a host fails, you can quickly find another copy to fetch files.

The key elements of HDFS:

1) Block: divide a file into blocks, usually 64M.

2) NameNode: Save the directory information, file information, and block information of the entire file system. This is stored exclusively by the only host. Of course, if this host fails, the NameNode will fail. Hadoop2.* began to support activity-standy mode-if the primary NameNode fails, start the standby host to run the NameNode.

3) DataNode: distributed on cheap computers, used to store block files.

MapReduce:

We want to count all the books in the library. You count the number one bookshelf, and I number the number two bookshelf. This is "Map". The more we are, the faster we count.

Now we come together and add all the statistics together. This is "Reduce".

In layman's terms, MapReduce is a programming model that extracts and analyzes elements from massive source data and finally returns the result set. Distributed storage of files on hard disk is the first step, and extracting and analyzing the content we need from massive data is what MapReduce does. .

The basic principle of MapReduce is: divide big data analysis into smaller pieces and analyze them one by one, and finally summarize and analyze the extracted data, and finally get the content we want. Of course, how to block analysis and how to do Reduce operations is very complicated. Hadoop has provided the implementation of data analysis. We only need to write simple demand commands to achieve the data we want.

Typical applications of Hadoop include: search, log processing, recommendation systems, data analysis, video image analysis, data storage, etc.

3. The Hadoop cluster is mainly composed of NameNode, DataNode, Secondary NameNode, JobTracker, and TaskTracker.

1) NameNode: Records how the file is split into blocks and these blocks are stored in those DateNode nodes.

2) NameNode: saves the running status information of the file system.

3) DataNode: Store the split blocks.

4) Secondary NameNode: Help NameNode collect the status information of file system operation.

5) JobTracker: Responsible for the operation of the Job when a task is submitted to the Hadoop cluster, and is responsible for scheduling multiple TaskTrackers.

6) TaskTracker: responsible for a map or reduce task.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More