Hadoop-first knowledge of hadoop

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is hadoop?

Before doing something, the first step is to know what, then why, and finally how ). However, after many years of project development, many developers get used to how first, then what, and finally why. This will only make them impetuous, at the same time, technologies are often misused in unsuitable scenarios.

The core designs in the hadoop framework are mapreduce and HDFS. The idea of mapreduce is widely circulated as mentioned in a Google paper. A simple sentence explains mapreduce as a "Summary of task decomposition and results ". HDFS is short for hadoop Distributed File System, which provides underlying support for Distributed Computing and storage.

From the perspective of its name, mapreduce roughly shows the reason. Two vertices map and reduce, "map (expand)" means to break a task into multiple tasks, "reduce" is to summarize the results of multi-task processing after decomposition to obtain the final analysis result. This is not a new idea. In fact, with the multithreading mentioned above, the multi-task design can find the shadows of this idea. Whether in the real world or in programming, a job can often be split into multiple tasks. The relationship between tasks can be divided into two types: one is unrelated tasks, it can be executed in parallel. The other is that tasks are dependent on each other, and the sequence cannot be reversed. Such tasks cannot be processed in parallel. Back in college, when teaching, let everyone analyze the Key Path, simply find the most time-saving task decomposition and execution method. In a distributed system, machine clusters can be seen as a hardware resource pool, which splits parallel tasks and submits them to every idle machine resource for processing, greatly improving computing efficiency, at the same time, this resource independence undoubtedly provides the best design guarantee for the expansion of computing clusters. (In fact, I always think that the cartoon icon of hadoop should not be a small elephant, it should be ant, distributed computing is like ant eating elephant, cheap machine groups can rival any high-performance computer, the vertical expansion curve is always inferior to the diagonal line of horizontal expansion ). After the task is decomposed and processed, it is necessary to summarize the processed results, which is the task of reduce.

Hadoop solves two problems: massive data storage and massive data analysis
Provides a reliable shared storage and analysis system. HDFS (hadoop Distributed File System) implements storage and mapreduce implements analysis and processing. These two are the core of hadoop..

Hadoop maximizes memory utilization, disk utilization, and CPU utilization.

Hbase:Nosql databases maximize memory usage.

HDFS Architecture Design:

Namenode:
1. Map a file to a batch of blocks and map the data blocks to the DN nodes. Cluster configuration management, data block management, and replication.
2. Transaction log processing: records file generation and deletion.
3. Because all the metadata of namenode is stored in the memory, the memory size of NN (namenode) determines the storage capacity of the entire cluster.
4. Data stored in NN memory: file list, Block List of each file, Block List of each DN, file attribute: generation time, copy parameter, File Permission (ACL)

Datenode:
Stores data blocks in the local file system and metadata of data blocks for CRC verification.
Responds to client requests for data blocks and metadata.
Periodically report all data block information stored in this dn to NN.
When the client needs to store data, it obtains the DN location list of the data block from the NN. The client sends the data block to the first DN, the first DN receives data and sends the data block to another DN through the pipeline stream. When a data block is written to all nodes, the client continues to send the next data block. The DN sends a heartbeat packet to the NN every 3 seconds. If the NN does not receive the heartbeat packet and then tries again, the DN becomes invalid.
When NN detects that the DN node is invalid, it selects a new node to copy the lost data block.
Secondnamenode:
Secondary namenode is a confusing name. In fact, secondary namenode is a server that assists NN in processing fsimage and transaction logs. It copies fsimage and transaction logs from NN to a temporary directory, combine fsimage and transaction log to generate a new fsimage, upload a new fsimage to Nn, NN updates fsimage and clears the original transaction log.
Simply put, this is a backup function, because all the NN metadata is stored in the memory, and over time the memory consumption is getting bigger and bigger, secondnamenode regularly backs up the fsimage of NN and the server of transaction logs. After NN is restarted, clear the original information. This avoids high memory consumption.

I have drawn a structure diagram of NN and DN based on my understanding.

Flow chart of online Interception.

Mapreduce:
Maximize CPU utilization, analyze and process large-scale Datasets

You can simply understand through the image that the same operation is put on parallel execution on multiple processors. After each processor executes 1 part of the execution, the results are summarized together, which reduces the time.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop-first knowledge of hadoop

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support