Hadoop-first knowledge of hadoop

Source: Internet
Author: User

What is hadoop?

Before doing something, the first step is to know what, then why, and finally how ). However, after many years of project development, many developers get used to how first, then what, and finally why. This will only make them impetuous, at the same time, technologies are often misused in unsuitable scenarios.

The core designs in the hadoop framework are mapreduce and HDFS. The idea of mapreduce is widely circulated as mentioned in a Google paper. A simple sentence explains mapreduce as a "Summary of task decomposition and results ". HDFS is short for hadoop Distributed File System, which provides underlying support for Distributed Computing and storage.

From the perspective of its name, mapreduce roughly shows the reason. Two vertices map and reduce, "map (expand)" means to break a task into multiple tasks, "reduce" is to summarize the results of multi-task processing after decomposition to obtain the final analysis result. This is not a new idea. In fact, with the multithreading mentioned above, the multi-task design can find the shadows of this idea. Whether in the real world or in programming, a job can often be split into multiple tasks. The relationship between tasks can be divided into two types: one is unrelated tasks, it can be executed in parallel. The other is that tasks are dependent on each other, and the sequence cannot be reversed. Such tasks cannot be processed in parallel. Back in college, when teaching, let everyone analyze the Key Path, simply find the most time-saving task decomposition and execution method. In a distributed system, machine clusters can be seen as a hardware resource pool, which splits parallel tasks and submits them to every idle machine resource for processing, greatly improving computing efficiency, at the same time, this resource independence undoubtedly provides the best design guarantee for the expansion of computing clusters. (In fact, I always think that the cartoon icon of hadoop should not be a small elephant, it should be ant, distributed computing is like ant eating elephant, cheap machine groups can rival any high-performance computer, the vertical expansion curve is always inferior to the diagonal line of horizontal expansion ). After the task is decomposed and processed, it is necessary to summarize the processed results, which is the task of reduce.

 

Hadoop solves two problems: massive data storage and massive data analysis
Provides a reliable shared storage and analysis system. HDFS (hadoop Distributed File System) implements storage and mapreduce implements analysis and processing. These two are the core of hadoop.
.

 

Hadoop maximizes memory utilization, disk utilization, and CPU utilization.

Hbase:Nosql databases maximize memory usage.

 

HDFS Architecture Design:

Namenode:
1. Map a file to a batch of blocks and map the data blocks to the DN nodes. Cluster configuration management, data block management, and replication.
2. Transaction log processing: records file generation and deletion.
3. Because all the metadata of namenode is stored in the memory, the memory size of NN (namenode) determines the storage capacity of the entire cluster.
4. Data stored in NN memory: file list, Block List of each file, Block List of each DN, file attribute: generation time, copy parameter, File Permission (ACL)

Datenode:
Stores data blocks in the local file system and metadata of data blocks for CRC verification.
Responds to client requests for data blocks and metadata.
Periodically report all data block information stored in this dn to NN.
When the client needs to store data, it obtains the DN location list of the data block from the NN. The client sends the data block to the first DN, the first DN receives data and sends the data block to another DN through the pipeline stream. When a data block is written to all nodes, the client continues to send the next data block. The DN sends a heartbeat packet to the NN every 3 seconds. If the NN does not receive the heartbeat packet and then tries again, the DN becomes invalid.
When NN detects that the DN node is invalid, it selects a new node to copy the lost data block.
Secondnamenode:
Secondary namenode is a confusing name. In fact, secondary namenode is a server that assists NN in processing fsimage and transaction logs. It copies fsimage and transaction logs from NN to a temporary directory, combine fsimage and transaction log to generate a new fsimage, upload a new fsimage to Nn, NN updates fsimage and clears the original transaction log.
Simply put, this is a backup function, because all the NN metadata is stored in the memory, and over time the memory consumption is getting bigger and bigger, secondnamenode regularly backs up the fsimage of NN and the server of transaction logs. After NN is restarted, clear the original information. This avoids high memory consumption.

 

I have drawn a structure diagram of NN and DN based on my understanding.

 

Flow chart of online Interception.

 

Mapreduce:
Maximize CPU utilization, analyze and process large-scale Datasets

You can simply understand through the image that the same operation is put on parallel execution on multiple processors. After each processor executes 1 part of the execution, the results are summarized together, which reduces the time.

 

 

 

 


 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.