Detailed description of hadoop operating principles and hadoop principles

Source: Internet
Author: User
Tags hadoop fs

Detailed description of hadoop operating principles and hadoop principles

Introduction

HDFS (Hadoop Distributed File System) Hadoop Distributed File System. It is based on a paper published by google. The paper is a GFS (Google File System) Google File System (Chinese and English ).

HDFS has many features:

① Multiple copies are saved and fault tolerance mechanisms are provided to automatically recover lost copies or downtime. By default, three copies are saved.

② Run on a cheap machine.

③ Suitable for big data processing. How big? How small? HDFS splits the file into blocks by default, and 64 MB is one block. Store the block key-value pairs on HDFS and map the key-value pairs to the memory. If there are too many small files, the memory burden will be very heavy.

As shown in, HDFS also follows the Master and Slave structures. NameNode, SecondaryNameNode, and DataNode roles.

NameNode: A Master node and a big leader. Manage data block ing; process client read/write requests; configure copy policies; manage HDFS namespaces;

SecondaryNameNode: a younger brother shares the workload of namenode, a cold backup of NameNode, and merges fsimage and fsedits and sends them to namenode.

DataNode: Slave node, Slave, working. Stores block data sent from the client and reads and writes data blocks.

Hot Backup: B is a hot backup, if a is broken. Then B immediately runs the work that replaces.

Cold backup: B is a cold backup, if a breaks down. B cannot replace a immediately. However, B stores some information about a to reduce the loss after a is broken.

Fsimage: Metadata image file (directory tree of the file system .)

Edits: operation logs of metadata (Modification Operation Records for file systems)

Namenode memory stores = fsimage + edits.

SecondaryNameNode is responsible for the default one hour. From namenode, The fsimage and edits are obtained for merging and then sent to namenode. Reduce namenode workload.

Working Principle

Write operation:

There is a file FileA, MB in size. The Client writes FileA to HDFS.

HDFS is configured by default.

HDFS is distributed in three racks: Rack1, Rack2, and Rack3.

A. The Client splits FileA into 64 m blocks. It is divided into two parts: block1 and Block2;

B. The Client sends a Data Write Request to the nameNode, with the blue dotted line ① ------>.

C. NameNode node, which records block information. And return available DataNode, such as the pink dotted line ② --------->.

Block1: host2, host1, host3

Block2: host7, host8, host4

Principle:

NameNode has the RackAware rack sensing function, which can be configured.

If the client is a DataNode node, the block storage rule is: Copy 1, on the same client node; Copy 2, on different rack nodes; copy 3, on another node of the second replica rack; other copies are randomly selected.

If the client is not a DataNode node, the block storage rule is: Copy 1, randomly selected on a node; Copy 2, Different Copy 1, on the rack; copy 3, on another node with the same copy 2; other copies are randomly selected.

D. The client sends block1 to DataNode. The sending process is stream writing.

Stream write process,

1> divide block1 of 64 m by 64 K package;

2> send the first package to host2;

3> after host2 receives the package, it sends the first package to host1 and the client wants host2 to send the second package;

4> host1 receives the first package, sends it to host3, and receives the second package from host2.

5> Similarly, the real line of the red line is shown until block1 is sent.

6> host2, host1, and host3 send notifications to NameNode and host2 to the Client, saying "The message has been sent ". The solid line of the pink color is shown.

7> after receiving the message from host2, the client sends the message to namenode, saying that I have finished writing the message. This is done. Yellow thick line

8> after sending block1, send block2 to host7, host8, and host4, as shown in blue.

9> after sending block2, host7, host8, and host4 send notifications to NameNode and host7, as shown in the light green solid line.

10> the client sends a message to the NameNode, saying that I have finished writing the message, and the yellow line is thick... This completes.

Analysis, through the write process, we can understand:

① To Write 1 t files, we need 3 T storage and 3 T network traffic loans.

② In the process of reading or writing, the NameNode and DataNode communicate with each other through HeartBeat to ensure that the DataNode is alive. If DataNode is found dead, put the data on the dead DataNode to another node. When reading data, read data from other nodes.

③ It doesn't matter if one node is suspended. There are other nodes that can be backed up. Even a rack is suspended. There are also backups on other racks.

Read operation:

The read operation is simpler. The client needs to read FileA from datanode. FileA consists of block1 and block2.

The read operation process is as follows:

A. the client sends a read request to the namenode.

B. Check the Metadata information in namenode and return the location of the block in fileA.

Block1: host2, host1, host3

Block2: host7, host8, host4

C. The block position is sequential. Read block1 first and then read block2. In addition, block1 is read from host2; block2 is then read from host7;

In the preceding example, the client is located outside the rack. If the client is located on a DataNode in the rack, for example, the client is host6. When reading data, follow the following rules:

It is preferred to read data on the local rack.

Commands commonly used in HDFS

1. hadoop fs

Hadoop fs-ls/hadoop fs-lsr hadoop fs-mkdir/user/hadoop fs-put a.txt/user/hadoop fs-get/user/hadoop/a.txt/hadoop fs -cp src dst hadoop fs-mv src dst hadoop fs-cat/user/hadoop/a.txt hadoop fs-rm/user/hadoop/a.txt hadoop fs-rmr/user/hadoop/a.txt hadoop fs-text/user/hadoop/a.txt hadoop fs-copyFromLocal localsrc dst is similar to hadoop fs-put. Hadoop fs-moveFromLocal localsrc dst uploads local files to hdfs and deletes local files.

2. hadoop fsadmin

Hadoop dfsadmin-report hadoop dfsadmin-safemode enter | leave | get | wait hadoop dfsadmin-setBalancerBandwidth1000

3. hadoop fsck

4. start-balancer.sh

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.