Operating principle of HDFs

Source: Internet
Author: User
Tags dashed line hadoop fs

Brief introduction

HDFS(Hadoop Distributed File System) Hadoop distributed filesystem. is based on a copy of a paper published by Google. The thesis is the GFS (Google file system) Google filesystem (Chinese, English).

HDFs has many features :

saves multiple replicas and provides fault-tolerant mechanisms for loss of replicas or automatic recovery of downtime. 3 copies are saved by default.

The is running on a cheap machine.

is suitable for processing big data. How big? How small? HDFs By default divides the file into block,64m to 1 blocks. The block key-value pair is then stored in HDFs and the mapping of the key-value pairs is saved to memory. If there are too many small files, the burden of memory will be heavy.

As shown, HDFs is also based on the structure of master and slave. The roles of Namenode, Secondarynamenode and Datanode.

NameNode: Is the master node, is the big leader. Manage data block mappings, handle read and write requests from clients, configure replica policies, and manage HDFS namespaces;

Secondarynamenode: is a younger brother, share the work of eldest brother Namenode, is namenode cold backup; Merge Fsimage and Fsedits and send to Namenode.

DataNode: Slave node, slave, working. It is responsible for storing block blocks of data sent by the client and performing read and write operations on the data blocks.

Hot backup : B is a hot backup, if a is broken off. Then b run the job instead of a right away.

Cold backup : B is a cold backup of a, if a is broken off. Then B can't replace a job immediately. But B stores some information about a, reducing the loss of a after a bad fall.

fsimage: Metadata image file (file system directory tree. )

edits: metadata operation log (record of modification operation for file system)

=fsimage+edits is stored in namenode memory.

Secondarynamenode is responsible for the scheduled default of 1 hours, from Namenode, get fsimage and edits to merge, and then send to Namenode. Reduce the workload of Namenode.

Working principle

Write operation:

There is a file filea,100m size. The client writes the Filea to HDFs.

HDFS is configured by default.

HDFs is distributed on three racks Rack1,rack2,rack3.

A. the client will Filea by 64M. Divided into two pieces, block1 and Block2;

B. client sends write data request to Namenode, blue dashed line ①------>.

c. namenode node, log block information. and return available datanode, such as a pink dashed ②--------->.

Block1:host2,host1,host3

Block2:host7,host8,host4

Principle:

The Namenode has a rackaware rack-aware function, which can be configured.

If the client is a Datanode node, the rule is: Copy 1, node on the same client, replica 2, on different rack node, replica 3, on another node of the second replica rack, and other copies randomly selected.

If the client is not a Datanode node, the rule is: Copy 1, randomly Select a node, copy 2, different copy 1, rack, copy 3, another node identical to copy 2, and other copies randomly selected.

d. the client sends Block1 to the Datanode, and the sending process is written in stream.

Streaming write process,

1> The 64M block1 according to the 64k package division;

2> then sends the first package to Host2;

3>Host2 received, the first package sent to Host1, while the client would like to host2 send a second package;

4>Host1 receives the first package, sends it to HOST3 and receives a second package from Host2.

5> and so on, the red line is displayed until the Block1 is sent.

6>host2,host1,host3 to Namenode,host2 to send a notification to the client, saying "the message has been sent out." The pink color is shown in the solid line.

7>Client received a message from HOST2, sent a message to Namenode, said I finished writing. This is really done. Yellow thick solid line

After the 8> sends the BLOCK1, it sends the BLOCK2 to Host7,host8,host4, as shown in the solid blue Line.

After 9> sends the BLOCK2, HOST7,HOST8,HOST4 sends a notification to the client to Namenode,host7, as shown in the light green solid line.

10>Client sent a message to Namenode, said I finished, yellow thick solid line ... That's it.

analysis, through the writing process, we can learn:

Write 1T file, we need 3T of storage, 3T of network traffic loan.

in the process of reading or writing, Namenode and Datanode through the heartbeat to save communication, determine Datanode alive. If Datanode is found dead, the data on the dead Datanode will be dropped to the other nodes. Read the other nodes to read.

hangs a node, it doesn't matter, there are other nodes that can be backed up; it doesn't matter if you hang up a rack. Other racks also have backups.

Read operation:

The read operation is simple, the client should read the Filea from the Datanode. The Filea is made up of Block1 and Block2.

Then, the read operation flow is:

A. the client sends a read request to the Namenode.

B. Namenode View the metadata information and return to the location of the Filea block.

Block1:host2,host1,host3

Block2:host7,host8,host4

c. the location of the block is in sequence, read Block1 first, then read Block2. And Block1 went to Host2 to read, and then Block2, went to the host7 to read;

In the example above, the client is outside the rack, so if the client is on a datanode in the rack, for example, the client is HOST6. So when reading, the following rule is:

preferably read the data on this rack .

Commands that are commonly used in HDFs

1. Hadoop FS

[Java]View Plaincopy
  1. Hadoop Fs-ls/
  2. Hadoop FS-LSR
  3. Hadoop Fs-mkdir/user/hadoop
  4. Hadoop fs-put a.txt/user/hadoop/
  5. Hadoop fs-get/user/hadoop/a.txt/
  6. Hadoop FS-CP SRC DST
  7. Hadoop fs-mv SRC DST
  8. Hadoop fs-cat/user/hadoop/a.txt
  9. Hadoop fs-rm/user/hadoop/a.txt
  10. Hadoop fs-rmr/user/hadoop/a.txt
  11. Hadoop fs-text/user/hadoop/a.txt
  12. Hadoop fs-copyfromlocal LOCALSRC DST is similar to Hadoop fs-put functionality.
  13. Hadoop fs-movefromlocal localsrc DST uploads local files to HDFs while deleting local files.

2. Hadoop fsadmin

[Java]View Plaincopy
    1. Hadoop Dfsadmin-report
    2. Hadoop Dfsadmin-safemode Enter | Leave | Get | Wait
    3. Hadoop dfsadmin-setbalancerbandwidth

3. Hadoop fsck

4, start-balancer.sh

Operation Principle of HDFs (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.