Hadoop learning note_7_distributed File System HDFS -- datanode Architecture

Last Update:2014-08-08 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Distributed File System HDFS-datanode Architecture

1. Overview

Datanode: provides storage services for real file data.

Block: the most basic storage unit [the concept of a Linux operating system]. For the file content, the length and size of a file is size. The file is divided and numbered according to the fixed size and order starting from the 0 offset of the file, each divided block is called a block.

Unlike the Linux operating system, a file smaller than the block size is uploaded, which occupies the space of the actual file size.

2. Enter hdfs-default.xml

<property>  <name>dfs.block.size</name>  <value>67108864</value>  <description>The default block size for new files.</description></property>

It is displayed that the default block size of HDFS is 64 MB (67108864b). If a 256/64 MB file is divided into = 4 blocks. namenode stores these blocks on different datanode. therefore, all blocks of a file are not necessarily placed on a datanode.

In HDFS, if a file is smaller than the size of a data block, it does not occupy the entire data block storage space.

3. Find the location where datanode stores blocks.

<Property> <Name> DFS. data. dir </Name> <value >$ {hadoop. TMP. dir}/dfs/Data </value> <description> determines where on the local filesystem an DFS data node shocould store its blocks. if this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. directories that do not exist are ignored. </description> </property>

Go to the/usr/local/hadoop/tmp/dfs/data/current directory.

XX. The meta file is used to verify the data.

Linux Command STAT/to view Linux File System block and other information

4. Verify the file size

1. Upload hadoop-xxx.tar.gz with a size of 61927560

As you can see, the file size is also 61927560 (the original file size)

2) then upload the jdk-xxx.bin and view

Two! Add two blocks = size of the original data block

Summary:

When HDFS datanode stores data, if the size of the original file is greater than 64 MB, it is split according to the size of 64 mb. If it is <64 MB, there is only one block, the occupied disk space is the actual size of the original file.

If you manually upload a file to the datanode directory, you cannot view the file information using hadoop FS-ls.

This will bypass namenode, and namenode will maintain the HDFS directory structure and know the namenode and data storage location information.

All file storage blocks are managed in namenode, which occupies a space in the memory. Therefore, the more blocks, the greater the pressure on namenode. for example, the storage of three small files of 2 K has no impact on the storage of datanode, because these files can be stored in a block, but the memory pressure on namenode increases. if it is a massive volume of small files, the pressure is amazing!

However, if the block size is too large, it is not good because the single point of reading and writing slows down and the re-transmission of errors is inconvenient.

The smaller the block division, the more pressure the namenode memory has.

Therefore, we need to divide the data according to the actual situation. Generally, 64 m, 128 M, and m are common.

Specific modification:

Copy related content from the hdfs-default.xml to the hdfs-site.xml and modify its numeric size.

5. Multiple copies of replication: The default value is three.

<property>  <name>dfs.replication</name>  <value>3</value>  <description>Default block replication.   The actual number of replications can be specified when the file is created.  The default is used if replication is not specified in create time.  </description></property>

5. view the HDFS directory structure in a browser

Browser address bar:

Http: // hadoop: 50070

Appendix:

Cluster: the basic unit for reading and writing data to and from a disk in a Windows file system. If the cluster is divided into 8 K, the file system reads and writes data to and from a file with 8 K as the basic unit. therefore, if the size of a file is 4 K, 8 K space will be occupied on the disk (a waste of resources)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More