Hadoop-based HDFS sub-framework

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Architecture

The image shows that HDFS mainly contains the following functional components:
Namenode: stores the metadata of a document and the directory structure of the entire file system.
Datanode: stores document block information, and there is redundant backup between document blocks.
The document block concept is mentioned here. Like the local file system, HDFS is also block-based storage, but the block size is relatively larger. The default value is 64 MB. If a file is less than 64 MB, it is only stored in one block and does not occupy 64 MB of disk space,
Note that HDFS is not applicable to small file storage because small files consume disk space, but because small files occupy too much block information, the metadata of each document block is stored in the namenode memory. Therefore, when there are many document blocks, the namenode memory is greatly consumed.

From the functional structure, namenode provides the data locating function. datanode provides data transmission, that is, the client directly reads data from datanode when accessing the file system, rather than namenode.

From the perspective of the deployment structure, namenode is the only single point in the system. If a problem occurs, the consequences are very dangerous. The entire document system will be paralyzed. HDFS designed secondnamenode for this purpose, used to back up namenode data

From the perspective of the directory structure, namenode consists of two parts.
Fsimage is used to encapsulate the metadata information of the file system (including the document access time, access permissions, and data blocks)
Edits is used to record operation logs of the document system.
When namenode is started, fsimage is loaded to the memory, and then edits records the data in the memory to ensure that the data in the memory is in the latest state.
After namenode is started, all accesses to the file system metadata are obtained from the memory, rather than fsimage. Fsimage and edits only enable persistent storage of metadata. In fact, all memory-based storage systems generally adopt this method.
The advantage of doing so is to speed up the metadata read and update operations (directly in the memory), but it also has a negative impact, when the edits content is large, the startup of namenode will become very slow.
In this regard, secondnamenode provides the ability to aggregate fsimage and edits. First, copy the data in namenode, then perform merge aggregation, and return the aggregated results to namenode, in addition, the local backup is retained, which not only speeds up the startup of namenode, but also increases the redundancy of namenode data.

Io operations

HDFS File Reading Process

First, connect to the distributed file system and obtain the block of the file to be accessed from the namenode. What is the storage address of each block?
Then, locate the specified datanode to read the file.
Note: The storage address of each block is loaded into the namenode memory only after hadoop is started, rather than persistently stored to the namenode local
Namenode and datanode have the heartbeat communication function, which regularly receives some feedback from datanode, including block storage address information.

HDFS file Writing Process

First, connect to the distributed file system and send the file creation command to namenode.
After namenode saves the metadata information of the document, it schedules a specific datanode to write data streams. After the writing is successful, redundant backup is required to copy multiple blocks, each point is stored in different machine nodes to prevent single point of failure (spof ).
HDFS is used to store data. Each block must be backed up in at least one copy. The default value is three copies. If no backup is specified or an exception occurs during the backup process, the file write operation will not succeed.

Scenarios where HDFS is not applicable

1. Low-latency data access
HDFS is mainly designed for large files and is mostly used for offline data analysis. For systems with high requirements on online applications and timeliness, try hbase.
2. A large number of small files
Namenode memory consumption. sequencefile or mapfile can be used as the container for small files.
3. multi-thread writing and random writing
In HDFS, only one writer can be enabled for each file concurrently, and write operations can only be performed at the end of the file.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop-based HDFS sub-framework

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop-based HDFS sub-framework

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support