HDFs storage mechanism (RPM)

Source: Internet
Author: User

The storage mechanism of HDFS in Hadoop

HDFS (Hadoop Distributed File System) is a data storage system in Hadoop distributed computing that is developed based on the need to access and process oversized files from streaming data patterns. Here we first introduce some basic concepts in HDFs, then introduce the process of read and write operations in HDFs, and finally analyze the advantages and disadvantages of HDFS.

1. Basic Concepts in HDFs

Block:The storage unit in HDFs is the data block of 64M per block Block,hdfs The default most basic storage unit. As with the normal file system, the files in HDFs are also stored in chunks of data that are partitioned into 64M blocks. The difference is that in HDFs, if a file size is smaller than the size of a block of data, it is not necessary to occupy the entire chunk of storage space.NameNode:Meta Data node. This node is used to manage namespaces in the file system, which is master. It saves all the metadata for the view and folder in a file system tree, which is saved on the hard disk for the namespace image (namespace image) and the Changelog (edit log), which is discussed later. In addition, Namenode also preserves which data blocks are included in a file and which data nodes are distributed. However, this information is not stored on the hard disk, but is collected from the data node when the system is started.DataNode:The data node is where HDFs really stores data. The client and Metadata nodes (NameNode) can request a data block to be written to or read from the data node. In addition, Datanode needs to periodically return the data block information it stores to the metadata node.Secondary NameNode:From the Meta data node. From the Meta Data node is not the Namenode when the problem occurs in the standby node, its main function is to periodically merge the Namenode in the namespace image and edit log, in case the log file is too large. In addition, the merged namespace image file will be saved on secondary namenode, in case the Namenode fails, it can be restored.Edit log:Modify the log, we will put this record in the modification log when the file system client is writing. After logging the modification log, Namenode modifies the data structure in memory. The edit log is synchronized to the file system before each write operation succeeds.Fsimage:A namespace mirror, which is the checkpoint of the in-memory metadata on the hard disk. When Namenode fails, the latest checkpoint metadata information is loaded from fsimage into memory, and attention is paid to re-executing the action in the Modify log. The secondary namenode is used to help the metadata node checkpoint the in-memory metadata information to the hard disk.

Specific checkpoint processes such as: (Refer to the Hadoop cluster blog)

The process of checkpoint is as follows: Secondary Namenode notifies Namenode to generate a new log file, and subsequent logs are written to the new log file. Secondary Namenode obtains fsimage files and old log files from Namenode with HTTP GET. Secondary Namenode loads the Fsimage file into memory, performs the operations in the log file, and then generates a new Fsimage file. Secondary Namenode The new Fsimage file back to Namenode with an HTTP post. Namenode can change the old fsimage files and old log files into new fsimage files and new log files (generated in the first step), and then update the fstime files to write the time of this checkpoint. In this way, the Fsimage file in the Namenode saves the latest checkpoint metadata information, and the log file starts again, and it doesn't change much.

2. File read and write operation flow in HDFs

In HDFs, the read and write process of a file is the process of interacting with the client and Namenode and Datanode. We already know that Namenode manages the metadata of the filesystem, Datanode stores the actual data, then the client contacts Namenode to get the metadata of the file, and the real file reads are directly interacting with the Datanode.

The process of writing a file:

The client invokes the Create () file Distributedfilesystem creates a new file in the file System namespace with the RPC call metadata node. The metadata node first determines that the file does not exist, and the client has permission to create the file, and then creates a new file. Distributedfilesystem returns Dfsoutputstream, which the client uses to write data. The client begins to write the data, dfsoutputstream the data into chunks, and writes it to the data queue. The data queue is read by data streamer and notifies the metadata node to allocate data nodes, which are used to store chunks (each of which replicates 3 blocks by default). The assigned data node is placed in a pipeline. Data Streamer writes a block to the first data node in the pipeline. The first Data node sends a block of data to the second data node. The second data node sends the data to a third data node. Dfsoutputstream holds an ACK queue for the emitted data block, waiting for the data node in the pipeline to tell the data has been successfully written. If the data node fails during the write process:

Close pipeline to the beginning of the data queue for the block of ACK in the queue.
The current data block is given a new flag by the metadata node in the data node that is already written, and the error node restarts to detect that its data block is obsolete and is deleted.
The failed data node is removed from the pipeline, and the other data block is written to the other two data nodes in the pipeline.
The metadata node is notified that the block is insufficient in number of copies and will create a third backup in the future.

The close function of the stream is called when the client ends writing the data. This operation writes all data blocks to the data node in pipeline and waits for the ACK queue to return successfully. Finally notifies the metadata node that the write is complete.

The process of reading a file:

The client opens the file using the FileSystem open () function Distributedfilesystem the metadata node with the RPC call to get the data block information of the file. For each chunk, the metadata node returns the address of the data node that holds the data block. Distributedfilesystem returns Fsdatainputstream to the client, which is used to read the data. The client calls the stream's read () function to begin reading the data. The Dfsinputstream connects the closest data node that holds the first chunk of this file. Data is read from the node to the client when this block is read, Dfsinputstream closes the connection to this data node, and then connects to the nearest data node of the next block of data for this file. When the client has finished reading the data, call Fsdatainputstream's close function. In the process of reading data, if the client has an error communicating with the data node, it attempts to connect to the next data node that contains the data block. The failed data node is logged and is no longer connected.

3. Analysis of advantages and disadvantages of HDFs

Advantages:

1) ability to handle oversized documents;

2) streaming access to data. HDFs can handle "write-once, read-write" tasks very well. That is, once a dataset is generated, it is copied to a different storage node and responds to a variety of data Analysis task requests. In most cases, the analysis task will involve most of the data in the data set. Therefore, HDFS requests to read the entire data set is more efficient than reading a record.

3) can be run on a relatively inexpensive commercial machine cluster.

Disadvantages and improvement Strategies:

1) is not suitable for low latency data access: HDFS is designed to handle large data set analysis tasks, primarily to achieve big data analysis, so latency may be high. Improved strategy: HBase is a better choice for applications that have low latency requirements. Make up for this deficiency as much as possible with a top-level data management project. There is a great improvement in performance, and its slogan is goes real time. Using a cache or multi-master design can reduce the data request pressure on the client to reduce latency. There is also the internal modification of the HDFS system, which has to weigh the large throughput and low latency.

2) cannot efficiently store large numbers of small files: Because Namenode puts the filesystem's metadata in memory, the number of files the file system can hold is determined by the size of the Namenode memory. In general, each file, folder, and block needs to occupy about 150 bytes of space, so if you have 1 million files, each occupying a block, you need at least 300MB of memory. Currently, millions of of the files are still viable, and when scaled to billions of, it is not possible to achieve the current level of hardware. Another problem is that because the number of map tasks is determined by splits, when you use Mr to process a large number of small files, you generate too much maptask, and the thread management overhead increases the job time. For example, processing 10000M files, if each split is 1M, there will be 10,000 maptasks, there will be a lot of thread overhead, if each split is 100M, then only 100 maptasks, each maptask will have more things to do, The management overhead of threads will also be much reduced. Improved strategy: There are many ways to get HDFs to handle small files. Using Sequencefile, MapFile, Har and other ways to archive small files, the principle of this method is to archive small files to manage, HBase is based on this. For this method, if you want to retrieve the original small file content, you have to know the mapping relationship with the archive file. Scale-out, with a limited number of small files that a Hadoop cluster can manage, drag several Hadoop clusters behind a virtual server to form a large Hadoop cluster. Google has done the same thing. Multi-Master Design, this role is obvious. The GFS II in development is also to be distributed multi-master design, also support master failover, and block size changed to 1M, intentionally tuned to handle small files ah.
With a Alibaba DFS design, but also a multi-master design, it separates the metadata mapping storage and management, consisting of multiple metadata storage nodes and a query master node.

3) does not support multi-user write and arbitrary modification of files: There is only one writer in a file in HDFs, and the write operation can only be done at the end of the file, that is, only the append operation can be performed. Currently HDFS does not support multiple users writing to the same file, as well as modifying it anywhere in the file.

HDFs storage mechanism (RPM)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.