HADOOP:HDFS data storage and segmentation _hadoop

Source: Internet
Author: User

The Introduction to Hadoop Tutorial: HDFS data storage and segmentation, in Hadoop data storage is HDFs responsible, HDFS is the storage cornerstone of Hadoop distributed computing, Hadoop's Distributed File system and other Distributed file systems have many similar characteristics. So what are the characteristics of HDFs compared to other file systems? A brief summary of the underlying features:

There is a single namespace for the entire cluster.

Data consistency. A model suitable for writing multiple reads at a time, and the client cannot see the file before the file is successfully created.

The files are split into multiple file blocks, each of which is allocated to the data node, and a copy file block is configured to ensure the security of the data.

Data storage in Hadoop involves the three important roles of HDFs: Name Node (namenode), Data node (datanode), client.

Namenode can be regarded as the manager of the Distributed file system, mainly responsible for the management of the file System namespace, cluster configuration information, storage block replication. Namenode stores the file system's metadata in memory, which mainly includes file information, that is, information about each file block, and information about each file block in Datanode.

Datanode is the basic unit of file storage. It stores the block in a local filesystem, saves the metadata of the block, and periodically sends reports of all existing blocks to Namenode. The client is the application that needs to obtain the Distributed File System files. The read and write procedures in the data store, as shown in Figure 1-3.

As you can see from figure 1-3, there are three operations in the data storage process that illustrate the interactions between Namenode, Datanode, and client. Based on the content shown in Figure 1-3, we briefly analyze the basic process steps for data write and read access in the Hadoop store.

The basic process for writing files to HDFs is as follows:

1 Client to Namenode to initiate file write request.

2) Namenode Returns the Datanode information it manages to the client according to the file size and file block configuration.

3 The client divides the file into multiple blocks and writes each datanode in order according to Datanode address information.

The basic process for file read HDFs is as follows:

1 Client to Namenode to initiate a file read request.

2) Namenode Returns the Datanode information of the file store.

3 Client reads the file information.

The basic process for copying a block of files in HDFs is as follows:

1) Namenode found that some of the file block does not meet the minimum number of copies or partial datanode failure.

2 notify the Datanode to copy block mutually.

3) Datanode began to replicate with each other.

With the above three processes we have a basic understanding of how Hadoop uses HDFs to store data, how does the data in Hadoop be split? We know that HDFs in the specific storage of file data first divided into logical block blocks, subsequent writes, reads, Replication is done using block blocks as units. So how does the data that is stored on the HDFs in Hadoop be split? In fact, from the HDFs file writing process, you can see that the client and the Namenode interact while they need to load the clients ' Hadoop configuration files. If the user sets the block Size configuration property dfs.block.size, will be logically divided according to user-defined size, if not configured, the cluster default configuration size is used, so the file is logically sliced when the data is written, and the default is to be mapreduce by the slice size and number of pieces when running. Move map, also is the number of the default map is determined when the data is written, of course, the user can also specify the size of the file data segmentation, through the mapred.min.split.size parameter when the job submitted to the client customization settings. Source: Cuug Official website

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.