Introduction to HDFS architecture and its advantages and disadvantages

Last Update:2018-07-24 Source: Internet

Author: User

Tags file size require

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 Overview of HDFS architecture and advantages and disadvantages
1.1 Introduction to Architecture

HDFs is a master/slave (Mater/slave) architecture that, from an end-user perspective, is like a traditional file system, where you can perform crud (Create, Read, update, and delete) operations on files through directory paths. However, due to the nature of distributed storage, the HDFs cluster has a namenode and some datanode. Namenode manages the metadata of the file system, Datanode stores the actual data. The client accesses the file system through interaction with Namenode and datanodes. The client contacts Namenode to get the metadata for the file, and the real file I/O operation is directly interacting with the Datanode. The following figure illustrates the overall structure of the HDFs

1.1.1 NameNode

Namenode can be regarded as the manager of Distributed File system, which is mainly responsible for managing the namespace of file system, cluster configuration information and storage block replication. Namenode will store the file system's Meta-data in memory, which mainly includes the file information, the corresponding file block of each file, and the information of each file block in Datanode. L Masterl Manage the namespace of HDFs L manage block mapping information L Configure replica policy L handle client read and write requests

1.1.2 Secondary Namenode

Not Namenode, auxiliary namenode, share its workload, periodically merge fsimage and Fsedits, and push to Namenode; In an emergency, you can assist in restoring Namenode.

1.1.3 DataNode

Datanode is the basic unit of file storage, which stores blocks in the local file system, preserves block meta-data, and periodically sends all existing block information to Namenode.
Slavel Store actual data blocks to perform block read/write

1.1.4 Client

File segmentation interacts with Namenode, obtains file location information, interacts with datanode, reads or writes data, manages HDFs, and accesses HDFs.

1.1.5 File Write

1) The client initiates a file write request to Namenode. 2) Namenode Returns the information to the client that it manages partially datanode based on the file size and file block configuration. 3) The client divides the file into blocks, which are written sequentially to each Datanode block according to the Datanode address information.

1.1.6 file Read

1) The client initiates a file read request to Namenode. 2) Namenode Returns the Datanode information for the file store. 3) The client reads the file information.

The typical deployment of HDFs is to run Namenode on a dedicated machine, run a datanode on each of the other machines in the cluster, or run Datanode on a machine running Namenode, or run multiple datanode on a single machine. A cluster with only one namenode design greatly simplifies the system architecture.

1.2 Advantages

1.2.1 Handling Oversized files

The oversized files here usually refer to hundreds of megabytes and hundreds of terabytes of file size. Currently, HDFs can be used to store and manage petabytes of data in real-world applications.

1.2.2 Streaming Access data

The design of HDFs is based on a more responsive "one-write, multiple-read-write" task. This means that once a data set is generated by a data source, it is copied and distributed to different storage nodes, and then responds to a variety of data Analysis task requests. In most cases, the analysis task involves most of the data in the dataset, which means that for HDFs, it is more efficient to read the entire dataset than to read a record.

1.2.3 running on a cheap commercial machine cluster

Hadoop is designed to be low on hardware requirements and only run on low-cost commercial hardware clusters without the need for expensive high-availability machines. Cheap commercial machines also mean that there is a high probability of node failure in large clusters. This requires that the design of HDFs should take into account the reliability, security and high availability of data.

1.3 Disadvantages

1.3.1 not suitable for low latency data access

HDFs does not work if you want to handle low-latency application requests that require shorter periods of time for users. HDFs is designed to handle large data set analysis tasks, primarily for high data throughput, which may require high latency as a cost.

Improved Strategy：

HBase is a better choice for applications that have low latency requirements. Make up for this deficiency as much as possible with a top-level data management project. There is a great improvement in performance, and its slogan is goes real time. Using a cache or multi-master design can reduce the data request pressure on the client to reduce latency. There is the internal modification of the HDFS system, which has to weigh the large throughput and low latency, HDFS is not a universal silver bullet.

1.3.2 cannot efficiently store large numbers of small files

Because Namenode places the file system's metadata in memory, the number of files the file system can hold is determined by the size of the Namenode memory. In general, each file, folder, and block needs to occupy about 150 bytes of space, so if you have 1 million files, each occupying a block, you need at least 300MB of memory. Currently, millions of of the files are still viable, and when scaled to billions of, it is not possible to achieve the current level of hardware. Another problem is that because the number of map tasks is determined by splits, when you use Mr to process a large number of small files, you generate too much maptask, and the thread management overhead increases the job time. For example, processing 10000M files, if each split is 1M, there will be 10,000 maptasks, there will be a lot of thread overhead, if each split is 100M, then only 100 maptasks, each maptask will have more things to do, The management overhead of threads will also be much reduced.

Improved Strategy：

There are a number of ways to get HDFs to handle small files. Using Sequencefile, MapFile, Har and other ways to archive small files, the principle of this method is to archive small files to manage, HBase is based on this. For this method, if you want to retrieve the original small file content, you have to know the mapping relationship with the archive file. Scale-out, with a limited number of small files that a Hadoop cluster can manage, drag several Hadoop clusters behind a virtual server to form a large Hadoop cluster. Google has done the same thing. Multi-Master Design, this role is obvious. The GFS II in development is also to be distributed multi-master design, also support master failover, and block size changed to 1M, intentionally tuned to handle small files ah. With a Alibaba DFS design, but also a multi-master design, it separates the metadata mapping storage and management, consisting of multiple metadata storage nodes and a query master node.

1.3.3 does not support multi-user write and arbitrary file modification

There is only one writer in a file in HDFs, and the write operation can only be done at the end of the file, that is, the append operation can only be performed. Currently HDFS does not support multiple users writing to the same file, as well as modifying it anywhere in the file.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More