Back-end Distributed series: Distributed storage-hdfs Architecture parsing

Source: Internet
Author: User

This article takes the Distributed File System (HDFS) provided by Hadoop as an example to further expand the key points of the design of the Distributed Storage Service architecture.

Architectural goals

Any software framework or service is created to solve a specific problem. Remember some of the concerns we described in the article "Distributed Storage-Overview"? Distributed file system belongs to a file-oriented data model in distributed storage, which needs to solve the capacity expansion and fault tolerance problems faced by stand-alone file system.

So the architecture of HDFS is designed to be a target:

    1. For very large files or large file datasets
    2. Auto-detect local hardware errors and recover quickly

Based on this goal, considering the scenario for simplified design and implementation, HDFS assumes a write-once-read-many file access model. This type of write-once and read-out model really adapts to many business scenarios in the real world, and this kind of assumption of architecture design is reasonable. Because of the existence of such assumptions, it also limits its application scenarios.

Architecture Overview

Here is an architecture diagram from an official document:

The architecture of the visible HDFS consists of three parts, each with its own clear delineation of responsibilities.

    1. NameNode
    2. DataNode
    3. Client

As can be seen, HDFS uses the central master-control architecture, NameNode is the central node of the cluster.

NameNode

NameNode's primary responsibility is to manage meta-information (Metadata) for the entire file system, which mainly includes:

    • File system Namesapce
      HDFS 类似单机文件系统以目录树的形式组织文件,称为 file system namespace
    • Replication factor
      文件副本数,针对每个文件设置
    • Mapping of blocks to Datanodes
      文件块到数据节点的映射关系

In the schema diagram above, the Metadata ops point to NameNode is primarily about creating, deleting, reading, and setting the number of copies of files, so all file operations are not around NameNode. In addition NameNode is responsible for managing DataNode, such as the new DataNode joins the cluster, the old DataNode exits the cluster, the distribution of load-balanced file data blocks between DataNode and so on. More on NameNode's design implementation analysis, which will be written separately.

DataNode

DataNode's duties are as follows:

    • Store file blocks (block)
    • Service responds to Client's file read and write requests
    • Perform file block creation, deletion, and replication

From the frame composition, see a Block OPS operating arrows from NameNode point to DataNode, will make people mistakenly think NameNode will take the initiative to send command calls to DataNode. In fact, NameNode never calls DataNode, only to carry the callback instruction information by DataNode sending the heartbeat to NameNode periodically.

The Rack1 and Rack2 are specifically marked on the frame composition, indicating that HDFS is designed specifically for rack perception when considering multiple copies of file data blocks, details we do not start here, more on the DataNode design implementation analysis, the following will be written in separate detail.

Client

Given the complexity of the HDFS interaction process, the Client of the pin-specific programming language is specifically provided to simplify usage. The Client's responsibilities are as follows:

    • Provides a consistent API for application programming languages, simplifying application programming
    • Improve access Performance

The client is able to improve performance because the cache is available for read, and for write can be buffered (buffer) batch mode, details we do not start here, more about the Client design implementation analysis, the following will be written separately.

Summarize

Originally wanted to write in an article in the HDFS architecture parsing, wrote that the discovery is not likely. As the most complex distributed storage class system in distributed system, every architectural design tradeoff is worth careful scrutiny, once you start this article feel the endless, so here first overall over a bit, for each part of the design implementation details to the theme of the detailed analysis of the article.

Reference

[1]hadoop documentation. HDFS Architecture.
[2]robert Chansler, Hairong Kuang, Sanjay Radia, Konstantin Shvachko, and Suresh Srinivas. The Hadoop distributed File System

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Back-end Distributed series: Distributed storage-hdfs Architecture parsing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.