3.1 HDFS architecture (HDFS)

Last Update:2018-10-26 Source: Internet

Author: User

Tags secure copy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

Hadoop Distributed File System (HDFS) is a distributed file system designed for running on commercial hardware. It has many similarities with the existing distributed file system. However, it is very different from other distributed file systems. HDFS is highly fault tolerant and intended to be deployed on low-cost hardware. HDFS provides high-throughput access to application data and is suitable for applications with large datasets. HDFS relaxed some POSIX (Portable Operating System Interface, portable operating system interface of UNIX) requirements to achieve streaming access to file system data. HDFS was initially built as an infrastructure for the Apache nutch network search engine project. HDFS is part of the Apache hadoop core project. The project URL is http://hadoop.apache.org /.

Hypothesis and target hardware faults

Hardware faults are normal, not exceptions. An HDFS instance may contain hundreds or thousands of servers, each of which stores part of the file system data. In fact, there are a large number of components and each component has a high probability of failure, which means that some components of HDFS never work. Therefore, it is the core architecture goal of HDFS to detect faults and quickly and automatically recover from them.

Stream Data Access

Applications running on HDFS need to stream access their datasets. They are not general applications that normally run on a universal file system. HDFS is designed for batch processing rather than interactive use by users. The focus is on the high throughput of data access, rather than the low latency of data access. POSIX imposes many hard requirements that are not required by HDFS applications. Some key fields increase data throughput by sacrificing POSIX semantics.

Large Dataset

Applications running on HDFS have large datasets. The typical file size in HDFS is 1 GB to 10 MB. Therefore, HDFS is adjusted to support large files. It should provide high aggregate data bandwidth and expand to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

Simple consistency Model

HDFS applications require a file access model that is written to multiple reads at a time. Once a file is written or closed, no updates are required except for append and deletion. Content can be appended to the end of a file, but cannot be updated at any time. This hypothesis simplifies data consistency and enables high-throughput data access. Mapreduce applications or web crawler applications are perfectly suited to this model.

Mobile computing is cheaper than mobile data

The Calculation of application requests is more effective if it is executed near the data of its operations. This is especially true when the dataset size is large. This can minimize network congestion and increase the overall throughput of the system. It is generally better to migrate computing to a location closer to the data, rather than moving data to the location where the application is running. HDFS provides an interface for applications to bring themselves closer to the data location.

Portability across heterogeneous hardware and software platforms

HDFS is designed to facilitate migration from one platform to another. This helps to widely use HDFS as the preferred platform for a large number of applications.

Namenode and datanodes

HDFS has a Master/Slave architecture. The HDFS cluster includes a namenode and a master server used to manage the file system namespace and client access to files. In addition, there are many datanode. Generally, each node in the cluster corresponds to a datanode, which is used to manage connections to them to run ?? Node storage. HDFS exposes the file system namespace and allows user data to be stored in files. Internally, files are divided into one or more blocks, which are stored in a group of datanode. Namenode executes File System namespace operations, such as opening, closing, and renaming files and directories. It also determines the dating between blocks and datanode. Datanode is responsible for providing read/write requests from the file system client. Datanode also executes block creation, deletion, and replication operations based on the namenode command.

Namenode and datanode are software designed to run on commercial machines. These machines usually run on the GNU/Linux operating system (OS. HDFS is built using the Java language. any machine that supports Java can run namenode or datanode services. Using a highly portable Java language means that HDFS can be deployed on various computers. A typical deployment is to deploy only the namenode service on a dedicated computer. Each other computer in the cluster runs a datanode software instance. This architecture does not rule out running multiple datanode on the same machine, but this is rare in actual deployment.

A single namenode in the cluster greatly simplifies the system architecture. Namenode is the arbitration and repository for all HDFS metadata. The system design prevents user data from flowing through namenode.

File System namespace

HDFS supports traditional hierarchical file organizations. Users or applications can create directories and store files in these directories. The hierarchical structure of the file system namespace is similar to that of most existing file systems. You can create and delete files, move files from one directory to another, or rename files. HDFS supports user quotas and access permissions. HDFS does not support hard links or soft links. However, in the future, HDFS architecture does not rule out the implementation of these functions.

Namenode maintains the file system namespace. Namenode records any changes to the file system namespace or its attributes. Applications can specify the number of copies of files to be maintained by HDFS. The number of copies of a file is called the copy factor of the file. This information is stored by namenode.

Data Replication

HDFS is designed for computers in large clusters to reliably store very large files. It stores each file as a series of blocks. Copy file blocks to implement fault tolerance. The block size and replication factor can be configured according to the file.

All the blocks in the file except the last block have the same size. After adding support for variable length blocks to append and hsync, you can start a new block instead of filling the last block to the configured block size.

The application can specify the number of file copies. The replication factor can be specified during file creation and can be changed later. Files in HDFS are written at a time (except for append and truncation), and there is a writer at any time.

Namenode makes all decisions about block replication. It regularly receives heartbeat and blockreport from each datanode in the cluster. When heartbeat is received, datanode is running properly. Blockreport contains a list of all blocks on datanode.

Copy placement: Step 1

The placement of copies is critical to the reliability and performance of HDFS. Optimized copy placement separates HDFS from most other distributed file systems. This is a feature that requires a lot of adjustments and experience. The rack-aware copy placement policy aims to improve data reliability, availability, and network bandwidth utilization. At present, the implementation of the copy placement policy is an effort. The short-term objective of implementing this policy is to validate it on the production system, learn more about its behavior, and lay the foundation for testing and researching more complex strategies.

Large HDFS instances are usually run in computer clusters on many racks. The communication between two nodes in different racks must pass through the switch. In most cases, the network bandwidth between computers in the same rack is greater than that between computers in different racks.

Namenode uses the process outlined in hadoop rack awareness to determine the rack ID of each datanode. A simple but not optimal strategy is to place copies on an independent rack. This prevents data loss when the entire rack fails and allows the bandwidth from multiple racks to be used for data reading. This policy distributes copies evenly in the cluster to easily balance the load of component faults. However, this policy increases the write cost because the write needs to transmit blocks to multiple racks.

In common cases, when the replication factor is 3, The HDFS placement policy places a copy on the local computer when the writer is located on datanode; otherwise, it is placed on the random datanode, another copy is placed on the node of another (remote) rack, and the last one is on different nodes of the same remote rack. This policy can reduce the write traffic between racks and improve the write performance. Rack failure is much less likely than node failure. This policy does not affect data reliability and availability assurance. However, it does reduce the aggregate network bandwidth used to read data, because the block is only placed in two unique racks rather than three. When this policy is used, copies of files are not evenly distributed across the rack. 1/3 of the copies are on one node, 2/3 of the copies are on one rack, and the other 1/3 are evenly distributed across the rest of the racks. This policy improves write performance without affecting data reliability or read performance.

If the replication factor is greater than 3, 4th or less replicas are randomly determined and the number of replicas per the rack is kept below the upper limit (basically(Copy-1)/Rack + 2).

Since namenode does not allow datanode to have multiple copies in the same block, the maximum number of copies created is the total number of datanode.

After you add the storage type and storage policy support to HDFS, in addition to rack awareness, namenode also considers other policies for replica placement. Namenode first selects Nodes Based on rack perception, and then checks whether the candidate nodes have the storage required by the file association policy. If the candidate node does not have a storage type, namenode searches for another node. If you cannot find enough nodes in the first path to store copies, namenode searches for nodes with callback Storage types in the second path.

The default copy placement policy described here is ongoing.

Copy Selection

To minimize global bandwidth consumption and read latency, HDFS tries to meet the read requests closest to the reader's copy. If a copy exists on the same rack as the reader node, the Read Request is preferred. If an HDFS cluster spans multiple data centers, replicas residing in the local data center take precedence over any remote replicas.

Security Mode

At startup, namenode enters a special State named safemode. When namenode is in safemode, data block replication does not occur. Namenode receives heartbeat and blockreport messages from datanode. Blockreport contains a list of data blocks hosted by datanode. Each block has a specified minimum number of copies. When namenode is used to check the minimum number of copies of the data block, the block is considered as a secure copy. After you use namenode to check the percentage of data blocks that can be configured (plus 30 seconds), namenode exits the safemode state. It then determines if there is still a list of data blocks with less than the specified number of copies. Namenode will copy these blocks to other datanode.

Persistence of File System metadata

HDFS namespaces are stored by namenode. Namenode uses the transaction log named editlog to persistently record every change in the metadata of the file system. For example, creating a new file in HDFS causes namenode to insert records into editlog to indicate this situation. Similarly, changing the File Replication factor adds a record to the editlog. Namenode uses files in the OS file system of its local host to store editlog. The entire file system namespace (including block-to-file and file system attribute ing) is stored in a file named fsimage. Fsimage is also stored as a file in the local file system of namenode.

Namenode saves snapshots of the entire file system namespace and file blockmap throughout the memory. When namenode is started or the checkpoint is triggered by a configurable threshold, it reads fsimage and editlog from the disk and applies all transactions in the editlog to the memory of fsimage for representation, and refresh the new version to the new fsimage on the disk. Then it can clear the old editlog because its transaction has been applied to the persistent fsimage. This process is called a checkpoint. The purpose of a checkpoint is to obtain a snapshot of the file system metadata and save it to fsimage to ensure that HDFS has a consistent view of the file system metadata. Although fsimage reading is effective, incremental editing of fsimage is not efficient. Instead of modifying each edited fsimage, the edited content is retained in the editlog. During the checkpoint, the change to editlog will be applied ?? In fsimage. A Checkpoint can be triggered at a given interval (DFS. namenode. Checkpoint. PeriodIn seconds, or after a specified number of file system transactions are accumulated (DFS. namenode. Checkpoint. txns). If both attributes are set, the first threshold to be reached triggers the checkpoint.

Datanode stores HDFS data in files in its local file system. Datanode does not know about HDFS files. It stores each HDFS data block in a separate file in its local file system. Datanode does not create all files in the same directory. Instead, it uses heuristic methods to determine the optimal number of files in each directory and create subdirectories as appropriate. Creating all local files in the same directory is not the best choice, because the local file system may not be able to effectively support a large number of files in a single directory. When datanode is started, it scans its local file system, generates a list Of all HDFS data blocks corresponding to each local file, and sends the report to namenode. This report is calledBlockreport.

Communication Protocol

All HDFS communication protocols are layered on the TCP/IP protocol. The client establishes a connection with the configurable TCP port on the namenode computer. It dialogs clientprotocol with namenode. Datanode uses the datanode protocol to communicate with namenode. Remote Procedure Call (RPC) abstracts the client protocol and the datanode protocol. As designed, namenode never starts any RPC. Instead, it only responds to RPC requests sent by datanodes or clients.

Robustness

The primary objective of HDFS is to store data reliably even when a fault occurs. The three common fault types are namenode, datanode, and network partition.

Data disk failure, heartbeat, and re-replication

Each datanode periodically sends heartbeat messages to namenode. Network partitions may cause the subset of datanode to lose the connection to namenode. Namenode detects this situation by missing heartbeat messages. Namenode marks a datanode with no recent heartbeat as a dead State and does not forward any new IO requests to them. Any data registered with dead datanode is no longer available for HDFS. The death of datanode may cause the replication factor of some blocks to be lower than the specified value. Namenode keeps track of the blocks to be copied and starts replication when necessary. The necessity of re-replication may occur for many reasons: datanode may become unavailable, the copy may be damaged, and the hard disk on datanode may fail,

Timeout when datanodes is marked as dead is a conservative duration (more than 10 minutes by default) to avoid replication storms caused by datanode status jitter. You can set a short interval to mark datanode as obsolete and avoid reading and/or writing outdated nodes by configuring performance-sensitive workloads.

Cluster rebalancing

The HDFS architecture is compatible with the data rebalancing solution. If the available space on datanode falls below a certain threshold, the solution may automatically move data from one datanode to another. If the demand for specific files suddenly increases, the solution can dynamically create other copies and rebalance other data in the cluster. These types of data rebalancing solutions have not yet been implemented.

3.1 HDFS architecture (HDFS)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More