Hadoop architecture Guide

Source: Internet
Author: User
Tags posix

HDFS architecture Guide

Introduction

Hadoop Distributed File System (HDFS) is a distributed file system running on a commercial hardware platform. It has many similarities with many existing distributed file systems. Of course, the difference with other distributed file systems is also obvious. HDFS provides highly reliable file services on low-cost hardware platforms and high data access throughput. HDFS is suitable for applications running on large datasets. HDFS does not fully comply with POSIX file system standards because the HDFS runtime environment is different from the POSIX filesystem applicable environment. HDFS supports stream access to file system data. HDFS was initially designed for the Apache nutch search engine project. Now HDFS is Apache
A subproject of hadoop where the project address is http://hadoop.apache.org/hdfs/

Assumptions and objectives

Hardware failure

Hardware failure is common, not an exception. The HDFS runtime environment may contain hundreds of servers, each of which stores part of the data in the file system. In fact, the data on these servers is huge, and each machine may fail, which leads to the failure of HDFS components. Therefore, failure detection and rapid recovery are the core design goals of HDFS.

Stream Data Access

Applications running on HDFS need streaming access to their datasets. HDFS applications are not like those that operate on common file systems. HDFS is intended for batch processing operations rather than user interaction operations. HDFS focuses on high throughput rather than low latency. Some requirements of POSIX standards are not suitable for HDFS applications. Therefore, to achieve high throughput, POSIX standards are violated in some respects.

Large data sets

Applications running on HDFS manipulate large datasets. The size of a typical HDFS file is GB ~ TB size. Therefore, HDFS is optimized for large files. A file is distributed to thousands of nodes in the cluster to provide higher data bandwidth. HDFS should support tens of millions of files.

Append-writes and file syncs

Most HDFS applications write files multiple times at a time. HDFS provides two advanced functions: hflush and append. Hflush provides read consistency and data persistence. Make the last part of the file visible to the users who read the file. Append provides a mechanism to re-open closed files and add additional data.

For more information about hflush and append, see Append/hflush/read design document.

Move computation is cheaper than moving data

Computing is more efficient when the data is stored near the node, especially when the data to be operated is very large. Nearby computing minimizes both network consumption and system throughput. In accordance with this principle, move computing to the data storage location rather than moving data to the computing node. HDFS provides an interface for applications to move computing to the data storage location.

Protability messaging SS heterogeneous hardware and software platforms

HDFS can be easily transplanted on different platforms. This feature makes HDFS a platform for many applications.

Namenode and datanodes

HDFS adopts the Master/Slave architecture. The HDFS cluster contains a namenode used to manage the name space of the file system and to manage the permissions of clients to access files. In addition, HDFS also includes a certain number of datanodes used to manage the storage space of the machine (usually each machine has only one datanodes ). HDFS uses namenode to provide users with a file system namespace, allowing users to store data as HDFS files. Within HDFS, a file is divided into multiple blocks, which are stored in multiple datanodes. Namenode executes File System namespace operations such as open, close,
The rename file and directory are also responsible for the block ing to the datanodes node. Datanodes is responsible for reading and writing data requests from the client. datanodes also creates, deletes, and copies blocks.

Namenode and datanode are software running on normal machines, which are generally GNU/Linux operating systems. HDFS is implemented in Java, so any machine supporting Java can run namenode and datanode software. Using a highly portable Java language means that HDFS can be deployed in a wide range. A typical deployment is that namenode uses a specific machine. Other machine nodes in the cluster run a datanode on each node. Although the HDFS architecture itself does not exclude one machine from running multiple datanode, it is rarely used in actual deployment.

Only one namenode in the cluster greatly simplifies the system architecture. Namenode is the arbitration of the system and is responsible for all the metadata of HDFS. Namedata does not handle any user data

The file system namespace

HDFS supports hierarchical organization of traditional files. A user or application can create directories and storage files under these directories. The hierarchical structure of the file system namespace is similar to most file systems: You can create and delete files, move files to another directory, or rename a file. HDFS implements the number of names in a directory and the user quota of data blocks. In addition, HDFS supports symbolic links.

Namenode maintains the namespace of the file system. Any changes to the file system namespace and namespace attributes will be recorded in namenode. An application can specify the copy factor of a file, which is stored in namenode.

Data Replication

HDFS can reliably store very large files to multiple machines in the cluster. Each file is divided into consecutive blocks. Except the last block, each block in the file is of the same size. The file block is replicated multiple times to provide fault tolerance. You can specify the block size and replication factor for each file. The replication factor can be specified or modified later when the file is created. In HDFS, only one writer is allowed at any time.

Namenode determines when block replication is performed. It periodically receives heartbeat and blockreport from each datanode in the cluster. When heartbeat is received from datanode, it means that datanode works properly. Blockreport contains a list of all blocks of datanode.

Replica placement: the first baby steps

The replication location is critical to system reliability and performance. The replication Location Optimization distinguishes HDFS from other distributed file systems. This features depends on adjustments and experience. The rack-aware replication location policy is used to improve data stability, availability, and network bandwidth optimization. The current replication location policy is only the first step in this direction. The short-term objective is to verify it in real deployment and use it to test and study more complex strategies.

Large HDFS implementations are usually distributed across multiple racks. Two nodes in different racks have to communicate through the switch between racks. Generally, the network bandwidth between machines in the same rack is greater than that of machines in different racks.

Namenode uses the hadoop rack awareness process to determine the rack ID of each datanode. A simple unoptimized policy is to store backups on different racks. Prevents data loss after the entire rack fails and reads data from multiple racks at the same time. This policy can effectively achieve Load Balancing after a node fails and the remaining backups. However, this policy increases the write cost because the write will execute block transmission across multiple racks.

When the replication factor is 3, The HDFS protection policy is that one is backed up in the local rack, and the other is backed up in a different (remote) rack, the last one is placed on different datanode of the remote rack. This policy reduces write traffic between racks and thus improves write performance. The chance of rack failure is much lower than the probability of node failure. Therefore, this policy does not affect the reliability and availability. However, it does reduce the network bandwidth usage when reading data, because one block is placed on two rack instead of three rack. This policy improves the write performance without compromising data stability and read performance.

In addition to the default placement policy described above, HDFS also provides a replaceable interface for block placement. For more information, see blcokplacementpolicy.

Replica Selection

To optimize network bandwidth consumption and read latency, HDFS tries to satisfy read requests from the nearest node of the reader. If a file backup and reader are on the same rack, select this backup to meet read requests. If an HDFS cluster is distributed across multiple data centers, local data center backup is superior to remote backup.

Safemode

At startup, namenode first enters a specific state called safemode. When namenode is in safemode, data block replication is not allowed. Namenode receives heartbeat and blockreport from datanodes. Blockreport contains a list of data blocks contained in datanode. Each block has a minimum number of backups. When the namenode check finds that the given data block has reached the minimum number of backups, it is considered safe. When the number of secure copies reaches the given percentage (plus an additional 30 seconds), namenode exits the safemode state. Then the data smaller than the minimum number of backups
Save blocks to a list and copy the backups of blocks to other datanodes.

The persistence of File System metadata

The namespace of the hdfs file system is stored in namenode. Namenode uses the transaction log editlog to record all changes that occur on the file system metadata. For example, when creating a new file in HDFS, The namenode will be prompted to insert a record to the editlog. Similarly, modifying the copy factor of a file also generates a new record in the editlog. Namenode saves this editlog file in its local OS file system. The namespace of the entire file system, including file block ing and file system attributes, is stored in a fsimage file. Fsimage is also stored in the local file system of namenode.

Namenode stores the namespace image and file block diagram of the entire file system in memory. This key metadata item is very compact, so that the namenode with 4 GB memory space can support a large number of destination directories and files. When namenode starts, it first reads fsimage and editlog from the local disk, and applies all editlog transactions to the representation of fsimage in the memory. Then, refresh the new version of memory fsimage to the disk. Now namenode can truncate the old editlog file, because we have permanently stored these transactions on-disk fsimage. This process is called checkpoint. Checkpoint
Node is a daemon independent of namenode. It can periodically create checkpoints from fsimage and editlog and upload them to namenode. The backup node is similar to the checkpoing node. You can create a checkpoint and maintain an upgraded copy in the memory.

Datanode contains HDFS data to files in the local file system. Datanode does not know the metadata of HDFS files. Each block of HDFS data is mapped to a datanode file. Datanode creates all the files in the same directory. Instead, it automatically determines the optimal number of files in each directory and creates a new subdirectory accordingly. Creating all files in a directory is not optimal, because the local file system may not be able to effectively support such large numbers of files in a single directory. When a datanode is started, it first scans its local file system, generates a list of local files corresponding to the HDFS data block, and sends it to namenode as a report: This is blockreport.

The communication protocols

All HDFS communication protocols are located on the TCP/IP protocol layer. The client connects to the namenode machine through a configurable TCP port, and uses clientprotocol between it and namenode. Datanodes and namenode use datanode protocol. A Remote Procedure Call (RPC) Abstraction Layer encapsulates client protocol and datanode protocol. According to the design, namenode does not initiate rpcs. It only responds to rpcs from datanodes and clients.

Robustness

A major objective of HDFS design is to store data reliably, even if there is a possibility of failure. There are three possible failures in HDFS: namenode failure, datanode failure, and network failure.

Data disk failure, heartbeats and re-replication

Each datanode periodically sends a hearbrat message to the namenode. The network may cause some datanodes to be unable to contact the namenode. In this case, datanode detects the missing heartbeat. Namenode marks the datanode that has not recently sent heartbeats as dead and does not send any I/O requests to this datanode. Caused some blocks replication to be smaller than the set value. Namenode continuously detects the chunks that need to be copied and starts the replication once necessary. Re-replication may be due to the following reasons: a datanode becomes unavailable, a replication becomes invalid, a disk of datanode becomes invalid, or the File Replication factor increases.

Cluster rebalancing

The HDFS architecture is compatible with the data balancing plan. When the free space of a datanode is lower than the given threshold value, the data of one datanode should be automatically moved to another datanode. When a file causes high requirements, a plan can dynamically add additional backups and balance other data in the cluster. Currently, this type of rebalanacing scheme has not been implemented.

Data Integrity

Block data obtained from datanode may be corrupted. The damage may be caused by storage device errors, network errors, or software bugs. HDFS client software performs checksum on HDFS content. When a client creates an HDFS file, it generates checksum for each block of the file and stores the checksums to a hidden file in the HDFS namespace. When a client obtains the content of this file from datanode, it verifies whether the obtained data block matches the checksum stored in the checksum file. If they do not match, the client obtains the backup of this block from other datanode.

Metadata disk failure

Fsimage and editlog are the core data structures of HDFS. If the two files are damaged, HDFS cannot work normally. For this reason, you can configure namenode to support multiple copies of fsimage and editlog. Any upgrade to fsimage and editlog will prompt all copies to be synchronized. Synchronous updates of multiple fsimage and editlog copies may reduce the speed of namespace transactions supported by namenode. However, this reduction is acceptable because even if HDFS applications are data-intensive, they are not metadata-intensive. When a namenode restarts, it selects the latest consistent fsimage and editlog.

Namenode is a single point of failure (spof) for HDFS clusters. If the namenode machine fails, manual intervention is required. The current software does not support Automatic restart and Failover to other machines

Snapshots

Snapshots refers to a backup of data in an instant, and snapshot is a component backup. Snapshot feather enables us to roll back the corrupted HDFS to the previous known good state. HDFS currently does not support snapshots, but will support this feature in the future.

Data Organization

Data blocks

HDFS is designed to support large files. Applications compatible with HDFS are those that process large datasets. These applications only write data once and read the data many times at a high speed. HDFS supports writing-once-read-committed to such files. A typical HDFS block size is 64 MB. Therefore, HDFS files are divided into 64 MB blocks. If possible, each chunk is stored on a different datanode.

Replication pipelining

When a client writes data to an HDFS cluster that is replicated so as 3, The namenode currency replication target selects an algorithm to obtain a datanode list. This list contains the datanode to save the data backup. The client then writes data to the first datanode. The first datanode receives data in a small unit (4 K), and writes each unit to the local storage and transmits this part to the second datanode of the list. The second datanode starts to receive each small part of the fast data. It writes the data to the local storage and transmits the data to the third datanode. In the end, all three datanode are written to the local storage. Therefore, a datanode receives data from the preceding datanode and sends it to the next datanode at the same time. It's just like flowing in a pipeline.

Accessibility

You can access HDFS in multiple ways. HDFS provides Java APIs for use. In addition, the C language interface is the packaging of this Java API. In addition, you can use HTTP browser to browse HDFS files. Access to HDFS through wevdav Protocol is under development.

FS Shell

HDFS allows you to organize data in the form of files and directories. It provides a command line interface called FS shell, and FS shell allows you to operate HDFS data through commands. The syntax of this command is similar to other shells commands. The following table provides examples of actions and commands.

Action Command
Create a directory named/foodir Bin/hadoop DFS-mkdir/foodir
Remove a directory named/foodir Bin/hadoop DFS-RMR/foodir
View the contents of a file named/foodir/myfile.txt Bin/hadoop DFS-CAT/foodir/myfile.txt

FS shell is intended for applications that require scripts to process data.

Dfsadmin

The dfsadmin command set is used to manage HDFS clusters. These commands are only used by the HDFS administrator. The following table lists some actions and examples of corresponding commands.

Action Command
Put the cluster in safemode Bin/hadoop dfsadmin-safemode enter
Generate a list of datanodes Bin/hadoop dfsadmin-Report
Recommision or decommisssion datanodes (s) Bin/hadoop dfsadmin-refreshnodes

Browser Interface

For a typical HDFS deployment, a Web server is configured to expose the HDFS namespace. In this way, you can use the web browser to browse the HDFS namespace and view the file content.

Space Reclamation

File deletes and undeletes

After a file is deleted by the user or application, the file will not be immediately deleted from HDFS. HDFS first renamed the file to a file under/trash. This file can be immediately restored if it is still under/trash. The time when the file stays in/trash is configurable. After this time, HDFS deletes the file from the namespace. Deleting this file will release data blocks related to this file. Note that there is a certain time delay between the deleted files and the corresponding increase in the idle space of HDFS.

You can undelete a file in the/trash directory. If you want to undelete a deleted file, first browse the/trash directory and restore the file. The/trash directory only contains the latest version of the deleted file. /Trash is very similar to other directories, but contains a special feature: HDFS will automatically delete files in this directory according to certain policies. The default policy is to delete files that have been moved to the directory/trash for more than 6 hours. In the future, this policy can be configured through a well-defined interface.

Decrease replication factor

When the replication factor is reduced, the namenode selects an excessive backup to delete it. Namenode transfers this information to datanode through heartbeat. After datanode deletes replicas, the corresponding free space is released to the cluster. Note that there is a certain delay between the setreplication API call and the corresponding idle space.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.