Original address: http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html
Translator: Dennis Zhuang (killme2008@gmail.com), please correct me with a mistake, thank you. Objective
This document can be used as a starting point for users of distributed file systems using Hadoop, either by applying HDFS to a Hadoop cluster or as a separate distributed file system. HDFs is designed to work in many environments right away, some HDFs's running knowledge will definitely help you configure and diagnose a cluster.
Overview
HDFs is the primary distributed storage for Hadoop applications. A HDFS cluster consists of a namenode that manages the metadata of the file system and some datanode that store the actual data. The HDFS architecture is described here in detail. This user guide is primarily provided to users or administrators who need to deal with the HDFs cluster. The diagrams in the HDFs architecture article depict basic interactions between Namenode, Datanode, and clients. In essence, the client communicates with the Namenode to obtain or modify the metadata of the file, and datanode the actual IO operation.
The following list should be the HDFs salient features that most users care about. The term italic is described in detail later.
1 Hadoop, including HDFs, is ideal for distributed storage and distributed processing on inexpensive machines. It is fault tolerant, scalable, and very easy to scale. And map-reduce, known for simplicity and applicability, is an integral part of Hadoop.
2 The default configuration of the HDFs is suitable for most installation applications. Typically, you only need to modify the default configuration on a very large cluster.
3 HDFs is written in Java and supports most platforms.
4 support shell command-line style of HDFs directory interaction.
5 Namenode and Datanode both built a Web server to easily view the status of the cluster
6) HDFs regularly implement new features and improvements, the following is a subset of the useful features in HDFs:
File Licensing and authorization
Rack Awareness: Consider the physical location of a node when scheduling tasks and allocating storage.
SafeMode (Safe Mode): A management state for maintenance
FSCK: A tool for diagnosing file systems to find missing files or block
Rebalancer: Tools for rebalancing clusters when data is not evenly distributed between datanode
Upgrades and rollback: When Hadoop software is upgraded, it can be rolled back to the state before the HDFS upgrade when an unexpected problem is encountered in the upgrade
Level two namenode: Help Namenode maintain files that contain HDFs modified logs (edits log files, as described below) are limited in size.
Prerequisite
The following documentation describes the installation and setup of a Hadoop cluster:
Hadoop Quickstart for initial user
Hadoop Cluster Setup Large-scale, distributed cluster
The remainder of this document assumes that you have already made and run a hdfs with at least one datanode. Based on the purpose of this document, Namenode and Datanode can run on the same machine.
Web Interface
Namenode and Datanode ran a built-in Web server to show some basic information about the current state of the cluster. Under the default configuration, Namenode's home address is http://namenode:50070 (Namenode is the machine IP or name of the Namenode node). This page lists all the Datanode in the cluster and the basic statistics for the cluster. The Web interface can also be used to browse the filesystem (click on the "Browse the file system" link on the Namenode homepage).
shell Command
Hadoop includes a variety of shell-style commands for interacting with other file systems supported by HDFs or Hadoop. Command Bin/hadoop Fs-help can list the commands that are supported by the Hadoop shell. Further, the Bin/hadoop fs-help command can show the help details of a particular command. These commands support the operations of the general file system, such as copying files, modifying file permissions, and so on. Some HDFs-specific commands are also supported, such as modifying the replication factor of a file.
dfsadmin Command
The ' bin/hadoop dfsadmin ' command supports the operation of some HDFS management functions. ' Bin/hadoop dfsadmin-help ' can list all currently supported commands. For example:
-report: Basic statistics for reporting HDFs. Some of the information is also displayed on the Namenode Web page. -safemode: Although it is not usually necessary, administrators can either enter or leave the safemode state by hand-finalizeupgrade: Remove the backup from the cluster on the last upgrade.
Level Two Namenode
Namenode stores modifications to the file system in a native file system file (a file named edits). When Namenode is started, it reads the state of the HDFs from the image file (fsimage) and then modifies the edits log file in this memory state, then writes the resulting new HDFs state back to Fsimage. Subsequent normal operations begin with an empty edits log file. Because Namenode only merges fsimage and edits at startup, edits files will be very large after a certain amount of time on a large cluster. One side effect of this is that the next Namenode reboot will take a long time. The level two Namenode is designed to solve this problem by periodically merging fsimage and edits log files and keeping the edits log file size within bounds. Usually it runs on another machine because its memory requirements are the same as the main namenode. Level two Namenode can be started by ' bin/start-dfs.sh ' on the nodes configured in the Conf/masters configuration file.
Rebalancer
HDFs data may not always be consistent across Datanode. A common cause is the addition of new datanode to existing clusters. When allocating block, Namenode determines which datanode to accept these blocks based on several parameters. Some factors to consider are as follows:
1 A copy of the block is stored on the node where the block is being written
2 A copy of a block needs to be extended to other racks to prevent data loss due to an entire rack failure.
3 One copy is usually placed on another node of the same rack, reducing network IO across the rack
4 Distribute the HDFS data uniformly in the datanode of the cluster.
Based on these competing factors, the data may not expand consistently between Datanode. HDFS provides administrators with a tool to analyze the allocation of blocks and to rebalance data between datanode. This feature is temporarily not implemented, its description can be seen in this PDF document, the record number HADOOP-1652.
Rack Awareness
While a typical large-scale Hadoop cluster is deployed on several racks, it is clear that network traffic between nodes within the same rack is preferable to network communication between nodes in different racks. In addition, Namenode will try to distribute a replica of the block in several racks to improve fault tolerance. Hadoop allows the Cluster Administrator to determine which rack a node belongs to, and to implement it by configuring variable Dfs.network.script. When the script is configured, each node runs the script to determine its rackid. The default installation assumes that all nodes belong to the same rack. This feature and configuration is further elaborated in this PDF document, numbered HADOOP-692.
Safemod (Safe mode)
When Namenode is started, it loads the state of the file system from the Fsimage and edits log two files. Then wait for Datanode to report their block information to prevent Namenode from copying the block prematurely before confirming that the block copy is sufficient. This period of time Namenode is in the so-called SafeMode state. The SafeMode Namenode is also a read-only model of the HDFs cluster, and no modifications to the file system or block are allowed at this time. Normally, Namenode automatically exits safemode after the start. If necessary, HDFs can enter the SafeMode state explicitly through the ' bin/hadoop dfsadmin-safemode ' command. The Namenode Web page shows whether the current safemode is open. More detailed description and configuration can refer to Javadoc for the Setsafemode () method.
In detail, the configuration parameters of the next safemode, in the SafeMode state, Namenode will wait for all datanode to report their own block information to see if all of the block replicas meet the minimum number of requirements, This number can be configured by the Dfs.replication.min parameter, which defaults to 1, which means at least one copy is required. When the number of Datanode reported is reached to a certain percentage, Namenode will leave the SafeMode state. This percentage is also configurable, by the dfs.safemode.threshold.pct parameter, the default is 0.999f (that is, require 99.9% Datanode qualified). Namenode when the number of qualified Datanode to meet the requirements, not immediately leave the safemode state, there will be an extended time, let the remaining Datanode to report block information, this extension time by default is 30 seconds, Can be configured with the Dfs.safemode.extension parameter in milliseconds.
Fsck
HDFS provides the fsck command to detect various inconsistencies. FSCK is designed to report problems with various files, such as whether the number of block,block copies of a file is less than set. Unlike the traditional native file system's fsck command, the HDFs fsck command does not fix detected errors. Typically, Namenode automatically corrects most bugs that can be fixed, HDFs fsck is not a Hadoop shel command that can be executed through ' bin/hadoop fsck ' and run across a file system or a subset of a file.
Upgrades and rollback
When you upgrade a cluster of Hadoop, as with any software upgrade, you may introduce new bugs or incompatible modifications that can lead to existing applications that have not been discovered in the past. In all important HDFS installation applications, it is not permissible to restart the HDFs from scratch due to loss of any data. HDFs allows the administrator to revert to an earlier version of Hadoop and roll the status of the cluster back to prior to the upgrade. Please refer to upgrade wiki for HDFs upgrade details. HDFs can only have one backup at any time, so administrators need to remove existing backups through the ' bin/hadoop dfsadmin-finalizeupgrade ' command before upgrading. The following is a brief description of the typical upgrade process:
1 before upgrading Hadoop, if there is already a backup, you need to end it first. The ' dfsadmin-upgradeprogress status ' command can be used to query whether the cluster needs to perform a finalize
2 Stop the cluster and distribute the new version of Hadoop
3 Execute a new version of Hadoop by adding-upgrade options, such as/bin/start-dfs.sh-upgrade
4 in most cases, the cluster can function correctly after the upgrade. Once the new HDFs has not been a problem after several days of operation, you can end the upgrade. Note that files deleted before the upgrade do not release the actual disk space on the Datanode until the cluster is terminated (finalize) upgrade.
5 If there is a need to go back to the old version of Hadoop, then you can:
A stop the cluster and distribute the old version of Hadoop
b start the cluster with the rollback option, such as Bin/start-dfs.sh-rollback
file Licensing and security
The design of the file license is similar to the file system of other platforms, such as Linux. In the current implementation, security is limited to Simple file permissions. Users who start Namenode are used as HDFs superuser. Future versions of HDFS will support network authentication, such as user authentication for the Kerberos scheme (a validation system developed by MIT) and encryption of data transfers. A more detailed discussion reference permissions User and Administrator Guide.
Scalability
Hadoop is running on a cluster of thousands of nodes. Poweredby Hadoop lists some of the organizations and institutions that deploy Hadoop on a large cluster. HDFs has only one namenode node on each cluster, and the available memory on the Namenode node is the primary limitation of current scalability. On very large clusters, increasing the average size of the files stored in the HDFs will help to increase the size of the cluster without increasing the namenode memory requirements. The default configuration may not be suitable for very large cluster applications. The Hadoop FAQ page lists recommendations for configuring improvements to a large Hadoop cluster.
Associated Documents
This user's Guide can be a good starting point for using HDFs, and there are some very valuable documentation on Hadoop and HDFS for reference while the document continues to improve. The following information can serve as a starting point for further exploration:
Hadoop Home Page:hadoop The start page for everything. Hadoop Wiki: A wiki document maintained by the community. FAQ from Hadoop Wiki. Hadoop JavaDoc API. Hadoop User mailing list:core-user[at]hadoop.apache.org. Browse the Conf/hadoop-default.xml file, which includes an overview of the currently available configuration variables.