Hadoop Distributed System 2

Source: Internet
Author: User
Tags hadoop mapreduce

Configure HDFS 

Configuring HDFS is not difficult. First, configure the HDFS configuration file and then perform the format operation on the namenode.


Configure Cluster 

Here, we assume that you have downloaded a version of hadoop and decompressed it.

Conf in the hadoop installation directory is the directory where hadoop stores configuration files. Some XML files need to be configured. The CONF/hadoop-defaults.xml file contains default values for any hadoop parameter. This file should be read-only. You can override the default configuration by setting new values in the conf/hadoop-site.xml. The hadoop-site.xml files for all machines on the cluster should be consistent.

The configuration file itself is a collection of key-value pairs.

XML Code

<Property> <br/> <Name> property-name </Name> <br/> <value> property-value </value> <br/> </property> <br/> 

Property has<Final> true </FINAL>Such a row indicates that this attribute cannot be overwritten by the user application.

The following attributes must be configured to run HDFS:

 

Key Value Example
FS. Default. Name Protocol://Servername:Port HDFS: // alpha.milkman.org: 9000
DFS. Data. dir Pathname /Home/Username/HDFS/Data
DFS. Name. dir Pathname /Home/Username/HDFS/Name

These attributes have the following meanings:

 FS. Default. Name-This is a URI describing the namenode node in the cluster (including the protocol, host name, and port number). Each machine in the cluster needs to know the namenode address. Datanode nodes are first registered on namenode so that their data can be used. An independent client program interacts with datanode through this URI to obtain the file block list.

 DFS. Data. dir-This is the path of the local file system where the datanode node is specified to store data. The path on the datanode node does not need to be identical, because the environment of each machine is likely to be different. However, if the path on each machine is configured in a unified manner, the work will be easier. By default, its value is/temp. This path can only be used for testing, because it may lose some data. Therefore, it is best to overwrite this value.

 DFS. Name. dir-This is the local system path where the namenode node stores hadoop file system information. This value is only valid for namenode and does not need to be used for datanode. The above warning for the/temp type also applies here. In practical applications, it is best to overwrite it.

Here we will introduce a configuration parameter calledDFS. Replication. It determines the number of data backups of file blocks in the system. For a practical application, it should be set to 3 (This number has no upper limit, but more backups may not work and will occupy more space ). Less than three backups may affect data reliability (data loss may occur when a system failure occurs ).

The following is a template file.

XML Code

<Configuration> <br/> <property> <br/> <Name> FS. default. name </Name> <br/> <value> HDFS: // your.server.name.com: 9000 </value> <br/> </property> <br/> <Name> DFS. data. dir </Name> <br/> <value>/home/username/HDFS/Data </value> <br/> </property> <br/> <property> <br/> <Name> DFS. name. dir </Name> <br/> <value>/home/username/HDFS/name </value> <br/> </property> <br/> </configuration> 

Your.server.name.comThe host name needs to be configured correctly, and the port number must be specified randomly.

 

After configuring the conf/hadoop-site.xml file, you can copy the conf directory to another machine.

It is necessary for the master node to know the addresses of other machines in the cluster so that the startup script can run normally. Conf/slaves this file lists all available host names, one host name row. In a cluster environment, this file does not need to contain the Master Address. However, in a single machine, the master must also be available, because no other machine except the master can be used as a datanode node.

Then, create the directory we want to use:

Shell code

user@EachMachine$ mkdir -p $HOME/hdfs/datauser@namenode$ mkdir -p $HOME/hdfs/name

Users who execute hadoop need the read and write permissions of these directories. They can use the CHMOD command to set permissions for this user. In a large cluster environment, we recommend that you create a hadoop user on each machine and use it to execute hadoop-related programs. For a single-host system, you can use your username. I do not recommend using the root user to execute hadoop.

 Start HDFS

Now, first format the hadoop file system we Just configured.

Java code

user@namenode:hadoop$ bin/hadoop namenode -format 

This operation should be performed once. After the execution is completed, we can start the distributed system.

Shell code

User @ namenode: hadoop $ bin/start-dfs.sh

This command will start the namenode program on the master machine. It also starts the datanode program on the slave machine. In a single-host cluster, slave and master are the same machine. In a real cluster environment, this command logs on to slave through SSH and starts the datanode program.

 

Interacting with HDFS

In this section, we will be familiar with some commands required to interact with HDFS, store files, and obtain files.

Most names are executed by bin/hadoop scripts. It will load the hadoop system on the Java Virtual Machine and execute local user commands. These commands are usually in the following format.

Shell code

 User @ machine: hadoop $ bin/hadoop modulename-cmd ARGs...

TheModulenameTell the script program which hadoop module to use. -The name of the command to be executed supported by the specified module. The parameter of the command follows the command name.

HDFS has the following modules:DFSAndDfsadmin.Their usage will be described below.

Examples of Common Operations

DFS module, also known as "fsshell", provides basic file processing functions. The following describes some of its usage.

A cluster is useful because it contains data that we are interested in. Therefore, the first operation we want to introduce is to write data to the cluster. Assume that the user is "someone", which depends on your actual situation. Actions can be performed on any machine that can access the cluster, where the conf/hadoop-site.xml file must be set to the namenode address in the cluster. The command can be run in the installation directory. The installation directory can be/home/someone/src/hadoop or/home/Foo/hadoop, depending on your situation. The commands described below are mainly about importing data into HDFS, verifying whether the data is actually imported, and exporting the data.

 List objects

When we try to use HDFS, we will find it interesting:

Shell code

Someone @ anynode: hadoop $ bin/hadoop DFS-ls <br/> someone @ anynode: hadoop $

The "-ls" command with parameters does not return any value. By default, it returns your content in the HDFS "home" directory, this directory is not the same as the home directory "/home/$ user" on our normal system (HDFS is independent from the local system ). In HDFS, there is no such concept as the current working directory or the CD command.

Give-ls some parameters, such as/, and you may see the following content:

Shell code

Someone @ anynode: hadoop $ bin/hadoop DFS-ls/<br/> found 2 items <br/> drwxr-XR-X-hadoop supergroup 0 2008-09-20 19: 40/hadoop <br/> drwxr-XR -X-hadoop supergroup 0 2008-09-20 20:08/tmp

These results are generated by the system. In the results, "hadoop" is the hadoop user we use to start hadoop. supergroup is a group containing hadoop. These directories allow hadoop mapreduce system to move data to different nodes. Details are described in Module 4.

We need to create our own home directory and import some files.

 Import Data

A typical Unix or Linux system stores user files in the/home/$ user directory, but hadoop stores user files in/home/$ user. For some commands, such as LS, the DIRECTORY parameter is required, but is not actually filled in. At this time, the value of the DIRECTORY parameter is the default directory. (Other commands generally require clear source paths and destination paths ). The relative paths used in HDFS are based on the basic paths just introduced (that is, the user directory ).

 Step 1:If your user directory does not exist, create one.

Shell code

someone@anynode:hadoop$ bin/hadoop dfs -mkdir /user

If there is no/user directory, you must first create it. In fact, it will be automatically created, but for the purpose of introduction, we first manually create this directory.

Now we can create our home directory.

Shell code

 Someone @ anynode: hadoop $ bin/hadoop DFS-mkdir/user/someone

Set/User/someoneChange/User/Yourusername .

Step 2:Import a file. We can use the "put" command.

Shell code

Someone @ anynode: hadoop $ bin/hadoop DFS-Put/home/someone/interestingfile.txt/user/yourusername/

It will/Home/someone/interestingfile.txt copy from local file systemHDFS/User/Yourusername/Interestingfile.txt.

Step 3:Verify the operation. Here we can use two commands, which are equivalent:

Shell code
someone@anynode:hadoop$ bin/hadoop dfs -ls /user/yourUserNamesomeone@anynode:hadoop$ bin/hadoop dfs -ls

You should be able to see a file list, which is preceded by "found 1 items". The file list contains the file you just inserted.

The following table demonstrates example uses ofPutCommand, and their effects:

The following list shows the usage and effects of the "put" command.

 

Command: Prerequisites: Output:
Bin/hadoop DFS-put Foo bar No files such as/user/$ user/bar exist in HDFS. Upload the local file Foo to the HDFS system and name it/User/$ user/Bar
Bin/hadoop DFS-put Foo bar /User/$ user/BarIs a directory Upload the local file Foo to the HDFS system and name it/user/$ user/BAR/Foo.
Bin/hadoop DFS-put Foo somedir/somefile /User/$ user/somedirDoes not exist Upload the local file Foo to the HDFS system and name it/User/$ user/somedir/somefileTo create a directory that does not exist.
Bin/hadoop DFS-put Foo bar /User/$ user/BarIt is already a file in the system The operation is invalid. The system returns an error to the user.

When a "put" operation is executed, the result is either full or full. When uploading a file, the file is first copied to the datanode node. When all datanode receives the data, the file handle is closed and the file is uploaded completely. Based on the return value of the "put" command, we can know whether the operation was successful or completely failed. The file cannot be uploaded in half. If the upload operation is interrupted when the file is uploaded in half, HDFS will regard it as nothing.

 Step 4:The "put" command can upload multiple commands at a time. It can upload an entire directory to HDFS.

Create a local directory and copy some files to it. As shown in the following figure.

Shell code

someone@anynode:hadoop$ ls -R myfilesmyfiles:file1.txt  file2.txt  subdir/myfiles/subdir:anotherFile.txtsomeone@anynode:hadoop$ 

DirectoryMyfiles/can be copied to HDFS as follows:

Shell code

someone@anynode:hadoop$ bin/hadoop -put myfiles /user/myUsernamesomeone@anynode:hadoop$ bin/hadoop -lsFound 1 items/user/someone/myfiles   <dir>    2008-06-12 20:59    rwxr-xr-x    someone    supergroupuser@anynode:hadoop bin/hadoop -ls myfilesFound 3 items/user/someone/myfiles/file1.txt   <r 1>   186731  2008-06-12 20:59  rw-r--r--  someone   supergroup/user/someone/myfiles/file2.txt   <r 1>   168     2008-06-12 20:59  rw-r--r--  someone   supergroup/user/someone/myfiles/subdir      <dir>           2008-06-12 20:59  rwxr-xr-x  someone   supergroup

The example above also proves that this directory is completely copied in. Note <R 1> next to the file path. The number 1 indicates that the number of backups is 1. The LS command also lists the file size, upload time, permissions, and owner information.

 -PutAnother way of writing is-Copyfromlocal. Their functions and usage are the same.

 Export data in HDFS

There are many ways to export data from HDFS. The simplest way is to use "cat" to output the content of a file to the standard output. (Of course, it can also be passed to the program as a parameter, or elsewhere)

Step 1: Cat command.

In this example, we assume that you have uploaded some files to your HDFS.

Shell code

someone@anynode:hadoop$ bin/hadoop dfs -cat foo(contents of foo are displayed here)someone@anynode:hadoop$

 Step 2:Copy the files in HDFS to the local system.

The "get" command has the opposite function of the "put" command, which can copy files or directories in HDFS to the local system. The alias for the "get" command isCopytolocal.

Shell code

  someone@anynode:hadoop$ bin/hadoop dfs -get foo localFoosomeone@anynode:hadoop$ lslocalFoosomeone@anynode:hadoop$ cat localFoo(contents of foo are displayed here)

Like the "put" command, "get" operations can operate both files and directories.

Disable HDFS

If you want to disable the HDFS system on the cluster, you can run the following command on the namenode node:

Shell code

someone@namenode:hadoop$ bin/stop-dfs.sh

 

This command must be executed by the HDFS user.

 HDFS command reference

Of course, bin/hadoop DFS commands are far more than those, but these commands can help you start to use HDFS. Run the bin/hadoop DFS command without parameters. All fsshell system commands are listed.Bin/hadoop DFS-helpCommandname Will list the Usage Guide for a specific command.

All commands are listed below as a table. Here are some notes for this table:

  • ItalicsThis variable is required by the user.
  • "Path" is a file name or directory name.
  • "Path..." multiple file names, or multiple directory names.
  • "File" any file name.
  • "Src" and "DEST" path name.
  • "Localsrc" and "localdest" are similar to the preceding path, but only limited to the local system. Other paths represent the paths in HDFS.
  • Parameters in "[]" are optional.

 

Command Operation
-LsPath Lists the contents of the directory specifiedPath, Showing the names, permissions, owner, size and modification date for each entry.
-LSRPath Behaves like-Ls, But recursively displays entries in all subdirectoriesPath.
-DuPath Shows disk usage, in bytes, for all files which matchPath; Filenames are reported with the full HDFS protocol prefix.
-DUSPath Like-Du, But prints a summary of disk usage of all files/directories in the path.
-MVSRC Dest Moves the file or directory indicatedSRCToDest, Within HDFS.
-CPSRC Dest Copies the file or directory identifiedSRCToDest, Within HDFS.
-RmPath Removes the file or empty directory identifiedPath.
-RMRPath Removes the file or directory identifiedPath. Recursively deletes any child entries (I. e., files or subdirectoriesPath).
-PutLocalsrc Dest Copies the file or directory from the local file system identifiedLocalsrcToDestWithin the DFS.
-CopyfromlocalLocalsrc Dest Identical-Put
-MovefromlocalLocalsrc Dest Copies the file or directory from the local file system identifiedLocalsrcToDestWithin HDFS, then deletes the local copy on success.
-Get [-CRC]SRC Localdest Copies the file or directory in HDFS identifiedSRCTo the local file system path identifiedLocaldest.
-GetmergeSRC Localdest[Addnl] Retrieves all files that match the pathSRCIn HDFS, and copies them to a single, merged file in the local file system identifiedLocaldest.
-CatFilename Displays the contentsFilenameOn stdout.
-Copytolocal [-CRC]SRC Localdest Identical-Get
-Movetolocal [-CRC]SRC Localdest Works like-Get, But deletes the HDFS copy on success.
-MkdirPath Creates a directory namedPathIn HDFS. creates any parent directories inPathThat are missing (e.g., likeMkdir-PIn Linux ).
-Setrep [-R] [-W]Rep Path Sets the target replication factor for files identifiedPathToRep. (The actual replication factor will move toward the target over time)
-TouchzPath Creates a filePathContaining the current time as a timestamp. fails if a file already existsPath, Unless the file is already size 0.
-Test-[ezd]Path Returns 1 ifPath EXists; hasZERO length; or isDIrectory, or 0 otherwise.
-Stat [format]Path Prints information aboutPath.FormatIs a string which accepts file size in blocks (% B), filename (% N), block size (% O), replication (% R), and modification date (% Y, % Y ).
-Tail [-F]File Shows the Lats 1kbFileOn stdout.
-Chmod [-R]Mode, mode ,... Path... Changes the file permissions associated with one or more objects identifiedPath.... Performs changes recursively-R.ModeIs a 3-digit octal mode, or{Augo} +/-{rwxx}. AssumesAIf no scope is specified and does not apply a umask.
-Chown [-R] [Owner] [: [Group]Path... Sets the owning user and/or group for files or directories identifiedPath.... Sets owner recursively if-RIs specified.
-Chgrp [-R]Group Path... Sets the owning group for files or directories identifiedPath.... Sets group recursively if-RIs specified.
-HelpCMD Returns usage information for one of the commands listed above. You must omit the leading '-'character inCMD

Dfsadmin command reference

The ''dfs "module provides operation commands for files and directories, and" dfsadmin "provides operations for managing the entire file system.

Global Status information:PassBin/hadoop dfsadmin-Report command, we can get a global status report. This report contains basic information about the HDFS cluster, and of course there are some situations for each machine.

 Detailed status information:You can useBin/hadoop dfsadmin-metasaveFilename Command. filename is the name of the file you want to view. NB: In its help, this command can obtain the main structure of namenode, which is inappropriate. The namenode status cannot be obtained from the information returned by this command. However, it shows how namenode stores HDFS file blocks.

Safemode:HDFS is read-only in safemode. Any replication, creation, or deletion operations are prohibited. When namenode is started, the system automatically enters this mode. datanode will register in namenode at this time and tell namenode what file blocks they own, namenode knows which file block backups are lost. After a certain percentage of datanodes work properly, namenode will exit the security mode. This percentage is configured in DFS. safemode. Threshold. PCT. When a certain percentage is reached, the security mode automatically exits and HDFS allows normal operation. You can useDfsadmin-safemodeWhat To operate the security mode. The parameters are described as follows:

 

  • Enter-Enter security mode
  • Leave-Force namenode to Exit security mode
  • Get-Returned information about whether the security mode is enabled.
  • Wait-Wait until the security mode ends.

Change HDFS member composition-When deleting a node, We need to disconnect it from the node step by step to ensure data will not be lost. For more information about the decommissioning command, we will discuss it later.


Upgrade HDFS-When HDFS is upgraded from one version to another, the file formats used by namenode and datanode may change. When you use the new version for the first time, you need to use bin/start-dfs.sh-upgrade to tell hadoop to change the HDFS version (otherwise, the new version will not take effect ). Then it starts upgrading. You can useBin/hadoop dfsadmin-upgradeprogress command to view version updates. Of course, you can use bin/hadoop dfsadmin-upgradeprogress details to view more details. When the upgrade process is blocked, you can use bin/hadoop dfsadmin-upgradeprogress force to force the upgrade to continue execution (when you use this command, be sure to consider it carefully ).

After HDFS is upgraded, hadoop retains information about the old version so that you can easily downgrade HDFS. You can use bin/start-dfs.sh-rollback to perform the downgrade operation.


Hadoop saves only one version of backup at a time. After running the new version for a few days, you can use the bin/hadoop dfsadmin-finalizeupgrade command to delete the backup of the old version from the system. After deletion, the rollback command becomes invalid. This operation is required before another version upgrade.

Get help-HeelDFSThe modules are the same. You can useBin/hadoop dfsadmin-HELP command to obtain some usage of specific commands.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.