Hadoop Distributed File System--hdfs detailed

Source: Internet
Author: User
Tags chmod ssh hadoop fs

This is a major chat about Hadoop Distributed File System-hdfs

Outline:

1.HDFS Design Objectives

The Namenode and Datanode inside the 2.HDFS.

3. Two ways to operate HDFs 1.HDFS design target hardware error

Hardware errors are normal rather than abnormal. (Every time I read this I think: programmer overtime is not abnormal) HDFs may consist of hundreds of servers, each of which stores part of the file system's data. The reality we face is that the number of components that make up the system is huge, and any component is likely to fail, which means that there is always a part of the HDFs component that is not working. Therefore, error detection and rapid and automatic recovery are the core architectural goals of HDFs.

Streaming data Access

The idea of HDFS structure is "write once, read multiple times". Because Hadoop is used to analyze data for a long time, each analysis involves most datasets or the entire dataset, so the time lag to read the entire dataset is more important than the time delay in reading the first record, which would rather start reading the data slowly, but read the entire data quickly and do not want to read the starting data quickly, But later data is slow to read.
Large data sets

Super large file GB,TB,PB level of data

A simple consistency model

The HDFS application requires a file access model of write multiple reads at a time. A file does not need to be changed after it has been created, written, and closed. This assumption simplifies data consistency issues and makes it possible to access high throughput data. Map/reduce applications or web crawler applications are ideal for this model. There are also plans to expand the model in the future to support the file's additional write operations.

"Mobile computing is more cost-effective than moving data"

Compute nodes and data nodes in the same calculation, reduce the data exchange in the network, bring faster speed heterogeneous software and hardware platform portability between

HDFs is designed to take into account the portability of the platform. This feature facilitates the popularization of HDFs as a large-scale data application platform.

With Hadoop, it allows you to implement the Namenode and Datanode in 2.HDFS.

The Master/slave architecture used by HDFs. A HDFs cluster is composed of a namenode and a certain number of datanodes. Namenode is a central server that manages the namespace (namespace) of file systems and client access to files. The Datanode in a cluster is typically a node (the meaning of a node can be understood as a host) and is responsible for managing the storage on its node. HDFs exposes the file system's namespace and allows users to store data in the form of files. Internally, a file is actually divided into one or more blocks of data, which are stored on a set of Datanode. Namenode performs namespace operations on file systems, such as opening, closing, renaming files, or directories. It is also responsible for determining the mapping of data blocks to specific datanode nodes. Datanode is responsible for processing and reading requests from file system clients. The creation, deletion and replication of data blocks under the unified dispatch of Namenode.

Data blocks: We know that the disk of the computer has a default block size, the file system block size is a disk block integer times, generally thousands of bytes, and disk block generally 512 bytes. HDFs also has the concept of block, the default is 64MB. Why is it so big? As we all know, disk is slower because the disk is read and write is mechanical and Hadoop is mainly used to deal with large data, in order to save addressing time, so a large chunk of data together, so you can save processing data time.

Hadoop provides a comprehensive file system abstraction that provides a variety of interfaces for file system implementations. HDFs is just an implementation of this abstract file system

Give an example:

Now I'm going to save a 2G movie and upload it to the HDFS process.

1.HDFS for data security, this movie will be saved by default 3 copies (we take a name: The original movie, backup one, Backup II). This number of backups can be adjusted.

2.2G of film data was cut into 32 copies stored in Datanode. The information in this movie is kept in Namenode.

3. The original film and backup will be placed on the same rack, backup two on the other rack. So if the original movie is lost for some reason, you can also find backup one (advantage: fast, avoid data transmission in the network), in the near case if the rack is damaged by the scourge, we have a second rack to find data (advantage: Secure)
3. Actual Operation HDFs

Run the Hadoo first, start the Hadoop code    [root@master ~]# su hadoop  //switch to Hadoop user     [hadoop@master root]$ cd /usr/hadoop/hadoop-1.0.4/bin  //into bin directory    [ hadoop@master bin]$ ls     hadoop              start-all.sh                stop-balancer.sh   hadoop-config.sh   start-balancer.sh           stop-dfs.sh   hadoop-daemon.sh    start-dfs.sh                stop-jobhistoryserver.sh   hadoop-daemons.sh  start-jobhistoryserver.sh   stop-mapred.sh   rcc                 start-mapred.sh            task-controller   slaves.sh          stop-all.sh   [Hadoop@master  bin]$ ./start-all.sh  //Startup    starting namenode, logging to / usr/hadoop/hadoop-1.0.4/libexec/.. /logs/hadoop-hadoop-namenode-master.hadoop.out   192.168.81.129: starting datanode,  logging to /usr/hadoop/hadoop-1.0.4/libexec/.. /logs/hadoop-hadoop-datanode-slave01.hadoop.out   192.168.81.130: ssh: connect to  host 192.168.81.130 port 22: No route to host   192.168.81.131: ssh: connect to host 192.168.81.131 port 22: no  route to host   192.168.81.128: starting secondarynamenode, logging to  /usr/hadoop/hadoop-1.0.4/libexec/.. /logs/hadoop-hadoop-secondarynamenode-master.hadoop.out   STARTING JOBTRACKER, LOGGING TO /USR /hadoop/hadoop-1.0.4/libexec/.. /logs/hadoop-hadoop-jobtracker-master.hadoop.out   192.168.81.129: starting tasktracker,  logging to /usr/hadoop/hadoop-1.0.4/libexec/.. /logs/hadoop-hadoop-tasktracker-slave01.hadoop.out   192.168.81.130: ssh: connect to  host 192.168.81.130 port 22: No route to host   192.168.81.131: ssh: connect to host 192.168.81.131 port 22: no  route to host   [hadoop@master bin]$   

How to operate HDFs here are two ways: 1.FS shell

The detailed fs shell script cat is provided here

How to use: Hadoop fs-cat uri [uri ...]

The path specifies the contents of the file to be exported to stdout.

Example: Hadoop fs-cat hdfs://host1:port1/file1 hdfs://host2:port2/file2 Hadoop fs-cat file:///file3/user/hadoop/file4

return value:
--> successfully returns 0, failure returns-1.

chgrp

How to use: Hadoop fs-chgrp [-R] GROUP uri [uri ...]

Change the group to which the file belongs. Using-R causes changes to be recursively performed under the directory structure. The user of the command must be the owner of the file or the superuser. chmod

How to use: Hadoop fs-chmod [-r] <mode[,mode] ... | octalmode> uri [uri ...]

Change the permissions on the file. Using-R causes changes to be recursively performed under the directory structure. The user of the command must be the owner of the file or the superuser. Chown

How to use: Hadoop Fs-chown [-R] [Owner][:[group]] uri [URI]

Change the owner of the file. Using-R causes changes to be recursively performed under the directory structure. The user of the command must be a superuser. copyfromlocal

How to use: Hadoop fs-copyfromlocal <localsrc> URI

In addition to qualifying the source path is a local file , it is similar to the put command. copytolocal

How to use: Hadoop fs-copytolocal [-IGNORECRC] [-CRC] URI <localdst>

In addition to qualifying the target path is a local file , it is similar to the Get command. CP

How to use: Hadoop fs-cp uri [uri ...] <dest>

Copies the file from the source path to the target path. This command allows multiple source paths, at which point the destination path must be a directory.
Example: Hadoop fs-cp/user/hadoop/file1/user/hadoop/file2 Hadoop fs-cp/user/hadoop/file1/user/hadoop/file2/user/hadoop/ Dir

return value:

Successfully returns 0, failure returns-1. du

How to use: Hadoop fs-du uri [uri ...]

Displays the size of all files in the directory, or the size of the file when only one file is specified.
Example:
Hadoop fs-du/user/hadoop/dir1/user/hadoop/file1 Hdfs://host:port/user/hadoop/dir1
return value:
Successfully returns 0, failure returns-1. Dus

How to use: Hadoop fs-dus <args>

Displays the size of the file. Expunge

How to use: Hadoop fs-expunge

Empty the Recycle Bin. Refer to the HDFs design documentation for more information about the Recycle Bin features. Get

How to use: Hadoop fs-get [-IGNORECRC] [-CRC] <src> <localdst>

Copy files to the local file system. You can copy a CRC failed file with the-IGNORECRC option. Use the-CRC option to copy files and CRC information.

Example: Hadoop fs-get/user/hadoop/file localfile Hadoop fs-get hdfs://host:port/user/hadoop/file localfile

return value:

Successfully returns 0, failure returns-1. Getmerge

How to use: Hadoop fs-getmerge <src> <localdst> [ADDNL]

Accepts a source directory and a target file as input, and connects all files in the source directory to the local destination file. ADDNL is optional and is used to specify that a newline character be added at the end of each file. ls

How to use: Hadoop fs-ls <args>

If it is a file, the file information is returned in the following format:
File name < number of copies > file size modification Date Modify time rights User ID group ID
If it is a directory, it returns a list of its immediate subfolders, as in Unix. The information for the catalog return list is as follows:
Directory name <dir> Modify date Modify time rights User ID group ID
Example:
Hadoop fs-ls/user/hadoop/file1/user/hadoop/file2 Hdfs://host:port/user/hadoop/dir1/nonexistentfile
return value:
Successfully returns 0, failure returns-1. LSR

How to use: Hadoop FS-LSR <args>
Recursive version of the LS command. Similar to the Ls-r in Unix. mkdir

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.