Deep Hadoop HDFS (ii)

Source: Internet
Author: User
Tags create directory

1. Introduction to the HDFS architecture

1.1 HDFs Architecture Challenge

1.2 Architecture Introduction

1.3 FileSystem Namespace File system Namespace

1.4 Data replication

1.5 Meta Data persistence

1.6 Information exchange Protocol

2. HDFs Data Accessibility

2.1 Web Interface

2.2 Shell command

<1>. HDFs Architecture Introduction

1.1 HDFs Architecture Challenge

HDFs and most of the existing Distributed file systems have many similar features, but have their own features: high fault tolerance highly fault-tolerant, high data throughput of higher throughput and so on. To meet the above features, HDFs will have to address some of the following tricky issues:

1. Hardware error: In an HDFS system may save a large number of servers, then each server has the possibility of hardware failure, then HDFS need to ensure that a server error can be automatically detected, while the automatic recovery. This goal is the primary solution to the HDFS architecture.

2. Streaming access streamingdata access: HDFs needs to provide streaming data access to the application.

3. Large file support: Files stored on HDFs may be at G-level or T-level, so HDFs needs to be able to support large files. There is also a need to support storing a large number of files in one instance (It should tens of millionsof files in A and a single instance).

4. Data Consistency Assurance: HDFS needs to be able to support the "Write-once-read-many access" model.

In the face of the above architectural requirements, let's look at how HDFs meets the architecture requirements above.

1.2 Architecture Introduction

HDFs uses the Master/slave model, an HDFS cluster contains a namenode and some columns of datanode, where Namenode acts as the master role, primarily responsible for managing the HDFs file system, Accept requests from clients; Datanode is primarily used to store data files, and HDFs splits a file into one more block, which may be stored on a datanode or multiple datanode.


Based on the architecture requirements above, Hadoop uses this master/slave architecture, specifically a few private parts:

1. NameNode: Basically the same as the status of Master, copy control of the underlying file IO operations, processing mapreduce tasks and so on.

2. DataNode: Run on the slave machine, responsible for reading and writing the actual underlying files. If the client program initiates a command to read the files on HDFs, first divide the files into the so-called block, and then Namenode will tell the client that the block data is stored on those datanode, The client interacts directly with the Datanode.

3. Secondary NameNode: This part of the main is the timing of the NameNode data snapshots backup, so as to minimize the NameNode crash, resulting in data loss.

4. Jobtracker: This is the equivalent of a bridge between the client program and Hadoop, where there is only one jobtracker instance in the entire Hadoop system.

5. Tasktracker:tasktracker is primarily responsible for each specific task, as follows:

1.3 FileSystem Namespace File system Namespace

HDFS supports the directory structure of legacy file systems, where applications can create directory directories, store files in these directories, create files, move files to remove file, rename files, but do not support hard links and soft joins.

1.4 Data replication Replication

HDFs divides a file into blocks and then stores the blocks in different datanode, so how to ensure that if a datanode is dead and the data is intact, the usual technique is to make a backup of the data, and HDFS uses this strategy as well.


We now consider the system startup, Namenode first into the safemode, in this mode is not the backup of the data (copy), Datanode to Namenode send Heartbeat and Blockreport, This allows the namenode to get the data files stored on each datanode, and then Namenode check that the number of backup images for those blocks has not reached the required number of backups, then Namenode will be on those blocks.

1.5 Meta Data persistence

HDFs uses the logging mechanism to store all file system operations in a single log file while the entire filesystem information (the mapping ofblocks to files and file system Properties ) is mapped to a fsimage file that is stored on the local file system of the Namenode host. At the same time Fsimage and log support multiple copies, these HDFS guarantee the consistency of these backup files.

1.6 Information exchange Protocol

The above mentioned "Datanode sends Heartbeat and Blockreport to Namenode", which obviously involves the problem of the Protocol, the HDFS Communication protocol is built on the TCP/IP protocol. Clients exchange information through the ClientProtocol protocol and Namenode, Namenode exchange information through Datanode Procotol protocol and Datanode.

<2> Data accessibility

2.1 Web Interface

HDFs can use Web pages to view the directory http://localhost:50075 in the file system in HDFs:

2.2 Shell command

1. Create a Directory

[Email protected]:~/hadoop/src/hadoop-0.21.0$./bin/hadoop Dfs-mkdir/foodir

2. Deleting a directory

Xuq[email protected]:~/hadoop/src/hadoop-0.21.0$./bin/hadoop Dfs-rmr/foodir

3. Uploading Files

[Email protected]:~/hadoop/src/hadoop-0.21.0$./bin/hadoop dfs-put./conf/*/foodir

4. View Files

[Email protected]:~/hadoop/src/hadoop-0.21.0$ bin/hadoop dfs-cat/foodir/capacity-scheduler.xml

5. deleting files

[Email protected]:~/hadoop/src/hadoop-0.21.0$ bin/hadoop dfs-rm/foodir/capacity-scheduler.xml More operations:/HTTP Hadoop.apache.org/common/docs/current/file_system_shell.html

//------------------------------------------

Last updated: 2011-6-1 children's Day.

If you feel good, welcome to the support of the sweep code.

  Xu Qiang

1 The article in this blog is a summary of individual learning and project development. There are inevitably shortcomings, welcome to leave a message. 2. The copyright of this article is owned by the author and the blog Park, please keep this link when reproduced.

Category: Hadoop

Deep Hadoop HDFS (ii)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.