HDFs Main Features and architecture

Source: Internet
Author: User

Introduction

The Hadoop Distributed File System (HDFS) is designed to be suitable for distributed file systems running on common hardware (commodity hardware). It has a lot in common with existing Distributed file systems. But at the same time, the difference between it and other distributed file systems is obvious. HDFs is a highly fault-tolerant system that is suitable for deployment on inexpensive machines. HDFS provides high-throughput data access and is ideal for applications on large-scale datasets. HDFs relaxes a subset of POSIX constraints to achieve the purpose of streaming data from the file system . HDFs was first developed as the infrastructure for the Apache Nutch search engine project. HDFs is part of the Apache Hadoop core project. The address of this project is http://hadoop.apache.org/core/.

Main features of HDFs

1. HDFS has the following key features:

working with oversized files: A large file can be stored in gigabytes, terabytes, petabytes of data.

Cluster scale dynamic expansion: nodes dynamically join the cluster, can be hundreds of thousands of

streaming Data Read-write: The design idea of HDFs "write once, read multiple times", once a data set is generated by a data source, it is copied and distributed to different storage nodes, and then responds to a variety of data Analysis task requests. The application of HDFS processing is generally batch processing, rather than user interactive processing, focusing on the throughput of data rather than the speed of data access.

run on a cheap commercial machine cluster: HDFs is designed with reliability, security, and high availability in mind, so Hadoop has low hardware requirements and can run on inexpensive commercial machine clusters without the need for expensive high-availability machines

2. Limitations of HDFS:

not suitable for low latency data access: HDFs is designed to handle large datasets, primarily to achieve high data throughput, which can be at the expense of high latency. Access of less than 10 milliseconds can ignore HDFs, but HBase can compensate for this disadvantage

cannot efficiently store large numbers of small files:the Namenode node stores the entire file system's metadata in memory, so the number of files is limited, and the metadata for each file is approximately 150 bytes

Multi-user write and arbitrarily modified files are not supported : Multiple users are not supported to operate the same file, and the write operation can only be done at the end of the file, that is, the append operation.

Architecture of HDFs

On a fully provisioned cluster, running HDFs means running some daemons (daemon) on different servers distributed across the network, which have their own special roles and work together to form a distributed file system

Data block

HDFs also has the concept of a block similar to the Linux file system, except that the default block size is 64MB, similar to the normal file system, the files on HDFs are also chunked, blocks as separate storage units, and stored in the file system of the data node as normal files on Linux. Data block is an HDFS file storage unit

HDFs is designed to support large files, and HDFs is for applications that need to handle large-scale datasets. These applications are written only once, but read one or more times, and the read speed should meet the needs of streaming reading. The "Write once read" semantics of HDFs support files. A typical data block size is 64MB. Thus, the files in HDFs are always cut into different blocks according to the 64M, and each block is stored in different datanode as much as possible.

HDFS uses data blocks with the following benefits:

1, HDFs can save a single disk than the storage node large files

File blocks can be saved on different disks

2, simplifies the storage subsystem

Simplifies storage management and eliminates the complexity of distributed management file metadata by separating the functional areas of management blocks and management files

3, convenient fault-tolerant, facilitate data replication

Data blocks are duplicated on different machines (typically 3 copies, stored in 3 different places)

Why do you use such a large chunk of data in HDFs?

Reduces the overhead required to manage data blocks

Namenode and Datanode

HDFs uses the Master/slave architecture. an HDFS cluster consists of a namenode and a certain number of datanodes. Namenode is a central server that manages the file system's namespace (namespace) and client access to files. The Datanode in a cluster is typically a node that is responsible for managing the storage on the node it resides on. HDFs exposes the namespace of the file system, allowing users to store data in the form of files. Internally, a file is actually partitioned into one or more blocks of data that are stored on a set of Datanode. Namenode performs namespace operations on the file system, such as opening, closing, renaming files or directories. It is also responsible for determining the mapping of data blocks to specific datanode nodes. The Datanode is responsible for handling read and write requests from the file system client. The creation, deletion and replication of data blocks under the unified dispatch of Namenode.

Namenode and Datanode are designed to run on ordinary commercial machines. These machines typically run the Gnu/linux operating system (OS). HDFs is developed in the Java language, so any Java-enabled machine can be deployed with Namenode or Datanode. With a highly portable Java language, HDFS can be deployed to multiple types of machines. A typical deployment scenario is that only one Namenode instance is running on a single machine, and the other machines in the cluster run one Datanode instance. This architecture does not exclude the running of multiple datanode on a single machine, but this is rarely the case.

The structure of the single namenode in the cluster greatly simplifies the architecture of the system. Namenode is the arbiter and manager of all HDFs metadata, so that user data never flows through Namenode.

Client

The client is the means by which the user interacts with HDFs, and HDFs provides a variety of clients, including command-line interface, Java API, thrift Interface, C-language library, user-space file system, etc.

Resources

1, http://hadoop.apache.org/docs/r1.0.4/cn/hdfs_design.html

2, "Hadoop technology Insider in-depth analysis of Hadoop Common and HDFS architecture design and implementation principles"

HDFs Main Features and architecture

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.