Introduction to the Hadoop file system

Last Update:2015-11-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The two most important parts of the Hadoop family are MapReduce and HDFs, where MapReduce is a programming paradigm that is more suitable for batch computing in a distributed environment. The other part is HDFs, the Hadoop Distributed File system. The Hadoop environment can be compatible with a variety of file systems, including the local file system, at the file system API level is a file system interface, this interface can have a variety of implementations, including the local file system or distributed File system.

One: The design goal of HDFs

HDFs is used to store data on multiple computers (computer clusters) as the main design goal, and it differs from traditional file systems in many ways.

(1) First, HDFs is mainly used to store large files, including hundreds of M, hundreds of G, hundreds of t size files, etc., currently HDFS has a Hadoop cluster that can be used to support petabyte-scale data storage.

(2) HDFs is primarily used to handle write-once-read operations, and a small portion of the content modification is not what Hadoop specializes in.

(3) HDFs uses a commercial server to form a cluster to run the HDFS system. Because no expensive hardware is used, a single point of failure is unavoidable, so HDFs needs to be highly tolerant. Works even in the case of a single point of failure.

(4) HDFs is optimized for high data throughput, so a small amount of data requires fast access and is not appropriate.

(5) HDFs uses a master-slave architecture with a name node that manages the entire system, which holds the metadata for the entire file system. And the data is quickly read in memory, and each file in HDFs holds the associated system into the memory of the name node, so the name node is the bottleneck for the number of files that the entire system can hold. You can use the Federated name node mechanism to extend HDFs. For this reason, it is not appropriate to store a large number of small files in HDFs.

(6) HDFs does not allow multiple user writes, and writing data is only written at the end of the file and does not support any write at any location.

II: Basic concept data blocks in HDFs

1. The disk is where the file is stored, and the file system is built on disk. A disk has the concept of a block of data, which is the basic unit of data read/write on disk. The size of the disk data block is generally 512 bytes, the file system also has the concept of data block, file system data block size is generally the size of the disk data block integer times, generally thousands of bytes.

2. There is also a block concept in HDFs, which, like normal file systems, divides a file into chunks of block size (chunk) as a separate storage unit. The block in HDFs is large and defaults to 64MB. In addition, files smaller than the block size in HDFs do not occupy the entire block of space. The reason the block is set so large is to minimize the addressing time, so that the time to transfer the disk data in the data request is much greater than the addressing time (where the file block is located).

3. Benefits of using the data block concept: A file size can exceed the size of any disk in the network, simplifying the design of subsystems (separating blocks from metadata), and chunking facilitates data backup to improve fault tolerance.

Namenode and Datanode

1. HDFs works in Master-slaver mode, where there is a namenode as the primary node, slaver as the slave node. Namenode is responsible for managing the entire file system namespace, which is implemented by saving two files on the local disk, which are namespace image files and edit log files, respectively. Namenode saves information about a file block, but it is not persisted. Because this information will be rebuilt at datanode startup.

2. The Datanode node is the working node of the file system, stores and retrieves data blocks, and periodically sends a list of stored blocks to Namenode.

3. Namenode fault tolerance mechanism, one is to back up the file system metadata persisted state of the file, the other is to run a secondary namenode.

4. Federated HDFs, because when the number of files in the file system is large, Namenode needs to store the metadata information in memory, which becomes the bottleneck of the file system, can use the federated HDFs mechanism, using multiple Namenode to manage the file system metadata, Each Namenode manages a subset of the metadata in the file system.

Introduction to the Hadoop file system

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to the Hadoop file system

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to the Hadoop file system

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support