Hadoop study notes: HDFS architecture

Source: Internet
Author: User
Keywords nbsp; name dfs disk new
Tags *.h file access aliyun application application data binary block checkpoint
HDFS Introduction

& HDI is fault-tolerant and designed to be deployed on low-cost hardware. And it provides high throughput to access application data for applications that have a large data set.

1. HDFS has the following main features:

Processing large files: a large file storage can reach several GB level, several terabytes, several petabytes.

Cluster size dynamic expansion: nodes dynamically added to the cluster, you can hundreds of thousands

Streaming data read and write: HDFS design idea "write once, read many times", once a data set generated by the data source, it will be copied and distributed to different storage nodes, and then respond to a variety of data Analyze job request.

Run on Cheap Commercial Machine Clusters: HDFS was designed with reliability, security, and high availability in mind, so Hadoop requires less hardware and can run on inexpensive commercial machine clusters without the need for costly high availability machines

2.HDFS limitations:

Not suitable for low latency data access: HDFS is designed to handle large data sets and is designed primarily for high data throughput, which can be at the cost of high latency. Visits below 10 milliseconds ignore hdfs, but hbase can make up for this lack

Can not efficiently store a large number of small files: The namenode node stores the entire file system's metadata in memory, so the number of files is limited, and the metadata for each file is about 150 bytes

Does not support multi-user write and modify files: Does not support multi-user operation on the same file, and write operations can only be completed at the end of the file, that is, additional operations.

HDFS architecture HDFS basic concepts:

Block:

HDFS files are stored as blocks, and the block size defaults to 64MB. Larger than most file system blocks. Usually the size of the file system block is several kilobytes, the size of the disk block is 512B.

Much larger than the disk block, the purpose is to reduce the addressing overhead. If the block is too small, a lot of time will be spent on disk block positioning time.

When the HDFS file is smaller than the block size, it will not fill the entire data block storage space? ?

HDFS architecture description

HDFS master / slave architecture. An HDFS cluster consists of a Namenode and a number of Datanodes.

Namenode is a central server that manages the file system namespace and client-side access to files.

Namenode performs file system namespace operations, such as opening, closing, renaming files and directories, and decides the mapping of blocks to specific Datanode nodes.

Datanode is responsible for handling the file system read and write requests, under the command of Namenode block creation, deletion and replication

A file is actually divided into one or more blocks, these blocks stored in the Datanode collection.

NameNode:

NameNode Role: responsible for the management of the file system namespace (metadata), maintain the entire file system directory tree and the index directory of these files.

NameNode file structure (Pictorial reference books: Hadoop combat):

fsimage: Binary file that stores HDFS files and directory metadata

Edits: Binary files, all HDFS operations between each save of fsimage and next save, recorded in the Edit s file. Each operation on the file, such as opening, closing, renaming files and directories, will generate an edit record.

fstime: Binary file, fsimage After completing a checkpoint, write the latest timestamp to fstime

VERSION: text file, the contents of the file (graphic reference books: Hadoop actual combat):

Where namespaceID is the file system's unique identifier, which is created when the file system is formatted for the first time. This identifier also requires that all DataNode nodes be consistent with the NameNode. NameNode will use it to identify the new DataNode, DataNode will only get the namespaceID after registering with the NameNode.

Metadata

Including ownership and permission of files and directories;

Which blocks include the file, the number of blocks and the number of copies of the block;

Datanode where the block is stored (reported by Datanode on startup);

The metadata structure in fsimage looks like this:

Metadata classification: divided into memory metadata and metadata files

Metadata Files: Contains fsimage & edits, Stored on Local Disk and NFS, Preventing Data Loss After Machine Disk Failure in NameNode

Memory Metadata: Image containing fsimage and Blockmap. NameNode start loading fsimage & edits file to memory, merge the latest fsimage back to the local disk and NFS, overwrite the old fsimage file

NameNode fsimage file processing flow during startup

The first step: First, load the fsimage file and edits file on the hard disk, merge the new fsimage to disk after the merge in memory, this process is called checkpoint

(Generally, NameNode will configure two directories for storing fsimage and edits files, namely local disk and NFS, respectively, so as to prevent data loss after the disk of NameNode is broken.

NameNode starts uploading the latest fsimage compared to the checkpoint time recorded in fstime in NFS and local disks. )

The second step: After NameNode finishes loading the fsimage & edits file, it will write the result of the merge to both the local disk and NFS. At this point there is a copy of the original fsimage file and a checkpoint file on the disk: fsimage.ckpt. At the same time edits file is empty.

The third step: After completing the checkpoint, the fsimage.ckpt renamed fsimage (overwrite the original fsimage), and the latest timestamp is written fstime file

DataNode

The role of DataNode:

Save the block

When you start the DataNode thread, it reports the block information to the NameNode

If the NameNode does not receive a heartbeat from the DataNode for 10 minutes, it considers that it has been lost and copies the block on it to other DataNodes

DataNode file structure (Pictorial reference books: Hadoop combat):

Blk_refix: HDFS in the file data block, the original content is stored

Blk_refix.meta: Metadata file for a block: a header file that includes version and type information, and a range of checksums for the blocks.

VERSION: text file, the content of the file is:

NamesopaceID, cTime, layoutVersion and NameNode consistent, namespaceID is obtained for the first time connected to the NameNode. The storageType is unique to the DataNode and is used for the NameNode to represent the DataNode.

DataNode startup process

When datanode is started, each datanode scans the local disk and reports the block information saved in this datanode to the namenode

After the namenode receives the chunk information report for each datanode, it stores the received chunk information and its datanode information in memory.

Namenode will block -> datanodes list corresponding table information saved in BlocksMap (as shown).

Secondary NameNode

In order to improve the reliability of NameNode, Secondary NameNode has been introduced since Hadoop 0.23.

Secondary NameNode role

Fsimage is HDFS metadata file, it will not be updated after each file operation of HDFS (such as opening, querying, creating and modifying files). Each HDFS file operation will add an edits record. This will result in an increasing number of edits records.

This design does not affect the system's resilience. Because if the Namenode fails, the latest state of the metadata can be retrieved by reloading the fsimage file read from disk and re-executing the edits record, which is exactly what the NameNode did when it restarted . However, if the edits record a lot, the NameNode will take a long time to run the actions in the edits record when started. During this time, the HDFS file system is not available.

To solve this problem, Hadoop runs a Secondary NameNode process on a node other than the NameNode. Secondary NameNode periodically copies fsimage and edits from the NameNode to a temporary directory and merges them into a new Fsimage. It then uploads the new fsimage to the NameNode, which causes the NameNode to update fsimage and delete the original edit log. This process is called checkpoint. The specific process is as follows:

Description:

The first step: The Secondary NameNode first request NameNode edits scrolling, so NameNode began re-write a new edit log

Step Two: The Secondary NameNode reads fsimage and edits in the NameNode through HTTP

The third step: Secondary NameNode read fsimage to memory, and then perform each operation in edits, and create a new unified fsimage file.

Step 4: The Secondary NameNode sends the new fsimage to the NameNode via HTTP

Step 5: NameNode replaces the old fsimage with the new fsimage, the old edits file is replaced with the edits from step 1, and the system updates the fsimage file to record the checkpoint time

Secondary NameNode file structure (the icon cited books: Hadoop actual combat):

Secondary NameNode shortcomings:

Because Secondary namenode is not a real-time checkpoint, all the edits in the Namenode from the last checkpoint to the time of the failure will not have the next checkpoint, although there was a hardware failure on the namenode and no metadata was stored via NFS Lost Because at this time the secondary namenode save only the last fsimage file, there is no latest edits file can not be secondary namenode during this period of data recovery.

The Secondary NameNode is not a backup of the NameNode. If the NameNode goes down while the SecondaryNameNode is down, the cluster still does not work. If you want to resume cluster work, you need to manually copy the fsimage file on the Secondary NameNode to the new NameNode.

In order to solve the above problem, Hadoop 2.0 has introduced a highly available HA NameNode

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.