Concept and characteristics of "pending modification" [Hdfs_1] HDFS

Source: Internet
Author: User

0. Reference

HDFs you must know, to test

Big Data development combat: Advantages and disadvantages of HDFs and MapReduce

Secondarynamenode the role of the explanation

1. What is HDFS

HDFS: A Distributed File system that provides high throughput access to application data and solves massive data storage problems.

2. Background & Design Prerequisites for HDFS generation 

With the development of the Internet, the number of data generation is getting bigger and faster. Traditional file systems rely on servers that are expensive, improve their processing performance costs and have reached technical bottlenecks, and vertical scaling does not meet today's needs.

HDFS It is designed to store large datasets on more than one common computer (scale-out), and can provide high-reliability and high-throughput services that enable the expansion of the cluster by adding nodes. So HDFS has its own design premise:

Good for storing large files, not for storing large numbers of small files
Stream access to data to ensure high throughput rather than low latency user response
Simple consistency, the use of the scene should be a write multiple reads, does not support multi-user write, does not support arbitrary modification of files.
Redundant backup mechanism, space for reliability (HADOOP3 in the introduction of erasure code mechanism, erasure code needs to be calculated to recover data, real time for space, interested can see the implementation of RAID)
Mobile computing is superior to mobile data, and mobile computing is better than mobile data to support large data processing, and provides related interfaces.

3. Advantages and disadvantages of HDFS

Advantages of 3.1 HDFS

HDFS is designed to be a distributed file system that runs on common and inexpensive hardware. It has a lot in common with existing distributed file systems, but the difference between him and other distributed file systems is obvious. HDFS is developed based on the need for streaming data patterns to access and process oversized files, with the following key features:

"1. Handling Oversized Files "

The oversized files here usually refer to gigabytes, terabytes, or even petabytes of files. By splitting oversized files into small HDFS and assigning them to hundreds, thousands, or even millions of nodes, Hadoop can easily scale and process these oversized files.

"2. Run on a cheap commercial machine cluster "

The HDFS design is low on hardware requirements and needs to be run on a low-cost commercial machine cluster without the use of expensive, highly available machines. Consider the reliability, security, and high availability of your data when designing HDFS.

"3. High level of fault tolerance and high reliability "

Considering the unreliability of inexpensive hardware in the HDFS design, a single copy of the data will automatically save multiple copies (with a specific set of settings, typically three copies), to ensure its fault tolerance by increasing the number of replicas. If one copy is lost, HDFS automatically copies the copy on the other machine.

Of course, there may be problems with multiple replicas, but HDFs is saved automatically across nodes and across racks, so this is a very low probability, and HDFS also provides a variety of copy placement strategies to meet different levels of fault tolerance requirements.

"4. Streaming Access Data "

The design of HDFS is based on more low "write-once, read-write" tasks, which means that once a data set is generated by a data source, it is replicated and distributed to different storage nodes, and then responds to a variety of data analysis task requirements. In most cases, the analysis task involves most of the data in the dataset, which means that for HDFS, it is more efficient to request that the entire dataset be read than a single record is requested.

Disadvantages of 3.2 HDFS

The above-mentioned features of HDFS are very suitable for batch processing of large data volume, but there is no advantage to some characteristics problems, and there are some limitations, the main performance of the following aspects:

"1. Not suitable for low latency data access "

HDFS does not work if you want to handle low-latency application requests (such as millisecond and second-level response times) that require shorter periods of time for users. HDFS is designed to handle large datasets, primarily designed to achieve high data throughput, typically in minutes or even hours.

HBase is a better choice for applications with low latency requirements, especially when access to massive datasets requires a millisecond response, and single HBase is designed to access a single or small number of datasets, and access to HBase must provide a primary key or a primary key range.

"2. Cannot efficiently store large numbers of small files "

"3. Multi-user write and random file modification not supported

There is only one writer in a file in HDFS, and the write operation can only be done at the end of the file, that is, the append operation can only be performed.

4. HDFS Architecture

"HDFS Frame composition"

  

HDFS is responsible for distributed storage.

HDFs employs a master-slave (MASTER/SLAVE) structure model, and an HDFS cluster consists of a NameNode and several DataNode.

1. HDFS Client

The client can access HDFS through some commands, access NameNode to get the metadata information of the file, and read the file interactively with DataNode.

The client is also responsible for slicing the file, and when the file is uploaded, the client divides the file into blocks for storage.

  2. NameNode

NameNode, as Master, is used to store metadata information (file type, size, path, permissions, etc.) for a file, and it is also responsible for mapping data blocks to specific DataNode.

NameNode performs namespace operations on the file system, such as opening, closing, renaming files or directories, and so on.

  

  3. DataNode

DataNode as Slave, is responsible for processing file system client file read and write requests, and in the NameNode of the unified scheduling of data block creation, deletion and replication work.

The DataNode manages the stored data in the cluster. HDFS allows users to store data in the form of files. Internally, the file is divided into blocks of data, and several pieces of data are stored on a set of DataNode.

  4. Secondary NameNode

Secondary NameNode as the secondary node of the NameNode, assists NameNode recovery in the event that the NameNode is not functioning properly.

His main task is to merge NameNode's edit logs into the Fsimage file.

4. Size of the HDFS file Block (block)

Hadoop1.x's HDFS default block size is 64mb;hadoop2.x and the default block size is 128MB.

If you are customizing the data block, you need to modify the Hdfs-site.xml file, for example:

I want to add ...

Concept and characteristics of "pending modification" [Hdfs_1] HDFS

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.