HDFs Easy Getting Started

Last Update:2015-04-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article address: http://www.cnblogs.com/archimedes/p/hadoop-simple.html, reprint please indicate source address.

Why we need HDFs

The file system consists of three parts: software related to file management, managed files, and the data structures required to implement file management.

Since it takes a long time to read all the data on a piece of disk, the write takes longer (3 times times the write time is generally read). We need a huge file. Transfer speed 10GB/S disk (now no such disk), and even if the file is 1ZB, or a small 10EB, such a disk can not be read with.

When the size of a dataset exceeds the storage capacity of a single physical computer, it is necessary to partition it and store it on several separate computers.

From the concept map, the Distributed file system will become more complex due to the incomplete structure of the distributed system, and the introduction of network programming also leads to the complexity of the Distributed file system.

For the above questions, how can we solve the problem of HDFs?

HDFs stores files in a stream-processing access mode.

Write once, read multiple times. Data sources are typically generated by sources or copied directly from a data source, and then analyzed over time on this data set, and big data does not need to be moved.

DFS is a stream processing of files, each file in the system can find its localized image, so for the user, regardless of the format of the file, do not care about the location, just remove from the DFS can be.

Generally speaking, the file processing process can not guarantee the smooth arrival of the file, the traditional file system is the use of local check to ensure that the data integrity, the file is dispersed, you need to deliberately arrange for each shard file verification code?

The number and size of shards is uncertain, massive amounts of data would have required a large number of verification processes, and after fragmentation, the tracking verification of each shard was counted on the planet at the same time as the number of stars in the sky. x

The solution for HDFs is Shard redundancy, local checksum.

Data redundant storage, direct multiple copies of the Shard file to the Shard storage server to verify

The redundant shard file also has an additional feature, as long as one copy of the redundant shard file is complete, and after many co-ordinated adjustments, the other shard files will be complete.

Coordinated verification, whether it is a transmission error, I/O error, or an individual server outage, the entire system files are complete

The Distributed file system has an unavoidable problem because the file is not on a disk causing the delay of the Read access operation, which is the main problem that HDFs is now experiencing.

At this stage, the configuration of HDFs is optimized for high data throughput and may be at the cost of a high time delay. Fortunately, HDFs is highly resilient and can be optimized for specific applications.

The concept of HDFs

HDFs can be implemented with the following abstract diagram

What is meta data?

Metadata is information about the features, data sets, or series of datasets, coverage, quality, management, the owner of the data, how the data is provided, and so on. More simply, it's about data.

HDFs is the data that turns huge data into large amounts of data.

PS: When a disk stores a file, it is stored according to the data block, that is, the data block is the minimum read/write unit of the disk. Data blocks are also called disk blocks. A file system built on a single disk manages the file system through disk blocks, in general, the size of the file system block is an integer multiple of the disk block. In particular, a single disk file system, a file smaller than a disk block, consumes the entire disk block. The size of the disk block is typically 512 bytes.

In HDFs, there is also the concept of block, which defaults to 64MB and each block as a separate storage unit.

Unlike other file systems, each file in HDFs that is smaller than the block size does not occupy the entire block of space . Specific reasons are described in the following. Here's why 64MB is a file block

In the file system, when the system stores the file, it needs to locate the data on the disk, and then transfer processing.

Locating the location on the disk takes time, and the same file transfer takes time.

T (Storage time) =t (location time) +t (transfer time)

If each block to be transferred is set to be large enough, the time to transfer data from the disk can be significantly greater than the time it was positioned to start the block.

T (Storage time) =t (positioning time)) [-∞]+t (Transmission time) [∞]

Approximate equals:T (Storage time) =t (transfer time)

Let's take an example: we're going to transfer a 10000MB file.

Under a single disk:

Store 1 10000MB files we need time.

10msx100+1000msx100=101s

10 Data nodes: Time taken to transfer 10000MB files: 10msx10+10ms+10s=10.11s

This example is theoretical data, which is actually slightly longer than this.

Summarize:

Such a setting makes it important to store a file for the duration of the transmission, and the block size determines the rate at which the data is transmitted by multiple fast-composing files, which is also the core technology of HSDF.

Of course not to set each block the bigger the better.

HDFs is provided to the MapReduce data service, whereas the map task of MapReduce typically processes the data in one block at a time, and if the number of tasks is too small (less than the number of nodes in the cluster), it does not take advantage of multiple nodes, and even the job will run at the same speed as a single node.

The advantages of Distributed file abstraction are:

1. One file can be larger than each disk

2, the file is not all on a disk.

3, simplifies the design of the storage subsystem.

Not only that, metadata-based storage is ideal for backup, and backup can provide data fault tolerance and availability.

Key operating mechanisms of HDFS

HDFs is based on a master-slave structure (master/slaver) component.

Detailed operating mechanism will be introduced in the next article ...

How to use HDFs

HDFs is ready to use after installing hadoop-0.20.2.tar.gz and successfully configured. The installation process is not detailed. See also: "Installing and Running Hadoop", "Installing JDK8 under Ubuntu 14.04"

Whether using a shell script or using a Web UI, you must understand the configuration of HDFs before using it. facilitates storage operation or optimization of operations.

Common HDFS commands: see "HDFS API Operation practices", "common operations for HDFs"

HDFs Easy Getting Started

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

HDFs Easy Getting Started

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support