In-depth introduction to Hadoop HDFS

Source: Internet
Author: User
Tags cassandra time 0 hadoop ecosystem

In-depth introduction to Hadoop HDFS

The Hadoop ecosystem has always been a hot topic in the big data field, including the HDFS to be discussed today, and yarn, mapreduce, spark, hive, hbase to be discussed later, zookeeper that has been talked about, and so on.

Today, we are talking about HDFS, hadoop distributed file system, which originated from Google's GFS. However, GFS is written in c ++ and Hadoop is written by Doug Cutting in yahoo using Java. Hadoop became an Apache top-level project in 2008.

Application

What scenarios does HDFS apply? Very large file storage, such as in G or T, because the basic unit of block inside HDFS is already 128 MB. Note that there is a small file problem. The misunderstanding is that a small file of 1 K can also occupy MB of hard disk. Actually, it is not. It still occupies 1 k of hard disk, however, the bottle neck for small files is in name node, because name node stores information about files and blocks in the memory, with a large number of files, the memory of the name node is insufficient (for example, a million small files occupy 300 MB of memory). Of course, hdfs federation can solve the problem of insufficient memory of the name node through sharding, which will be detailed later. HDFS is also applicable to "write once, read-committed" scenarios, and its write is append only, so it cannot be changed if you want to change it. If you write data multiple times, consider cassandra (see my previous article ). At the same time, HDFS files can only be written by single writer (only one writer can write a file at the moment through lease ). HDFS is welcomed by enterprises because it only requires common commodity hardware and does not require expensive high-availability hardware. HDFS is not applicable to data access that requires low latency, because HDFS uses high throughput for latency switching.

Concept Blocks

In HDFS, the file is split into chunks of block size, and each block is 128 MB. Some people may ask, why is it so big, it is mainly to shorten the proportion of seeking time in the total hard disk read and write time, for example, seeking time requires 5 MS, and seeking time can only account for the proportion of the total time 0.5%, then the hard disk read and write time is about 1 s, how many files can be worn in 1 s? If the hard disk reads and writes are 128 MB/s, 128 MB can be passed, so the block size is defined as 128 MB, this ensures that the hard disk operation time is effective for applications to read and write, rather than to track. Of course, it is too big to work. mapreduce map is usually based on blocks. If there are too few blocks, the efficiency of mapreduce will be relatively low.

hdfs fsck $path -files -blocks -locations 

The above command can be used to provide the block information of a file, such as the machine on which the block is located and the name, so that you can further query the Specific block information.

Namenodes and datanodes

Namenode manages namespace and the tree structure of the file system and metadata of the file/directory. The information is persisted in the hard disk in the following way: namespace image and edit log. The block metadata is also stored in the namd node and in the memory. As mentioned above, the memory usage of millions of small files is MB. Why is block information not persistent? Because it changes, the system will be rebuilt from datanode upon restart.

There are several backup methods for name node. One is to write the persistent information stored on the hard disk to both the local hard disk and remote NFS mount. The other method is to run secondary namenode, which does not actually assume the namenode role, but periodic merge namesapce image and edit log to prevent the edit log from being too large. It will save a merged namespace image. Once primary fails, it will copy the metadata on NFS to the secondary namenode, so that secondary becomes the new primary.

As shown in the specific process, both edit log and fsimage are on the hard disk, and edit log is WAL (cassandra write operations also use WAL, WAL is very popular and can be pulled out once ), fsimage is check point of the filesystem metadata. Write edit log first, and then update in-memory representation of filesystem metadata (used to serve read requests). This operation is not shown in the figure.

 

Is there a better way? If the preceding method does not provide HA, the namenode is still a single point of failure. The new primary requires (1) load namespace image into memory (2) replay edit log (3) to receive enough block reports from datanode (the block information mentioned above is in memory ). This process may take 30 minutes or longer. The client cannot wait ~~

Hadoop 2 provides HA support. Namenode uses the active-standby configuration method:

  • Namenodes uses highly available shared storage to store the edit log. Each write of the active node is read by standby and synchronize to its memory.
  • Datanodes will send block reports to all name nodes at the same time. Remember that block mapping is in the memory.
  • The client needs to configure handle namenode failover, which is actually the leader election of watch zookeeper (see zookeeper I mentioned earlier)
  • In this way, you do not need secondary namenode. The replacement of standby will periodically generate check points.

The shared storage mentioned above mainly refers to QJM (quorum journal manager), which is usually configured with 3 (of course I have also seen 50 nodes with 5 journal nodes ), quorum must be satisfied when writing.

In this way, standby can immediately hold the active namenode fail, because latest edit log and up-to-date block mapping are both in the memory.

HDFS write

HDFS read

CLI Example

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.