Hadoop Study Notes (5): Basic HDFS knowledge

Source: Internet
Author: User
ArticleDirectory
    • 1. Blocks
    • 2. namenode and datanode
    • 3. hadoop fedoration
    • 4. HDFS high-availabilty

When the size of a data set exceeds the storage capacity of a single physical machine, we can consider using a cluster. The file system used to manage cross-network machine storage is called Distributed filesystem ). With the introduction of multiple nodes, the corresponding problems arise. For example, the most important problem is how to ensure that data will not be lost when a node fails. Hadoop has a core sub-project HDFS (hadoop distributed filesystem), which is used to manage Cluster Storage problems. Of course, in hadoop, not only HDFS can be used, hadoop has a general abstract file system concept, which enables hadoop to operate under different types of file systems. For example, hadoop can be integrated with Amazon's S3 file system.

HDFS Design Concept

1. store large files
The "super large file" here refers to a file of several hundred MB, GB, or even TB level.
2. Stream Data Access
HDFS is based on the concept that the most effective data processing mode is the write-once, read-only-times mode, the dataset stored in HDFS is used as the analysis object of hadoop. After a dataset is generated, various analyses are performed on the dataset for a long time. Most or even all of the data in the dataset will be designed for each analysis. Therefore, the time delay for reading the entire dataset is more important than the time delay for reading the first record. (Streaming reading minimizes the addressing overhead of the hard disk. You only need to address it once and then read it all the time. The physical structure of the hard disk causes the addressing overhead to be optimized to keep up with the read overhead. Therefore, stream reading is more suitable for hard disk features. Of course, the features of large files are also more suitable for stream reading. AndStream Data AccessCorrespondingRandom Data AccessIt requires a small delay in locating, querying, or modifying data, which is suitable for reading and writing data multiple times after data is created. traditional relational databases are very suitable for this)
3. One of the design concepts of HDFS running on ordinary cheap servers is to make it run on ordinary hardware, even if the hardware fails, the fault tolerance policy can also be used to ensure high data availability.

Inconvenience caused by HDFS

1. Use HDFS for scenarios with low latency requirements for data access
Because HDFS is designed for high data throughput applications, it must be at the cost of high latency.
2. Store a large number of small files
In HDFS, metadata (basic file information) is stored in the namenode memory, while namenode is a single point. If the number of small files is large enough, the namenode memory will not be enough.

Basic concepts in HDFS 1. Blocks

Each disk has a block size, which is the minimum unit for reading or writing data at a time. HDFS also has the concept of data blocks, but the data blocks in HDFS are much larger than the data blocks on general disks (typically 512 bytes. Like a general disk file system, HDFS splits files into blocks (if not specifically stated below, blocks are all 64 MB blocks in HDFS) and stores them independently. However, unlike the general disk file system, if a file is smaller than a single block, this file does not have a whole block (if a small file of several hundred kb occupies the entire 64 mb data block, how much waste will it cause ).

Why does HDFS use big data blocks?

Data blocks in HDFS are much larger than common disk file systems. The reason for this is to minimize the data addressing time in the file system. By setting a large block size, the time for addressing data is much smaller than the time for data transmission, so as to process a large file (HDFS is mainly used to process big data) the time is mainly determined by the data transmission time.
If the average data addressing time is 10 ms and the transmission rate is 100 Mb/s, let's roughly calculate that the data addressing time only accounts for 1% of the data transmission time, we need to set the size of each block to 100 MB. In fact, the default block size is 64 MB (many other HDFS implementations also use 128 MB ). In the future, the block size may increase as the data transmission rate increases.
However, the block size does not keep increasing. Because the map task in mapreduce can only process one block at a time. For a file of the same size, if the block is too large and the map task is too small, the job running time will increase.

Extracting a block at the Distributed File System level can bring the following benefits:
1. because no file must be stored on a single disk, the storage space of a single file can be larger than that of any node in the cluster, so that the storage capacity of the cluster can be fully utilized. It is possible (though not common) that a file occupies the storage space of all nodes in the cluster.
2. Using blocks (rather than files) as abstract units simplifies the storage subsystem. Simplicity is the common goal of all storage systems, and is particularly important in Distributed File Systems with various fault methods. The storage subsystem only needs to process blocks to simplify Storage Management (because blocks are fixed in size, it is easy to calculate the maximum number of blocks a disk can store ), it also saves the metadata management burden (because the block is only a string of data that needs to be stored, metadata such as access permissions of files does not need to be stored together with the block, in this way, the namenode of another system can be managed separately ).
3. With block, a redundant backup (replication) mechanism that provides data fault tolerance and availability can work better. In HDFS, each block has several copies on different machines to prevent data block corruption or disk and machine crash (default value: 3 ). If a block cannot be used, HDFS copies a new backup in a transparent way to users, in this way, the data security level of the cluster is restored to the previous level (you can also improve the data security level by increasing the number of redundant backups ).

You can use the fsck command in HDFS to interact with each other at the block level. For example, run the following command:
% Hadoop fsck/-files-Blocks
The blocks list of all files in the file system.

2. namenode and datanode

HDFS clusters have two types of nodes: namenode and datanode. Namenode manages the namespace of the entire file system. It maintains the metadata of the entire file system tree and all files and directories in the tree. This information is permanently stored in the local file system in two forms: namespace image (including the inode and block lists of all files in namespace) and edit log (records all user changes to HDFS ). Namenode also stores the location of blocks that makes up the given file, but the information is not permanently stored on the disk, this information is rebuilt at system startup based on the feedback from datanode, and is updated regularly based on datanode. It is highly dynamic.

A client represents a user who interacts with namenode and datanode to access the file system. The client provides a file system interface similar to POSIX (Portable Operating System Interface). Therefore, you do not need to implement namenode and datanode in programming.

Namenode is a single point of failure (single point of failure) for the entire distributed file system. Without namenode, the entire distributed file system cannot be used, because we cannot reconstruct the corresponding file from blocks. Therefore, it is important to ensure that the namenode can be recovered from failure in a timely manner. We can start from the following two aspects:
1. the first method is to back up the permanent information stored in namenode (that is, the namespace image and edit log mentioned above ), namenode can save its permanent information to multiple file systems through additional configurations (these multi-write operations are synchronous and atomic ). The most common practice is to save permanent information to the local file system and a remote NFS (network filesystem.
2. Another possibility is to run a secondary namenode. Although its name is similar to that of namenode, its function is different from that of namenode. Its main task is to combine the namespace image checkpoint file with the edit log (to prevent the edit log from being too large) and save the merged namespace image on its local file system, send this new backup to namenode at the same time. Because a large number of CPU resources are required and the memory size is the same as that of namenode, secondary namenode is usually run on another separate machine (for more information about how to run secondary namenode, seeHere). Then, because the state information stored on the secondary namenode always lags behind the State Information on the namenode (the unmerged edit log records this change), if the namenode fails completely, part of the data must be lost.
The general practice is to combine the above two methods, that is, when the namenode is on the machine, copy the namespace image on the remote NFS to the secondary namenode, run Secondary namenode as namenode.

3. hadoop fedoration

Namenode stores reference of each file and directory in the file system in the memory. However, when the cluster Scale is expanded, this creates a bottleneck. Therefore, a new concept is introduced in the hadoop 2.x release: hadoop fedoration. It allows the cluster to have more than one namenode, so that each namenode is only responsible for maintaining part of the file system, such as one namenode to maintain the/user directory, and the other namenode to maintain the/share directory.
In fedoration, each namenode maintains two parts of information: 1) namespace volume composed of namespace metadata; 2) A block pool that contains the block location information of all files in a part of the file system it maintains. Namespace volume is independent of each other, which means that there is no interaction between namenode, and a namenode does not affect the normal use of other namenode. Compared with namespace volume, block pool is not partitioned. Therefore, datanodes must register with each namenode in the cluster and store data from multiple block pools.
To use a cluster with the fedoration feature, you can use the user-side Mount table to map the file path to namenode. This can be configured through viewfilesystem and viewfs: // URI.
(For more information about hadoop fedoration, seeHere)

4. HDFS high-availabilty

although you can regularly merge namespace image and edit log by backing up namespace metadata in multiple file systems and using secondary namenode to generate a new checkpoint, You can protect the cluster from data loss. However, this does not provide high availability of the cluster, because namenode itself is still a single point of failure-If namenode fails, all clients, mapreduce jobs cannot read, write, and view files normally, because namenode is the only database that maintains namespace metadata and provides file-to-block ing.
to recover from a failed namenode, the administrator should start a new namenode and configure datanode and the user to use this new namenode. This new namenode does not work until it has completed the following tasks:
1) Add the namespace image backup to the memory;
2) replay operations in the edit log.
3) accept sufficient block reports from datanode (that is, record block information in each datanode to determine file-to-block ing ), then exit safe mode.
in a cluster with many nodes and files, this operation may take tens of minutes !!

Hadoop 2.x release effectively avoids long downtime by adding support for HDFS high-availabilty. In this implementation, there is a pair of namenode, which are configured as active and standby respectively. When active namenode is disabled, standby namenode immediately takes over and continues to provide services to the client. The interruption time during this period is very small. To implement HDFS high-availabilty, the following changes occur in the structure:
1) Two namenode use a highly available shared device (originally, the HA implementation used NFS to share the edit log, but more options will be provided in future versions, for example, a bookkeeper-based system built on zookeeper is used to store the edit log. When standby namenode takes over the operation, it will immediately replay the operations in the edit log (it also acts as the role of secondary namenode, and constantly merges the old namespace image and the new edit log to avoid the edit log being too large ), so that the status before the active namenode is quickly reached.
2) datanode needs to send block report to two namenode, because block mapping is stored in memory rather than on disk.
3) The client must be properly configured and use a transparent method to handle namenode failure recovery. See the following figure:

What information is stored in the edit log?

All mutations to the file system namespace, such as file renames, permission changes, file creations, Block allocations, etc, are written to a persistent write-ahead log by the name node before returning success to a client call. in addition to this edit log, periodic checkpoints of the file system, called the fsimage, are also created and stored on-disk on the Name node. block locations, on the other hand, are stored only in memory. the locations of all blocks are stored ed via "Block reports" sent from the data nodes when the Name node is started.

With the above changes as the foundation, when active namenode fails, because standy namenode stores the latest edit log (and the last checkpoint image file) and the latest block mapping, standy namenode can take over the work quickly within dozens of seconds. In actual applications, the failure recovery time is longer (about one minute), because the system needs additional time to determine that the active namenode is indeed on the machine.

indicate the source for reprinting: http://www.cnblogs.com/beanmoon/archive/2012/12/08/2809315.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.