The Hadoop Distributed File system is the Hadoop distributed FileSystem.
When the size of a dataset exceeds the storage capacity of a single physical computer, it is necessary to partition it (Partition) and store it on several separate computers, managing a file system that spans multiple computer stores in the network as a distributed File system (distributed FileSystem).
The system architecture and network are bound to introduce the complexity of network programming, so the Distributed file system is more complex than the ordinary disk file system. For example, making a file system tolerant of node failures and not losing data is a huge challenge.
Hadoop has a distributed file system that becomes HDFs, the Hadoop distributed FileSystem. DFS is also called in informal documents or in old documents. HDFs is the flagship file system for Hadoop, which is actually an abstraction of a comprehensive file system. For example, you can also integrate additional text
System such as Amazon S3 or the local file system.
HDFs stores oversized files in streaming data access mode, running on commercial hardware clusters, with the following features:
1. Ultra-large file storage
"Oversized files" here refers to files with even MB, hundreds of GB, and hundreds of TB sizes, and there is already a Hadoop cluster that stores petabytes of data. (The world's largest Hadoop cluster in Yahoo, there are about 25,000 nodes, mainly used to support the advertising system and web search.) )
2. Streaming data access
The idea of constructing HDFs is one-time write, and multiple reads are the most efficient mode of access. Datasets are typically generated by or copied from a data source, and then analyzed on this data set for a long period of time. Each analysis will involve most or all of the data set, so reading the entire data set
Time delay is more important than the time delay of reading the first record.
3. Commercial hardware
Hadoop does not need to run on expensive and highly reliable hardware. It is designed on a cluster of ordinary hardware, so at least the probability of node failure is very high for large clusters. HDFs is designed to continue running without the user noticing the obvious
The interrupt. Also, applications that are not suitable for operation on HDFs are worth studying. Currently, applications with high real-time requirements are not suitable for operation on HDFs.
4. Low time delay data access
Applications that require low-latency data access, such as the time-of-day millisecond range, are not suitable for operation on HDFs. Because HDFs is optimized for high-throughput applications, this can be at the expense of increased time delays. Currently, HBase is a better choice for low latency time access requirements.
5, a large number of small files
Because Namenode stores the file system's metadata in memory, the total number of files that the file system can store is limited by the namenode memory capacity. Based on experience, the storage information for each file, directory, and data block is approximately 150 bytes, so for example if there are 1 million files and each
Files for a block of data, which requires at least 300MB of memory. Although it is possible to store millions of files, storing billions of files exceeds the current hardware capability.
6, multi-user write, arbitrary modification of files
Files in HDFs can only have one write, and currently (hadoop-0.x~hadoop-2.6.0) can only be appended to the end of the file. Multiple write operations are not supported, and modifications to the operation are not supported anywhere in the file.
HDFS Data Block
Each disk has a default chunk size, which is the smallest unit of disk data read/write. A file system built on a single disk manages blocks in the file system through a disk block, which can be an integer multiple of the disk block. File System file system block is generally thousands of words
section, and the disk block is typically 521 bytes.For example, the disk units of the file system on WinDOS are clustered, and the disk units of the file system on Linux are block.
HDFs also has a block concept, but is relatively large.The default block size is 64MB in hadoop-0.x and hadoop-1.x, and the default block size is 128MB in Hadoop-2.0 and later versions. Similar to the file system on a single disk, files on HDFs are also divided into chunks of block size
(chunk), as a separate storage unit. Unlike other file systems, files that are smaller than one block in HDFs do not occupy the entire block of space.The chunking of HDFs is my understanding of logical chunking, not physical chunks. The block of HDFs is large to minimize addressing overhead. If the block is set
Is large enough, the time to transfer data from the disk is significantly greater than the time it takes to locate the block's starting position. If more time is used to transfer data, it is obviously more efficient than addressing. But the block is not too big,the map task in MapReduce typically processes data in only one block at a time, if a task
number is too small (less than the number of nodes in the cluster), the job will run slower. Chunking is ideal for data backup to provide data fault tolerance and increase availability。in distributed clusters, the number of block replicas for HDFs is 3 by default, and each block is distributed over different nodes. This
the benefit is that the data is not lost after a disk or machine failure, and another copy is read on the other node if one block is not available, and the process is transparent to the user.a block that has been lost because of damage or machine failure can be copied from another node to a functioning machine .
to protect the number of copies back to normal levels. Also, some applications may choose to set a higher number of replicas for some commonly used file blocks, thereby dispersing the read load in the cluster.
Hadoop Distributed File System HDFs detailed