HDFS is a hadoop distributed filesystem, A hadoop distributed file system.
When the data is as big as one machine and cannot be stored, it should be distributed to multiple machines. The file system that manages the storage space on multiple computers through the network is called a distributed file system. The complexity of network programs makes distributed file systems much more complex than ordinary disk file systems. For example, one of the biggest challenges is fault tolerance. After one or more nodes die, data integrity can still be ensured.
HDFS is the flagship File System of hadoop, but hadoop also has an abstract file system for integrating other file systems, such as local storage.
HDFS is a file system designed for storing very large files. It is suitable for streaming data access and runs in common commercial computer clusters.
"Very large file": refers to several hundred MB, several GB or even several TB of data. Yahoo used to store Pb of data on 4000 nodes. "Streaming data access": The index data is written once and then read and analyzed and computed multiple times. In addition, most of the data in the file is used for each calculation, rather than the first row of data. "Common Commercial computer cluster": hadoop is designed to run on common computers instead of running on very expensive and highly reliable servers. This means that the node is unreliable and it is normal for the node to exit the cluster due to a fault. HDFS is designed to prevent users from being clearly aware of node faults.
HDFS also has some unsuitable scenarios. "Instant data access": HDFS is not suitable for instant systems. HDFS is optimized for high throughput and has an advantage in processing a large amount of data. However, HDFS is not suitable for processing a small portion of data and then responding quickly to the returned system. "A large number of small files": The file metadata is stored in the namenode memory. The metadata occupies about 1 million MB of memory on the namenode machine. Therefore, even though millions of files can afford it, billions of files cannot afford the most popular hardware configurations. "A large number of write operations and file intermediate modifications": HDFS currently only supports single-threaded write operations and can only be written at the end of the file. You cannot modify any location in a file. (It may be supported in later versions, but even if it is supported, it will be inefficient .)
Hadoop HDFS (1)