HDFS Core Principle2016-01-11 Du Yishu
HDFS (Hadoop Distribute file system) is a distributed filesystem
The file system is the disk space management service provided by the operating system, we only need to specify where to put the file, from which path to read the file sentence, do not care about how the file is stored on disk
What happens when the file requires more space than the native disk space?
One is to add a disk, but to a certain extent there is a limit
The second is to add the machine, the way to provide networked storage with remote shared directory, this way can be understood as a prototype of distributed file system, can put different files into different machines, space is not enough to continue to add machines, breaking the limit of storage space
But there are a number of problems with this approach
(1) Stand-alone load may be extremely high
For example, a file is hot, many users often read this file, so that the file is located on the machine access pressure very high
(2) Data not secure
If the machine on which a file is located fails, the file is inaccessible and the reliability is poor
(3) Difficult to organize documents
For example, to adjust the storage location of some files, it is necessary to see if the target machine is sufficient space, and need to maintain the file location, if the machine is very many, the operation is extremely complex
The solution of HDFs
HDFs is an abstraction layer, the bottom relies on a lot of independent servers, external to provide unified file management functions, for the user, feel like to operate a machine, can not feel the number of servers under HDFs
For example, when a user accesses the/a/b/c.mpg file in HDFs, HDFs is responsible for reading from the underlying corresponding server and then returning it to the user so that the user can only deal with HDFS and not care how the file is stored
Write File Example
For example, a user needs to save a file/a/b/xxx.avi
HDFs first divides this file, for example, into 4 pieces, and then puts them on separate servers.
This is a good thing, not afraid of the file is too large, and the pressure to read the file will not be all concentrated on a single server
But if a server is broken, the file is not read all.
HDFs makes multiple backups of each file block to ensure file reliability
Block 1: A B C
Block 2: A B D
Block 3:b C D
Block 4:a C D
The reliability of this file is greatly enhanced, even if a server is broken, it can read the file completely
At the same time, it also brings a great benefit, that is, to increase the file's concurrent access ability, for example, when multiple users read this file, read block 1,hdfs can choose from which server to read the block according to the server's busy level 1
Management of meta-data
What files are stored in HDFs?
What blocks are the files divided into?
On which server is each block placed?
......
These are called meta-data, which are abstracted into a directory tree, documenting these complex correspondence.
The metadata is managed by a separate module called NameNode
The real server that holds the file blocks is called DataNode.
So the process of user access to HDFs can be understood as:
DataNode, NameNode, HDFS, user
HDFS Benefits
(1) capacity can be linearly extended
(2) A copy mechanism, high storage reliability, increased throughput
(3) with Namenode, the user accesses the file only by specifying the path on the HDFs
HDFS Core Principle