Brief introduction
HDFS(Hadoop Distributed File System) Hadoop distributed filesystem. is based on a copy of a paper published by Google. The thesis is the GFS (Google file system) Google filesystem (Chinese, English).
HDFs Features:
1, save multiple copies, and provide fault-tolerant mechanism, copy loss or downtime automatic recovery. The default backup is 3 copies.
2, can support running on the cheap machine.
3, suitable for the processing of big data. HDFs divides the file into blocks, by default a block of 64M, stores the chunked data in a key-value pair to HDFs, and maps the key-value pairs into memory.
As shown, HDFs is also based on the structure of master and slave. The roles of Namenode, Secondarynamenode and Datanode.
NameNode: Is the master node, is the manager. Manage data block mappings, handle read and write requests from clients, configure replica policies, and manage HDFs namespaces.
The block is stored on those datanode nodes (this part of the data is not stored on the Namenode disk, it is escalated to Namenode at Datanode startup, and the information is saved in memory after it is received).
The location information of the block is not stored back in the fsimage.
Edits file records the client operation Fsimage Log, the file additions and deletions and so on.
Secondarynamenode: Share the workload of Namenode, Namenode cold backup, merge Fsimage and fsedits and send to Namenode.
Merge the Fsimage and fsedits files, and then send and replace the Namenode fsimage file, leaving a copy of yourself,
This copy can be part of the file recovery after the Namenode outage or necrosis.
1, you can modify the merge interval by configuring Fs.checkpoint.period, the default is 1 hours.
2, can also configure the size of the edits log file, fs.checkpoint.size specify the maximum value edits file, to let Secondarynamenode to know when to do the merge operation, the default size is 64M.
The merge process is as follows:
DataNode: Slave node, slave, working. It is responsible for storing block blocks of data sent by the client and performing read and write operations on the data blocks.
Hot backup : B is a hot backup, if a is broken off. Then b run the job instead of a right away.
Cold backup : B is a cold backup of a, if a is broken off. Then B can't replace a job immediately. But B stores some information about a, reducing the loss of a after a bad fall.
fsimage: Metadata image file (file system directory tree. )
edits: metadata operation log (record of modification operation for file system)
=fsimage+edits is stored in namenode memory.
Secondarynamenode is responsible for the scheduled default of 1 hours, from Namenode, get fsimage and edits to merge, and then send to Namenode. Reduce the workload of Namenode.
HDF s pros and Cons:
Advantages
1. High fault tolerance
Data automatically saves multiple copies
Automatic recovery after a copy is lost
2, suitable for batch processing
Calculation and manipulation of movement
Data location exposed to the computational framework
3. Suitable for big data processing
GB, TB, PB or even greater
Number of documents over millions
10k+ node
4, can be framed on the cheap machine
Improve reliability with replicas
Provides fault tolerance and recovery mechanisms
Disadvantages
1. Low Latency Data access
2, small file access consumption resources (occupy Namenode memory space)
3, concurrent write (one file can only have one writer), file can not be randomly modified (only support append)
Operating principle of HDFs