Hadoop distributed FileSystem (Hadoop Distributed File System, HDFS)
Distributed File system is a file system that allows files to be shared across multiple hosts over a network, allowing multiple users on multiple machines to share files and storage space.
HDFs is just one of them. applies to the case of one write, multiple queries, does not support concurrent write situations, small files are inappropriate.
2.HDFS Architecture
HDFs uses the Master/slave architecture. An HDFS cluster consists of a namenode and a certain number of datanodes. Namenode is a central server that manages the file system's namespace (namespace) and client access to files. The Datanode in a cluster is typically a node that is responsible for managing the storage on the node it resides on. HDFs exposes the namespace of the file system, allowing users to store data in the form of files. Internally, a file is actually partitioned into one or more blocks of data that are stored on a set of Datanode. Namenode performs namespace operations on the file system, such as opening, closing, renaming files or directories. It is also responsible for determining the mapping of data blocks to specific datanode nodes. The Datanode is responsible for handling read and write requests from the file system client. The creation, deletion and replication of data blocks under the unified dispatch of Namenode.
NameNode
- is the management node for the entire file system. It maintains a file directory tree for the entire file system, meta-information for the file/directory, and a list of data blocks for each file. Receives the user's action request.
- Files include: (see the Dfs.name.dir directory for details)
- Fsimage: Metadata image file. Stores Namenode memory metadata information for a certain period of time.
- Edits: Operation log file.
- Fstime: Time to save last checkpoint
- These files are stored in the Linux file system.
DataNode
- A storage service that provides real-world file data.
- File Block: The most basic unit of storage. For the file content, the length of a file is size, then starting from the 0 offset of the file, according to the fixed size, the order of the file is divided and numbered, divided each block is called a block. HDFS default block size is 128MB to a 512MB file, a total of 4 block.
- Unlike the normal file system, HDFs, if a file is smaller than the size of a block of data, does not occupy the entire block of storage space
- Replication. Multiple copies. The default is three.
Secondarynamenode
- A solution for HA (high availability availability). But it does not support hot standby. Configuration.
- Execution process: Download metadata information (fsimage,edits) from Namenode, then merge the two, generate a new fsimage, save it locally, push it to Namenode, and reset Namenode edits.
- The default is installed on the Namenode node, but this ... Not Safe!
HDFS Read Process
1. Initialize the filesystem, then the client opens the file with the FileSystem open () function
2.FileSystem use RPC to call the metadata node, get the data block information of the file, for each data block, the metadata node returns the address of the data node where the data block is stored.
3.FileSystem returns Fsdatainputstream to the client to read the data, and the client calls the stream's read () function to begin reading the data.
4.DFSInputStream connect the closest data node that holds the first chunk of this file, and data is read from the node to the client
5. When this block of data has been read, Dfsinputstream closes the connection to this data node and then connects to the nearest data node of the next block of data for this file.
6. When the client has finished reading the data, call Fsdatainputstream's close function.
7. In the process of reading the data, if the client has an error communicating with the data node, it attempts to connect to the next data node that contains the data block.
8. The failed data node is logged and is no longer connected.
HDFs Write Process
1. Initialize the filesystem, the client calls create () to make the file
2.FileSystem using RPC to call the metadata node to create a new file in the file system's namespace, the metadata node first determines that the file originally does not exist, and the client has permission to create the file, and then creates a new file.
3.FileSystem returns Dfsoutputstream, the client is used to write data, and the client begins to write data.
4.DFSOutputStream divides the data into chunks and writes it to a data queue. The data queue is read by data streamer and notifies the metadata node to allocate data nodes, which are used to store chunks (each of which replicates 3 blocks by default). The assigned data node is placed in a pipeline. Data Streamer writes a block to the first data node in the pipeline. The first Data node sends a block of data to the second data node. The second data node sends the data to a third data node.
5.DFSOutputStream saves the ACK queue for the emitted data block, waiting for the data node in the pipeline to tell that the data has been successfully written.
6. When the client finishes writing the data, call the stream's close function. This operation writes all data blocks to the data node in pipeline and waits for the ACK queue to return successfully. Finally notifies the metadata node that the write is complete.
7. If the data node fails during the write process, closes the pipeline, puts the data block in the ACK queue at the beginning of the data queue, and the current data block is given a new label by the metadata node in the already-written node, then the error node restarts to detect that its data block is obsolete. will be deleted. The failed data node is removed from the pipeline, and the other data block is written to the other two data nodes in the pipeline. The metadata node is notified that the block is insufficient in number of copies and will create a third backup in the future.
Official documents
- HDFS Users Guide
- HDFS Architecture
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
HDFs Learning Notes (1) on HDFs