HDFS is one of our common components in big data. HDFS is an indispensable framework in the hadoop ecosystem. Therefore, when we enter hadoop, we must have a certain understanding of it.
First, we all know that HDFS is a Distributed File System in the hadoop ecosystem. It stores massive data in our big data,
It is precisely because of the release of Google's paper that we will produce HDFS, along with the advent of the big data era.
Next we will introduce the major components (excluding the HA mode), ① namenode (saving metadata, keeping heartbeat with datanode, and establishing communication with the client, now we have to talk about the communication method here. We all know that TCP is a common communication method in Java, and RPC (Remote transitional call) is introduced in hadoop ), is to maintain sessions between our nodes. ② Datanode (where data is actually stored, we usually set a multi-copy mechanism (the number of copies is 3 the best), ③ secondrynamenode (here we need to remember that it is not a backup of namenode, instead, the editlog and fsimage are merged regularly (namenode defaults to 6 hours or 1 million operations), and then refreshed to the image in our namenode, that is, fsimage)
HDFS writing process (not to mention nonsense, directly !!!)
Problem 1: node3 is dead, blk1 will continue to upload, and it will not be received after a certain number of Heartbeat times. At this time, namenode will clear the metadata of DD3. If it has not been deleted, after node3 restarts, namenode notifies the datanode next to him to copy a copy of data to him. If it is replaced, it cannot be used again.
Problem 2 multiple communication connections, and then cannot connect, it will be considered not to be connected
The problem 3 was tried multiple times and finally the job failed.
Problem 4: The job fails to be Submitted multiple times and the connection is considered as failed.
Problem 5 job failure
HDFS of hadoop