Analysis of HDFS file writing principles in Hadoop

Source: Internet
Author: User

Analysis of HDFS file writing principles in Hadoop

Not to be prepared for the upcoming Big Data era. The following vernacular briefly records what HDFS has done in Hadoop when storing files, provides some reference for future cluster troubleshooting.

Enter the subject

The process of creating a new file:

Step 1: The client uses the creat () method in the DistributedFilesystem object to create a file. At this time, RPC calls namenode through an RPC connection protocol and creates a new file in the namespace, namenode checks various permissions and the file isexist. dfs returns an output stream; otherwise, an IOEXCEPTION is thrown. The output stream controls a DFSoutPutstream to process the communication between the data node and the Name node.

Step 2: the client starts to write data through the output stream. DFSoutPutstream divides the data written by the client into data packages and then writes the data to a queue in dfs, the data packets in these queue are managed by the data stream in dfs. The data stream uses a certain distribution mechanism to form copies of these data packets and place them on datanode. Currently, for example, the dfs we set. replication = 3, you need to put the copy on three datanode, the three datanode will be connected through a pipeline, the data flow will be distributed to the first datanode in the pipeline, this node stores the package and sends it to the second datanode in the pipeline. Similarly, the second data node Storage Package is passed to the third datanode in the pipeline.

(I am not going to draw a flowchart. You will surely want to understand it)

Step 3: Actually, step 3 should belong to step 2. The DFSoutPutstream mentioned in the previous step has an internal queue waiting for confirmation, which is used to store data packets received by datanode, only when all the datanode in the pipeline receives the copy and the storage returns a successful identifier will all data packets be removed after the queue is confirmed. Now, you may have to ask, if a datanode in the pipeline fails during the replication process, how does hadoop handle it? This is the strength of hadoop's fault tolerance;

First, the pipeline will be closed and all data packets in the waiting queue will be added back to the data queue. This ensures the integrity and order of data packets.

Next, take a normal data node from the current block and contact namenode, inform namenode of the faulty node, so that the Incomplete Copy files in the faulty node can be cleared after the fault node recovers.

3. the faulty node is deleted, and the remaining data packets are written to the remaining nodes. Namenode notices that the current copy is insufficient (dfs. replication = 3), a new copy will be created on another datanode.

Now the question comes (which mining technology is strong ?? ) What should I do if a large-scale datanode fault occurs during writing ??

In fact, this situation rarely happens, but does it happen to all birds? We have a configuration option in hadoop deployment: dfs. replication. the default value of min is 1, which means that if one node is successfully written, hdfs considers it successful, in the future, it will itself realize the insufficiency of the number of replicas to implement replication redundancy.

Finally, the book is connected. After the client completes writing, it will call the close () method through DistributedFilesystem. This method has a magical effect, it stores all the remaining bags in the data queue in the waiting for confirmation queue and waits for confirmation. The namenode records the datanode of all copies.

After reading the theoretical knowledge, I 'd like to share it with you in a simple vernacular.

Principle Analysis of HDFS File Reading in Hadoop

How HDFS reads and writes in Hadoop work

Copy local files to HDFS

Download files from HDFS to local

Upload local files to HDFS

Common commands for HDFS basic files

Introduction to HDFS and MapReduce nodes in Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.