Analysis of HDFS file writing principles in Hadoop

Last Update:2015-02-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Not to be prepared for the upcoming Big Data era. The following vernacular briefly records what HDFS has done in Hadoop when storing files, provides some reference for future cluster troubleshooting.

Enter the subject

The process of creating a new file:

Step 1: The client uses the creat () method in the DistributedFilesystem object to create a file. At this time, RPC calls namenode through an RPC connection protocol and creates a new file in the namespace, namenode checks various permissions and the file isexist. dfs returns an output stream; otherwise, an IOEXCEPTION is thrown. The output stream controls a DFSoutPutstream to process the communication between the data node and the Name node.

Step 2: the client starts to write data through the output stream. DFSoutPutstream divides the data written by the client into data packages and then writes the data to a queue in dfs, the data packets in these queue are managed by the data stream in dfs. The data stream uses a certain distribution mechanism to form copies of these data packets and place them on datanode. Currently, for example, the dfs we set. replication = 3, you need to put the copy on three datanode, the three datanode will be connected through a pipeline, the data flow will be distributed to the first datanode in the pipeline, this node stores the package and sends it to the second datanode in the pipeline. Similarly, the second data node Storage Package is passed to the third datanode in the pipeline.

(I am not going to draw a flowchart. You will surely want to understand it)

Step 3: Actually, step 3 should belong to step 2. The DFSoutPutstream mentioned in the previous step has an internal queue waiting for confirmation, which is used to store data packets received by datanode, only when all the datanode in the pipeline receives the copy and the storage returns a successful identifier will all data packets be removed after the queue is confirmed. Now, you may have to ask, if a datanode in the pipeline fails during the replication process, how does hadoop handle it? This is the strength of hadoop's fault tolerance;

First, the pipeline will be closed and all data packets in the waiting queue will be added back to the data queue. This ensures the integrity and order of data packets.

Next, take a normal data node from the current block and contact namenode, inform namenode of the faulty node, so that the Incomplete Copy files in the faulty node can be cleared after the fault node recovers.

3. the faulty node is deleted, and the remaining data packets are written to the remaining nodes. Namenode notices that the current copy is insufficient (dfs. replication = 3), a new copy will be created on another datanode.

Now the question comes (which mining technology is strong ?? ) What should I do if a large-scale datanode fault occurs during writing ??

In fact, this situation rarely happens, but does it happen to all birds? We have a configuration option in hadoop deployment: dfs. replication. the default value of min is 1, which means that if one node is successfully written, hdfs considers it successful, in the future, it will itself realize the insufficiency of the number of replicas to implement replication redundancy.

Finally, the book is connected. After the client completes writing, it will call the close () method through DistributedFilesystem. This method has a magical effect, it stores all the remaining bags in the data queue in the waiting for confirmation queue and waits for confirmation. The namenode records the datanode of all copies.

After reading the theoretical knowledge, I 'd like to share it with you in a simple vernacular.

Principle Analysis of HDFS File Reading in Hadoop

How HDFS reads and writes in Hadoop work

Copy local files to HDFS

Download files from HDFS to local

Upload local files to HDFS

Common commands for HDFS basic files

Introduction to HDFS and MapReduce nodes in Hadoop

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of HDFS file writing principles in Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis of HDFS file writing principles in Hadoop

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support