Topic Center

Contact Sales

Home > Others

How to ensure that no data is lost when processing large data

Last Update:2017-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In 7 August this year, Hangzhou imposed power rationing, resulting in Ali Yuhang computer room machine accident, resulting in the HDFs cluster of some data loss.

Prior to the Hadoop 2.0.2-alpha, HDFs may have been writing a loss of data in the event of a machine outage or accidental crash. In the recently released CDH4, HDFS provided a HSync () method call (HDFS-744) on the client side to ensure that the data would not be lost if the machine crashed or had an accidental power outage. This document will be a simple analysis of its implementation details around the new interface, hoping to find a strategy to use HSync () appropriately to avoid critical data loss.

The difference between sync (), Hflush () and HSync () in HDFs

Before HSync (), HDFs had already provided the call for Sync () and Hflush (), and it was hard to tell the difference between the three methods from the name of the method. Let's start with the difference between these methods.

In HDFs, calling Hflush () updates the stored data in client-side buffer to the Datanode end until the call is received for all Datanode ack responses. This ensures that consistent data can be read by all client side at the end of the Hflush () call. The sync () nature of the HDFs is also called Hflush ().

HSync () is to ensure that the stored data in client-side buffer is updated to the Datanode end, and that the Datanode-side data is updated to the physical disk, so that after the HSync () call ends, even if the machine where the Datanode is located is unexpectedly powered off, Data is not lost as a result. Hflush () may lose data if the machine loses power unexpectedly, because the data that the client side passes to Datanode may exist in the cache of Datanode and not be persisted to disk. The following figure describes the flow of packets delivered in HDFs after a write request from the client.

The realization Essence of HSync ()

When HSync () executes, it actually produces a fsync system call on the corresponding Datanode machine, which updates the data on the related files in memory to disk.

When the client side executes the HSync, the Datanode end recognizes that the Syncblock_ field in the packet sent by the client is true, which determines that the data in memory needs to be updated to disk. The following statement is executed in the Flushorsync () of Blockreceiver.java:

((FileOutputStream) cout). Getchannel (). Force (true);

The FileChannel force (Boolean Metadata) method, in the JDK, calls Fsync or fdatasync at the bottom of the filedispatcherimpl.c. Executes the Fsync when metadata is true, and executes Fdatasync when false.

Java_sun_nio_ch_filedispatcherimpl_force0 (jnienv *env, jobject this, Jobject 
FDO, Jboolean md)
{
    Jint fd = Fdval (env, FDO);
    int result = 0;

    if (MD = = jni_false) {Result
        = Fdatasync (FD);
    } else {result
        = Fsync (FD);
    }
    Return handle (env, result, "Force failed");
}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to ensure that no data is lost when processing large data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support