How to ensure that no data is lost when processing large data

Source: Internet
Author: User

In 7 August this year, Hangzhou imposed power rationing, resulting in Ali Yuhang computer room machine accident, resulting in the HDFs cluster of some data loss.

Prior to the Hadoop 2.0.2-alpha, HDFs may have been writing a loss of data in the event of a machine outage or accidental crash. In the recently released CDH4, HDFS provided a HSync () method call (HDFS-744) on the client side to ensure that the data would not be lost if the machine crashed or had an accidental power outage. This document will be a simple analysis of its implementation details around the new interface, hoping to find a strategy to use HSync () appropriately to avoid critical data loss.

The difference between sync (), Hflush () and HSync () in HDFs

Before HSync (), HDFs had already provided the call for Sync () and Hflush (), and it was hard to tell the difference between the three methods from the name of the method. Let's start with the difference between these methods.

In HDFs, calling Hflush () updates the stored data in client-side buffer to the Datanode end until the call is received for all Datanode ack responses. This ensures that consistent data can be read by all client side at the end of the Hflush () call. The sync () nature of the HDFs is also called Hflush ().

HSync () is to ensure that the stored data in client-side buffer is updated to the Datanode end, and that the Datanode-side data is updated to the physical disk, so that after the HSync () call ends, even if the machine where the Datanode is located is unexpectedly powered off, Data is not lost as a result. Hflush () may lose data if the machine loses power unexpectedly, because the data that the client side passes to Datanode may exist in the cache of Datanode and not be persisted to disk. The following figure describes the flow of packets delivered in HDFs after a write request from the client.

The realization Essence of HSync ()

When HSync () executes, it actually produces a fsync system call on the corresponding Datanode machine, which updates the data on the related files in memory to disk.

When the client side executes the HSync, the Datanode end recognizes that the Syncblock_ field in the packet sent by the client is true, which determines that the data in memory needs to be updated to disk. The following statement is executed in the Flushorsync () of Blockreceiver.java:

((FileOutputStream) cout). Getchannel (). Force (true);

The FileChannel force (Boolean Metadata) method, in the JDK, calls Fsync or fdatasync at the bottom of the filedispatcherimpl.c. Executes the Fsync when metadata is true, and executes Fdatasync when false.

Java_sun_nio_ch_filedispatcherimpl_force0 (jnienv *env, jobject this, Jobject 
FDO, Jboolean md)
{
    Jint fd = Fdval (env, FDO);
    int result = 0;

    if (MD = = jni_false) {Result
        = Fdatasync (FD);
    } else {result
        = Fsync (FD);
    }
    Return handle (env, result, "Force failed");
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.