Modifying the Flume-ng HDFs sink parsing timestamp source greatly improves write performance

Source: Internet
Author: User
Tags string format

Transferred from: http://www.cnblogs.com/lxf20061900/p/4014281.html

The pathname of the HDFs sink in Flume-ng (the corresponding parameter "Hdfs.path", which is not allowed to be empty) and the file prefix (corresponding to the parameter "Hdfs.fileprefix") support the regular parsing timestamp to automatically create the directory and file prefix by time.

In practice, it is found that the flume built-in parsing method is time-consuming and has great room for improvement. If you do not need to configure timestamp parsing time, then this article is not very useful to you, HDFs sink corresponding parsing timestamp code in the Org.apache.flume.sink.hdfs.HDFSEventSink process () method, Involves two lines of code:

// reconstruct the path name by substituting place holders2         String realpath = bucketpath.escapestring (file Path, Event.getheaders (),3             timeZone, needrounding, Roundunit, Roundvalue, uselocaltime); 4         String realname = bucketpath.escapestring (fileName, Event.getheaders (),5           timeZone, needrounding, Roundunit, Roundvalue, uselocaltime);

Where Realpath is the full pathname after the regular parse timestamp, the filepath parameter is "Hdfs.path" in the configuration file, Realname is the filename prefix after the regular parse timestamp, and the filename parameter is the " Hdfs.fileprefix ". The other parameters are the same, event.getheaders () is a map with a timestamp (can be set by interceptor, customizing, using the Uselocaltimestamp parameter of HDFs sink three ways), other parameters are time zone, Whether rounding and time units.

Bucketpath.escapestring This method is the regular parsing timestamp, the specific code we no longer analyze, now we write a program to test the performance of bucketpath.escapestring this method, run the test class is either in the source code:

 Public classTest { Public Static voidMain (string[] args) {HashMap<string, string> headers =NewHashmap<string, string>(); Headers.put ("Timestamp", Long.tostring (System.currenttimemillis ())); String FilePath= "hdfs://xxxx.com:8020/data/flume/%y-%m-%d"; String FileName= "%h-%m"; LongStart =System.currenttimemillis (); System.out.println ("Start time is:" +start);  for(inti = 0; i < 2400000; i++) {String Realpath= Bucketpath.escapestring (FilePath, headers,NULL,false, Calendar.second, 1,false); String Realname= Bucketpath.escapestring (fileName, headers,NULL,false, Calendar.second, 1,false); }LongEnd =System.currenttimemillis (); System.out.println ("End time is:" + End + ". \ntotal time is:" + (End-start) + "Ms."); }}

The following 5 parameters of this method we generally do not need to use, so this is actually set to the actual value of no effect. Headers parameter to have "timestamp" parameter, we here Loop processing 240W event, see Running Result:

Start time Is:1412853253889end time Is:1412853278210.total time is:24321 Ms.

I actually spent more than 24s, to know that the elder brother's current data volume of the day is 4W event per second, this is not a peak ... Plus parse time stamp full amount of can't carry, how to do??

What can I do? Only to find a way to replace this resolution, so, I think of this, see the test program:

 Public classTest {Private StaticSimpleDateFormat Sdfymd =NULL; Private StaticSimpleDateFormat SDFHM =NULL;  Public Static voidMain (string[] args) {Sdfymd=NewSimpleDateFormat ("Yyyy-mm-dd"); SDFHM=NewSimpleDateFormat ("hh-mm"); HashMap<string, string> headers =NewHashmap<string, string>(); Headers.put ("Timestamp", Long.tostring (System.currenttimemillis ())); String FilePath= "hdfs://dm056.tj.momo.com:8020/data/flume/%y-%m-%d"; String FileName= "%h-%m"; LongStart =System.currenttimemillis (); System.out.println ("Start time is:" +start);  for(inti = 0; i < 2400000; i++) {            //String Realpath = bucketpath.escapestring (filePath, headers, NULL, FALSE, Calendar.second, 1, false); //String realname = bucketpath.escapestring (fileName, headers, NULL, FALSE, Calendar.second, 1, false);String Realpath= GetTime ("Yyyy-mm-dd", Long.parselong (Headers.get ("timestamp")))); String Realname= GetTime ("hh-mm", Long.parselong (Headers.get ("timestamp")))); }        LongEnd =System.currenttimemillis (); System.out.println ("End time is:" + End + ". \ntotal time is:" + (End-start) + "Ms."); }     Public StaticString GetTime (string format,Longtimestamp) {String time=""; if(Format.equals ("hh-mm")) time=Sdfhm.format (timestamp); Else if(Format.equals ("Yyyy-mm-dd")) time=Sdfymd.format (timestamp); returnTime ; }}

We use Java's own SimpleDateFormat to complete the parsing in the specified format so that the entire path or name cannot be passed in to see the results of the operation:

Start time Is:1412853670246end time Is:1412853672204.total time is:1958 Ms.

Nima!!! No, less than 2s ... I'm testing it on my MBP, i5+8g+128g SSD, what are you hesitating about?

To start changing the source ...

We'd better make the parsing format configurable, and the best way to retain the original prefix name, because it is possible to join the host name Ah what, but you can use this prefix as infix, parse the results of the timestamp as a prefix ...

1, we need two SimpleDateFormat to implement the path and the name of the format, and at the time of configuration to complete the instantiation, so that the object can be created once OK, but also requires the path and name of the format string, which can be made global or local , we are in the big picture (not really necessary, are we?) haha), variable declaration phase code:

Private null;        // For file in HDFs path    Private null;        // For file name prefix    Private String Filepathformat;     Private String Filenameformat;

2, configure (Context context) method needs to configure the above objects, very simple, it is obvious that the relevant code is as follows:

FilePath = Preconditions.checknotnull (                context.getstring ("Hdfs.path"), "Hdfs.path is required");        =  context.getstring ("Hdfs.path.format", "Yyyy/mm/dd");        // Time ' s format ps: "YYYY-MM-DD"        New SimpleDateFormat (Filepathformat);         = Context.getstring ("Hdfs.fileprefix", defaultfilename);         = Context.getstring ("Hdfs.filePrefix.format", "HHmm");         New SimpleDateFormat (Filenameformat);

Added is the above 3, 4, 6, 74 lines of code, parsing format string is in the "Hdfs.path.format" and "Hdfs.filePrefix.format" in the configuration, the other places do not have time stamp format string, and do not appear the original built-in those%h,% MM and so on format. The above two format configuration has a default format string, it is good to make your own decision.

3, increase the parsing time stamp method:

 Public String GetTime (String type,long  timestamp) {        stringtime = "";         if (Type.equals ("name"))            time =Sdfname.format (timestamp);         Else if (Type.equals ("path"))            time =Sdfpath.format (timestamp);         return Time ;    }

The parameter type is used to specify whether the file name or pathname is used to invoke the corresponding formatted object.

4, the following is the focus, the above steps even if configured, not in this modification will not play any role, modify the process () method, with the following code to replace the most mentioned two lines of code:

String Realpath =FilePath; String Realname=FileName; if(Realname.equals ("%host") && event.getheaders (). Get ("host")! =NULL) Realname= Event.getheaders (). Get ("host"). toString (); if(Event.getheaders (). Get ("timestamp")! =NULL){                    LongTime = Long.parselong (Event.getheaders (). Get ("timestamp")); Realpath+ = Directory_delimiter + getTime ("path", time); Realname= GetTime ("name", time) + "." +Realname; }

These lines of logic actually have: A, can be customized infix ("Hdfs.fileprefix", can be a constant or "%host", the latter is used to obtain the hostname, if you want to set Hostinterceptor); B, the default infix is the default "Flumedata" C, if there is a timestamp in headers, call the GetTime method to parse the timestamp.

5. Compile & package & Replace & Run ...

Brother Packaging compared to the original, because only modified a class, the compiled class file with Hdfseventsink beginning of a few class files to replace the original Flume-hdfs-sink jar package of the corresponding class file ... The original bar ... MAVEN, go directly to maven ...

My test result here is that if you do not have the compression feature configured, the performance increase is more than 70%, if the configuration on the compression function (gzip) performance increase of more than 50%, this value is for reference only, different environments different host different personality may vary.

Look forward to the results of the test ...

Modifying the Flume-ng HDFs sink parsing timestamp source greatly improves write performance

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.