Analysis of the specific write flow of HDFs sink

Source: Internet
Author: User
Tags tmp file

The previous article said the implementation of Hdfseventsink, here according to the configuration of HDFs sink and call analysis to see the sink in the entire HDFS data writing process:
Several important settings for on-line HDFs sink

Hdfs.path = Hdfs://xxxxx/%{logtypename}/%y%m%d/%h:hdfs.rollinterval = 60hdfs.rollsize = 0// Want the file to Rollhdfs.rollcount only according to the reality = 0hdfs.batchsize = 2000hdfs.txneventmax = 2000hdfs.filetype = DataStream Hdfs.writeformat = Text

Here are the classes related to Hdfs.filetype and Hdfs.writeformat, a class that defines the file flow, and a class that defines the specific data serialization.
1) Hdfs.filetype has 3 options: Sequencefile/datastream/compressedstream,datastream can be imagined as HDFs textfile, The default is Sequencefiletype,compressedstream is used for compression when setting
2) Hdfs.writeformat defines 3 serialization methods, text writes only the body of the event, Header_and_text writes the body of the event and Header,avro_event is the AVRO serialization method

The above setting, the data write process is probably as follows:

Sinkrunner.process->sinkprocessor.process->hdfseventsink.process->hdfseventsink.append-> Bucketwriter.append->hdfswriter.append->hdfsdatastream.append->bodytexteventserializer.write-> Java.io.OutputStream.write

Simply put:
Bucketwriter and Hdfswriter are instantiated in Hdfseventsink:

        if  (bucketwriter == null)  {           hdfswriter hdfswriter = writerfactory.getwriter (filetype );  //gets hdfswriter  object ....           Bucketwriter = new bucketwriter (rollinterval , rollsize , rollcount ,               batchsize, context  , realPath, realName, inUsePrefix, inUseSuffix,               suffix, codeC, compType, hdfsWriter,  timedrollerpool,               Proxyticket, sinkcounter , idletimeout , idlecallback, lookuppath);  // Get Bucketwriter objects based on hdfswriter  objects

Get Hdfswriter here The Org.apache.flume.sink.hdfs.HDFSWriterFactory Getwriter method is used for the object, which returns the specific Org.apache.flume.sink.hdfs.HDFSWri based on the Hdfs.filetype setting. The object of the TER implementation Class
currently supports only 3 types of

  static final String SequenceFileType =  "Sequencefile"  ;   static final string datastreamtype =  "DataStream"  ;  static final  String CompStreamType =  "Compressedstream"  ;....  public hdfswriter  getwriter (String filetype)  throws IOException {    if  ( Filetype.equalsignorecase ( sequencefiletype))  { //SequenceFile,sequencefile       return new hdfssequencefile ();    } else if  ( Filetype.equalsignorecase (Datastreamtype))  { //datastream      return  new hdfsdatastream ();    } else if  (FileType.equalsIgnoreCase ( Compstreamtype))  { //CompressedStream      return new  Hdfscompresseddatastream (); &nbSp;   } else {      throw new ioexception (" file type  " + fileType + "  not supported ");     }

Bucketwriter can be understood as an encapsulation of the underlying data operations, such as the fact that the data is written in a call to its Append method, append mainly has the following several steps:
1) First determine if the file is open:

if (! IsOpen) {if (idleclosed) {throw new IOException ("This bucket writer is closed due to idling and this H      Andle "+" is thus no longer valid "); } open (); If not, call the Open->doopen->hdfswriter.open method to open Bucketpath (Bucketpath is temporarily written to the directory, such as the directory at the end of TMP, TargetPath is the final directory)}

The main steps of Doopen
A. Set two file names:

Bucketpath = FilePath + directory_delimiter + inuseprefix + fullfilename + inusesuffix; TargetPath = FilePath + Directory_delimiter + fullfilename;

B. Call the Hdfswriter.open method to open Bucketpath

         if  (codec == null)  {           // need to get reference to fs  using above config before underlying           // writer does in order to avoid shutdown hook &  illegalstateexceptions          filesystem =  new path (bucketpath ). Getfilesystem (config);           log.info ("creating "  + bucketPath );           writer.open ( bucketpath);         } else  {          // need to get  Reference to fs before writer does to avoid shutdown hook           filesystem = new path (bucketpath ). Getfilesystem (config);           log.info ("creating "  + bucketPath );           writer.open ( bucketPath, codeC ,  comptype );         }

C. If Rollinterval is set, the Execute scheduled task calls the Close method

    // if time-based rolling is enabled, schedule the  roll    if  (rollinterval > 0)  {       Callable<Void> action = new Callable<Void> ()  {         public void call ()  throws Exception {           log.debug ("rolling file  ({}):  roll scheduled  after {} sec elapsed. "  ,              bucketPath,  rollinterval );          try {             close ();           } catch (throwable t)  {             log.error ("Unexpected error"  , t);           }          return null  ;        }      };       timedrollfuture = timedrollerpool.schedule (action, rollInterval ,           timeunit. seconds);     }

2) Determine if the file needs to be flipped (up to hdfs.rollsize or Hdfs.rollcount settings):

Check if it's time to rotate the file if (Shouldrotate ()) {close ();//close Call Flush+doclose,flush call DOFLUSH,DOFL    Ush calls the Hdfswriter.sync method to synchronize the data to open () in HDFs; }

where Shouldrotate (based on volume and size of roll mode):

  private boolean shouldrotate ()  {    boolean doRotate  = false;    if  ( rollcount > 0)  &&  ( rollcount <= eventcounter ))  { // Hdfs.rollcount is greater than 0 and the number of event processed is greater than or equal to hdfs.rollcount,dorotate  set to true       Log.debug (  "rolling: rollcount: {}, events: {}"  , rollCount ,  eventcounter );       dorotate = true;    }     if  (( rollsize > 0)  &&  ( rollSize  <= processsize))  { //hdfs.rollcount greater than 0 and the number of event processing is greater than or equal to hdfs.rollcount,dorotate  set to True       log.debug (  "rolling: rollsize: {}, bytes: {}"  , rollSize , processSize );       dorotate = true;    }    return dorotate;   }

One of the main steps of Doclose
A. Calling the Hdfswriter.close method
B. Call the Renamebucket method to name the TMP file the final file:

if (Bucketpath! = NULL && FileSystem! = null) {Renamebucket ();//could block or throw IOException file    System = null; }

Among them Renamebucket:

Filesystem.rename (Srcpath, Dstpath)

3) Call the Hdfswriter.append method to write the event

Writer.append (event);

4) Update counter

UPDATE STATISTICS Processsize + = Event.getbody ().    Length    eventcounter++; batchcounter++;

5) Determine if flush is required (to achieve Hdfs.batchsize settings), batch writes data to HDFs

if (Batchcounter = = batchsize) {flush (); }

Event write-in Bucketwriter Append method calls Org.apache.flume.sink.hdfs.HDFSWriter implements the Append method of the class, such as the Hdfsdatastream class here, the main method of Hdfsdatastream:
Configure for setting Serializer:

  public void configure (Context context)  {    serializertype  = context.getstring (  "serializer",  "TEXT"  )  //default serialization is text     userawlocalfilesystem = context.getboolean (  "Hdfs.userawlocalfilesystem",         false);    serializercontext =         new context (Context.getsubproperties (eventserializer.ctx_prefix));     logger.info (  "serializer = "  + serializerType +  ",  userawlocalfilesystem =  "        +  Userawlocalfilesystem);   }append method for write to event, call Eventserializer.write method:   public void  append (event e)  throws IOException {    // shun  Flumeformatter...    serialiZer.write (e);  //call eventserializer.write method write event  } 

The main steps of the Open method:
1) Open or create a new file according to the settings of the Hdfs.append.support (false by default)

Boolean appending = false; if (Conf.getboolean ("Hdfs.append.support", false) = = True && hdfs.isfile (Dstpath)) {//default Hdfs.append      . Support is False outstream = Hdfs.append (Dstpath);    appending = true; } else {OutStream = Hdfs.create (Dstpath);//If Append is not supported, create file}

2) Create Eventserializer objects using the Eventserializerfactory.getinstance method

Serializer = Eventserializerfactory.getinstance (Serializertype, Serializercontext, OutStream); Instantiating a Eventserializer object

3) throws an exception if the Eventserializer object supports reopen and Hdfs.append.support is set to True

if (Appending &&! Serializer.supportsreopen ()) {outstream.close ();      serializer = null;    throw new IOException ("serializer (" + Serializertype + ") does not support append"); }

4) Call the file open or after reopen operation

if (appending) {serializer.afterreopen ();    } else {serializer.aftercreate (); }  }

Here are the 3 types of settings and corresponding classes of Hdfs.writeformat:

TEXT (Bodytexteventserializer.builder. Class),//Support reopen Header_and_text (Headerandbodytexteventserializer.builder . Class),//Support reopen Avro_event (Flumeeventavroeventserializer.builder. Class),//Do not support reopen

The default setting is text, which is the Bodytexteventserializer class:

Private Bodytexteventserializer (OutputStream out, Context ctx) {//constructor method this. Appendnewline = Ctx.getboolean (append_new line, Append_newline_dflt); The default is true this.  out = out; } .... public void write (Event e) throws IOException {//write method Out.write (E.getbody ());//java.io.outputstream.write, only    Write the event body if (appendnewline) {//Add a carriage return after each line out.write (' \ n '); }

This article is from the "Food and Light Blog" blog, please make sure to keep this source http://caiguangguang.blog.51cto.com/1652935/1618343

Analysis of the specific write flow of HDFs sink

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.