The previous article said the implementation of Hdfseventsink, here according to the configuration of HDFs sink and call analysis to see the sink in the entire HDFS data writing process:
Several important settings for on-line HDFs sink
Hdfs.path = Hdfs://xxxxx/%{logtypename}/%y%m%d/%h:hdfs.rollinterval = 60hdfs.rollsize = 0// Want the file to Rollhdfs.rollcount only according to the reality = 0hdfs.batchsize = 2000hdfs.txneventmax = 2000hdfs.filetype = DataStream Hdfs.writeformat = Text
Here are the classes related to Hdfs.filetype and Hdfs.writeformat, a class that defines the file flow, and a class that defines the specific data serialization.
1) Hdfs.filetype has 3 options: Sequencefile/datastream/compressedstream,datastream can be imagined as HDFs textfile, The default is Sequencefiletype,compressedstream is used for compression when setting
2) Hdfs.writeformat defines 3 serialization methods, text writes only the body of the event, Header_and_text writes the body of the event and Header,avro_event is the AVRO serialization method
The above setting, the data write process is probably as follows:
Sinkrunner.process->sinkprocessor.process->hdfseventsink.process->hdfseventsink.append-> Bucketwriter.append->hdfswriter.append->hdfsdatastream.append->bodytexteventserializer.write-> Java.io.OutputStream.write
Simply put:
Bucketwriter and Hdfswriter are instantiated in Hdfseventsink:
if (bucketwriter == null) { hdfswriter hdfswriter = writerfactory.getwriter (filetype ); //gets hdfswriter object .... Bucketwriter = new bucketwriter (rollinterval , rollsize , rollcount , batchsize, context , realPath, realName, inUsePrefix, inUseSuffix, suffix, codeC, compType, hdfsWriter, timedrollerpool, Proxyticket, sinkcounter , idletimeout , idlecallback, lookuppath); // Get Bucketwriter objects based on hdfswriter objects
Get Hdfswriter here The Org.apache.flume.sink.hdfs.HDFSWriterFactory Getwriter method is used for the object, which returns the specific Org.apache.flume.sink.hdfs.HDFSWri based on the Hdfs.filetype setting. The object of the TER implementation Class
currently supports only 3 types of
static final String SequenceFileType = "Sequencefile" ; static final string datastreamtype = "DataStream" ; static final String CompStreamType = "Compressedstream" ;.... public hdfswriter getwriter (String filetype) throws IOException { if ( Filetype.equalsignorecase ( sequencefiletype)) { //SequenceFile,sequencefile return new hdfssequencefile (); } else if ( Filetype.equalsignorecase (Datastreamtype)) { //datastream return new hdfsdatastream (); } else if (FileType.equalsIgnoreCase ( Compstreamtype)) { //CompressedStream return new Hdfscompresseddatastream (); &nbSp; } else { throw new ioexception (" file type " + fileType + " not supported "); }
Bucketwriter can be understood as an encapsulation of the underlying data operations, such as the fact that the data is written in a call to its Append method, append mainly has the following several steps:
1) First determine if the file is open:
if (! IsOpen) {if (idleclosed) {throw new IOException ("This bucket writer is closed due to idling and this H Andle "+" is thus no longer valid "); } open (); If not, call the Open->doopen->hdfswriter.open method to open Bucketpath (Bucketpath is temporarily written to the directory, such as the directory at the end of TMP, TargetPath is the final directory)}
The main steps of Doopen
A. Set two file names:
Bucketpath = FilePath + directory_delimiter + inuseprefix + fullfilename + inusesuffix; TargetPath = FilePath + Directory_delimiter + fullfilename;
B. Call the Hdfswriter.open method to open Bucketpath
if (codec == null) { // need to get reference to fs using above config before underlying // writer does in order to avoid shutdown hook & illegalstateexceptions filesystem = new path (bucketpath ). Getfilesystem (config); log.info ("creating " + bucketPath ); writer.open ( bucketpath); } else { // need to get Reference to fs before writer does to avoid shutdown hook filesystem = new path (bucketpath ). Getfilesystem (config); log.info ("creating " + bucketPath ); writer.open ( bucketPath, codeC , comptype ); }
C. If Rollinterval is set, the Execute scheduled task calls the Close method
// if time-based rolling is enabled, schedule the roll if (rollinterval > 0) { Callable<Void> action = new Callable<Void> () { public void call () throws Exception { log.debug ("rolling file ({}): roll scheduled after {} sec elapsed. " , bucketPath, rollinterval ); try { close (); } catch (throwable t) { log.error ("Unexpected error" , t); } return null ; } }; timedrollfuture = timedrollerpool.schedule (action, rollInterval , timeunit. seconds); }
2) Determine if the file needs to be flipped (up to hdfs.rollsize or Hdfs.rollcount settings):
Check if it's time to rotate the file if (Shouldrotate ()) {close ();//close Call Flush+doclose,flush call DOFLUSH,DOFL Ush calls the Hdfswriter.sync method to synchronize the data to open () in HDFs; }
where Shouldrotate (based on volume and size of roll mode):
private boolean shouldrotate () { boolean doRotate = false; if ( rollcount > 0) && ( rollcount <= eventcounter )) { // Hdfs.rollcount is greater than 0 and the number of event processed is greater than or equal to hdfs.rollcount,dorotate set to true Log.debug ( "rolling: rollcount: {}, events: {}" , rollCount , eventcounter ); dorotate = true; } if (( rollsize > 0) && ( rollSize <= processsize)) { //hdfs.rollcount greater than 0 and the number of event processing is greater than or equal to hdfs.rollcount,dorotate set to True log.debug ( "rolling: rollsize: {}, bytes: {}" , rollSize , processSize ); dorotate = true; } return dorotate; }
One of the main steps of Doclose
A. Calling the Hdfswriter.close method
B. Call the Renamebucket method to name the TMP file the final file:
if (Bucketpath! = NULL && FileSystem! = null) {Renamebucket ();//could block or throw IOException file System = null; }
Among them Renamebucket:
Filesystem.rename (Srcpath, Dstpath)
3) Call the Hdfswriter.append method to write the event
Writer.append (event);
4) Update counter
UPDATE STATISTICS Processsize + = Event.getbody (). Length eventcounter++; batchcounter++;
5) Determine if flush is required (to achieve Hdfs.batchsize settings), batch writes data to HDFs
if (Batchcounter = = batchsize) {flush (); }
Event write-in Bucketwriter Append method calls Org.apache.flume.sink.hdfs.HDFSWriter implements the Append method of the class, such as the Hdfsdatastream class here, the main method of Hdfsdatastream:
Configure for setting Serializer:
public void configure (Context context) { serializertype = context.getstring ( "serializer", "TEXT" ) //default serialization is text userawlocalfilesystem = context.getboolean ( "Hdfs.userawlocalfilesystem", false); serializercontext = new context (Context.getsubproperties (eventserializer.ctx_prefix)); logger.info ( "serializer = " + serializerType + ", userawlocalfilesystem = " + Userawlocalfilesystem); }append method for write to event, call Eventserializer.write method: public void append (event e) throws IOException { // shun Flumeformatter... serialiZer.write (e); //call eventserializer.write method write event }
The main steps of the Open method:
1) Open or create a new file according to the settings of the Hdfs.append.support (false by default)
Boolean appending = false; if (Conf.getboolean ("Hdfs.append.support", false) = = True && hdfs.isfile (Dstpath)) {//default Hdfs.append . Support is False outstream = Hdfs.append (Dstpath); appending = true; } else {OutStream = Hdfs.create (Dstpath);//If Append is not supported, create file}
2) Create Eventserializer objects using the Eventserializerfactory.getinstance method
Serializer = Eventserializerfactory.getinstance (Serializertype, Serializercontext, OutStream); Instantiating a Eventserializer object
3) throws an exception if the Eventserializer object supports reopen and Hdfs.append.support is set to True
if (Appending &&! Serializer.supportsreopen ()) {outstream.close (); serializer = null; throw new IOException ("serializer (" + Serializertype + ") does not support append"); }
4) Call the file open or after reopen operation
if (appending) {serializer.afterreopen (); } else {serializer.aftercreate (); } }
Here are the 3 types of settings and corresponding classes of Hdfs.writeformat:
TEXT (Bodytexteventserializer.builder. Class),//Support reopen Header_and_text (Headerandbodytexteventserializer.builder . Class),//Support reopen Avro_event (Flumeeventavroeventserializer.builder. Class),//Do not support reopen
The default setting is text, which is the Bodytexteventserializer class:
Private Bodytexteventserializer (OutputStream out, Context ctx) {//constructor method this. Appendnewline = Ctx.getboolean (append_new line, Append_newline_dflt); The default is true this. out = out; } .... public void write (Event e) throws IOException {//write method Out.write (E.getbody ());//java.io.outputstream.write, only Write the event body if (appendnewline) {//Add a carriage return after each line out.write (' \ n '); }
This article is from the "Food and Light Blog" blog, please make sure to keep this source http://caiguangguang.blog.51cto.com/1652935/1618343
Analysis of the specific write flow of HDFs sink