"Java" "Fulme" flume-ng source code reading Spooldirectorysource

Source: Internet
Author: User

Org.apache.flume.source.SpoolDirectorySource is a frequently used source of flume, which supports obtaining file data from a directory on disk. Unlike other asynchronous sources, this source avoids data loss after a restart or failure to send. Flume can monitor the directory and read the file and fetch the data when a new file appears. When a given file is read into the channel by all, the file is renamed to the flag already complete. At the same time, the source needs a cleanup process to periodically remove the completed files.

The channel optionally inserts the original file of a completed path into the Hearder domain of each event. When the file is read, source caches the file data into memory. At the same time, you need to make sure that the Buffermaxlinelength option is set to ensure that the data is much larger than the longest row of data in the input data.

Attention!!! The channel only receives files that are uniquely named in spooling directory. If the file name is repeated or the file is altered during the read process, there will be a read failure to return the exception information. In this scenario, a file with the same name is copied to this folder with a unique identifier, such as a timestamp.

First, configure (context context) method. The code is as follows:

public void Configure (context context) {SpoolDirectory = context.getstring (spool_directory);    Preconditions.checkstate (SpoolDirectory! = NULL, "Configuration must specify a spooling directory");    Completedsuffix = context.getstring (Spooled_file_suffix, Default_spooled_file_suffix);    Deletepolicy = context.getstring (Delete_policy, Default_delete_policy);    Fileheader = Context.getboolean (Filename_header, Default_file_header);    Fileheaderkey = context.getstring (Filename_header_key, Default_filename_header_key);    BatchSize = Context.getinteger (batch_size, default_batch_size);    Inputcharset = context.getstring (Input_charset, Default_input_charset);    Ignorepattern = context.getstring (Ignore_pat, Default_ignore_pat);    Trackerdirpath = context.getstring (Tracker_dir, Default_tracker_dir);    Deserializertype = context.getstring (Deserializer, Default_deserializer); Deserializercontext = new Context (Context.getsubproperties (deserializer+        ".")); "Hack" to the backwards compatibility with previous generation of//spooling directory source, which do not su    Pport deserializers Integer buffermaxlinelength = Context.getinteger (buffer_max_line_length); if (buffermaxlinelength! = NULL && Deserializertype! = null && deserializertype.equals (Default_dese    Rializer)) {deserializercontext.put (Linedeserializer.maxline_key, buffermaxlinelength.tostring ()); }  }

1, SpoolDirectory is the monitoring folder, cannot be empty, no default value. This source does not have the ability to monitor subfolders, that is, it cannot be monitored recursively. If necessary, this needs to be achieved by itself, http://blog.csdn.net/yangbutao/article/details/8835563 here has the realization of recursive detection;

2, Completedsuffix is the file after the completion of the file is added to the tag suffix, the default is ". Completed ";

3, Deletepolicy This is whether to delete the read completed file, the default is "never", is not deleted, now only support "never" and "IMMEDIATE";

4, Fileheader whether to add the file name in the header of the event, Boolean type

5, Fileheaderkey This is the header of the event Key,value is the file name

6, BatchSize This is the number of records processed at a time, the default is 100;

7, Inputcharset encoding method, the default is "UTF-8";

8, Ignorepattern ignore the qualified file name

9. Trackerdirpath is the storage folder for file metadata, default ". Flumespool"

10, Deserializertype the data in the file serialized into the way of event, the default is "line"---org.apache.flume.serialization.LineDeserializer

11, Deserializercontext This is mainly used in Deserializer to set the encoding method Outputcharset and the maximum length of the file per line maxlinelength.

  

Second, start () method. The code is as follows:

public void Start () {Logger.info ("Spooldirectorysource Source starting with directory: {}", spooldirectory);    Scheduledexecutorservice executor = Executors.newsinglethreadscheduledexecutor ();    Countergroup = new Countergroup ();    File directory = new file (spooldirectory); try {reader = new Reliablespoolingfileeventreader.builder (). SpoolDirectory (directory). completeds Uffix (Completedsuffix). Ignorepattern (Ignorepattern). Trackerdirpath (Trackerdirpath). Annotatef Ilename (Fileheader). Filenameheader (Fileheaderkey). Deserializertype (Deserializertype). Deseria Lizercontext (Deserializercontext). Deletepolicy (Deletepolicy). Inputcharset (Inputcharset). Buil    D ();    } catch (IOException IoE) {throw new Flumeexception ("Error instantiating spooling Event parser", IOE);    } Runnable Runner = new Spooldirectoryrunnable (reader, countergroup); Executor. Schedulewithfixeddelay (Runner, 0, Poll_delay_ms, timeunit.milliseconds);    Super.start ();  Logger.debug ("Spooldirectorysource source started"); }

1, constructed a Org.apache.flume.client.avro.ReliableSpoolingFileEventReader object reader;

2, started a Poll_delay_ms (default 500, Unit MS) run a spooldirectoryrunnable process;

Third, read and send the event process. The code is as follows:

Private class Spooldirectoryrunnable implements Runnable {private Reliablespoolingfileeventreader reader;    Private Countergroup Countergroup;  Public spooldirectoryrunnable (Reliablespoolingfileeventreader Reader, Countergroup countergroup) {This.reader      = reader;    This.countergroup = Countergroup; } @Override public void Run () {try {while (true) {list<event> events = Reader.readeven  TS (batchsize);          Read BatchSize record if (Events.isempty ()) {break;          } countergroup.addandget ("Spooler.events.read", (long) events.size ());  Getchannelprocessor (). Processeventbatch (events);        Send events to channel Reader.commit () in bulk;        }} catch (Throwable t) {logger.error ("uncaught exception in Runnable", t);        if (t instanceof error) {throw (error) t; }      }    }  }

The process implements the bulk reading of the data that the reader points to, and sends it to the channel.

Four, the construction method of Org.apache.flume.client.avro.ReliableSpoolingFileEventReader first is to try to spooldirectory whether to create file, read, write, delete permission , and then construct the "$spoolDirectory/.flumespool/.flumespool-main.meta" metadata file

V. list<Event> events = reader.readevents (batchsize) in the Spooldirectoryrunnable.run method above is Org.apache.flume.client.avro.ReliableSpoolingFileEventReader.readEvents (batchsize):

 Public list<event> readevents (int numevents) throws IOException {if (!committed) {if (!currentfile.ispres            ENT ()) {//IS NULL, assuming that optional includes a non-null reference (reference exists), returns true throw new IllegalStateException ("File should not roll when" +      "Commit is outstanding.");      } logger.info ("Last read was never committed-resetting mark position.");    Currentfile.get (). Getdeserializer (). reset (); } else {//has committed succeeded//Check if new files have arrived since last call//returns true if this holder contain      S A (Non-null) instance if (!currentfile.ispresent ()) {//Is empty, gets the next file, first calls Currentfile = GetNextFile (); }//Return empty list if no new files if (!currentfile.ispresent ()) {//is empty, no readable file has Return collections.      Emptylist ();    }//Other description is currentfile currently reading} Eventdeserializer des = Currentfile.get (). Getdeserializer (); list<event> events = des.readevents (numevents);//Join the Event body//It's possible the lastRead took us just up to a file boundary. * If so, try-to-roll-to-the-next file, if there is one.  */if (Events.isempty ()) {retirecurrentfile (); Renamed word Currentfile = GetNextFile ();//Replace a file if (!currentfile.ispresent ()) {return collections.emptylist ()      ;  } events = Currentfile.get (). Getdeserializer (). readevents (numevents);//Continue reading, join event Body} if (Annotatefilename)      {String filename = Currentfile.get (). GetFile (). GetAbsolutePath ();  for (Event event:events) {event.getheaders (). put (filenameheader, filename);//Join Header}} committed =    False    Lastfileread = Currentfile;  return events; }

1,committed is true when initialized, so the first execution is to get the current file to be read by GetNextFile (). Let's say null returns a null value.

2, use Deserializer (default is Org.apache.flume.serialization.LineDeserializer) readevents (numevents) to bulk read the data encapsulated into an event.

3, such as the acquisition of bulk events is empty, that the file read, you need to read the finished file to do a "delete" (Retirecurrentfile () method, which will also delete the metadata file), is based on Deletepolicy (delete or join to read the completion suffix completedsuffix), but this method has a return value is events, so need to get the next file, that is, execute the GetNextFile (), and events = Currentfile.get (). Getdeserializer (). Readevents (numevents)

4, do you want to include the file name in the header of these events?

5,committed = false; Lastfileread = Currentfile; and return to events.

There are several points to explain in this approach:

One is the committed, which relates to whether the event of this batch has been properly processed. Can see the above 5, each call reliablespoolingfileeventreader.readevents (batchsize) will be at the end of the committed set to False, However, in the Spooldirectoryrunnable.run () method, it is also possible to see that Reliablespoolingfileeventreader is called after the readevents method is called . The commit () method, code such as the following:

/** Commit The last lines which were read. *  /@Override public  Void Commit () throws IOException {    if (!committed && currentfile.ispresent ()) {      currentfile.get (). Getdeserializer (). Mark ();      committed = true;    }  }

This method illustrates the ability to meet two conditions: first, write to the Trackerfile read to the record location, mark () The Syncposition method writes the Trackerfile, and the position in Resettablefileinputstream is added to the staging location until when it will be syncposition=position, This is to prevent the exception from being used to recover lost data, and to committed   = True. Two conditions: One is Committed=false, this runs out of readevents will finally be set to false; currentfile " Non-Empty", which represents a file that is being read. Suppose committed is false at the beginning of readevents , stating: First, there was a problem when the event was submitted to the channel, The reader.commit is not running, currentfile has been "empty", stating that there are no readable files. These two points are now readevents start part, Committed=false, assuming no readable file will throw an exception file should not roll when the commit is outstanding. " Suppose that a problem is passed Currentfile.get () when it is submitted to the channel. Getdeserializer (). Reset () Again withdraws to the location where the channel was last correctly submitted, which enables no data loss.

The second, is the GetNextFile () method. This method first filters the subfolders of the detected folder (that is, it cannot be recursive), and hides the file ("."). Files that have been read (with completedsuffix suffix), files that conform to Ignorepattern, and then sort the filtered files in chronological order before creating a new corresponding metadata file , constructs an input stream resettablefileinputstream that reads a file, and passes this input stream as a parameter to Deserializer, and finally returns a Optional.of (Nextfile, Deserializer));

The third, is Linedeserializer) of the Readevents (numevents) method. This method takes multiple (numevents) calls to Linedeserializer (the default) ReadLine () to get a row of data encapsulated into an event. ReadLine () will continue to get the data through Org.apache.flume.serialization.ResettableFileInputStream.readChar (), After reading the positive line, infer whether the length of each line exceeds the specified value maxlinelength. The ReadChar () method, in addition to constantly reading a character, writes down the position of the character, waiting for the location to be written to the metadata file (via Deserializer.mark ())



"Java" "Fulme" flume-ng source code reading Spooldirectorysource

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.