Based on the Thriftsource,memorychannel,hdfssink three components, this article analyzes the transactions of flume data transfer, and if you are using other components, the flume transaction will be handled differently. Under normal circumstances, with Memorychannel is good, our company is this, FileChannel slow, although provide log level of data recovery, but in general, constantly electric Memorychannel will not lose data.
Flume provides the operation of things to ensure the reliability of the user's data, mainly reflected in:
- When the data is transferred to the next node (usually bulk data), the data is rolled back if the receiving node has an exception, such as a network exception. Therefore, it is possible to cause data to be re-
Within the same node, the source writes data to the channel, and the data in one batch is not written to the channel if there is an exception to the data. Some of the data received is discarded directly, and the data is re-sent by the previous node.
Programming model
Flume to put and take operations on the channel, it must be wrapped in things, such as:
channel ch = new Memorychannel (); Transaction Txn = Ch.gettransaction (); //things start txn.begin (); try {Event eventtostage = eventbuilder.withbody ( "Hello Flume!" , Charset.forname ( "UTF-8"); //to the temporary buffer put data ch.put (eventtostage); //or Ch.take () //submit this data to the channel Txn.commit ();} catch (Throwable t) {txn.rollback (); if (t instanceof Error) {throw (error) t;}} finally {txn.close ();}
Put transaction flow
Put transactions can be divided into the following stages:
- DoPut: Writes batch data to the staging buffer first Putlist
- Docommit: Check if the channel memory queue is sufficient to merge.
- Dorollback:channel Insufficient memory queue space to discard data
We analyze the put thing from the process of receiving the source data and writing to the channel.
Thriftsource will spawn multiple worker threads (Thriftsourcehandler) to process the data, the worker processes the interface of the data, we only see Batch batch processing this interface:
@Override public Status appendBatch(List<ThriftFlumeEvent> events) throws TException { List<Event> flumeEvents = Lists.newArrayList(); for(ThriftFlumeEvent event : events) { flumeEvents.add(EventBuilder.withBody(event.getBody(), event.getHeaders())); } //ChannelProcessor,在Source初始化的时候传进来.将数据写入对应的Channel getChannelProcessor().processEventBatch(flumeEvents); ... return Status.OK; }
The transaction logic is in the Processeventbatch method:
Publicvoid processeventbatch (list<event> events) {... //preprocessing each row of data, someone used to do ETL what events = interceptorchain.intercept (events); ... //categorical data, dividing the data corresponding to different channel sets //Process required channels Transaction tx = Reqchannel.gettransaction (); ... //transaction begins, TX is Memorytransaction class instance Tx.begin (); list<event> batch = Reqchannelqueue. get (Reqchannel); for (Event event:batch) {// The put operation actually calls Transaction.doput Reqchannel.put (event);} //commits, writes data to the channel's queue Tx.commit ();} catch (Throwable t) {//rollback tx.rollback ();.}} ...}
Each worker thread has a transaction instance, which is stored in the channel (basicchannelsemantics) threadlocal variable currenttransaction.
So, what did the transaction do?
In fact, the transaction instance contains two bidirectional blocking queue linkedblockingdeque (it feels unnecessary to use a bidirectional queue, each thread writes its own putlist, not multiple threads?). ), respectively:
For put things, of course, the only use of putlist. Putlist is a temporary buffer, the data is put to putlist first, and the commit method checks to see if the channel has enough buffers, and then the queue is merged into the channel.
Channel.put-Transaction.doput:
protected void doPut (Event event) throws Interruptedexception {//calculate data byte size int eventbytesize = (int) Math.ceil (Estimateeventsize ( event)/bytecapacityslotsize); //write temporary buffers putlist if (!putlist.offer ( event)) {throw new channelexception ( "Put Queue for Memorytransaction of capacity "+ putlist.size () + " full, consider committing more fr equently, "+ " increasing capacity or increasing thread count ");} Putbytecounter + = Eventbytesize; }
Transaction.commit:
@Override protected void docommit () throws InterruptedException { Span class= "Hljs-comment" >//check if the channel's queue remaining size is sufficient ... int puts = Putlist.size (); .... synchronized (queuelock) {if (Puts > 0) {while (!putlist.isempty ()) {/ /write to channel queue if (!queue.offer (Putlist.removefirst ())) { Throw new runtimeexception ( "Queue Add failed, this shouldn ' t is able to happen "); }}} //clear temporary queue putlist.clear () ...}
If an exception occurs during a transaction, such as when there is insufficient space in the channel, rollback:
@Override protected void doRollback() { ... //抛弃数据,没合并到channel的内存队列 putList.clear(); ... }
Take transaction
The take transaction is divided into the following stages:
- Dotake: The data is first taken to the temporary buffer takelist
- Send data to the next node
- Docommit: Clears the staging buffer if all data is successfully sent Takelist
- Dorollback: If an exception occurs during data transmission, rollback returns the data in the temporary buffer takelist to the channel memory queue.
Sink is actually called by the Sinkrunner thread to process the data by calling the Sink.process method. From Hdfseventsink's process approach, the sink class has a process approach for handling the logic of transmitting data. :
Public StatusProcess () throws Eventdeliveryexception {... Transaction Transaction = Channel.gettransaction (); ...The transaction begins Transaction.begin (); ...for (Txneventcount = 0; txneventcount < batchSize; txneventcount++) {//take data to the temporary buffer, the actual call is Transaction.dotake Event event = Channel.take (); if (event = = null) {break; } ... //write data to HDFs bucketwriter.append (event) ... //flush all pending buckets before committing the transaction for ( Bucketwriter bucketwriter:writers) {Bucketwriter.flush ();} //commit transaction.commit (); ...} catch (IOException EIO) {transaction.rollback (); ...} finally {transaction.close ();}}
Approximate flowchart:
Then look at Channel.take, which is to put the data into a temporary buffer, which is actually called Transaction.dotake:
protected Event doTake() throws InterruptedException { ... //从channel内存队列取数据 synchronized(queueLock) { event = queue.poll(); } ... //将数据放到临时缓冲区 takeList.put(event); ... return event; }
Then, the HDFs write thread Bucketwriter writes the data to HDFs, and commits if the batch data is written:
protected void doCommit() throws InterruptedException { ... takeList.clear(); ...}
Very simple, is actually empty takelist just. If an exception occurs when Bucketwriter writes data to HDFs, it is rollback:
protected void doRollback() { int takes = takeList.size(); //检查内存队列空间大小,是否足够takeList写回去 synchronized(queueLock) { Preconditions.checkState(queue.remainingCapacity() >= takeList.size(), "Not enough space in memory channel " + "queue to rollback takes. This should never happen, please report"); while(!takeList.isEmpty()) { queue.addFirst(takeList.removeLast()); } ... } ... }
Flume data transmission transaction analysis [GO]