Understanding the BatchSize and transactioncapacity parameters of Flumeng and the principle of transport transactions "Go"

Last Update:2016-12-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Based on the Thriftsource,memorychannel,hdfssink three components, the transaction of the flume data transfer is analyzed, and if other components are used, the flume transaction will be handled differently. Flume principle of transaction Processing: Flume must be wrapped in things when put and take operations on the channel, such as:

Channel ch = new Memorychannel ();
Transaction Txn = Ch.gettransaction ();
Things start
Txn.begin ();
try {
Event eventtostage = Eventbuilder.withbody (\ "Hello flume!\",
Charset.forname (\ "utf-8\"));
Put data toward temporary buffer
Ch.put (Eventtostage);
or Ch.take ()
Submit this data to the channel
Txn.commit ();
} catch (Throwable t) {
Txn.rollback ();
if (t instanceof Error) {
throw (Error) t;
}
} finally {
Txn.close ();
}

Put transaction flow

Put transactions can be divided into the following stages:

DoPut: Writes batch data to the staging buffer first Putlist
Docommit: Check if the channel memory queue is sufficient to merge.
Dorollback:channel Memory queue space, discarding data (this place personally understands that there may be data loss)

We analyze the put thing from the process of receiving the source data and writing to the channel.

Thriftsource will spawn multiple worker threads (Thriftsourcehandler) to process the data, the worker processes the interface of the data, we only see Batch batch processing this interface:

@Override
public status appendbatch (list< thriftflumeevent> events) throws texception {
list<event> flumeevents = lists.newarraylist ();
for (thriftflumeevent event : events) {
flumeevents.add (Eventbuilder.withbody (Event.getbody (), Event.getheaders ()));
}
//channelprocessor, which is passed in when source is initialized. Writes data to the corresponding channel
Getchannelprocessor (). Processeventbatch (flumeevents);
...
return status.ok;

The transaction logic is in the Processeventbatch method:

public void Processeventbatch (list<event> events) {
...
Preprocessing each row of data, someone used to do ETL
Events = Interceptorchain.intercept (events);
...
Classify data, divide the data of different channel sets
Process Required Channels
Transaction tx = Reqchannel.gettransaction ();
...
Transaction begins, TX is a Memorytransaction class instance
Tx.begin ();
list<event> batch = Reqchannelqueue.get (Reqchannel);
for (Event Event:batch) {
The put operation actually calls the Transaction.doput
Reqchannel.put (event);
}
Commits, writes data to the channel's queue
Tx.commit ();
} catch (Throwable t) {
Rolling back
Tx.rollback ();
...
}
}
...
}

Each worker thread has a transaction instance, which is stored in the channel (basicchannelsemantics) threadlocal variable currenttransaction.

So, what did the transaction do?

In fact, the transaction instance contains two bidirectional blocking queue linkedblockingdeque (it feels unnecessary to use a bidirectional queue, each thread writes its own putlist, not multiple threads?). ), respectively:

Putlist
Takelist

For put things, of course, the only use of putlist. Putlist is a temporary buffer, the data is put to putlist first, and the commit method checks to see if the channel has enough buffers, and then the queue is merged into the channel.
Channel.put-Transaction.doput:

protected void DoPut (Event event) throws Interruptedexception {
Calculate Data byte size
int eventbytesize = (int) Math.ceil (estimateeventsize (event)/bytecapacityslotsize);
Write Temporary buffer putlist
if (!putlist.offer (event)) {
throw New Channelexception (
\ "Put queue for memorytransaction of capacity \" +
Putlist.size () + \ "Full, consider committing more frequently, \" +
\ "Increasing capacity or increasing thread count\");
}
Putbytecounter + = Eventbytesize;
}

Transaction.commit:

@Override
protected void Docommit () throws Interruptedexception {
Check that the channel's queue remaining size is sufficient
...
int puts = Putlist.size ();
...
Synchronized (Queuelock) {
if (puts > 0) {
while (!putlist.isempty ()) {
Queue written to channel
if (!queue.offer (Putlist.removefirst ())) {
throw new RuntimeException (\ "Queue add failed, this shouldn\ ' is able to happen\");
}
}
}
Clear the staging queue
Putlist.clear ();
...
}
...
}

If an exception occurs during a transaction, such as when there is insufficient space in the channel, rollback:

@Override
protected void Dorollback () {
...
Discard data, not merge into channel's memory queue
Putlist.clear ();
...
}

Take transaction

The take transaction is divided into the following stages:

Dotake: The data is first taken to the temporary buffer takelist
Send data to the next node
Docommit: Clears the staging buffer if all data is successfully sent Takelist
Dorollback: If an exception occurs during data transmission, rollback returns the data in the temporary buffer takelist to the channel memory queue.

Sink is actually called by the Sinkrunner thread to process the data by calling the Sink.process method. From Hdfseventsink's process approach, the sink class has a process approach for handling the logic of transmitting data. ：

Public Status process () throws Eventdeliveryexception {
...
Transaction Transaction = Channel.gettransaction ();
...
Transaction start
Transaction.begin ();
...
for (txneventcount = 0; txneventcount < batchsize; txneventcount++) {
Take data to the temporary buffer, the actual call is Transaction.dotake
Event event = Channel.take ();
if (event = = null) {
Break
}
...
Write data to HDFs
Bucketwriter.append (event);
...
Flush all pending buckets before committing the transaction
for (Bucketwriter bucketwriter:writers) {
Bucketwriter.flush ();
}
Commit
Transaction.commit ();
...
} catch (IOException EIO) {
Transaction.rollback ();
...
} finally {
Transaction.close ();
}
}

Approximate flowchart:

Then look at Channel.take, which is to put the data into a temporary buffer, which is actually called Transaction.dotake:

Protected Event Dotake () throws Interruptedexception {
...
Fetching data from channel memory queues
Synchronized (Queuelock) {
event = Queue.poll ();
}
...
Putting data into a temporary buffer
Takelist.put (event);
...
return event;
}

Then, the HDFs write thread Bucketwriter writes the data to HDFs, and commits if the batch data is written:

protected void Docommit () throws Interruptedexception {
...
Takelist.clear ();
...
}

Very simple, is actually empty takelist just. If an exception occurs when Bucketwriter writes data to HDFs, it is rollback:

protected void Dorollback () {
int takes = Takelist.size ();
Check the memory queue space size, whether enough takelist write back
Synchronized (Queuelock) {
Preconditions.checkstate (queue.remainingcapacity () >= takelist.size (), \ "Not enough space in memory channel \" +
\ "Queue to rollback takes. This should never happen, please report\ ");
while (!takelist.isempty ()) {
Queue.addfirst (Takelist.removelast ());
}
...
}
...
}

After reading the code is visible

BatchSize is a concept proposed for source and sink that restricts the batch processing of source and sink to event.

Once you can handle batchsize event, this one-time refers to a transaction.

The larger the parameter value, the greater the range of commits for each transaction, the less the number of operations such as tasklist, and so on, the performance will certainly increase, but the return of the rollback may become larger in case of an error.

Next look

Internal class memorytransaction in the memory channel:

Private class Memorytransaction extends Basictransactionsemantics {    private linkedblockingdeque takelist;    Private Linkedblockingdeque putlist;    Private final channelcounter Channelcounter;    private int putbytecounter = 0;    private int takebytecounter = 0;    Public memorytransaction (int transcapacity, Channelcounter counter) {      putlist = new Linkedblockingdeque ( transcapacity);      Takelist = new Linkedblockingdeque (transcapacity);      Channelcounter = counter;    }

visible transactioncapacity Parameters actually

Is the size of the putlist and takelist. in the flume1.5 version Spillablememorychannel's the lengths of Putlist and Takelist are largesttaketxsize and largestputtxsize parameters, the value of the parameter is

Understanding the BatchSize and transactioncapacity parameters of Flumeng and the principle of transport transactions "Go"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Understanding the BatchSize and transactioncapacity parameters of Flumeng and the principle of transport transactions "Go"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Understanding the BatchSize and transactioncapacity parameters of Flumeng and the principle of transport transactions "Go"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support