Understanding the BatchSize and transactioncapacity parameters of Flumeng and the principle of transport transactions "Go"

Source: Internet
Author: User

Based on the Thriftsource,memorychannel,hdfssink three components, the transaction of the flume data transfer is analyzed, and if other components are used, the flume transaction will be handled differently. Flume principle of transaction Processing: Flume must be wrapped in things when put and take operations on the channel, such as:
    1. Channel ch = new Memorychannel ();
    2. Transaction Txn = Ch.gettransaction ();
    3. Things start
    4. Txn.begin ();
    5. try {
    6. Event eventtostage = Eventbuilder.withbody (\ "Hello flume!\",
    7. Charset.forname (\ "utf-8\"));
    8. Put data toward temporary buffer
    9. Ch.put (Eventtostage);
    10. or Ch.take ()
    11. Submit this data to the channel
    12. Txn.commit ();
    13. } catch (Throwable t) {
    14. Txn.rollback ();
    15. if (t instanceof Error) {
    16. throw (Error) t;
    17. }
    18. } finally {
    19. Txn.close ();
    20. }
Put transaction flow

Put transactions can be divided into the following stages:

    • DoPut: Writes batch data to the staging buffer first Putlist
    • Docommit: Check if the channel memory queue is sufficient to merge.
    • Dorollback:channel Memory queue space, discarding data (this place personally understands that there may be data loss)

We analyze the put thing from the process of receiving the source data and writing to the channel.

Thriftsource will spawn multiple worker threads (Thriftsourcehandler) to process the data, the worker processes the interface of the data, we only see Batch batch processing this interface:

    1. @Override
    2. public status appendbatch (list< thriftflumeevent> events)  throws texception { 
    3.        list<event> flumeevents = lists.newarraylist ();  
    4.        for (thriftflumeevent event : events)  { 
    5.          flumeevents.add (Eventbuilder.withbody (Event.getbody (),  Event.getheaders ()));  
    6.       } 
    7.        //channelprocessor, which is passed in when source is initialized. Writes data to the corresponding channel
    8.        Getchannelprocessor (). Processeventbatch (flumeevents);  
    9.          ... 
    10.       return status.ok; 
    11.     


The transaction logic is in the Processeventbatch method:

  1. public void Processeventbatch (list<event> events) {
  2. ...
  3. Preprocessing each row of data, someone used to do ETL
  4. Events = Interceptorchain.intercept (events);
  5. ...
  6. Classify data, divide the data of different channel sets
  7. Process Required Channels
  8. Transaction tx = Reqchannel.gettransaction ();
  9. ...
  10. Transaction begins, TX is a Memorytransaction class instance
  11. Tx.begin ();
  12. list<event> batch = Reqchannelqueue.get (Reqchannel);
  13. for (Event Event:batch) {
  14. The put operation actually calls the Transaction.doput
  15. Reqchannel.put (event);
  16. }
  17. Commits, writes data to the channel's queue
  18. Tx.commit ();
  19. } catch (Throwable t) {
  20. Rolling back
  21. Tx.rollback ();
  22. ...
  23. }
  24. }
  25. ...
  26. }

Each worker thread has a transaction instance, which is stored in the channel (basicchannelsemantics) threadlocal variable currenttransaction.

So, what did the transaction do?

In fact, the transaction instance contains two bidirectional blocking queue linkedblockingdeque (it feels unnecessary to use a bidirectional queue, each thread writes its own putlist, not multiple threads?). ), respectively:

    • Putlist
    • Takelist

For put things, of course, the only use of putlist. Putlist is a temporary buffer, the data is put to putlist first, and the commit method checks to see if the channel has enough buffers, and then the queue is merged into the channel.
Channel.put-Transaction.doput:

  1. protected void DoPut (Event event) throws Interruptedexception {
  2. Calculate Data byte size
  3. int eventbytesize = (int) Math.ceil (estimateeventsize (event)/bytecapacityslotsize);
  4. Write Temporary buffer putlist
  5. if (!putlist.offer (event)) {
  6. throw New Channelexception (
  7. \ "Put queue for memorytransaction of capacity \" +
  8. Putlist.size () + \ "Full, consider committing more frequently, \" +
  9. \ "Increasing capacity or increasing thread count\");
  10. }
  11. Putbytecounter + = Eventbytesize;
  12. }

Transaction.commit:

  1. @Override
  2. protected void Docommit () throws Interruptedexception {
  3. Check that the channel's queue remaining size is sufficient
  4. ...
  5. int puts = Putlist.size ();
  6. ...
  7. Synchronized (Queuelock) {
  8. if (puts > 0) {
  9. while (!putlist.isempty ()) {
  10. Queue written to channel
  11. if (!queue.offer (Putlist.removefirst ())) {
  12. throw new RuntimeException (\ "Queue add failed, this shouldn\ ' is able to happen\");
  13. }
  14. }
  15. }
  16. Clear the staging queue
  17. Putlist.clear ();
  18. ...
  19. }
  20. ...
  21. }

If an exception occurs during a transaction, such as when there is insufficient space in the channel, rollback:

    1. @Override
    2. protected void Dorollback () {
    3. ...
    4. Discard data, not merge into channel's memory queue
    5. Putlist.clear ();
    6. ...
    7. }




Take transaction

The take transaction is divided into the following stages:

    • Dotake: The data is first taken to the temporary buffer takelist
    • Send data to the next node
    • Docommit: Clears the staging buffer if all data is successfully sent Takelist
    • Dorollback: If an exception occurs during data transmission, rollback returns the data in the temporary buffer takelist to the channel memory queue.

Sink is actually called by the Sinkrunner thread to process the data by calling the Sink.process method. From Hdfseventsink's process approach, the sink class has a process approach for handling the logic of transmitting data. :

  1. Public Status process () throws Eventdeliveryexception {
  2. ...
  3. Transaction Transaction = Channel.gettransaction ();
  4. ...
  5. Transaction start
  6. Transaction.begin ();
  7. ...
  8. for (txneventcount = 0; txneventcount < batchsize; txneventcount++) {
  9. Take data to the temporary buffer, the actual call is Transaction.dotake
  10. Event event = Channel.take ();
  11. if (event = = null) {
  12. Break
  13. }
  14. ...
  15. Write data to HDFs
  16. Bucketwriter.append (event);
  17. ...
  18. Flush all pending buckets before committing the transaction
  19. for (Bucketwriter bucketwriter:writers) {
  20. Bucketwriter.flush ();
  21. }
  22. Commit
  23. Transaction.commit ();
  24. ...
  25. } catch (IOException EIO) {
  26. Transaction.rollback ();
  27. ...
  28. } finally {
  29. Transaction.close ();
  30. }
  31. }

Approximate flowchart:

Then look at Channel.take, which is to put the data into a temporary buffer, which is actually called Transaction.dotake:

    1. Protected Event Dotake () throws Interruptedexception {
    2. ...
    3. Fetching data from channel memory queues
    4. Synchronized (Queuelock) {
    5. event = Queue.poll ();
    6. }
    7. ...
    8. Putting data into a temporary buffer
    9. Takelist.put (event);
    10. ...
    11. return event;
    12. }

Then, the HDFs write thread Bucketwriter writes the data to HDFs, and commits if the batch data is written:

    1. protected void Docommit () throws Interruptedexception {
    2. ...
    3. Takelist.clear ();
    4. ...
    5. }

Very simple, is actually empty takelist just. If an exception occurs when Bucketwriter writes data to HDFs, it is rollback:

  1. protected void Dorollback () {
  2. int takes = Takelist.size ();
  3. Check the memory queue space size, whether enough takelist write back
  4. Synchronized (Queuelock) {
  5. Preconditions.checkstate (queue.remainingcapacity () >= takelist.size (), \ "Not enough space in memory channel \" +
  6. \ "Queue to rollback takes. This should never happen, please report\ ");
  7. while (!takelist.isempty ()) {
  8. Queue.addfirst (Takelist.removelast ());
  9. }
  10. ...
  11. }
  12. ...
  13. }

After reading the code is visible

BatchSize is a concept proposed for source and sink that restricts the batch processing of source and sink to event.

Once you can handle batchsize event, this one-time refers to a transaction.

The larger the parameter value, the greater the range of commits for each transaction, the less the number of operations such as tasklist, and so on, the performance will certainly increase, but the return of the rollback may become larger in case of an error.

Next look

Internal class memorytransaction in the memory channel:

Private class Memorytransaction extends Basictransactionsemantics {    private linkedblockingdeque takelist;    Private Linkedblockingdeque putlist;    Private final channelcounter Channelcounter;    private int putbytecounter = 0;    private int takebytecounter = 0;    Public memorytransaction (int transcapacity, Channelcounter counter) {      putlist = new Linkedblockingdeque ( transcapacity);      Takelist = new Linkedblockingdeque (transcapacity);      Channelcounter = counter;    }

visible transactioncapacity Parameters actually

Is the size of the putlist and takelist. in the flume1.5 version Spillablememorychannel's the lengths of Putlist and Takelist are largesttaketxsize and largestputtxsize parameters, the value of the parameter is

Understanding the BatchSize and transactioncapacity parameters of Flumeng and the principle of transport transactions "Go"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.