When storm processes streaming data in real time, a common scenario is to process a certain number of tuple tuples in batches, instead of processing a tuple immediately every time a tuple is received. This may be a performance consideration, or the specific business needs.
For example, to query or update a database in batches, if each tuple generates an SQL statement to execute a database operation, the efficiency will be much lower when the data volume is large, affecting the system throughput.
Of course, if you want to use storm's reliable data processing mechanism, you should use containers to cache the references of these tuple into the memory until the tuple is processed in batches.
The following is a simple example:CodeExample:
Now, suppose we already have a dbmanager database operation interface class, which has at least two interfaces:
(1) getconnection (): returns a java. SQL. connection object;
(2) getsql (tuple): Generate database operation statements based on tuple tuples.
To Cache a certain number of tuple in bolts, when constructing bolts, the int n parameter is passed to the int count member variable assigned to bolts, and each n tuple is specified for batch processing.
At the same time, to cache tuple in the memory, the concurrent1_queue in Java concurrent is used to store tuple. Each time count tuple is collected, batch processing is triggered.
In addition, because the data volume is small (for example, the Count tuple is not enough for a long time) or the count is too large, a timer is added to bolt, ensure that tuple can be processed at most once every 1 second.
The following is the complete bolt code (for reference only ):
Import Java. util. Map; Import Java. util. Queue; Import Java. util. Concurrent. concurrent1_queue; Import Java. SQL. connection; Import Java. SQL. sqlexception; Import Java. SQL. statement; Import Backtype. Storm. task. outputcollector; Import Backtype. Storm. task. topologycontext; Import Backtype. Storm. topology. irichbolt; Import Backtype. Storm. topology. outputfieldsdeclarer; Import Backtype. Storm. tuple. tuple; Public Class Batchingbolt Implements Irichbolt { Private Static Final Long Serialversionuid = 1l ; Private Outputcollector collector; Private Queue <tuple> tuplequeue = New Concurrent1_queue <tuple> (); Private Int Count; Private Long Lasttime; Private Connection conn; Public Batchingbolt ( Int N) {count = N; // Number of tuple records processed in batches Conn = dbmanger. getconnection (); // Get database connection through dbmanager Lasttime = system. currenttimemillis (); // Timestamp of the last batch processing } @ Override Public Void Prepare (MAP stormconf, topologycontext context, outputcollector collector ){ This . Collector = Collector;} @ override Public Void Execute (tuple) {tuplequeue. Add (tuple ); Long Currenttime = System. currenttimemillis (); // Each Count tuple is submitted in batches, or once every 1 second. If (Tuplequeue. Size ()> = count | currenttime> = lasttime + 1000 ) {Statement stmt = Conn. createstatement (); Conn. setautocommit ( False ); For ( Int I = 0; I <count; I ++ ) {Tuple Tup = (Tuple) tuplequeue. Poll (); string SQL = Dbmanager. getsql (Tup ); // Generate SQL statements Stmt. addbatch (SQL ); // Add SQL Collector. Ack (Tup ); // ACK } Stmt.exe cutebatch (); // Batch submit SQL statements Conn. Commit (); Conn. setautocommit ( True ); System. Out. println ( "Batch insert data into database, total records:" + Count); lasttime =Currenttime ;}@ override Public Void Cleanup () {}@ override Public Void Declareoutputfields (outputfieldsdeclarer declarer) {}@ override Public Map <string, Object> Getcomponentconfiguration (){ // Todo auto-generated method stub Return Null ;}