Kudu+impala is a good fit for data analysis, but inserting data directly into the Kudu table using the Insert Values statement is not very efficient, and testing the insert is only 80/sec. The reason is also obvious, the Kudu itself writes very efficiently, But Impala did not do this optimization, observing that each Impala statement executed by the overhead is too large, resulting in frequent small batch write efficiency is very poor, kudu is officially recommended to use the Java API or Python API to complete the data writing work. The following are test cases using the Java API, and you can see the approximate usage of the Kudu API.
=========================
Prepare the test table
=========================
--Kudu TableCREATE TABLEkudu_testdb.tmp_test_perf (ID string ENCODING plain_encoding COMPRESSION snappy,name string ENCODING dict_encoding COMPRESSION SNAPPY,PRIMARY KEY(ID)) PARTITION byHASH (ID) partitions6STORED asKudutblproperties ('Kudu.table_name' = 'Testdb.tmp_test_perf','kudu.master_addresses' = '10.0.0.100:7051,10.0.0.101:7051,10.0.0.101:7051','Kudu.num_tablet_replicas' = '1' ) ;
=========================
Writing test Java programs
=========================
Packagekudu_perf_test;ImportJava.sql.Timestamp;ImportJava.util.UUID;Importorg.apache.kudu.client.*; Public classTest {Private Final Static intOperation_batch = 500; //three-mode test cases supported simultaneously Public Static voidInserttestgeneric (kudusession session, kudutable table, Sessionconfiguration.flushmode mode,intRecordCount)throwsException {//SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND//SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC//SessionConfiguration.FlushMode.MANUAL_FLUSHsession.setflushmode (mode); if(SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC! =mode) {session.setmutationbufferspace (Operation_batch); } intUncommit = 0; for(inti = 0; i < RecordCount; i++) {Insert Insert=Table.newinsert (); Partialrow Row=Insert.getrow (); UUID UUID=Uuid.randomuuid (); Row.addstring ("id", uuid.tostring ()); Row.addstring ("Name", Mode.name ()); Session.apply (insert); //for manual submission, the buffer needs to be flush when it is not full, which is submitted when half of the buffer is used. if(SessionConfiguration.FlushMode.MANUAL_FLUSH = =mode) {Uncommit= Uncommit + 1; if(Uncommit > OPERATION_BATCH/2) {Session.flush (); Uncommit= 0; } } } //for manual submission, make sure to complete the final submission if(SessionConfiguration.FlushMode.MANUAL_FLUSH = = Mode && uncommit > 0) {Session.flush (); } //for background autocommit, you must ensure that the final commit is completed and that you can throw an exception if there is an error if(SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND = =mode) {Session.flush (); Rowerrorsandoverflowstatus Error=session.getpendingerrors (); if(error.isoverflowed () | | error.getrowerrors (). length > 0) { if(error.isoverflowed ()) {Throw NewException ("Kudu overflow Exception occurred."); } StringBuilder errormessage=NewStringBuilder (); if(Error.getrowerrors (). length > 0) { for(RowError errorObj:error.getRowErrors ()) {Errormessage.append (errorobj.tostring ()); Errormessage.append (";"); } } Throw NewException (errormessage.tostring ()); } } } //only test cases that support manual flush Public Static voidInserttestmanual (kudusession session, Kudutable table,intRecordCount)throwsException {//SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND//SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC//SessionConfiguration.FlushMode.MANUAL_FLUSHSessionconfiguration.flushmode mode =SessionConfiguration.FlushMode.MANUAL_FLUSH; Session.setflushmode (mode); Session.setmutationbufferspace (Operation_batch); intUncommit = 0; for(inti = 0; i < RecordCount; i++) {Insert Insert=Table.newinsert (); Partialrow Row=Insert.getrow (); UUID UUID=Uuid.randomuuid (); Row.addstring ("id", uuid.tostring ()); Row.addstring ("Name", Mode.name ()); Session.apply (insert); //for manual submission, the buffer needs to be flush when it is not full, which is submitted when half of the buffer is used.Uncommit = uncommit + 1; if(Uncommit > OPERATION_BATCH/2) {Session.flush (); Uncommit= 0; } } //for manual submission, make sure to complete the final submission if(Uncommit > 0) {Session.flush (); } } //only test cases that support auto flush Public Static voidInserttestinautosync (kudusession session, Kudutable table,intRecordCount)throwsException {//SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND//SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC//SessionConfiguration.FlushMode.MANUAL_FLUSHSessionconfiguration.flushmode mode =SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC; Session.setflushmode (mode); for(inti = 0; i < RecordCount; i++) {Insert Insert=Table.newinsert (); Partialrow Row=Insert.getrow (); UUID UUID=Uuid.randomuuid (); Row.addstring ("id", uuid.tostring ()); Row.addstring ("Name", Mode.name ()); //for Auto_flush_sync mode, apply () completes the kudu write immediatelysession.apply (insert); } } Public Static voidTest ()throwskuduexception {kuduclient client=NewKuduclient.kuduclientbuilder ("10.0.0.100:7051,10.0.0.101:7051,10.0.0.101:7051"). build (); Kudusession Session=client.newsession (); kudutable Table= Client.opentable ("Testdb.tmp_test_perf"); Sessionconfiguration.flushmode mode; Timestamp D1=NULL; Timestamp D2=NULL; LongMillis; Longseconds; intRecordCount = 0; Try{mode=SessionConfiguration.FlushMode.AUTO_FLUSH_BACKGROUND; D1=NewTimestamp (System.currenttimemillis ()); Inserttestgeneric (Session, table, mode, recordCount); D2=NewTimestamp (System.currenttimemillis ()); Millis= D2.gettime ()-D1.gettime (); Seconds= millis/1000% 60; System.out.println (Mode.name ()+ "time-consuming number of seconds:" +seconds); Mode=SessionConfiguration.FlushMode.AUTO_FLUSH_SYNC; D1=NewTimestamp (System.currenttimemillis ()); Inserttestinautosync (Session, table, RecordCount); D2=NewTimestamp (System.currenttimemillis ()); Millis= D2.gettime ()-D1.gettime (); Seconds= millis/1000% 60; System.out.println (Mode.name ()+ "time-consuming number of seconds:" +seconds); Mode=SessionConfiguration.FlushMode.MANUAL_FLUSH; D1=NewTimestamp (System.currenttimemillis ()); Inserttestmanual (Session, table, RecordCount); D2=NewTimestamp (System.currenttimemillis ()); Millis= D2.gettime ()-D1.gettime (); Seconds= millis/1000% 60; System.out.println (Mode.name ()+ "time-consuming number of seconds:" +seconds); } Catch(Exception e) {//TODO auto-generated Catch blockE.printstacktrace (); } finally { if(!session.isclosed ()) {Session.close (); } } } Public Static voidMain (string[] args) {Try{test (); } Catch(kuduexception e) {//TODO auto-generated Catch blockE.printstacktrace (); } System.out.println ("Done"); }}
=========================
Performance Test Results
=========================
Manual_flush Mode: 8000 Row/second
Auto_flush_background Mode: 8000 Row/second
Auto_flush_sync mode: Row/second
Impala SQL Insert statement: Row/second
=========================
Kudu API Usage Summary
=========================
1. Try to use Manual_flush, the best performance, if there is a write kudu error, FLUSH () function will throw an exception, the logic is very clear.
2. Auto_flush_sync is also a good choice in situations where performance requirements are low.
3. Only use Auto_flush_background in demo scenario, the code can be very simple and good performance, regardless of exception handling. In the production environment, the reason is not recommended: the insertion of data may be chaotic, and once considered to catch the exception code is very procrastination.
Kudu Series: Java API usage and efficiency testing