For HBase in single column family single column qualifier and single column family multiple column qualifier two scenarios, batch put write performance comparison The following is a combination of HBase source code to explain the phenomenon of simple analysis.
1. Test results
When the client batch writes, the TPS and RPC times of single row and Single-column group are very different, and the test data in two modes of Wal are opened with 10 threads of the client, for example.
The TPS can reach 12403.87 in single row group, and the actual RPC number is 53 times.
In single row family multi-column mode, TPS is only 1730.68, and the actual RPC number is 478 times.
The TPS is about 7 times times different, and the number of RPC is about 9 times times difference. Detailed test environment is no longer listed here, we are concerned only in the two conditions of performance differences.
2. Rough Analysis
Let's start with a "rough" analysis of the HBase storage principle:
HBase's KeyValue class takes up about 50~60 bytes in its own field (reference HBase source code org/apache/hadoop/hbase/ Keyvalue.java), then the client put a row of data (53 fields, row key for the Bytes,value for 751 bytes):
1) Open Wal, single column family single column qualifier, Batch put: (50~60) + + 751 = 865~875 bytes;
2) Open Wal, single column family multiple column qualifier, Batch put: ((50~60) +) * + 751 = 6793~7323 bytes.
Therefore, overall, the latter actually transmits the amount of data is the former: (6793~7323 bytes)/(865~875 bytes) = 7.85~8.36 times, and test results 478/53 = 9 times times the basic match (as the client write buffer size, the actual The proportional relationship of the number of requests represents the proportional relationship of the actual amount of data transferred.
3. Source Code Analysis
OK, Kothe, below we further validate the above theoretical estimates by analyzing the source code of the HBase:
After the HBase client performs the put operation, the Put.heapsize () is invoked to accumulate the data in the current client buffer and invoke Flushcommits () to submit the client data to the server if the following conditions are met:
1 every time the Put method call may pass in a list<put>, at which time every Doput_wb_check bar (default is 10), check whether the current cache data exceeds writebuffersize (the test is set to 5MB), Force the refresh to execute;
2) The AutoFlush is set to true, and a refresh is performed after this put method call;
3) AutoFlush is set to false, but the current cached data has exceeded the set writebuffersize, the refresh is performed.
private void DoPut (final list<put> puts) throws IOException {
int n = 0;
For [put Put:puts] {
validateput (put);
Writebuffer.add (put);
Currentwritebuffersize + + put.heapsize ();
We need to periodically if the WriteBuffer are full instead//of waiting until the end of the
List
n++;
if (n% Doput_wb_check = = 0
&& currentwritebuffersize > Writebuffersize) {
flushcommits ();
}
}
if (AutoFlush | | currentwritebuffersize > Writebuffersize) {
flushcommits ();
}
}
By the above code visible, through the put.heapsize () accumulation of the client's cached data, as a basis for judgment; then we can follow the actual test data, write code to generate the put object can be the test process of a row of data (composed of 53 fields, a total of 731 bytes) The actual client cache size is occupied:
Import Org.apache.hadoop.hbase.client.Put;
Import org.apache.hadoop.hbase.util.Bytes;
public class Putheapsize {
/**
* @param args */public
static void Main (string[] args) {
//single COL Umn put size
byte[] Rowkey = new byte[64];
byte[] value = new byte[751];
Put singlecolumnput = new put (rowkey);
Singlecolumnput.add (Bytes.tobytes ("T"), Bytes.tobytes ("col"), value);
System.out.println ("single column put size:" + singlecolumnput.heapsize ());
Multiple columns put size
value = null;
Put multiplecolumnsput = new put (rowkey);
for (int i = 0; i < i++) {Multiplecolumnsput.add (
bytes.tobytes ("T"), Bytes.tobytes ("col" + i), value);
}
System.out.println ("Multiple columns put size:" + (Multiplecolumnsput.heapsize () + 751));
}
The results of the program output are as follows:
Single column put size:1208
Multiple columns put size:10575
More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/database/extra/
The results obtained from the operation, 9719/1192 = 8.75, and the above theoretical analysis value (7.85~8.36 times), the actual test results (9 times times) is very close to the basic test results can be verified accuracy.
If you are also interested in the Put.heapsize () method, you can continue to read its source code implementation, you will find that for a put object, where the size of the KeyValue object is the most important to determine the entire use of the object's heapsize size, in order to further verify through the instance, The following code calculates the heapsize size of the KeyValue object for a row of data in single column and multiple columns two cases:
import org.apache.hadoop.hbase.KeyValue; public class Keyvalueheapsize {/** * @param args * * Publ IC static void Main (string[] args) {//single column keyvalue size byte[] row = new Byte[64] ; Test row length byte[] family = new Byte[1]; Test family length byte[] qualifier = new BYTE[4]; Test qualifier length long timestamp = 123456L; TS byte[] value = new byte[751];
Test value length KeyValue singlecolumnkv = new KeyValue (row, family, qualifier, timestamp, value);
System.out.println ("single column keyvalue size:" + singlecolumnkv.heapsize ());
Multiple columns keyvalue size value = NULL;
KeyValue multiplecolumnswithoutvaluekv = new KeyValue (row, family, qualifier, timestamp, value);
SYSTEM.OUT.PRINTLN ("Multiple columns KeyValue size:" + (Multiplecolumnswithoutvaluekv.heapsize () * 53 + 751)); }
}
The results of the program output are as follows:
Single column KeyValue size:920
Multiple columns KeyValue size:10079
In contrast to the output from the previous Putheapsize program, KeyValue does occupy most of the heapsize space of the entire put object and finds that the amount of outgoing data from the KeyValue object level contrasts between two situations: 10079/920 = 10.9 times times, is also close to the actual test value.
4. Relevant conclusions
Through the above analysis, we can draw the following conclusions:
In the actual scenario, for the single column qualifier and multiple column qualifier two cases, if the length of the value is longer, the shorter the row key length, the less the number of fields (column qualifier), The former and the latter will be smaller in actual transmission data, while the other is a large difference.
If you store in multiple column qualifier, and the client takes a bulk write, you can increase the client's write buffer size appropriately, in order to improve the writing throughput of the client, depending on the actual situation.
Author: great Circle those things
URL: http://www.cnblogs.com/panfeng412/archive/2013/11/28/ Hbase-batch-put-performance-analysis-of-single-column-and-multiple-columns.html