This paper summarizes the common problems encountered in the process of hbase development and use, hoping to solve some of the doubts of the newly added Hbaser.
1. Is htable thread safe?
Htable is not thread-safe, it is recommended that a Htable object be used in a thread during use, and that htable objects not be shared among different threads.
At the same time, in order to improve the efficiency of the client, different htable objects common to the same configuration object, sharing the HBase server-side metadata information, see HBase Authoritative guide on the htable part of the introduction.
2. How should Htablepool be used?
Before HBase version 0.90, the Htable object was obtained by using the GetTable method when Htablepool was used, and the Puttable object was put back into the pool using the after completion.
In a new version after HBase 0.90, the Puttable method is no longer invoked after use is completed, but the htable Close method is called to put the Htable object back into the pool.
3. What is the difference between scanner caching and batch?
Caching is the number of record bars fetched from the server to the client at a time, while batch is the number of keyvalue returned to the client (in a result object) at a time.
More Wonderful content: http://www.bianceng.cnhttp://www.bianceng.cn/database/extra/
For example: If there are 5 keyvalue per row and just set caching=10, assuming that the number of records in the table in the scan table is greater than 10, then the next () operation returns 10 result from the server. Each result contains 5 keyvalue:
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
* * * * *
If you set batch=2 at the same time, then the next () action returns 17 KeyValue:
* *
* *
*
* *
* *
*
* *
* *
*
* *
4. Put (list<put> puts) efficiency must be higher?
Put (list) is used to increase throughput. The first is parallel between the different region servers, parallel on the same region on the same region server, and serially on different region.
Because on the same region server, if put (list) is executed on a region, it is packaged to write a log, equivalent to parallel writing, while executing on a different region, it is a for loop rather than a thread pool, so it is serial.
5. Is the correctness and performance of increment reliable?
At present, increment is still unable to complete the correctness of the network appears timeout and region server occurs when failover may not be correct, use the need to choose according to the application scene.
As for increment performance, performance is good, especially in hbase 0.94 and ordinary put write operation is not very different, can achieve each region server 7000+.
6. The more region number, the better?
Region too much affect the efficiency of flush and region server failover efficiency, there is a greater pressure on HDFs, region too little has an impact on reading, because the number of storefile more, If the number of storefile per region is very small (the online cluster is generally maintained at 1.x~2.x StoreFile), there is no noticeable effect on the read operation.
Write, region how much impact, unless the region is particularly small and write pressure is particularly high, such as a region server only one region special case.
Now the HBase 0.90 version of each region server recommendation 3,000 within region, while HBase 0.94 version is recommended within 1000, hundreds of is even a lot.
7. How expensive is the Wal operation Log?
In general, when you put data into the region Server service, if you open the write Wal log, the service side probably needs 1ms~2ms overhead. And if you do not open the Wal log, the server only about 0.XMS of overhead, is basically memstore memory operation.
In addition, in either case, the network transport delay RTT overhead (typically within 0.5ms~1ms) is required for a put operation.
8. How high is the write performance of a single form region?
Regardless of any optimization of the premise, the Java API in the case of Wal open to 800 TPS per second, shutdown Wal can achieve 1000+ TPS, is basically the upper limit of the network speed; and the thrift interface has a 10~20% performance loss, the actual test shows in the network RTT In the =0.7MS environment, it can achieve about 550 TPS per second, and in the case of network rtt=2.0ms, only about 350 TPS can be achieved.
The above written performance data is far from the actual write performance of HBase, because it is not considered for optimization, the actual application can be from Prescaler region, multiple clients, multithreading, batch write and so on each region server to achieve the performance limit.
Note: In the case of single form region, if there is no problem with low concurrent writes for a short period of time, and if high concurrent writes are written, then a region split may result in the request being blocked.
9. How high is the write performance of a single region server?
If you want to press the write capability of a single region server, you can create a new test table while creating multiple region (for example, 10~50) and randomly generate rowkey and value writes to the HBase cluster for stress testing. The final test result divided by the number of region servers in the HBase cluster, you can probably get the write performance value of a single region server.
From the results of the HBase cluster Test team, for the 16 kernel CPU,24GB memory Region server machine (Datanode process allocated 1GB memory, HBase regionserver process allocated 16GB memory), do not open Wal, A single region server can achieve TPS at more than 3w, and a general single region server can achieve 9k~2w TPS when opening wal. and the HBase thrift interface can achieve the performance of native Java API 80%~90%.
What is the problem with the rowkey containing a timestamp that results in an empty region?
Assuming that the rowkey contains a timestamp, and the TTL is set, and the expired data is automatically deleted, as the timestamp value increases, the expiration data has been deleted, but the empty region still exists, and too much region can affect flush efficiency and region server Failover efficiency, there is a greater pressure on the HDFs, and because too much empty region will also occupy some metadata information.
There are generally two ways to solve this problem:
1 Create a different table by time cycle: for example, by building different tables, switching tables across the day to read and write, the disadvantage of this method is to build the table every day, because the DDL transaction is master, as long as the master problem can easily affect the transaction, in general, in order to prevent the master impact, You need to build your watch a few days in advance.
2 The timestamp field is processed: Rowkey does not store the current timestamp timestamp directly, but instead stores the timestamp value (either seconds or milliseconds) accumulated since the start of the week Monday 00:00:00, so that the table stores up to a maximum of one week's data corresponding to the number of region. After the start of next week began to reuse before the region, so as to avoid the region expansion. Accordingly, the query end also needs to handle the response of the timestamp timestamp. Given extensibility, assuming that the data to be saved is longer than one week, the method can be extended from 1 weeks to 1 months to accommodate.