Hbase performance optimization notes

Source: Internet
Author: User

1 hbase. hregion. Max. filesize

2 autoflush = false

3. Concerning the settings of family and qualifier in table from the perspective of performance

4 hbase. regionserver. handler. Count

1 hbase. hregion. Max. filesize
Default Value: 256 m
Description: Maximum hstorefile size. If any one of a column families 'hstorefiles has grown to exceed this value, the hosting hregion is split in two.

The maximum value of hstorefile. If the size of any column family (or hstore) hstorefiles exceeds this value, the hregion to which it belongs is split into two.

Optimization:

The default maximum value of hfile in hbase (hbase. hregion. max. filesize) is 256 MB, and the maximum size of tablet in Google's bigtable paper is also recommended to be 100-200 MB. What is the secret of this size?
We all know that data in hbase will be written to memstore at the beginning. When memstore is 64 MB in size, it will be flushed to disk and become a storefile. When the number of storefiles exceeds 3, compaction is started to merge them into a storefile. In this process, some expired timestamp data, such as the update data, will be deleted. When the size of the merged storefile is greater than the default maximum value of hfile, the split action is triggered to split it into two region segments.
LZ tests the continuous insert pressure and sets different hbase. hregion. max. filesize: the smaller the value, the larger the average throughput, but the more unstable the throughput. The larger the value, the smaller the average throughput, and the shorter the throughput instability.

Why? The inference is as follows:

A When hbase. hregion. max. filesize is relatively small, and the probability of triggering split is greater, and the region offline will be used during split. Therefore, the request to access the region will be blocked before the split end time, the default client self-block time is 1 s. When a large number of region shards are split at the same time, the overall access service of the system will be greatly affected. Therefore, it is prone to unstable throughput and response time.
B When hbase. hregion. max. when The filesize is large, the probability of triggering split in a single region is small, and the probability of triggering split in a large number of region is also small, so the throughput is more stable than the hfile size. However, due to the lack of split for a long time, the chance of multiple compactions in the same region is increased. The principle of compaction is to read and overwrite the original data to HDFS, and then delete the original data. Undoubtedly, this behavior will reduce the speed of the system with IO as the bottleneck, so the average throughput will be affected and decreased.
Based on the above two cases, hbase. hregion. Max. filesize should not be too large or too small. MB may be a more ideal empirical parameter. For offline applications, adjusting to 256 MB is more appropriate. For online applications, the split mechanism should not be less than MB unless modified.

2 autoflush = false

Both the official website and many blogs advocate setting autoflush = false in the application code to improve the writing speed of hbase. Then, LZ considersThis setting should be carefully performed in online applications.. The reason is as follows:

The principle of a autoflush = false is that when the client submits a delete or put requestCache on the clientUntil the data exceeds 2 MB (determined by hbase. Client. Write. buffer) or when the user executes hbase. flushcommits ()To the regionserver.. Therefore, even if the htable. Put () Execution returns a successful result, the request is not successful. Assume thatClient crashes because the cache is not reachedThis part of data is lost because it is not sent to the regionserver. This is unacceptable for online services with zero tolerance.

Although B autoflush = true will reduce the write speed by 2-3 timesThis must be enabled for many online applications.The reason why hbase sets its default value to true. When this value is true, each request is sent to the regionserver. After the regionserver receives the request, the first thing is to write the hlog. Therefore, the IO requirement is very high, to improve the writing speed of hbase, I/O throughput should be increased as much as possible, such as increasing disks, using raid cards, and reducing the number of replication factors.

3. Concerning the settings of family and qualifier in table from the perspective of performance
For a table in a traditional relational database, how should we set family and qualifier in terms of performance when the business is converted to hbase modeling?
Most challenging: ① each column is set to a family, ② A table has only one family, and all columns are one qualifier. What is the difference?

Read considerations:
The more family, the more obvious the advantage of getting data from each cell, because Io and network are reduced.

If there is only one family, all the data of the current rowkey will be read for each read, and there will be some losses on the network and Io.

Of course, if you want to obtain a fixed number of columns of data, it is better to write these columns to a family than to set the family separately, because only one request can retrieve all the data.

From the write perspective:

First,MemoryFor a region, a store is allocated to each family in each table, and a memstore is allocated to each store. Therefore, more family members consume more memory.
Second,In terms of flush and compactionIn the current version of hbase, both flush and compaction are in Region units. That is to say, when a family reaches the flush condition, all memstores of the family of the region will be flushed once, even if there is only a small amount of data in memstoreFlush is triggered to generate small files.. In this wayIncreases the probability of compaction.,Compaction is also based on region, which is prone to compaction storms and thus reduces the overall throughput of the system..
Third,From the aspect of splitBecause hfile is in the unit of family, data is distributed to more hfiles for multiple families, reducing the probability of split. This is a double-edged sword.A smaller split will lead to a large size of the region. Because the balance is based on the number of region instead of the size, the balance may become invalid.In good terms, fewer split will allow the system to provide more stable online services. However, we can avoid the disadvantages by manually splitting and balance at the low point of the request.
Therefore, for systems with a large number of writes,If it is offline, we should try to use only one family, but if it is an online application, we should allocate the family reasonably according to the application's situation..

4 hbase. regionserver. handler. Count

The number of RPC listener instances enabled on the regionserver, that is, the number of I/O Request threads that the regionserver can process. The default value is 10.

This parameter is closely related to memory. When setting this value, the main reference is monitoring memory.

For scenarios with high memory consumption for a single request (large put capacity or scan with a large cache) or with insufficient memory for the reigonserver, it can be relatively small.

For scenarios with low memory consumption per request and high TPS (transactionpersecond, transaction processing per second) requirements, you can set a relatively large value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.