Some thoughts on improving hbase writing performance-products and technologies

Last Update:2014-12-25 Source: Internet

Author: User

Keywords So some raised nbsp;

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Some thoughts on improving http://www.aliyun.com/zixun/aggregation/13713.html ">hbase writing performance" release time: 2012.04.16 15:12 Source: csdn Author: csdn

The following are three thoughts using hbase for a period of time, so writing performance is the focus of thinking because HBase can provide more satisfying read performance in sufficient memory. I hope readers will discuss different opinions

1 Effect of Autoflush=false

Both the official and many blogs advocate setting Autoflush=false in the application code to improve the HBase write speed, and then the LZ thinks that this setting should be done cautiously in an online application. The reasons are as follows:

The principle of a autoflush=false is that when a client submits a delete or put request, the request is cached on the client until the data exceeds 2M (Hbase.client.write.buffer decision) Or the user executes the hbase.flushcommits () before submitting the request to the Regionserver. So even if htable.put () execution returns successfully, it does not mean that the request was really successful. If the client crashes without having reached the cache, the portion of the data is lost because it was not sent to Regionserver. This is unacceptable for 0 of tolerated online services.

b autoflush=true will reduce the write speed by 2-3 times, but for many online applications this must be open, and that's why HBase lets it default to true. When this value is true, each request is sent to Regionserver, and Regionserver receives the request, the first thing is to write hlog, so the requirements for IO are very high, in order to improve the write speed of hbase, you should increase IO throughput as much as possible, such as adding disks, using RAID cards, reducing the number of replication factors, etc.

2 hbase.hregion.max.filesize should be set how much appropriate

The default maximum value (hbase.hregion.max.filesize) for hfile in HBase is 256MB, and the maximum value of the tablet in Google's bigtable paper is also recommended as 100-200MB, what's the secret to this size?

It is well known that data in HBase will be written to Memstore at the beginning, and when Memstore is 64MB, it will flush to disk and become storefile. When the number of storefile exceeds 3 o'clock, the compaction process is initiated to merge them into a single storefile. This process deletes some timestamp out-of-date data, such as update data. When the merged storefile size is greater than the hfile default maximum, the split action is triggered, dividing it into two region.

LZ for continuous insert pressure test, and set up a different hbase.hregion.max.filesize, according to the results of the following conclusions: the smaller the value, the greater the average throughput, but the more unstable the throughput; the larger the value, the smaller the average throughput and the less stable throughput time.

Why is that? The inference is as follows:

A when hbase.hregion.max.filesize compares hours, the probability of triggering a split is greater, and split will region offline, so the request to access the region before the split ends is blocked, The client self block time defaults to 1s. When a large number of region at the same time split, the system's overall access service will be greatly affected. Therefore, the instability of throughput and response time is easy to occur

b When the hbase.hregion.max.filesize is relatively large, the probability of triggering a split in a single region is small, and the chances of a large number of region triggering split are smaller, so the throughput is more stable than the small hfile size. However, the chance of multiple compaction in the same region has increased due to the long absence of split. The principle of compaction is to read the original data and write it over to the HDFs, then delete the original data. This behavior will undoubtedly reduce the speed of the IO-bottleneck system, so the average throughput will be degraded by some impact.

Combination of the above two cases, hbase.hregion.max.filesize should not be too large or too small, 256MB may be a more ideal experience parameters. For the application of the line, adjust to 128MB will be more appropriate, and online application unless the split mechanism to transform, otherwise should not be less than 256MB

3 setting of accessibility and qualifier in table from the perspective of performance

For a table in a traditional relational database, how do you set up accessibility and qualifier from a performance perspective when modeling business transitions to hbase?

Most extreme, each column can be set to a accessibility, or can only have a accessibility, but all columns are one of the qualifier, then what is the difference?

The more accessibility, the better the advantage of getting each cell data, because IO and the network are reduced, and if there is only one accessibility, every read will read all the current Rowkey data, and there will be some loss on the network and IO.

Of course, if you want to get a fixed number of columns of data, it is better to write these columns in a accessibility than to set accessibility separately, because you can retrieve all the data once you have a request.

The above is to consider from the aspect of reading, then write? can refer to this article:

Http://hbase.apache.org/book/number.of.cfs.html

First, different accessibility are under the same region. Each accessibility allocates a memstore, so more accessibility consumes more memory.

Second, the current version of the HBase, in flush and compaction are region units, that is, when a accessibility to reach the flush conditions, the region of all accessibility belong to Memstore will flush once, Small files are generated even if only a very small amount of data in the Memstore triggers the flush. This increases the probability that the compaction occurs, and the compaction is also region, which makes it easy to have compaction storms and thus reduce the overall throughput of the system.

Third, because hfile is accessibility, so for multiple accessibility, the data is dispersed to more hfile, reducing the chance of split. This is a double-edged sword. Fewer split will cause the region to be larger, because the balance is in the number of region rather than the size, and may cause balance to fail. On the good side, fewer split will allow the system to provide more stable online services.

The 3rd benefit is obvious for online applications, and the downside can be avoided by artificially split and balance at the requested trough time.

So for more systems to write, if it is offline should, we try to use only one accessibility good, but if it is online application, it should still be based on the application of the appropriate distribution of accessibility.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More