Summary of HBase database performance optimization

Source: Internet
Author: User
Tags manual flush garbage collection time interval cpu usage

Garbage collection optimization

The master node does not have garbage collection problems.
Because memstore's flushing mechanism is not continuous, the heap memory of the Java virtual machine will have holes.
The data written to the disk is quickly divided into new generations, and the space will be recycled first.
The data remains for too long and will be divided into the old generation or even the lifetime generation. In addition, the old generation and the lifetime generation usually occupy several GB, while the new generation generally only takes several hundred MB.

New generation space
The general distribution of the new generation space is as follows:

-XX: MaxNewSize = 128 m-XX: NewSize = 128 m

Can be abbreviated


-Xmn128m

After setting it, check whether it is reasonable. If it is unreasonable, you will find that the CPU usage of the server increases sharply, because the new generation of recycling accounts for a lot of CPU usage.
If the new generation is set to a higher value, the benefits will be: objects with a longer lifetime will not be quickly divided into the old generation.
If it is too large, recycling will produce a long pause

Gc log
If there are too many holes in the JRE and there is not enough space, you need to compress the heap memory fragments. If the compression fails, a failure log will appear. Therefore, use the following parameters to enable jvm gc logs:


-Verbose: gc-XX: + PrintGCDetails-XX: + PrintGCTimeStamps-Xloggc: $ HBASE_HOME/logs/gc-$ {hostname}-hbase. log

"Concurrent mode failure" or "promotion failed" information will appear in the log. Note: However, this log will not be automatically rolled and will become larger and larger. You can manually use linux's daily rolling to perform manual cleaning.

Reclaim policy
The garbage collection policy can be switched. The following policy is recommended.


-XX: + UseParNewGC and-XX: + UseConcMarkSweepGC

The first option is to set the young proxy Parallel New Collector recycling policy: stop jvm to clear the young generation. Because the young generation is very small, this process is very fast, generally less than one second, so this pause is acceptable.

CMS policy
However, the old generation cannot use this policy. Because the old generation is very large, it will take a long time to pause. If a session larger than zk times out, it will cause the problem of Juliette pause. Therefore, the Concurrent Mark-Sweep Collector (CMS) can be used as a substitute for parallel Mark recycler. This policy tries to implement garbage collection asynchronously, but the cpu usage is high. However, if the collection fails, the jvm will be paused for memory sorting. When the CMS policy is used, there is an additional parameter to set when to start concurrent marking.


-XX: CMSInitiatingOccupancyFraction = 70

This value sets a percentage. 70% is a good value because it is slightly higher than region's heap usage of 60% (20% cache + 40% memstore)
In this way, parallel collection is started before the heap space is occupied.
Not too small, leading to frequent recovery

Optimization principles
Block cache + memstore cannot be greater than 100%
It is reasonable to leave space for other operations, so block cache + memstore = 60%

Local memstore allocation buffer (MSLAB)
MSLAB = Memstore-Local Allocation Buffers Local memstore allocates the jvm holes (fragments) in the buffer. If there are too many holes, the stop-the-world garbage collection will be triggered, and the whole jvm will be stopped. MSLAB is committed to reducing fragments. The method is to assign objects of a fixed size each time. When these objects are recycled, they will leave holes of a fixed size, then, if the size of the new object is the same, you can use these holes directly, so that the promotion fail will not be triggered, and MSLAB will not be enabled in the stop-the-world process by default, if not, set hbase. hregion. memstore. mslab. enabled to enable
Hbase. hregion. memstore. mslab. chunksize can be set to a fixed-size hole. The default value is 2 MB. If everything you store is large, increase the value.
If you want to save something larger than the upper boundary hbase. hregion. memstore. mslab. max. allocation of the storage buffer, the default value is 256 K. No cells greater than this value use the mslab feature, but directly apply for space from the jvm.
The cost of MSLAB is a waste of space. Even if you do not use the last byte of the buffer, the buffer is still so large. Therefore, you must weigh the pros and cons (I personally suggest wasting resources, which is better than causing jvm suspension)
Using the buffer requires additional memory replication, so it is a little slower than using the KeyValue instance directly.

Compression
Snappy is recommended. However, it should be used in the first place, so it is difficult to switch between them.

Optimize split and merge

Manage split
Split/merge storm
When the user's region size keeps increasing at a constant speed, the region split will occur at the same time, because the storage files in the region need to be compressed at the same time, this process will overwrite the Shard's region, this will cause IO increase. Suggestion: disable automatic splitting and manually call the split and major_compact commands.
How do I disable automatic splitting?
Convert hbase. hregion. max. filesize is very large, but it should not be larger than Long. MAX_VALUE (that is, 9223372036854775807), it is recommended that you run GB manually. It can be executed on different region in different time periods to distribute pressure.
You can create a cron job to regularly perform these operations.
Manual splitting can be avoided: When you perform troubleshooting, automatic splitting may remove the region you are reading.

Region hotspot
Do not use incrementing items like timestamps as the primary key to prevent region hotspot

Pre-split region
When creating a table, you can use the SPLITS attribute to directly define the range of each region and pre-split the region.

Server load balancer
The master has a built-in balancer. By default, the balancer runs every five minutes. This is set through the hbase. balancer. period attribute. It tries to evenly allocate region to all region servers. Start the balancer, which first determines a region allocation plan to describe how region moves. Then, move the region by calling the unassign () method in the management API iteratively. The balancer has an upper limit on its running time. It is configured through the hbase. balancer. max. balancing attribute. By default, it is set to half the running time interval of the balancer, that is, two minutes and a half.

Merge region
Multiple region can be merged using hbase org. apache. hadoop. hbase. util. Merge testtable.
When deleting a large amount of data, you can merge region so that there will be less region
Client API: Best practices

Disable automatic write Flushing
If a large number of write operations are performed, setAutoFlush (false) is used. Otherwise, the Put instances are transmitted to the region server one by one. If automatic flushing is disabled, you can batch send data when the write buffer is filled up.
You can use the flushCommits () method to explicitly write data.
Using the close method of HTable also implicitly calls the fl

Use scan cache
If HBase is used as the input source of a MapReduce job, you can use setCache () to set a value greater than 1 and enable scanning cache. In this way, multiple records (such as 500 records) can be retrieved from region at a time and processed on the client.
However, the overhead of data transmission and memory will increase. So the bigger the value, the better.

Scan range Limited
When Scan is used to process a large number of rows (for example, as the MapReduce input source), it is best to set only the specified columns. If addFamily () is used, all columns of the entire family are loaded. (In fact, it is the same as traditional SQL statements that we recommend that you do not SELECT)

Disable resulttasks
Remember to close the resultlistener in time (in fact, you should remember to close the connection with the traditional database)
Disable resulttables in finally

Block cache usage
Scan can set setCacheBlocks () to use the block cache on the region server.
In MapReduce, this should be set to false.
If some rows are frequently accessed, this value should be set to true.

Optimized the way to obtain the row key.
If you only perform some simple operations such as row statistics that do not need to retrieve all columns, remember to add FirstKeyFilter or KeyOnlyFilter to the FilterList, so that only the first KeyValue row key can be returned, greatly reduces network transmission

Disable WAL on Put
Put writeToWAL (false) can disable WAL, which can greatly improve the throughput, but the side effect is that region will lose data if a problem occurs.
In fact, if the data is evenly distributed across clusters, disabling logs will not improve the performance.
So it is best not to disable WAL. If you really want to increase the throughput, use bulk load technology. This will be introduced in 12.2.3.

Configuration
Reduce ZooKeeper timeout
The default timeout between region and zk is 3 minutes. It is recommended to be set to 1 minute, so that this fault can be detected more quickly.
The default time is so long to avoid problems during the import of big data. If there is no big data import, you can set the timeout to a little shorter.
In 12.5.3, "stability problems" will introduce a method to detect such pauses.
I personally think that there is not much need. If it is to be hung up, it will be useless for just two minutes.

Add processing thread
Hbase. regionserver. handler. count defines the number of threads that respond to requests from external users to access data tables. The default value is 10, which is a little small. This is to prevent the server from being overloaded when the client uses a large buffer zone in high concurrency.
However, when the overhead of a single request is small, you can set a higher level.
Setting too high may cause pressure on region memory and even cause OutOfMemoryError. If the available memory is too low, garbage collection will be triggered, resulting in a full pause.

Increase heap size
Adjust HBASE_HEAPSIZE in hbase-env.sh to increase to 8 GB
However, it is best to use HBASE_REGIONSERVER_OPTS instead of HBASE_HEAPSIZE to increase the size of the region heap separately. The master does not need a large heap size, and 1 GB is enough.

Enable data compression
We recommend snappy. If no snappy is available, use LZO for compression.

Increase region size
Larger region can reduce the number of region
Less region can make the cluster run more smoothly
If a region becomes a hotspot, you can manually split it.
The default region is 256 M, which can be 1 GB or larger.
If region is too large, merging under high loads will pause for a long time.
Use hbase. hregion. max. filesize to set the region size.

Adjust the block cache size
The default block cache value is 20% (that is, 0.2)
You can set this percentage through the perf. hfile. block. cache. size attribute.
If we find that many blocks are swapped out according to the evicted (LV) parameter mentioned in Section 10.2.3. In this way, you need to increase the block cache size to accommodate more blocks.
If the user load is basically read requests, you can also add block cache
The block cache + memstore limit cannot exceed 100%. By default, their sum is 60%, which is adjusted only when the user confirms it is necessary and does not cause any side effects.

Modify memstore restrictions
Use hbase. regionserver. global. memstore. upperLimit to set the upper limit. The default value is 0.4.
Hbase. regionserver. global. memstore. lowerLimit sets the lower limit. The default value is 0.35.
Close the upper and lower limits to avoid overwrite.
If you mainly process read requests, you can reduce the upper and lower limits of memstore to increase the block cache space.
If the volume of data to be written is small, for example, it is only 5 MB, you can increase the storage limit to reduce IO operations.
Increasing blocking is the number of stored files
Hbase. hstore. blockingStoreFiles settings determine that when the stored file data reaches the threshold value, all update operations (put, delete) will be blocked. Then perform the merge operation. The default value is 7.
If the number of files is always high, do not improve the configuration item, because this will only delay the problem, but cannot avoid

Increase blocking rate
The default value of hbase. hregion. memstore. block. multiplier is 2. When memstore reaches the attribute multiplier multiplied by the flush size limit, further updates are blocked.
When there is enough storage space, you can add this value to increase the smooth processing of write bursts.

Maximum log reduction
Setting the hbase. regionserver. maxlogs attribute is the number of WAL files on the disk and controls the flushing frequency. The default value is 32.
This value should be lowered for applications with high write pressure, so that data can be written to the hard disk more frequently, so that logs that have been flushed to the hard disk can be discarded.

Load testing

PE
HBase has its own stress testing tool named PE (Performance Evaluation)
YCSB
Cloud service benchmarking tool launched by Yahoo. It is easier to use than PE and can perform stress testing on hbase.
YCSB provides more options and can mix read/write loads together


HBase performance optimization notes

1 hbase. hregion. max. filesize

Default value: 256 M
: Maximum HStoreFile size. If any one of a column families 'hstorefiles has grown to exceed this value, the hosting HRegion is split in two.

The maximum value of HStoreFile. If the size of any Column Family (or HStore) HStoreFiles exceeds this value, the HRegion to which it belongs is Split into two.

Optimization:

The default maximum value of hfile in hbase (hbase. hregion. max. filesize) is 256 MB, and the maximum size of tablet in google's bigtable paper is also recommended to be 100-200 MB. What is the secret of this size?
We all know that data in hbase will be written to memstore at the beginning. When memstore is 64 MB in size, it will be flushed to disk and become a storefile. When the number of storefiles exceeds 3, compaction is started to merge them into a storefile. In this process, some expired timestamp data, such as the update data, will be deleted. When the size of the merged storefile is greater than the default maximum value of hfile, the split action is triggered to split it into two region segments.
Lz tests the continuous insert pressure and sets different hbase. hregion. max. filesize: the smaller the value, the larger the average throughput, but the more unstable the throughput. The larger the value, the smaller the average throughput, and the shorter the throughput instability.

Why? The inference is as follows:

A when hbase. hregion. max. filesize is relatively small, and the probability of triggering split is greater, and the region offline will be used during split. Therefore, the request to access the region will be blocked before the split end time, the default client self-block time is 1 s. When a large number of region shards are split at the same time, the overall access service of the system will be greatly affected. Therefore, it is prone to unstable throughput and response time.
B when hbase. hregion. max. when the filesize is large, the probability of triggering split in a single region is small, and the probability of triggering split in a large number of region is also small, so the throughput is more stable than the hfile size. However, due to the lack of split for a long time, the chance of multiple compactions in the same region is increased. The principle of compaction is to read and overwrite the original data to hdfs, and then delete the original data. Undoubtedly, this behavior will reduce the speed of the system with io as the bottleneck, so the average throughput will be affected and decreased.
Based on the above two cases, hbase. hregion. max. filesize should not be too large or too small. Mb may be a more ideal empirical parameter. For offline applications, adjusting to 256 MB is more appropriate. For online applications, the split mechanism should not be less than MB unless modified.

2 autoflush = false

Both the official website and many blogs advocate setting autoflush = false in the application code to improve the writing speed of hbase, and lz believes that this setting should be carefully performed in online applications. The reason is as follows:

The principle of a autoflush = false is that when the client submits a delete or put request, the request is cached on the client until the data exceeds 2 MB (hbase. client. write. buffer) or the user executes hbase. flushcommits () is used to submit a request to the regionserver. Therefore, even if the htable. put () execution returns a successful result, the request is not successful. If the client crashes before the cache is reached, the data is lost because it is not sent to the regionserver. This is unacceptable for online services with zero tolerance.

Although B autoflush = true will reduce the write speed by 2-3 times, this is required for many online applications, and it is exactly why hbase sets its default value to true. When this value is true, each request is sent to the regionserver. After the regionserver receives the request, the first thing is to write the hlog. Therefore, the io requirement is very high, to improve the writing speed of hbase, I/O throughput should be increased as much as possible, such as increasing disks, using RAID cards, and reducing the number of replication factors.

3. Concerning the settings of family and qualifier in table from the perspective of performance
For a table in a traditional relational database, how should we set family and qualifier in terms of performance when the business is converted to hbase modeling?
Most challenging: ① each column is set to a family, ② a table has only one family, and all columns are one qualifier. What is the difference?

Read considerations:
The more family, the more obvious the advantage of getting data from each cell, because io and network are reduced.

If there is only one family, all the data of the current rowkey will be read for each read, and there will be some losses on the network and io.

Of course, if you want to obtain a fixed number of columns of data, it is better to write these columns to a family than to set the family separately, because only one request can retrieve all the data.

From the write perspective:

First, in terms of memory, for a Region, a Store is allocated to each Family in each table, and a MemStore is allocated to each Store, therefore, more families consume more memory.
Secondly, in terms of flush and compaction, the current version of hbase, flush and compaction are both in region units, that is, when a family reaches the flush condition, all memstores of the region family are flushed once, even if there is only a small amount of data in memstore, flush is triggered to generate small files. In this way, the probability of compaction is increased, and compaction is also measured in region, which is prone to compaction storms and thus reduces the overall throughput of the system.
Third, from the aspect of split, because hfile is in the unit of family, for multiple families, data is scattered into more hfiles, reducing the probability of split. This is a double-edged sword. A smaller split will lead to a large size of the region. Because the balance is based on the number of region instead of the size, the balance may become invalid. In good terms, fewer split will allow the system to provide more stable online services. However, we can avoid the disadvantages by manually splitting and balance at the low point of the request.
Therefore, for systems with a large number of writes, we should try to use only one family if it is offline. However, for online applications, we should rationally allocate the family based on the application situation.

4 hbase. regionserver. handler. count

The number of RPC listener instances enabled on the RegionServer, that is, the number of I/O request threads that the RegionServer can process. The default value is 10.

This parameter is closely related to memory. When setting this value, the main reference is monitoring memory.

For scenarios with high memory consumption for a single request (large PUT capacity or scan with a large cache) or with insufficient memory for the ReigonServer, it can be relatively small.

For scenarios with low memory consumption per request and high TPS (TransactionPerSecond, transaction processing per second) requirements, you can set a relatively large value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.