HBase Performance Optimization Encyclopedia, very full, very cow __hbase

Source: Internet
Author: User
Tags compact
Summary of HBase Performance optimization Method (i): Table Design

This article is mainly from the HBase application design and development point of view, summed up several commonly used performance optimization methods. About the HBase system configuration level optimization, can refer to: Taobao Ken Wu classmate's blog.

Here is the first part of the summary of this article: table design-related optimization methods. 1. Table Design 1.1 pre-creating regions

By default, a region partition is created automatically when the HBase table is created, and when the data is imported, all HBase clients write the data to the region until the region is large enough to be split. One way to speed up batch writes is to create some empty regions, so that when data is written to HBase, the data is load-balanced within the cluster according to region partitioning.

For more information on pre-partitioning, see Table creation:pre-creating regions, here is an example:

public static Boolean createtable (hbaseadmin admin, htabledescriptor table, byte[][] splits)
Throws IOException {
try {
Admin.createtable (table, splits);
return true;
catch (Tableexistsexception e) {
Logger.info ("table" + table.getnameasstring () + "already exists");
The table already exists ...
return false;
}
}

public static byte[][] Gethexsplits (String startkey, string endkey, int numregions) {
Byte[][] splits = new byte[numregions-1][];
BigInteger Lowestkey = new BigInteger (Startkey, 16);
BigInteger Highestkey = new BigInteger (EndKey, 16);
BigInteger range = highestkey.subtract (Lowestkey);
BigInteger regionincrement = Range.divide (biginteger.valueof (numregions));
Lowestkey = Lowestkey.add (regionincrement);
for (int i=0 i < numregions-1;i++) {
BigInteger key = Lowestkey.add (Regionincrement.multiply (biginteger.valueof (i)));
Byte[] B = String.Format ("%016x", Key). GetBytes ();
Splits[i] = b;
}
return splits;
}

1.2 Row Key

The row key in HBase is used to retrieve records in a table, supporting the following three ways:

· Access through a single row key: That is, a get operation is performed according to a row key value;

· Scan through the range of row key: That is, by setting Startrowkey and Endrowkey, scanning within this range;

· Full table Scan: Directly scans all row records in the entire table.

In HBase, the row key can be any string, the maximum length 64KB, the actual application is generally 10~100bytes, save as a byte[] byte array, generally designed to be fixed-length.

The row key is stored in a dictionary order, so when you design row key, take advantage of this sorting feature to store the data that is often read together in a piece of data that might be accessed recently.

For example: If the most recent data written in the HBase table is most likely to be accessed, consider using the timestamp as part of the row key, and because it is a dictionary order, you can use Long.max_value-timestamp as the row key. This ensures that the newly written data can be hit quickly when read. 1.3 Column Family

Do not define too many column family in a table. Currently HBase does not work well with tables that exceed 2~3 column family. Because a column family is flush, its adjacent column family is triggered flush by the association effect, which eventually results in more I/O being generated by the system. Interested students can carry on the actual test to own hbase cluster, verify from the obtained test result data. 1.4 in Memory

When creating a table, you can place the table in the Regionserver cache by Hcolumndescriptor.setinmemory (true) to ensure that it is cached when read. 1.5 Max Version

When you create a table, you can set the maximum version of the data in the table through hcolumndescriptor.setmaxversions (int maxversions), and if you only need to save the latest version of the data, you can set Setmaxversions (1). 1.6 Time to Live

When you create a table, you can set the storage lifetime of the data in the table through hcolumndescriptor.settimetolive (int timetolive), and the expired data is automatically deleted, for example, if you only need to store data for the last two days, Then you can set Settimetolive (2 * 24 * 60 * 60). 1.7 Compact & Split

In HBase, data is first written to the Wal log (Hlog) and memory (Memstore) in the update, and the data in Memstore is sorted, and when memstore accumulates to a certain threshold, a new memstore is created. The old Memstore are added to the flush queue, which is flush to disk by a separate thread and becomes a storefile. At the same time, the system will record a redo point in the zookeeper, indicating that the previous change has persisted (minor compact).

StoreFile is read-only and can no longer be modified once it is created. So the hbase update is actually a continuous operation. When a storefile in a store reaches a certain threshold, it is merged (Major compact), the modification of the same key is merged together to form a large storefile, when the storefile size reaches a certain threshold, it will StoreFile (split) is divided into two storefile.

Because updates to the table are constantly appended, when processing a read request, you need to access all the StoreFile and Memstore in the store and merge them according to the row key, because StoreFile and Memstore are sorted, And StoreFile with an in-memory index, the merge process is usually relatively fast.

In practice, it is possible to consider manually major compact when necessary, merging the modification of the same row key to form a large storefile. At the same time, the storefile can be set larger, reduce the occurrence of split. Summary of HBase Performance optimization Method (II.): Write table Operation

This article is mainly from the HBase application design and development point of view, summed up several commonly used performance optimization methods. About the HBase system configuration level optimization, can refer to: Taobao Ken Wu classmate's blog.

The following is the second part of this article: Write table operations related optimization methods. 2. Write table Operation more than 2.1 htable concurrent write

Create multiple htable clients for write operations, increasing the throughput of write data, one example:

Static final Configuration conf = Hbaseconfiguration.create ();
Static final String table_log_name = "User_log";
Wtablelog = new Htable[tablen];
for (int i = 0; i < Tablen; i++) {
Wtablelog[i] = new htable (conf, table_log_name);
Wtablelog[i].setwritebuffersize (5 * 1024 * 1024); 5MB
Wtablelog[i].setautoflush (FALSE);
}

2.2 htable parameter setting 2.2.1 Auto Flush

By calling the Htable.setautoflush (false) method, the automatic flush of the htable write client can be turned off so that the data can be written to the hbase in bulk, rather than having a put that performs an update, only when put fills the client write cache. To actually initiate a write request to the HBase server. Auto Flush is turned on by default. 2.2.2 Write Buffer

By calling the Htable.setwritebuffersize (writebuffersize) method, you can set the write buffer size of the htable client, and if the new buffer is less than the data in the current write buffer, The buffer will be flush to the service side. Where the units of the Writebuffersize are byte bytes, the value can be set based on how much data is actually written. 2.2.3 WAL Flag

In Hbae, when the client submits data to the Regionserver in the cluster (Put/delete operation), it first writes the Wal (write ahead Log) log (that is, Hlog, All region on a regionserver share a hlog), only after the Wal log succeeds, then the Memstore is written, and the client is notified that the submission is successful, and if the write Wal log fails, the client is notified that the submission failed. The advantage of this is that you can do the data recovery after regionserver downtime.

Therefore, for relatively less important data, you can discard the write Wal log by calling the Put.setwritetowal (false) or Delete.setwritetowal (false) function at the put/delete operation. Thus improving the performance of data writes.

It is worth noting that the Wal log is carefully chosen to close because, once regionserver is down, the Put/delete data will not be recoverable according to the Wal log. 2.3 Batch Writing

By invoking the Htable.put (Put) method, you can write a specified row key record to HBase, and HBase provides another method: by calling the Htable.put (list<put>) method, the specified row key list can be Bulk writing to multiple-line records, which benefits from bulk execution, requires only one network I/O overhead, which can lead to significant performance gains in the context of high real-time data requirements and network traffic RTT scenarios. 2.4 Multiple threads concurrent write

Open multiple htable write threads on the client, each write thread is responsible for the flush operation of a Htable object, this combination of timed flush and write buffer (writebuffersize), can ensure that when the amount of data is small, Data can be flush in a relatively short period of time (such as 1 seconds), while ensuring that when the volume of data, write a buffer full time to flush. A specific example is given below:

for (int i = 0; i < Threadn; i++) {
Thread th = new Thread () {
public void Run () {
while (true) {
try {
Sleep (1000); 1 second
catch (Interruptedexception e) {
E.printstacktrace ();
}
Synchronized (Wtablelog[i]) {
try {
Wtablelog[i].flushcommits ();
catch (IOException e) {
E.printstacktrace ();
}
}
}
}
};
Th.setdaemon (TRUE);
Th.start ();
}

Address: http://www.cnblogs.com/panfeng412/archive/2012/03/08/2385335.html

Summary of HBase Performance optimization Method (III.): Read table Operation

This article is mainly from the HBase application design and development point of view, summed up several commonly used performance optimization methods. About the HBase system configuration level optimization, can refer to: Taobao Ken Wu classmate's blog.

The following is a summary of the third part of this article: Reading table operation related optimization methods. 3. Read Table operation more than 3.1 htable concurrent read

Create multiple htable clients for read operations, improve throughput of read data, one example:

Static final Configuration conf = Hbaseconfiguration.create ();
Static final String table_log_name = "User_log";
Rtablelog = new Htable[tablen];
for (int i = 0; i < Tablen; i++) {
Rtablelog[i] = new htable (conf, table_log_name);
Rtablelog[i].setscannercaching (50);
}

3.2 htable parameter setting 3.2.1 Scanner Caching

The Hbase.client.scanner.caching configuration item can set the number of data bars that hbase scanner to crawl from the server at a time, by default one time. By setting it to a reasonable value, you can reduce the time cost of next () in the scan process, at the cost of scanner to maintain these cache-line records through the client's memory.

There are three places to configure: 1 To configure in the HBase conf profile, 2 to configure by calling htable.setscannercaching (int scannercaching), 3 by calling Scan.setcaching (int caching) for configuration. The priority of the three is getting higher. 3.2.2 Scan Attribute Selection

SCAN specifies the required column Family to reduce the amount of data transferred by the network, or the default scan operation returns data for all column Family in the entire row. 3.2.3 Close Resultscanner

After you have finished fetching the data by scan, remember to turn off the Resultscanner, or the regionserver may be problematic (the corresponding server resource cannot be freed). 3.3 Batch Read

By invoking the Htable.get (get) method, you can obtain a row of records based on a specified row key, and the same HBase provides another method: by invoking the Htable.get (list<get>) method, you can base a list of the specified row keys. The benefit of bulk fetching of multiple-line records is bulk execution, which requires only a single network I/O overhead, which can lead to significant performance gains in situations where the data is in real time and the network traffic RTT is high. 3.4 Multi-thread concurrent read

Multiple htable read threads are opened on the client, and each read thread is responsible for get operations through the Htable object. Here is a multithreaded concurrent read HBase, get the store day of each minute PV value example:

public class Datareaderserver {
Get the store every minute of the day. The entry function for PV values
public static concurrenthashmap<string, string> Getunitminutepv (Long uid, long startstamp, long Endstamp) {
Long min = Startstamp;
int count = (int) ((endstamp-startstamp)/(60*1000));
list<string> lst = new arraylist<string> ();
for (int i = 0; I <= count; i++) {
Min = startstamp + i * 60 * 1000;
Lst.add (uid + "_" + min);
}
Return Parallelbatchminutepv (LST);
}
Multithreading concurrent query, get the minute PV value
private static concurrenthashmap<string, string> Parallelbatchminutepv (list<string> lstKeys) {
concurrenthashmap<string, string> hashret = new concurrenthashmap<string, string> ();
int parallel = 3;
List<list<string>> Lstbatchkeys = null;
&nbs

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.