HBase Basics (8): Caching and batch processing of scanning operations _

HBase Basics (8): Caching and batch processing of scanning operations __hbase

Last Update:2018-08-21 Source: Internet

Author: User

Tags log4j

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Each next () call generates a separate RPC request for each row of data, even with the next (int nbrows) method, because the method simply calls the next () method on the client loop. Obviously, when the cell data is small, this does not perform well. Therefore, it makes more sense if an RPC request can fetch more than one row of data. Such a method can be implemented by the scanner cache, which is closed by default.
It can be opened on two levels: at the table level, the cache for all scanned instances of the table takes effect, or at the scanning level, which affects only the current scan instance. Users can set the table-level scanner cache using the following htable methods:

void setscannercaching (int scannercaching)
int getscannercaching ()

The user can modify the default value of 1 for the entire hbase cluster. Simply add the following configuration entry to the Hbase-site.xml:

<property>
 <name>hbase.client.scanner.caching</name>
 <value>10</value>
</property>

This allows the scanner cache size of all scan instances to be set to 10. Users can also override the default configuration from a table or scan two levels, but need to be clear about this.
Setscannercaching () can set the cache size, getscannercaching () can return the value of the current cache size. Each time the user invokes Getscanner (scan), the API configures the settings to the scan instance--unless the user uses a scan-level configuration and overwrites the table-level configuration, the scan level has the highest configuration priority. You can use the following scan class methods to set scan-level caching:

void setcaching (int caching)
int getcaching ()

These two methods, like the surface surface method, can control the number of rows retrieved by the RPC call. Both of the next () methods are affected by these configurations.

The user needs to find a balance between the number of RPC requests and the memory consumption of the client and the server. Many times, comparing the scanner cache settings improves the performance of the scan, but setting too high can have a bad effect: each next () call takes longer, because to get more files and transfer to the client, if the data returned to the client exceeds the size of its heap, The program terminates and throws a OutOfMemoryException exception.
when the data is transferred and processed over the configured scanner lease timeout, the user receives a scanertimeoutexception-form lease expiration error.
The following is the timeout sample code when using the scanner:

Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.hbase.HBaseConfiguration;
Import org.apache.hadoop.hbase.HConstants;
Import org.apache.hadoop.hbase.client.HTable;
Import Org.apache.hadoop.hbase.client.Result;
Import Org.apache.hadoop.hbase.client.ResultScanner;

Import Org.apache.hadoop.hbase.client.Scan; public class Hbaseresultscanner {public static void main (string[] args) throws IOException {Configuratio
        N conf = hbaseconfiguration.create ();
        htable table = new htable (conf, "testtable");
        Scan Scan = new Scan ();
        Resultscanner scanner = Table.getscanner (scan); int scannertimeout = (int) Conf.getlong (Hconstants.hbase_regionserver_lease_period_key,
        -1);

        try {thread.sleep (scannertimeout + 5000);
      The catch (Exception e) {//Todo:handle Exception} while (true) {      try {result = Scanner.next ();
                if (result = = null) break;

            SYSTEM.OUT.PRINTLN (result);
                catch (Exception e) {e.printstacktrace ();
            Break

    } scanner.close (); }

}

This code gets the currently configured lease time, sleeps longer than this time, and then the server-side perceives the lease timeout and triggers the lease recovery operation.

Users may try to add the following information to the configuration:

Configuration conf = hbaseconfiguration.create ();
        Conf.setlong (Hconstants.hbase_regionserver_lease_period_key, 12000);

This modification is invalid if the timeout period is extended, because the value is configured in the client application and is not passed to the remote region server.
For rows with very large amounts of data, these rows are likely to exceed the memory capacity of the client process. HBase and its client API have a workaround for this problem: batch. Users can control the acquisition of bulk operations by using the following methods:

void Setbatch (int batch)
int Getbatch ()

Caching is a row-level operation, and a batch is a column-oriented operation. Bulk allows the user to select how many columns to retrieve for each resultscanner () instance's next () operation.
* * If the number of columns included in the row exceeds the value set in the batch, you can fragment the row, and each next operation returns one.
When the number of columns in a row cannot be divisible by the value set in the batch, the last result instance returned contains a smaller number of columns, for example, if a row has 17 columns and the user sets the value of batch to 5, a total of 4 result instances will be returned, and 4 instances should include 5, 5, 5, and 2 columns. **
Combining the scanner cache and batch size allows users to easily control the number of RPC calls required to scan the row keys in each range.
The method is as follows:

Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.hbase.HBaseConfiguration;
Import org.apache.hadoop.hbase.client.HTable;
Import Org.apache.hadoop.hbase.client.Result;
Import Org.apache.hadoop.hbase.client.ResultScanner;
Import Org.apache.hadoop.hbase.client.Scan;
Import Org.apache.log4j.Appender;
Import Org.apache.log4j.AppenderSkeleton;
Import Org.apache.log4j.Level;
Import Org.apache.log4j.Logger;

Import org.apache.log4j.spi.LoggingEvent; public class Hbaseresultscanner {private static void scan (int caching, int batch) throws IOException {Config
        Uration conf = hbaseconfiguration.create ();
        htable table = new htable (conf, "testtable");
        Logger log = Logger.getlogger ("Org.apache.hadoop");
        Final int[] counters = {0, 0}; 
                Appender Appender = new Appenderskeleton () {@Override protected void Append (Loggingevent event) { String msg = Event.getmessage (). toStRing ();
                if (msg!= null && msg.contains ("Call:next")) {counters[0]++;
            @Override public boolean requireslayout () {return false;
        @Override public void Close () {}};
        Log.removeallappenders ();
        Log.setadditivity (FALSE);
        Log.addappender (Appender);
        Log.setlevel (Level.debug);
        Scan Scan = new Scan ();
        Scan.setcaching (caching);
        Scan.setbatch (Batch);
        Resultscanner scanner = Table.getscanner (scan);
        for (result Result:scanner) {counters[1]++;
        } scanner.close (); System.out.println ("Caching:" + Caching + ", Batch:" + Batch + ", Results:" + counters[1] + ", RPCs:" + Coun

    Ters[0]);
        public static void Main (string[] args) throws IOException {Scan (1, 1);
        Scan (200, 1); ScAn (2000, 100);
        Scan (2, 100);
        Scan (2, 10);
        Scan (5, 100);
        Scan (5, 20);
    Scan (10, 10); }

The results of the operation are as follows:

caching:1,batch:1,results:200,rpcs:201
caching:200,batch:1,results:200,rpcs:2
Caching:2000,Batch:100, Results:10,rpcs:1
caching:2,batch:100,results:10,rpcs:6
caching:2,batch:10,results:20,rpcs:11
Caching:5,batch:100,results:10,rpcs:3
caching:5,batch:20,results:10,rpcs:3
caching:10,batch:10,results : 20,rpcs:3

The user can modify these two parameters to see how they affect the output. As shown in the following table

of result

Caching	Bulk Processing	number	Number of RPCs	Description
1	1	200	201	Each column is returned as a result instance. Finally, one more RPC confirmation scan complete
200	1	200	2	Each result instance contains only one column of values, but they are retrieved by an RPC request
2	10	20	11	A batch parameter is half the number of columns that a row contains, so 200 columns divided by 10 requires 20 result instances. Also requires 10 RPC requests to retrieve.
5	100	10	3	For a row, this batch parameter is really too big, so the row of 20 columns are put into a result instance. At the same time the cache was 5, so 10 result instances were retrieved by two RPC requests.
5	20	10	3	Ditto, but this time the batch value is exactly the same as the number of rows, so the output is the same as the one above
10	10	20	3	This time, the table was divided into smaller result instances, but with a larger cache value, the RPC request returned the data only two times

To calculate the number of RPC requests for a single scan operation, the user needs to first calculate the number of trips and the number of rows per row. The value is then divided by the batch size and the smaller of each row number. Finally, the result is divided by the scanner cache value. The mathematical formula is expressed as follows:

Number of RPC Requests = (number of rows x per row)/
  Min (number of columns per row, batch size)/scanner cache

In addition, some requests are required to turn the scanner on and off. Users may need to take these two requests into account as well.
The following figure shows how the cache and batch two parameters are linked.

Small batch values enable the server to load 3 columns into a result instance, while the scanner cache is 6, allowing each RPC request to transmit 6 rows, or 6 batch-encapsulated result instances. If the batch size is not specified, but the scanner cache is specified, then a call result can contain all rows, because each row is contained in a results instance. The inline (intra-row) scan feature is enabled only if the user is using bulk mode.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More