HBase Basics (8): Caching and batch processing of scanning operations __hbase

Source: Internet
Author: User
Tags log4j

Each next () call generates a separate RPC request for each row of data, even with the next (int nbrows) method, because the method simply calls the next () method on the client loop. Obviously, when the cell data is small, this does not perform well. Therefore, it makes more sense if an RPC request can fetch more than one row of data. Such a method can be implemented by the scanner cache, which is closed by default.
It can be opened on two levels: at the table level, the cache for all scanned instances of the table takes effect, or at the scanning level, which affects only the current scan instance. Users can set the table-level scanner cache using the following htable methods:

void setscannercaching (int scannercaching)
int getscannercaching ()

The user can modify the default value of 1 for the entire hbase cluster. Simply add the following configuration entry to the Hbase-site.xml:

<property>
 <name>hbase.client.scanner.caching</name>
 <value>10</value>
</property>

This allows the scanner cache size of all scan instances to be set to 10. Users can also override the default configuration from a table or scan two levels, but need to be clear about this.
Setscannercaching () can set the cache size, getscannercaching () can return the value of the current cache size. Each time the user invokes Getscanner (scan), the API configures the settings to the scan instance--unless the user uses a scan-level configuration and overwrites the table-level configuration, the scan level has the highest configuration priority. You can use the following scan class methods to set scan-level caching:

void setcaching (int caching)
int getcaching ()

These two methods, like the surface surface method, can control the number of rows retrieved by the RPC call. Both of the next () methods are affected by these configurations.

The user needs to find a balance between the number of RPC requests and the memory consumption of the client and the server. Many times, comparing the scanner cache settings improves the performance of the scan, but setting too high can have a bad effect: each next () call takes longer, because to get more files and transfer to the client, if the data returned to the client exceeds the size of its heap, The program terminates and throws a OutOfMemoryException exception.
when the data is transferred and processed over the configured scanner lease timeout, the user receives a scanertimeoutexception-form lease expiration error.
The following is the timeout sample code when using the scanner:

Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.hbase.HBaseConfiguration;
Import org.apache.hadoop.hbase.HConstants;
Import org.apache.hadoop.hbase.client.HTable;
Import Org.apache.hadoop.hbase.client.Result;
Import Org.apache.hadoop.hbase.client.ResultScanner;

Import Org.apache.hadoop.hbase.client.Scan; public class Hbaseresultscanner {public static void main (string[] args) throws IOException {Configuratio
        N conf = hbaseconfiguration.create ();
        htable table = new htable (conf, "testtable");
        Scan Scan = new Scan ();
        Resultscanner scanner = Table.getscanner (scan); int scannertimeout = (int) Conf.getlong (Hconstants.hbase_regionserver_lease_period_key,
        -1);

        try {thread.sleep (scannertimeout + 5000);
      The catch (Exception e) {//Todo:handle Exception} while (true) {      try {result = Scanner.next ();
                if (result = = null) break;

            SYSTEM.OUT.PRINTLN (result);
                catch (Exception e) {e.printstacktrace ();
            Break

    } scanner.close (); }

}

This code gets the currently configured lease time, sleeps longer than this time, and then the server-side perceives the lease timeout and triggers the lease recovery operation.

Users may try to add the following information to the configuration:

Configuration conf = hbaseconfiguration.create ();
        Conf.setlong (Hconstants.hbase_regionserver_lease_period_key, 12000);

This modification is invalid if the timeout period is extended, because the value is configured in the client application and is not passed to the remote region server.
For rows with very large amounts of data, these rows are likely to exceed the memory capacity of the client process. HBase and its client API have a workaround for this problem: batch. Users can control the acquisition of bulk operations by using the following methods:

void Setbatch (int batch)
int Getbatch ()

Caching is a row-level operation, and a batch is a column-oriented operation. Bulk allows the user to select how many columns to retrieve for each resultscanner () instance's next () operation.
* * If the number of columns included in the row exceeds the value set in the batch, you can fragment the row, and each next operation returns one.
When the number of columns in a row cannot be divisible by the value set in the batch, the last result instance returned contains a smaller number of columns, for example, if a row has 17 columns and the user sets the value of batch to 5, a total of 4 result instances will be returned, and 4 instances should include 5, 5, 5, and 2 columns. **
Combining the scanner cache and batch size allows users to easily control the number of RPC calls required to scan the row keys in each range.
The method is as follows:

Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.hbase.HBaseConfiguration;
Import org.apache.hadoop.hbase.client.HTable;
Import Org.apache.hadoop.hbase.client.Result;
Import Org.apache.hadoop.hbase.client.ResultScanner;
Import Org.apache.hadoop.hbase.client.Scan;
Import Org.apache.log4j.Appender;
Import Org.apache.log4j.AppenderSkeleton;
Import Org.apache.log4j.Level;
Import Org.apache.log4j.Logger;

Import org.apache.log4j.spi.LoggingEvent; public class Hbaseresultscanner {private static void scan (int caching, int batch) throws IOException {Config
        Uration conf = hbaseconfiguration.create ();
        htable table = new htable (conf, "testtable");
        Logger log = Logger.getlogger ("Org.apache.hadoop");
        Final int[] counters = {0, 0}; 
                Appender Appender = new Appenderskeleton () {@Override protected void Append (Loggingevent event) { String msg = Event.getmessage (). toStRing ();
                if (msg!= null && msg.contains ("Call:next")) {counters[0]++;
            @Override public boolean requireslayout () {return false;
        @Override public void Close () {}};
        Log.removeallappenders ();
        Log.setadditivity (FALSE);
        Log.addappender (Appender);
        Log.setlevel (Level.debug);
        Scan Scan = new Scan ();
        Scan.setcaching (caching);
        Scan.setbatch (Batch);
        Resultscanner scanner = Table.getscanner (scan);
        for (result Result:scanner) {counters[1]++;
        } scanner.close (); System.out.println ("Caching:" + Caching + ", Batch:" + Batch + ", Results:" + counters[1] + ", RPCs:" + Coun

    Ters[0]);
        public static void Main (string[] args) throws IOException {Scan (1, 1);
        Scan (200, 1); ScAn (2000, 100);
        Scan (2, 100);
        Scan (2, 10);
        Scan (5, 100);
        Scan (5, 20);
    Scan (10, 10); }

The results of the operation are as follows:

caching:1,batch:1,results:200,rpcs:201
caching:200,batch:1,results:200,rpcs:2
Caching:2000,Batch:100, Results:10,rpcs:1
caching:2,batch:100,results:10,rpcs:6
caching:2,batch:10,results:20,rpcs:11
Caching:5,batch:100,results:10,rpcs:3
caching:5,batch:20,results:10,rpcs:3
caching:10,batch:10,results : 20,rpcs:3

The user can modify these two parameters to see how they affect the output. As shown in the following table

of result
Caching Bulk Processing numberNumber of RPCs Description
1 1 200 201 Each column is returned as a result instance. Finally, one more RPC confirmation scan complete
200 1 200 2 Each result instance contains only one column of values, but they are retrieved by an RPC request
2 10 20 11 A batch parameter is half the number of columns that a row contains, so 200 columns divided by 10 requires 20 result instances. Also requires 10 RPC requests to retrieve.
5 100 10 3 For a row, this batch parameter is really too big, so the row of 20 columns are put into a result instance. At the same time the cache was 5, so 10 result instances were retrieved by two RPC requests.
5 20 10 3 Ditto, but this time the batch value is exactly the same as the number of rows, so the output is the same as the one above
10 10 20 3 This time, the table was divided into smaller result instances, but with a larger cache value, the RPC request returned the data only two times

To calculate the number of RPC requests for a single scan operation, the user needs to first calculate the number of trips and the number of rows per row. The value is then divided by the batch size and the smaller of each row number. Finally, the result is divided by the scanner cache value. The mathematical formula is expressed as follows:

Number of RPC Requests = (number of rows x per row)/
  Min (number of columns per row, batch size)/scanner cache

In addition, some requests are required to turn the scanner on and off. Users may need to take these two requests into account as well.
The following figure shows how the cache and batch two parameters are linked.



Small batch values enable the server to load 3 columns into a result instance, while the scanner cache is 6, allowing each RPC request to transmit 6 rows, or 6 batch-encapsulated result instances. If the batch size is not specified, but the scanner cache is specified, then a call result can contain all rows, because each row is contained in a results instance. The inline (intra-row) scan feature is enabled only if the user is using bulk mode.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.