Some summary of hbase built-in filters

Source: Internet
Author: User
Tags in python

HBase provides a set of filters for filtering data that allows you to filter data on multiple dimensions (rows, columns, and data versions) of data in HBase, i.e. the data that the filter can eventually filter can be refined to a specific storage cell (by row key, listed, Time stamp positioning). In general, there are more scenarios for filtering data by using row keys and values.


1. RowFilter: Filter out all matching rows, for this filter application scenario, is very intuitive: using Binarycomparator can filter out the row with a row key, or by changing the comparison operator (CompareFilter.CompareOp.EQUAL in the example below) to filter out multiple data that matches a certain condition, the following is a row of data that filters the travel key to Row1:


Filter RF = new RowFilter (CompareFilter.CompareOp.EQUAL, New Binarycomparator (Bytes.tobytes ("Row1")); OK to filter out all rows that match

2. Prefixfilter: Filters out data for row keys with a specific prefix. The functionality implemented by this filter can actually be implemented by RowFilter combined with Regexcomparator, but here is a handy way to use the filter, which filters all rows prefixed with the row key:

Filter PF = new Prefixfilter (bytes.tobytes ("Row")); OK  filter matches row key prefix successful row

3. Keyonlyfilter: The only function of this filter is to return only the row key of each row, the value is all empty, which is very suitable for the application that only focuses on the row key, so ignoring its value can reduce the amount of data passed to the client, which can play a certain role in optimization:

Filter KOF = new Keyonlyfilter (); OK returns all rows, but the values are all empty

4. Randomrowfilter: From the name can be seen its approximate usage, the role of this filter is to follow a certain probability (<=0 will filter out all the rows, >=1 will contain all the rows) to return a random result set, for the same data set, Using the same randomrowfilter multiple times will return an out-of-date result set, which can be used for scenarios where you need to randomly extract part of the data:

Filter RRF = new Randomrowfilter ((float) 0.8); OK randomly select a portion of the row

5. Inclusivestopfilter: When scanning, we can set a start row key and a terminating row key, by default, the return of this row key is the front closed after the opening interval, that is, contains the starting row, but does not contain the terminating row, if we want to include both the starting and terminating lines, Then we can use this filter:

Filter ISF = new Inclusivestopfilter (bytes.tobytes ("Row1")); OK contains the upper limit of the scan within the result

6. Firstkeyonlyfilter: If you only want to return the result set containing only the first column of data, then this filter can meet your requirements. It stops the scan after it finds the first column in each row, which makes the performance of the scan a certain boost:

Filter fkof = new Firstkeyonlyfilter (); OK to filter out the first cell of each first

7. Columnprefixfilter: As the name implies, it filters cells by prefix of column names, and if we want to limit the prefixes of the returned columns, you can use this filter:

Filter CPF = new Columnprefixfilter (bytes.tobytes ("Qual1")); OK filter out the prefix matching columns

8. Valuefilter: Filter the cells according to the specific values, this will filter out a row of values that are not satisfied, such as the following constructor, for each row of a column, if its corresponding value does not contain ROW2_QUAL1, then this column will not be returned to the client:

Filter VF = new Valuefilter (CompareFilter.CompareOp.EQUAL, New Substringcomparator ("Row2_qual1")); OK Filter a specific cell (the condition of the value satisfies)

9. Columncountgetfilter: This filter returns the maximum number of columns per row, and ends the scan operation when the number of columns in a row exceeds the limit value we set:

Filter CCF = new Columncountgetfilter (2); OK If you suddenly find that the number of columns in a row exceeds the set maximum, the entire scan operation stops

Singlecolumnvaluefilter: The value of a column determines whether the data in this row is filtered. On its specific object, you can call Setfilterifmissing (true) or setfilterifmissing (false), the default value is False, and the effect is, for the column we want to use as a condition, if the column itself does not exist, If true, such rows are filtered out, and if false, such rows are included in the result set.


		Singlecolumnvaluefilter SCVF = new Singlecolumnvaluefilter (
				bytes.tobytes ("colfam1"), 
				bytes.tobytes ("Qual2 "), 
				CompareFilter.CompareOp.NOT_EQUAL, 
				new Substringcomparator (" BOGUS "));
		Scvf.setfilterifmissing (false);
		Scvf.setlatestversiononly (TRUE); Ok

Singlecolumnvalueexcludefilter: The only difference between this and 10 filters is that the columns that are the filter criteria are not included in the returned results.

Skipfilter: This is an additional filter, which is used in conjunction with Valuefilter, and if a column in a row is found to be out of compliance, the entire row is filtered out:

Filter SKF = new Skipfilter (VF); OK when you find a column in a row that needs filtering, the entire row is filtered out

Whilematchfilter: The application of this filter is also very simple, if you want to encounter a certain condition data before the data, you can use this filter, when the data does not meet the set conditions, the entire scan is over:

Filter wmf = new Whilematchfilter (RF); OK similar to the takewhile in Python itertools

FilterList: Used to synthesize multiple filters. There are two kinds of relationships: FilterList.Operator.MUST_PASS_ONE and FilterList.Operator.MUST_PASS_ALL, the default is FilterList.Operator.MUST_PASS_ All, as the name implies, is the relationship between and and or, and filterlist can be nested using filterlist, allowing us to express more requirements:

		list<filter> filters = new arraylist<filter> ();
		Filters.add (RF);
		Filters.add (VF);
		filterlist fl = new FilterList (FilterList.Operator.MUST_PASS_ALL, filters); OK Comprehensive use of multiple filters, and and or two relationships
	

Above, is a partial summary of the filters built into HBase, the following code is the data write code:

Package com.reyun.hbase;
Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import org.apache.hadoop.hbase.HBaseConfiguration;
Import org.apache.hadoop.hbase.client.HTable;
Import Org.apache.hadoop.hbase.client.Put;

Import org.apache.hadoop.hbase.util.Bytes;
	public class Hbasedatafeeding {private final static byte[] ROW1 = bytes.tobytes ("Row1");
	Private final static byte[] ROW2 = bytes.tobytes ("Row2");
	Private final static byte[] COLFAM1 = bytes.tobytes ("colfam1");
	Private final static byte[] COLFAM2 = bytes.tobytes ("colfam2");
	Private final static byte[] QUAL1 = bytes.tobytes ("Qual1");
	
	
	Private final static byte[] QUAL2 = bytes.tobytes ("Qual2");
		public static void Main (string[] args) throws IOException {Configuration conf = hbaseconfiguration.create ();
		htable table = new htable (conf, "testtable");
		Table.setautoflushto (FALSE);
		Put put_row1 = new put (ROW1);
		Put_row1.add (COLFAM1, QUAL1, Bytes.tobytes ("Row1_qual1_val")); Put_row1.add (COLFAM1, QUAL2, Bytes.tobytes ("Row1_qual2_val"));
		Put put_row2 = new put (ROW2);
		Put_row2.add (COLFAM1, QUAL1, Bytes.tobytes ("Row2_qual1_val"));
		
		Put_row2.add (COLFAM1, QUAL2, Bytes.tobytes ("Row2_qual2_val"));
			try{Table.put (PUT_ROW1);
		Table.put (PUT_ROW2);
		}finally{Table.close ();
 }
	}

}

The following is the filter test code, you can modify the code, change the filter to see the specific effect:

Package com.reyun.hbase;
Import java.io.IOException;
Import java.util.ArrayList;

Import java.util.List;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.hbase.Cell;
Import Org.apache.hadoop.hbase.CellUtil;
Import org.apache.hadoop.hbase.HBaseConfiguration;
Import org.apache.hadoop.hbase.client.HTable;
Import Org.apache.hadoop.hbase.client.Result;
Import Org.apache.hadoop.hbase.client.ResultScanner;
Import Org.apache.hadoop.hbase.client.Scan;
Import Org.apache.hadoop.hbase.filter.BinaryComparator;
Import Org.apache.hadoop.hbase.filter.ColumnCountGetFilter;
Import Org.apache.hadoop.hbase.filter.ColumnPrefixFilter;
Import Org.apache.hadoop.hbase.filter.CompareFilter;
Import Org.apache.hadoop.hbase.filter.Filter;
Import org.apache.hadoop.hbase.filter.FilterList;
Import Org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
Import Org.apache.hadoop.hbase.filter.InclusiveStopFilter;
Import Org.apache.hadoop.hbase.filter.KeyOnlyFilter; Import Org.apache.hadoop.hbase.filter.Pagefilter;
Import Org.apache.hadoop.hbase.filter.PrefixFilter;
Import Org.apache.hadoop.hbase.filter.RandomRowFilter;
Import Org.apache.hadoop.hbase.filter.RowFilter;
Import Org.apache.hadoop.hbase.filter.SkipFilter;
Import Org.apache.hadoop.hbase.filter.ValueFilter;
Import Org.apache.hadoop.hbase.filter.SingleColumnValueFilter;
Import Org.apache.hadoop.hbase.filter.SubstringComparator;
Import Org.apache.hadoop.hbase.filter.WhileMatchFilter;

Import org.apache.hadoop.hbase.util.Bytes; public class Hbasescannertest {public static void main (string[] args) throws IOException, illegalaccessexception {Co
		Nfiguration conf = hbaseconfiguration.create ();
		htable table = new htable (conf, "testtable");
		
		Table.setautoflushto (FALSE);
		Scan scan1 = new scan (); 
				Singlecolumnvaluefilter SCVF = new Singlecolumnvaluefilter (bytes.tobytes ("colfam1"), Bytes.tobytes ("Qual2"),
		CompareFilter.CompareOp.NOT_EQUAL, New Substringcomparator ("BOGUS"); Scvf.setfilterifmissing (FALSE); Scvf.setlatestversiononly (TRUE); OK Filter CCF = new Columncountgetfilter (2); OK If you suddenly find that the number of columns in a row exceeds the set maximum value, the entire scan operation stops the Filter VF = new Valuefilter (CompareFilter.CompareOp.EQUAL, New Substringcomparato R ("Row2_qual1")); OK Filter a specific cell (the condition of the value satisfies) filter CPF = new Columnprefixfilter (bytes.tobytes ("Qual2")); OK filter out the prefix matching column filter fkof = new Firstkeyonlyfilter (); OK filter out first each first cell of filter ISF = new Inclusivestopfilter (bytes.tobytes ("Row1")); OK contains the upper limit of the scan within the result of Filter RRF = new Randomrowfilter ((float) 0.8); OK randomly selects a portion of the row Filter KOF = new Keyonlyfilter (); OK returns all rows, but the values are all empty Filter pf = new Prefixfilter (bytes.tobytes ("Row")); OK filter match row key prefix successful line filter RF = new RowFilter (CompareFilter.CompareOp.NOT_EQUAL, New Binarycomparator (Bytes.tobytes ("Ro W1 "))); OK filter out all rows matching filter wmf = new Whilematchfilter (RF); OK is similar to the TakeWhile Filter in Python itertools, SKF = new Skipfilter (VF); OK when you find a column in a row that needs filtering, the entire row is filtered out list<filter> filters = new Arraylist<filtEr> ();
		Filters.add (RF);
		Filters.add (VF); filterlist fl = new FilterList (FilterList.Operator.MUST_PASS_ALL, filters);
		OK combines multiple filters, and and or two relationships with scan1.
		Setstartrow (Bytes.tobytes ("Row1")).
		Setstoprow (Bytes.tobytes ("row3")). 
		SetFilter (SCVF);
		
		Resultscanner scanner1 = Table.getscanner (scan1); for (Result Res:scanner1) {for (Cell cell:res.rawCells ()) {System.out.println ("KV:" + Cell + ", Value:" + Bytes
			. toString (Cellutil.clonevalue (cell)));
		} System.out.println ("------------------------------------------------------------");
		} scanner1.close ();
	Table.close ();
 }
 
}


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.