In HBase, read business is very frequent. Many operations are based on the meta table where the client navigates to the specific regionserver and then queries the specific data in the region.
But now the problem is, a region consists of a memstore and multiple filestore, memstore like cache in server memory, can improve the efficiency of insertion, when memstore reach a certain size (by Hbase.hregion.memstore.flush.size settings ) or after the user has manually flush, it will be cured on a disk system such as HDFs. In other words, a region can correspond to many files with valid data, although the data in the file is sorted according to Rowkey, but the rowkey between the files is not in any order (unless the major_compact is merged into one file).
If the user is now making a request to view a random column of Rowkey (ROW1) (cf1:col1)
Even with get ' tab ', ' Row1 ', ' cf1:col1 ' this command
It is possible that the row1 is between the Startkey and EndKey of each file, so Regionserver needs to scan the relevant chunks of each file for multiple physical IO. However, there is no guarantee that there must be row1 in every file, and many physical IO are invalid, which has a great impact on performance. Thus there is a Bron filter, to a certain extent, to determine whether the file has a specified line health.
Bron filter is divided into row and rowcol two kinds, the principle is similar, take the Rowcol type as an example:
When Memstore writes to HDFs to form a file, a part of the file is called Meta, and in the process of writing follows the following algorithm:
1. First initializes a longer bit array that may be called bit arr[n]={0};
2. Using k hash function (K<n), the single (Row:cf:col) data is K-hash, guaranteeing the result of calculation in [0,n-1];
3. Assuming that the result of a hash function is R, set arr[r]=1 so that each (Row:cf:col) can have a k result, and the ARR data corresponding position is set to 1;
4. So repeatedly know that all data is written to the file and then write arr to the Meta section in the file
Due to the structural characteristics of the bitmap index itself, it is guaranteed that arr[n] will not be large, so even if it is cached in memory (not memstore) it will not take up much space, although bitmap indexing can cause a lot of lock-in in relational databases, especially OLTP systems, but in HBase Files that have already been written will hardly be modified unless the compact is otherwise.
Now look at get ' tab ', ' Row1 ', ' cf1:col1 ', in determining if a file contains (ROW1:CF1:COL1), only need to do a k hash of the row1:cf1:col1, and determine whether each result corresponds to the ARR array value is not 1 , if one is not, you can indicate that the column data does not exist in the file (although all of them are not necessarily represented by 1), which avoids reading unnecessary files and improving query efficiency.
From the visible Bron filter can be to some extent avoid reading unnecessary files, but because it is based on the hash function, so it is not completely accurate, and for large-scale scan such operations, there is no need to use the filter filter.
2017.1.15
Bitmap index in HBase--fabric filter