There are two types of compaction:
(1) Minor compaction: lightweight. Rewriting multiple small storefile files to a smaller number of large storefile files, reducing the number of files stored, is actually a multi-way merge process. It does not delete data that is marked as "deleted" and previously expired data, and there are multiple StoreFile files after the minor merge operation has been performed once. Because each file in the hfile is categorized, the merge is fast and is only affected by disk I/O performance.
(2) Major compaction: it belongs to the weight class. A region, a list of a series of StoreFile rewrite as a storefile, it can scan all <key,value> pairs, sequentially rewrite all the data, the process of rewriting the data, will skip the deletion of the marked data, assert that the deletion takes effect at this time , blocking all client requests for the region to which the operation belongs until the merge is complete, and the merged storefile file is finally deleted
Regionserver memory, when set, in general this configuration:
(1) Memstore, approximately 40% of memory space (mainly used for writing):
Write requests are written first Memstore,regionserver will provide a memstore for each region, and Memstore will start flush flush to disk after the write is full. When the total size of the memstore exceeds the limit, the flush process is forced to start and flush is known to be below the limit from the largest memstore.
(2) Blockcache, approximately 40% of the memory space (mainly for reading):
Read the request to Memstore first to check the data, can not be found in the Blockcache, and then found on the disk to read, and read the results into Blockcache. Blockcache uses the LRU algorithm, when the Blockcache reached the upper limit, the elimination of the most recent unused batch of data eliminated, each regionserver only one Blockcache
(3) Other, about 20% of the memory space.
In the context of application scenarios that focus on read response time, you can set the Blockcache to a larger, memstore set smaller to increase the cache hit ratio.
Blockcache Grading Idea:
(1) First through the InMemory type cache, you can selectively place the InMemory column famlies into Regionserver memory, such as meta metadata information;
(2) by distinguishing between single and multi types of caches, you can prevent frequent bumps due to the scan operation and add the least used block to the elimination algorithm.
The default configuration. For the entire Blockcache memory, use the following percentages for single, Multi, InMemory: 0.25,0.50 and 0.25
Understanding of compaction in hbase and the use of regionserver memory, cacheblock mechanism