Exploringcompactionpolicy of HBase compaction algorithm

Source: Internet
Author: User

In version 0.98 , the default compaction algorithm was replaced with Exploringcompactionpolicy, which was previously ratiobasedcompactionpolicy

Exploringcompactionpolicy Inherits Ratiobasedcompactionpolicy, overrides the Applycompactionpolicy method, Applycompactionpolicy is a strategy algorithm for selecting files for minor compaction.

Applycompactionpolicy method Content:

Public list<storefile> applycompactionpolicy (final list<storefile> candidates, Boolean Mightbestuck, Boo Lean mayuseoffpeak, int minfiles, int maxfiles) {//This ratio is used for the later algorithm, can set ratio (default: 5.0) for non-peak time periods, thus merging more Data final double Curr Entratio = Mayuseoffpeak?    Comconf.getcompactionratiooffpeak (): Comconf.getcompactionratio ();    Start off choosing nothing.    list<storefile> bestselection = new arraylist<storefile> (0); list<storefile> smallest = Mightbestuck?    New Arraylist<storefile> (0): null;    Long bestsize = 0;    Long smallestsize = Long.max_value; int opts = 0, Optsinratio = 0, Beststart = 1;    For debug logging//Consider every starting place. for (int start = 0, start < candidates.size (); start++) {//Consider every different sub list permutation in BETW      Een start and end with min files.          for (int currentend = start + minFiles-1; Currentend < Candidates.size (); currentend++) {List<        storefile> potentialmatchfiles = candidates.sublist (Start, currentend + 1);        Sanity Checks if (Potentialmatchfiles.size () < minfiles) {continue;        } if (Potentialmatchfiles.size () > Maxfiles) {continue;        }//Compute The total size of files, that would//has to be read if this set of files are compacted.        Long size = Gettotalstoresize (potentialmatchfiles);  Store the smallest set of files.        This stored set of files would be used//if it looks like the algorithm is stuck.          if (mightbestuck && size < smallestsize) {smallest = Potentialmatchfiles;        smallestsize = size;        } if (Size > Comconf.getmaxcompactsize ()) {continue;        } ++opts;          if (size >= comconf.getmincompactsize () &&!filesinratio (Potentialmatchfiles, currentratio)) {        Continue        } ++optsinratio; if (isBeTterselection (Bestselection, Bestsize, potentialmatchfiles, size, Mightbestuck)) {bestselection = PotentialMatch          Files;          bestsize = size;        Beststart = start; }}} if (bestselection.size () = = 0 && mightbestuck) {log.debug ("Exploring compaction algorithm H      As selected "+ smallest.size () +" Files of size "+ Smallestsize +" because the store might be stuck ");    return new arraylist<storefile> (smallest);  } log.debug ("Exploring compaction algorithm has selected" + bestselection.size () + "Files of size" + bestsize + ' starting at candidate # ' + Beststart + ' after considering ' + opts + ' permutations with ' + Optsinratio + '    in ratio "); return new arraylist<storefile> (bestselection);

as you know from the code, the main algorithms are:

  1. over and over The calendar file, judging all eligible combinations
  2. the number of files selected for the combinationmust be >= minfiles (default: 3)
  3. the number of files selected for the combination must be <= maxfiles (default: Ten)
  4. The total file size of the calculated combination size,size must be <= maxcompactsize (via hbase.hstore.compaction.max.size configuration, default: Long.max_value, The equivalent does not work, the official document inside said only think compaction often occurs and not how much time, can modify this value )
  5. of the combination is compliant if  >=   Mincompactsize, also need to determine filesinratio
  6. filesinratio algorithm: FileSize (i) <= (Sum (0,n,filesize (_))-FileSize (i)) * Ratio, which means that all individual file sizes within a combination must meet Singlefilesize <= (totalfilesize-singlefilesize) * currentratio, the meaning of this algorithm is to limit too much compaction, the selected file does not have a great, As far as possible first to merge some small file size difference, the code is as follows
    Private Boolean Filesinratio (final list<storefile> files, final double currentratio) {    if (files.size () < 2) {      return true;    }    Long totalfilesize = gettotalstoresize (files);    for (StoreFile file:files) {      Long singlefilesize = File.getreader (). Length ();      Long sumallotherfilesizes = totalfilesize-singlefilesize;      if (Singlefilesize > sumallotherfilesizes * currentratio) {        return false;      }    }    return true;  }
  7. Find the most solution, preference file combination file number, when the number of files as long as the number of selected files small, this purpose is to merge as many files as possible and produce less IO the better
    Private Boolean isbetterselection (list<storefile> bestselection, Long bestsize, list<storefile> selection  , long Size, Boolean Mightbestuck) {if (Mightbestuck && bestsize > 0 && size > 0) {//Keep The selection that removes most files for least size.      That's penaltizes adding//Large files to compaction, and not small files, so we don ' t become totally inefficient (Might want to tweak, the future). Also, given the current order of looking at//permutations, prefer earlier files and smaller selection if the differ      ence is small.      Final double replace_if_better_by = 1.05;      Double thresholdquality = (double) bestselection.size ()/bestsize) * replace_if_better_by;    Return Thresholdquality < (double) selection.size ()/size);  }//Keep If this gets rid of more files.    Or the same number of files for less IO. return selection.size () > bestselection.size () | | (selection.size () = = Bestselection.sIze () && size < bestsize); }

The main algorithm ends here, and here are some other details and their optimizations:

the ratio default value for step 6 is 1.2, but when you turn on optimizations for off-peak time periods, you can have different values. The non-peak ratio default value is 5.0, which is optimized to combine more data when the business is undervalued, and this optimization can only be a day's fiction time period, not yet flexible.

The logical part of the algorithm about Mightbestuck , this parameter is used to indicate whether it is possible that compaction will be stuck, its state is the number of files to be selected-the number of files being made compaction + futurefiles (the default value is 0, There is a file in the compaction is 1) >= Hbase.hstore.blockingStoreFiles (the default is 10, this configuration will also be used in flush, the later analysis of the flush will be supplemented), if true:

    1. Choosing a file algorithm will also look for a minimal solution. A combination of the smallest file size is recorded before step 4 above
    2. isbetterselection section, the algorithm changes to (Bestselection.size ()/bestsize) * 1.05 < Selection.size ()/size, Select a suitable solution by the ratio of the file size to the number of files
    3. when the result is returned, there is no suitable optimal solution or a minimum solution is returned.

the optimization portion of the mightbestuck is equivalent to ensuring that, in the case of a number of files, a minimal solution can be chosen to do compaction without having to keep the file growing until a suitable combination occurs.

This algorithm is followed by ratiobasedcompactionpolicy The difference , simply said Ratiobasedcompactionpolicy is simple to traverse the StoreFile list from beginning to end, The execution compaction is selected when a sequence that meets ratio conditions is encountered. The Exploringcompactionpolicy is the current optimal, and then selects a global optimal list from the beginning to the end of the loop.

Exploringcompactionpolicy of HBase compaction algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.