Research on LEVELDB Principle and Code Analysis (i)

Source: Internet
Author: User
Tags assert
1. Overview

Leveldb (http://code.google.com/p/leveldb/) is Google Open source Key/value storage System, its Committer lineup is very strong, basically bigtable of the original class, including like Jeff Dean, Daniel, its code design is very useful, is a typical LSM tree of the KV engine implementation, from its data structure, the basic is sstable open source implementation, and for a variety of platforms made port, currently used in chrome and other projects.


2. LSM Tree

LEVELDB is a typical implementation of Log-structured-merge tree, which accelerates data writing and guarantees data security by delaying write and write Log ahead techniques. The records in each of the LEVELDB data files (sstable) are sorted in the order of key, but the arrival of the key at random writes is unordered, making it difficult to insert records into its sort position. So it needs to take a kind of delay write way, save a certain amount of data in a batch, put them in the memory in order, write to the disk once. But during this time, once the system loses power or other anomalies, can result in data loss, so the data needs to be written to the log file, so that random writes are converted to append writes, and there is a significant increase in disk performance, and if the process breaks down, the data written before the log can be restored after the reboot.


2.1 Write Batch
Level DB supports only two types of update operations: 1. Insert a record 2. Delete a record code as follows:

std::string Key1,key2,value;  
Leveldb::status s;
s = Db->put (Leveldb::writeoptions (), key1, value);  
s = Db->delete (Leveldb::writeoptions (), key2);

It also supports writing data in a batch format:
std::string Key1,key2,value; 
Leveldb::writebatch Batch; 
Batch. Delete (key1); 
Batch. Put (key2, value); 
Leveldb::status s = db->write (Leveldb::writeoptions (), &batch);

In fact, within level DB, the interfaces for individually updating calls to batch updates are the same, and individual updates are organized into batch that contain one record and then written to the database. Write batch is organized as follows:
2.2 Log Format
Each update operation is organized into such a packet and written as a log into the log file, and it is also resolved into a memory record, which is inserted in the corresponding position in the memory table after the key is sorted. LEVELDB uses memory mapping to access log data: If the previously mapped space is full, expand the file to a certain length (at the length of each extension by 64kb,128kb,... The order is doubled, Max to 1MB), and then mapped to memory, the mapped memory to the 32KB page for the Shard, each write log filled to page, accumulate a quantitative after the sync to disk (also can set writeoptions, each write a log on sync once, But this is inefficient), the code for the memory-mapped file is as follows:
Class Posixmmapfile:public Writablefile {private:std::string filename_;                File name   int fd_;      File handle   size_t Page_size_;   size_t Map_size_;      //memory-mapped area size   char* base_;            //memory-mapped area's starting address   char* Limit_;          //memory-mapped zone end address   char* Dst_;            //Last occupied memory end address   char* LAST_SYNC_;      //Last sync to disk's end address   uint64_t file_offset_;      //the current file's offset value   BOOL pending_sync_; Flags for delayed synchronization   public:   Posixmmapfile (const std::string& fname, int fd, size_t page_size)     &NBSP ; : Filename_ (fname),         Fd_ (FD),         Page_size_ (page_size),         Map_size_ (Roundup (65536, page_size),         Base_ (NULL),         L Imit_ (NULL),         DSt_ (NULL),         LAST_SYNC_ (null),         file_offset_ (0),         PENDING_SYNC_ (false) {    assert (Page_size & (page_size-1) = = 0);  }   ~POSIXMMAPFI Le () {    if (fd_ >= 0) {      posixmmapfile::close ();    }  } Status Ap
    Pend (const slice& data) {Const char* src = data.data ();
    size_t left = Data.size ();
      while (Left > 0) {//calculates the remaining capacity of the last requested area, if it is completely depleted,//unloads the current zone, applies for a new zone size_t avail = limit_-dst_; if (avail = = 0) {if (!
            Unmapcurrentregion () | | !
        Mapnewregion ()) {return IOError (filename_, errno); }//Fill the remaining capacity of the current area size_t n = (left <= avail)?
      Left:avail;
      memcpy (dst_, SRC, n);
      Dst_ + = n;
      src = n;
    Left-= n;
  return Status::ok (); Status Posixmmapfile::close () {    status S;    size_t unused = limit_-dst_;     IF (! Unmapcurrentregion ()) {      s = ioerror (filename_, errno);    } else if (Unused > 0) {&nbs P    //When the file is not used in the space truncate out       if (Ftruncate (Fd_, file_offset_-unused) < 0) {  & nbsp
    s = ioerror (filename_, errno);      }    }     if (Close (Fd_) < 0) {      (S.ok ()) {  & nbsp
    s = ioerror (filename_, errno);
     }     {    fd_ =-1;
    Base_ = NULL;
    Limit_ = NULL;
    return s;   &nbsp virtual Status Sync () {    Status s;     if (pending_sync_) {     
Data in the last zone is not synchronized, first synchronizing data       PENDING_SYNC_ = false;        if (Fdatasync (Fd_) < 0) {        s = ioerror (filename_, errno);      }     }     if (Dst_ > Last_sync_) {     //To calculate the starting and ending addresses of unsynchronized data, when synchronized, the starting address is page_size_ down,//end
The address is rounded up to ensure that each synchronization is synchronized with one or more page       size_t p1 = truncatetopageboundary (Last_sync_-Base_); 
            size_t p2 = truncatetopageboundary (Dst_-base_-1);
If just the whole number of page_size_, because the following synchronization will inevitably add a page_size_, so here you can subtract 1       LAST_SYNC_ = Dst_;       if (Msync (Base_ + p1, P2-p1 + page_size_, Ms_sync) < 0) {        s = Ioerro
R (Filename_, errno);
     }    }     return s;  } private://align x to y up static size_t Roundup (size_t x, size_t y) {    return ((x + y-1)/y) *
Y   //press S to page_size_ down   size_t truncatetopageboundary (size_t s) {    S-= (S & (Page_size_-
1));
    assert ((s% page_size_) = = 0);
    return s;  }//Uninstall the current mapped memory area   BOOL Unmapcurrentregion () {&nbsP
  BOOL result = true;     if (base_!= NULL) {      if (Last_sync_ < Limit_) {       //if current page
is not fully synchronized, it is indicated that this file needs to be synchronized and the next time the sync () method is invoked, the unsynchronized data on this page is synchronized to disk         PENDING_SYNC_ = true;       }       if (Base_, Limit_-Base_)!= 0 munmap         result =
False
     }       file_offset_ + = Limit_-Base_;
      Base_ = NULL;
      Limit_ = NULL;
      LAST_SYNC_ = NULL;        Dst_ = null;     //Use doubling strategy to increase the size of the next application area, maximum to 1MB       (Map_size_ < (1&LT;&LT;20))
{        Map_size_ *= 2;      }    }     return result;  }  bool Mapnewregion () {    assert (base_ = NULL);//When requesting a new zone, the last requested area must have been uninstalled//the file should be expanded first &N Bsp   if (Ftruncate (fd_, file_offset_ + MAp_size_) < 0 {      return false;    }//Map new zone to file     void* ptr = mmap (NUL L, Map_size_, Prot_read |
Prot_write, map_shared,                      fd_, file_offset_);     if (ptr = = map_failed) {      return false;    }     Base_ = Reinterpre
T_cast<char*> (PTR);
    Limit_ = Base_ + map_size_;
    Dst_ = Base_;
    LAST_SYNC_ = Base_;
    return true;  };
However, when a batch data is organized in the way above, and if you write log for a journal, you are likely to need to span two or more page; In order to better manage the logs and secure the data, LEVELDB has made finer segmentation of the log records, If a batch corresponding data needs to spread, it will be cut into multiple entry, and then written to a different page, entry will not cross the page, we can restore the batch data through the solution of multiple entry. Eventually, the Leveldb log file is organized into the following form:

Here, we can look at the Log_writer code:
Status Writer::addrecord (const slice& Slice) {const char* ptr = Slice.data ();

  size_t left = Slice.size ();
  Status s;
  BOOL begin = TRUE;
    do {const int leftover = kblocksize-block_offset_;
    ASSERT (leftover >= 0); if (leftover < kheadersize) {//If the remaining length of the current page is less than 7 bytes and greater than 0, it is filled ' with ' and a new page if (Leftover > 0) {A
        Ssert (kheadersize = 7);
      Dest_->append (Slice ("\x00\x00\x00\x00\x00\x00", leftover));
    } block_offset_ = 0;
    //Calculate whether the page can hold the whole log, if not, cut the log into multiple entry, insert a different page, and type to indicate that the entry is the beginning of the log, the middle part or the end part.
    const size_t Avail = kblocksize-block_offset_-kheadersize; Const size_t Fragment_length = (left < avail)?

    Left:avail;
    RecordType type;
    const BOOL = (left = = fragment_length);   if (begin && End) {type = Kfulltype;  This entry saves the complete batch} else if (begin) {type = Kfirsttype;   This entry only saves the starting part} else if (end) {type = Klasttype; This entry is only insuredThe save End Part} else {type = Kmiddletype//This entry saves the middle portion of the batch, does not contain the start and end, and sometimes may need to save multiple middle} s = Emitphysicalrec
    Ord (type, PTR, fragment_length);
    PTR + = Fragment_length;
    Left-= fragment_length;
  begin = FALSE;
  while (S.ok () && left > 0);
return s;  
  Status Writer::emitphysicalrecord (RecordType t, const char* PTR, size_t N) {assert (n <= 0xffff);

  ASSERT (block_offset_ + kheadersize + n <= kblocksize);
  Fill record head char buf[kheadersize];
  BUF[4] = static_cast<char> (n & 0xff);
  BUF[5] = static_cast<char> (n >> 8);

  BUF[6] = static_cast<char> (t);
  Compute CRC uint32_t CRC = Crc32c::extend (Type_crc_[t], PTR, n); 
  CRC = Crc32c::mask (CRC);

  ENCODEFIXED32 (BUF, CRC);
  Fill entry content Status s = dest_->append (Slice (buf, kheadersize));
    if (S.ok ()) {s = dest_->append (Slice (PTR, n));
    if (S.ok ()) {s = Dest_->flush ();
  } block_offset_ + + = kheadersize + N;
return s;
 }

2.3 Write Log ahead
Level db in the update, first write log, and then update memtable, each memtable will set a maximum capacity, if exceed the threshold, then use double buffer mechanism, close the current log file and the current memtable switch not from the memtable, Then create a new log file and memtable, write the data into the new log file and Memtable, and notify the background thread to process from memtable, dump it to disk in time, or start the compaction process. The code analysis for write is as follows:
Status Dbimpl::write (const writeoptions& options, writebatch* updates) {status status;  Mutexlock L (&mutex_);   
  Lock the mutex, only one thread at a time can update the data Loggerid self;
  Gets the right to use the logger, and waits until it frees ownership if other threads have ownership.
  Acquireloggingresponsibility (&self);  Status = Makeroomforwrite (false);  May temporarily release lock and wait uint64_t last_sequence = Versions_->lastsequence (); Gets the current version number if (Status.ok ()) {//version with the current version number plus 1 as the log for this update,//A bulk update may contain multiple operations, all of which have a benefit in one version://All operations of this update, or
    Visible, or not visible, there is no part of the visible, the other part is not visible.
    Writebatchinternal::setsequence (Updates, last_sequence + 1);

    However, this update may have multiple operations, skipping the same version number as the operand, guaranteed not to be used last_sequence + = Writebatchinternal::count (updates);
      Writes batch to log, and then applies to memtable {assert (Logger_ = = &self); Mutex_.
      Unlock (); Here, you can unlock it because you have acquired logger ownership in the Acquireloggingresponsibility () method, and//other threads even acquire a lock, but because &self!= logger, it blocks in Acquirel
      Oggingresponsibility () method. Writes the update to the log file, and if sync is set for each write, itSynchronized to disk, this operation may be longer,//prevents the Mutex_ object from being occupied for a long time because it is also responsible for synchronizing the status of some other resources = Log_->addrecord (writebatchinternal::contents
      Updates));
      if (Status.ok () && options.sync) {status = Logfile_->sync (); } if (Status.ok ()) {//log is successfully written to memtable status = Writebatchinternal::insertinto (updates, mem_)
      ; }//re-lock Mutex_ mutex_.
      Lock ();
    ASSERT (Logger_ = = &self);
  }//Update version number versions_->setlastsequence (last_sequence);
//release ownership of the logger and notify the waiting thread, then unlock releaseloggingresponsibility (&self);
  return status; The//force parameter indicates the force of a new memtable Status dbimpl::makeroomforwrite (bool force) {  Mutex_.
Assertheld ();
  ASSERT (Logger_!= NULL);
  BOOL Allow_delay =!force;
  Status S;   while (true) {    if (!bg_error_.ok ()) {     //Background thread problem, return error, do not accept update     &NBS P
s = bg_error_;
      break;    } else if (        Allow_delay &&         versions_->numlevelfiles (0) >= Config::kl0_slowdo Wnwritestrigger) {//If it is not a forced write and the sstable of level 0 is more than 8, this update blocks 1 milliseconds,//LEVELDB divides sstable into multiple levels, where the key of the different tables in level 0 is can overlap,//If l0 sstable too much, will cause query performance degradation, then need to appropriately reduce the update speed, let//background thread for compaction operation, but the designer does not want to let a write operation wait a few seconds,//But let each update operation points Delay, that is, each write operation Block 1 milliseconds, balance read and write rate;//In addition, this also theoretically allows compaction threads to get more CPU time (of course,//This is assumed to make sense when compaction shares a CPU with the update operation)   &NB Sp   Mutex_.
Unlock ();
      Env_->sleepformicroseconds (1000);       Allow_delay = false;  //up to one delay, next time no delay       mutex_.
Lock ();    } else if (!force &&                 (Mem_->approximatemem Oryusage () <= options_.write_buffer_size)) {     //If the space used by the current memtable is less than write_buffer_size, jump out,
      Updates to the current memtable. When force is true, the first loop goes back to the else logic, switches memtable, and force is set to false.The second cycle is where you can jump out of       break;    } else if (imm_!= NULL) {//If the current memtable has exceeded write_buffer_size and the standby memtable is also in use, block updates and wait for   &NBS P   BG_CV_.
Wait ();    } else if (versions_->numlevelfiles (0) >= Config::kl0_stopwritestrigger) {     //IF The previous memtable has used less space than write_buffer_size, but the standby memtable is not used,//check the number of sstable level 0, such as more than 12, blocking updates and waiting for       L
OG (options_.info_log, "waiting...\n");       BG_CV_.
Wait ();    } else {     //Otherwise, create a new log file with a new ID, switch the current memtable to an alternate memtable, create a new one//memtable, and then move the data Write the current new memtable, that is, switch log files with memtable and tell the background thread//can perform compaction operation       ASSERT (Versions_->prevlognumbe
R () = = 0);
      uint64_t new_log_number = Versions_->newfilenumber ();
      writablefile* lfile = NULL;
      s = env_->newwritablefile (LogFileName (Dbname_, New_log_number), &lfile);   &nbsp
  if (!s.ok ()) {        break;      }       Delete log_;
      Delete logfile_;
      Logfile_ = lfile;
      Logfile_number_ = New_log_number;
      Log_ = new Log::writer (lfile);
      imm_ = mem_;       HAS_IMM_.
Release_store (IMM_);
      MEM_ = new memtable (INTERNAL_COMPARATOR_);
      Mem_->ref ();       force = false;
 //Next judgment can not be new memtable       maybeschedulecompaction ();
   }  }   return s; } void Dbimpl::acquireloggingresponsibility (loggerid* self) {  while (Logger_!= NULL) {    logger_cv_.
Wait ();
 }   logger_ = self; } void Dbimpl::releaseloggingresponsibility (loggerid* self) {  assert (logger_ = self);   Logger_ = NULL; &nbs P Logger_cv_.
Signalall ();
 }

2.4 Skip List
Level DB internal use of the jump table structure to organize memtable, each insert a record, based on the jump table through multiple key comparisons, positioning to record should be inserted position, and then according to a certain probability to determine the node needs to establish how many levels of indexing, the table structure is as follows:


Leveldb's skiplist is up to 12 levels, the bottom layer (level0) chain is full chain, that is, each record must be inserted in the chain corresponding index node, from Level1 to Level11 is based on the probability of determining whether the need to build the index, the probability according to the factor of 1/4. Here's an example to illustrate this process:
1. Looking at the graph, assuming that our chain does not exist in Record3,level0, Record2 's next record is Record4,level1, and Record2 's next record is Record5.
2. Now, we insert a record record3, through the comparison of key, we locate it should be between Record2 and Record4.
3. We then follow the code to determine that a record needs to establish several indexes in the Hop table:
Template<typename Key, class comparator>
int skiplist<key,comparator>::randomheight () {
  // Increase height with probability 1 in kbranching
  static const unsigned int kbranching = 4;
  int height = 1;
  while (Height < kmaxheight && (rnd_. Next ()% kbranching) = = 0)) {
    height++
  }
  return height;
}

According to the above code, we can conclude that the probability of establishing an X-level index is 0.25 ^ (x-1) * 0.75, so the probability of establishing a Level 1 index is 75%, the probability of establishing a 2-level index is 25%*75%=18.75%, ... (Personally, Google makes the branching factor 4 a bit high, so that in most cases the height of the jump is less than 3). 4. Insert Record3 in the appropriate position of the linked list in Level0 ~ level (X-1), assuming that according to the formula above, we are required to establish a Level 2 index for RECORD3, that is, x= 2, so you need to insert the RECORD3 in the chain of Level0 and Level1: In the chain of level 0, Record3 is inserted between Record2 and Record4, and in the chain of level 1, Record3 inserts between Record2 and RECORD5 , forming the current index structure, when querying a record, you can look down from the highest index to save comparison times.

2.5 record Format

LEVELDB combines each update or delete operation of a user into a record in the following format:



As you can see from the diagram, each record will add the version number and the type of key (update or delete) based on the original key and make up the internal key. When you insert a hop table, you sort by internal key, not user key. In this way, we can only add nodes to the hop table, and it is not possible to delete and replace nodes.

Internal key when comparing, follow the following algorithm:

int Internalkeycomparator::compare (const slice& akey, const slice& bkey) const {
  int r = User_comparator_-> ; Compare (Extractuserkey (Akey), Extractuserkey (Bkey));
  if (r = = 0) {
    //comparison of integers constructed after 8 bytes, the type of the first byte is least significant byte
    const uint64_t anum = DECODEFIXED64 (Akey.data () + Akey.size ()-8);
    Const uint64_t Bnum = DECODEFIXED64 (Bkey.data () + bkey.size ()-8);
    if (Anum > Bnum)  //NOTE: integer large instead of key relatively small
    {
      r =-1;
    } else if (Anum < Bnum) {
      r = +1;
    }
  } return
  R;
}

According to the above algorithm, we can know the internal key comparison order:

1. If the user key is not equal, then the user key relatively small records of the internal key is also relatively small, user key defaults to the dictionary order (lexicographic) For comparison, can be in the table parameters in the custom comparator.
2. If the type is the same, then compare Sequence num,sequence Num internal key is smaller.
3. If sequence num is equal, the internal key that compares the record of the Type,type for update (key type=1) to the record for deletion (key type=0) is smaller.

When inserting into a hop table, the internal key is generally not equal (unless you have manipulated the same record two times in one batch, a bug will appear here: In a write batch, insert a record, delete the record, and finally write the batch to the DB , you will find that the record exists in DB. Therefore, a record of the same key is not recommended for multiple operations in batch, and the record of Sequence num will be in front of the same record when the User key is inserted into the hop table.
The design internal key has some of the following functions:
1. Level DB supports snapshot queries, that is, the version number of the snapshot specified at the time of the query, and the value corresponding to the User Key when the snapshot was created, so that a internal key:sequence= snapshot version number can be formed, type=1,user Key assigns a key to the user, then queries the data file and memory, and finds the first record that is greater than or equal to this internal key and the user key matches (that is, sequence num is less than the first record that is equal to the snapshot version number).
2. If you are querying the latest records, set sequence num to 0xFFFFFFFFFFFFFF. Because we are more to query the latest records, so let sequence num large record row front, you can meet in the last record of the first match immediately return, reduce the number of subsequent traversal.






Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.