DIY Database (ix)--DIYDB data persistence and storage format

Last Update:2016-07-10 Source: Internet

Author: User

Tags goto int size mutex

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, data persistence

Diydb is a document-based database (not an in-memory database), and he needs to persist the data, and read and write data on disk. How is it more efficient to read and write data on disk? The current approach to Linux is to use Mmap, the memory-mapping mechanism.

Why do you say mmap is efficient? We know that when we read a file in the process, it is generally necessary to first copy the corresponding chunks of the files on the disk into the kernel space of the process, and then copy the required data from the kernel space to the user space. You will find that the data has been dumped in the kernel space, which is unnecessary and resource-intensive for the application. Mmap is omitting the data in the kernel dump, he makes the data on the disk directly mapped to the virtual memory space of the process, and the virtual memory space in the user space, when we read the disk data mapped by mmap, the corresponding data block will be copied directly to the user space of the process, This will not have to go through the kernel space to dump. Here is the map of the process address space after the mmap mapping:

Next, let's analyze the classes that manage memory mappings in Diydb through the source code.

#ifndef ossmmapfile_hpp_#define ossmmapfile_hpp_#include "core.hpp" #include "osslatch.hpp" #include "      OSSPRIMITIVEFILEOP.HPP "Class _ossmmapfile{protected:class _ossmmapsegment//a data segment of the memory map, mainly afraid that there is no contiguous large memory segment {public: void *_ptr;//memory address unsigned int _length;//memory segment length unsigned long long _offset;//Offset _ossmmapsegmen         T (void *ptr, unsigned int length, unsigned long long offset) {         _ptr = ptr;         _length = length;      _offset = offset;   }   } ;   typedef _ossmmapsegment Ossmmapsegment; Ossprimitivefileop _fileop;//File Ossxlatch _mutex;//mutex ensures that only one thread is operating on the data segment bool _opened;//file is already open std::vector< Ossmmapsegment> _segments;//This file is mapped to multiple segments char _filename [oss_max_pathsize];//file name Public:typedef std::vector&lt   ; Ossmmapsegment>::const_iterator Const_itr;//iterator inline CONST_ITR begin () {return _segments.begin (); } inline Const_itr End () {return _segments.end ();   } Inline unsigned int segmentsize () {return _segments.size ();      }public: _ossmmapfile () {_opened = false;   memset (_filename, 0, sizeof (_filename)); } ~_ossmmapfile () {close ();//Reclaim all mapped memory spaces, traverse _segments, and reverse map each memory segment} int open (const char *pfilename, Unsi   gned int options);   void Close (); int map (unsigned long long offset, unsigned int length, void **paddress);} ; typedef class _ossmmapfile Ossmmapfile; #endif

As can be seen here, our mapping file class _ossmmapfile is actually managing a memory map of a file, because some database files can be very large, if you want to map the database file directly to a contiguous virtual address space, it is possible to map the failure, so _ Ossmmapfile is the mapping of files to multiple memory segments, and each memory segment corresponds to a _ossmmapsegment type object. All data segments mapped to memory are placed in a collection (Std::vector<ossmmapsegment> _segments). Above is the code of OSSMMAPFILE.HPP, OssMmapFile.cpp here not to elaborate, and finally will be annotated code PO out.

Second, the storage of data

It says that when diydb data is persisted, it is through memory mapping to read and write the disk files efficiently, then what format is the database data stored on the disk, and how do we manipulate the formatted data?

1. Overview of database file structure

Header: The metadata that holds the database file. Includes a string identifier (equivalent to a magic number that indicates that this is a diydb database file), the number of data pages, the state of the database, and the version information.

Data pages: Our database files are divided into one data page of the same size, and the management of the free space is placed inside the data page. In addition, because each data cannot cross the data page, here each data page size is 4M, so the size of a data in diydb cannot exceed 4M.

Data segments: Data segments consist of multiple data pages that represent successive pieces of data in a database file that are mapped to virtual memory by Mmap. So the data segment is a unit that exists only in memory, and there are no data segments in the database file.

Note: The Diydb database file is simplified, she contains only one database, and the index does not persist storage.

2, the data page knot structure

Length: Length of the data page to extend the length of the data page later.

Identity: Identifies the state of a data page, such as: whether it is available

Number of slots: the number of slots that are contained in the data page.

Offset of the last slot: the data portion of the data page is preceded by a slot, and the next part is the data block (the middle is the free space). This property is the offset of the last slot in the data page.

Free space size: Represents the space that is not used in the data page, that is, the portion between the slot area and the data region.

Free Space Start address offset: The data block in the data page allocates space from the back, so the area where the data block resides is in the trailing area of the data page, and this property represents the address offset of the starting position of the trailer area.

Note: The size of the data page is fixed at 4M and the slot size is 4B

3. Structure of data record

Data record length: The overall length of a data record.

Data record ID: Indicates whether the data record is available (that is, whether it was deleted).

Data: holds a real piece of data (here is a Bson object).

4. External operation

(1) Data insertion (insert)

(2) Data removal (remove)

(3) Data lookup (Find)

(4) initialization (initialize)

5. Internal operation

(1) Add data segment (_extendsegment)

1, expand the file 2, map the extended file into memory

(2) Initializing empty files (_initnew)

When there is no data file, create a new database file, expand the file, fill in the database file header information, and then map the file into memory

(3) extension file (_extendfile)

Extend the file by 128M, the file on disk expands 128M (length of one segment)

(4) Loading data (_loaddata)

When starting a database, if you already have a database file, you need to load the database file in. 1. Load the header of the database file into 2, map each segment in the database into memory 3, calculate the free space in each data page, save the result in a std::map, this map object is the free space management container

(5) Search slots (_searchslot)

Given a data page, given a RID, this function calculates what the offset of this slot is.

(6) Reclaim space (_recoeverspace)

That is, in-page reorganization

(7) Update the remaining space (_updatefreespace)

When you insert data within a page, the page has less free space, so you have to update the free space Management container

(8) Find data page (_findpage)

Given the length of a data, this method is used to find a page with the appropriate free space.

6, with the above introduction, we take a look at the code implementation

(1) The ID of each data record consists of the page ID and slot ID, that is, each time a record is found, we first find the page where the record is located, and then find the record slot, and then according to the slot to find the data record

typedef unsigned int PAGEID;//page number typedef unsigned int slotid;//slot number//record ID consists of the page ID and slot ID of the struct dmsrecordid{   PAGEID _ PageID;   SlotID _slotid;} ;

(2) Structure of each record

struct dmsrecord//data record {   unsigned int _size;   unsigned int _flag;   Char         _data[0];};

(3) header of the database file

Database file header struct dmsheader{   char         _eyecatcher[dms_header_eyecatcher_len];//Database File magic number   unsigned int _size ;   unsigned int _flag;   unsigned int _version;} ;

(4) Structure of the data page

Page Structure/*********************************************************page STRUCTURE-------------------------| PAGE HEADER           |-------------------------| Slot List             |-------------------------| Free Space            |-------------------------| Data                  |-------------------------**********************************************************/#define DMS_PAGE_ Eyecatcher "Pagh"//Data page Magic number # Dms_page_eyecatcher_len 4#define dms_page_flag_normal 0#define    dms_page_flag_ Unalloc   1#define Dms_slot_empty 0xffffffff//when the data record for the slot is deleted, set the slot to -1struct dmspageheader{   Char             _ Eyecatcher[dms_page_eyecatcher_len];   unsigned int     _size;   unsigned int     _flag;   unsigned int     _numslots;   unsigned int     _slotoffset;   unsigned int     _freespace;   unsigned int     _freeoffset;   Char             _data[0];};

(5) The size of each unit in the database file

#define DMS_PAGESIZE   4194304//linux the size of a data block is 4096,diy database one page size is set to 4m#define dms_max_pages  262144// Database file up to 256K data pages, so the database file is 1t#define dms_file_segment_size 134217728//length 128m#define dms_file_header_size  65536 The length of the header of the database file # define Dms_extend_size 65536//the size of the expansion disk once, is actually the length of a segment

(7) The Declaration of the Implementation class of the DMS Data Management module is as follows

Class Dmsfile:public Ossmmapfile{private:dmsheader *_header;//Header Std::vector<char *> _bod of database files y;//The starting position of each segment in virtual memory std::multimap<unsigned int, pageid> _freespacemap;//Manage free space, each time you want to insert a record, according to the record size OSSSLATC h _mutex;//read-write lock ossxlatch _extendmutex;//Mutex to extend the database file, prevent two simultaneous threads from extending this file char *_   Pfilename;//file name Ixmbucketmanager *_ixmbucketmgr;//Data Index Public:dmsfile (Ixmbucketmanager *ixmbucketmgr);   ~dmsfile ();   Initialize the DMS file int Initialize (const char *pfilename); Insert the data, insert the record into the data record for the slot specified by the RID, and return the record with Outrecord int insert in memory map after insertion (bson::bsonobj &record, Bson::   Bsonobj &outrecord, Dmsrecordid &rid);   Given a record ID, delete the corresponding record int remove (Dmsrecordid &rid); Find the corresponding record based on the record ID int find (dmsrecordid &rid, bson::bsonobj &result);p rivate:int _extendsegment ();//For database text The piece expands one segment int _initnew ();//Initializes an empty database file, creating only one database file header int _extendfile (int size);//Expands the file, extending the specified size int _loaddata ();//Load Database File//search slot int _searchslot (char *page,//given a data page Dmsrecordid &recordid, Slotoff &slot);//Search slot void _recoverspace (char *page);//Reorganization VO ID _updatefreespace (dmspageheader *header, int changesize, PAGEID PAGEID);//Update free space PAGEID _findpage (size_t requiredsize);//Find Satisfied <span style= "font-family:arial, Helvetica, Sans-serif in the list of free spaces; font-size:12px; " >requiredsize size of page </span>}

Note: According to the above description, we can find that the metadata of the database is: the information in the database file header (mainly the size of the database file), the database file mapping to the beginning of each segment in memory, the database free space list, the database file name, the database index

(6) Data insertion (insert) implementation

int Dmsfile::insert (bsonobj &record, Bsonobj &outrecord, Dmsrecordid &rid) {int rc =   DIY_OK;   PAGEID PAGEID = 0;   char *page = NULL;   Dmspageheader *pageheader = NULL;   int recordsize = 0;   Slotoff offsettemp = 0;   const char *pgkeyfieldname = NULL;   Dmsrecord Recordheader; Recordsize = Record.objsize ();//Record size if ((unsigned int) recordsize > Dms_max_record)//maximum 4m per record minus page      The head {rc = Diy_invalidarg;      Pd_log (Pderror, "record cannot bigger than 4MB");   Goto error;   } pgkeyfieldname = Gkeyfieldname;      Detects if there is a _id field if (Record.getfielddottedorarray (pgkeyfieldname). Eoo ()) {rc = Diy_invalidarg;      Pd_log (Pderror, "record must is with _id");   Goto error;   }retry://Lock the Global lock _mutex.get (); PageID = _findpage (recordsize + sizeof (Dmsrecord));//find enough space//if there ' s not enough space in any existing pages, le T ' s Release DB lock if (Dms_invalid_pageid = = PAGEID) {_mutex.release ()///If no data page of the appropriate size is found, unlock/if there ' s Not enough space in no existing pages, let's release db lock and//try to allocate a new segment by calling _extend         Segment if (_extendmutex.try_get ())//expansion lock, that is, increase the data segment {//At the same time only one thread can extend the data segment, expand the database file first, and then map the extended segment into memory         The metadata for each data page is then initialized, and then the metadata for the database is initialized, including the list of changes to the free space, the starting position list of the segment that maps//into memory rc = _extendsegment ();            if (RC) {pd_log (Pderror, "Failed to extend segment, rc =%d", RC);            _extendmutex.release ();         Goto error;         }} else {//If we cannot get the Extendmutex, that means someone else was trying to extend      So let ' s wait until getting the mutex, and release it and try again _extendmutex.get ();      } _extendmutex.release (); Goto retry;//Then continue looking for pages that have enough space}//Same as PageID find the page in the mapped in-memory location page = Pagetooffset (PageID);      If the corresponding page is not located in memory, release the extension lock and return error if (!page) {rc = Diy_sys;      Pd_log (Pderror, "Failed to find the page");   Goto Error_releasemutex;   }//Read the page's meta-data PageHeader = (Dmspageheader *) page; The identification field of the detection page has no problem if (memcmp (Pageheader->_eyecatcher, Dms_page_eyecatcher, Dms_page_eyecatcher_len      )! = 0)//detection is not a database page {rc = Diy_sys;      Pd_log (Pderror, "Invalid page Header");   Goto Error_releasemutex;  }//The page we found just said that the sum of free space is enough to insert a piece of data, but the space in the page is not necessarily continuous,//So, to see if there is no continuous space to insert a piece of data, if not, this will be the industry reorganization, that is, the page//within the number of free space to adjust to Pageheader->_slotoffset + recordsize + sizeof (Dmsrecord) + sizeof (SlotID) > Pageheader->_freeoffset)//See there There is not enough space and sufficient contiguous space {_recoverspace (page);//Intra-page reorganization} offsettemp = Pageheader->_freeoffset-recordsize-si   Zeof (Dmsrecord);   Recordheader._size = recordsize + sizeof (Dmsrecord);   Recordheader._flag = Dms_record_flag_normal; Fill in slots with insert record allocation * (slotoff*) (page + SiZeof (Dmspageheader) + pageheader->_numslots * sizeof (slotoff)) = Offsettemp;   Fill in the Recorded header information memcpy (page + offsettemp, (char*) &recordheader, sizeof (Dmsrecord));   Fill in the Record Body memcpy (page + offsettemp + sizeof (Dmsrecord), Record.objdata (), recordsize);   Outrecord = bsonobj (page + offsettemp + sizeof (Dmsrecord));   Rid._pageid = PageID;   Rid._slotid = Pageheader->_numslots;   Change metadata information for a data page pageheader->_numslots + +;   Pageheader->_slotoffset + = sizeof (SlotID);   Pageheader->_freeoffset = offsettemp; Change the metadata information for the database (that is, the free space list) _updatefreespace (PageHeader,-(recordsize+sizeof (SlotID) +sizeof (Dmsrecord   )), PageID); Release global lock _mutex.release ();d One:return RC; Error_releasemutex: _mutex.release (); Error:goto done;}

Note: As we can see here, when a thread in the database operation, plus the database of the global lock, the granularity of this lock is quite large, it is not recommended to do so, this is one of the reasons why this database is not commercially available.

In addition, we can see that the insert operation of the data includes:

? Determining the legitimacy of input data
? Lock Database
? Find a page of data with enough space
? If you cannot find a data page with enough space, release the lock,
With the new data segment, get the lock, and then look again
? If the found free page does not include enough contiguous size memory pages,
Then the data page is reorganized
? Writing records to a data page
? Update data page metadata information
? Update free Space information
? Unlock

Iii. Summary

1, this chapter mainly explains the Diydb DMS module, that is, the data management module, the runtime module in the execution of the request, it is based on this module on the data in the database operation (and of course, according to the index module to find records)

2, DIYDB data in the presence of disk above, the traditional read disk data will lead to two copies of the data, very time-consuming, so diydb used to map the data in the database file into the user memory space (virtual), to improve the efficiency of disk read and write. And to prevent the user memory space from not having enough contiguous addresses, each time a piece of data is mapped to the user's memory space

3, DIYDB data Management module in the operation of data, will lock the entire database, so efficiency is not flattering

4. The data storage format in the Diydb database file is more witty, he assigns a slot to the record for each data record, because the length of the data record is variable, and the slot size is certain, so it is more convenient to find the slot. Because of this, each record ID consists of a page ID and a slot ID.

DIY Database (ix)--DIYDB data persistence and storage format

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More