Read the storage implementation of simpledb again

Source: Internet
Author: User

A few gossips: at the end of, at the mentor's suggestion, I went to a teaching RDBMS, called simpledb, made by Professor Edward sciore at the Brown University, then I worked with my classmates to implement a C # version, just to understand the implementation principles of RDBMS. A few years later, when I was asked this little thing several times, I found that I had forgotten many details. You can't help but read it again. Perhaps, RDBMS doesn't want to get so popular before, nosql is popular, but I think of a paper in readings in database called what goes around und, comes around und, And I will review some of it, there will be some benefits. In addition, simpledb is very small and dirty, which is quite helpful for understanding the underlying implementation of RDBMS. CoreCodeMore than 6000 rows (this timeArticleBased on the Self-reproduced C # version code ).

The question is coming soon.

There are 15 packages in simpledb: buffer, file, index, log, materialize, metadata, multibuffer, opt, parse, planner, query, record, remote, server, TX.

Figure 1 simpledb file structure

---------- Add on 2012.8.29 -----------

By accident, I turned to the previous documents and saw a logic diagram of each simpledb module. I think it is helpful for me to understand the structure of the entire RDBMS. I will add it here.

-------------------------------------------

This article begins with the underlying storage, buffer & file. Here, the log is also underlying: the log package only provides read and write operations on int and string at the disk block level, and does not care about log Content ideographic. However, due to its special data structure for reading and writing in reverse order, the next article will talk about it separately.

The file structure in the file package is as follows:

Figure 2 class diagram of the file package

 

> Block

Is the most basic disk storage unit and the smallest unit for data access in the system. All data is persistent by flushing to the block. The block member parameters include the file name filename and the number blknum in the file. The relationship between block and file can be used to indicate:

Figure 3 Relationship Between Blocks and files

An object consists of integer blocks. The block number starts from 0. If a new file is created, it indicates that the file is empty. As the system needs to add a block to the file, the file size will expand.

 

> Page

Is the location where the actual data is saved during system running. A page contains a byte contents array. When the system reads and writes data, the data is first stored in this contents array, and then persisted into a block through this array, or pass it to another memory object. Generally, the size of a block is 4 K and the byte array length is 4096. However, note that,Page itself, not bound to a block.

Figure 4 Relationship between PAGE and block

Note the following two points:

1. When reading and writing data, consider the concurrent accesses to the same file by multiple threads. To maintain consistency, use the lock mechanism to achieve mutually exclusive access. Threadlock is a "pile" provided for Lock locking ". The first time I wrote it, I used "lock (this);". In this way, I only locked the thread itself and could not interfere with the thread. Later, when I debug, set a lock pile ".

2. The read, write, and append methods in the page do not really implement the read and write of the underlying data. Instead, they call the read, write, and append methods of filemgr through a filemgr object packaged by the page. These three methods are actually operations between the block and the page: the data in the block is unique in the page, and the data in the page is written into the block, attaches data in the page to a specified file (that is, it is written into the block of the specified file ). Page only needs to care about reading and writing data on the contents array: setint, setstring, getint, and getstring.

 

> Filemgr

Is a tool class. Specifically, data is transferred between memory and disk.

The storage structure of the database is coming soon.

Dbdirectory is the database directory name. After you create a database named testdb, a directory named testdb is created under the specified directory. The tables in the database are saved in this directory.

As a dictionary, openfiles stores the ing between file names and file streams, which is a bit like the handle list. Openfiles maintains the files opened in the current database.

GetFile is the only way to operate openfiles. The input is a file name, and the output is a file stream. Read, write, and append all need to get a file stream through GetFile. If the file exists, open the file and return the file stream. If the file does not exist, create a new file. GetFile contains the file creation, so it is private.

There are two common methods in filemgr:

Size () returns the number of blocks in the specified file. The number of the block in the file starts from 0. The returned size is the number of the block to be added to the file. Size () is usually called by various tool classes to obtain the end position of the file.

Isnew () returns whether the database folder is created. It is called during system initialization.

The read, write, and append Methods convert the memory data and disk data in the whole system. First, lock the file, then get the file stream object FS corresponding to the specified file through GetFile, and then implement data exchange between byte [] and block through FS.

 

The file structure in the buffer package is as follows:

Figure 5 class diagram of the buffer package

The buffer package provides simpledb's own buffer mechanism to implement its own buffer pool and buffer scheduling.Algorithm.

> Buffer

Is the object directly used by other upper-layer classes in the system. That is to say, when you need to read and write data such as log and transaction, you must first create a buffer object and then use this object for read and write.

The class diagram shows that the buffer contains both a Page Object contents and a block object BLK. Contact the previous page to know that the core of the page is a byte array contents, which is the memory ing of the specific disk block, and BLK is a disk block object. here we can see that, buffer maintains the ing between memory data and disk data.

In the member parameter, the initial value of logsequencenumber is-1, and the buffer is changed when data is written. The lsn of the log corresponding to the write operation is recorded; the initial value of modiffiedby is-1, when a write operation is performed, the transaction ID of the write operation is recorded, which indicates the transaction that was modified. From these two points, we can find that only logs are performed on write operations, and the read operations are not recorded or directly read. This is also a consideration of system performance.

The getint () and getsring () methods directly encapsulate the corresponding contents methods. Do not repeat them. setint () and setstring () encapsulate the corresponding contents methods, only the transaction ID and lsn are added. Because the corresponding methods in the page class all take into account the concurrency and lock issues, so at the buffer level, direct calls do not need to be considered.

After flush () confirms that the current buffer is modified, call the write () method of contents to write data to the BLK and reset modifiedby to-1. Note that Wal (write ahead log) is embodied here. First, call logmgr's flush () to write the log to the file, and then call write () to write the data to the file.

Assigntoblock (Block B) reads the specified block object to the current buffer. First, call flush () to ensure that dirty data is written to the disk if the current buffer contains dirty data. Then, point the BLK reference to the target block object, and read the data of the block to contents.

Assigntonew (string filename, pageformatter fmtr) formats the contents object according to the specified page format and attaches the Page Object to the specified filename file. It is a little difficult to understand this. Refer to the code below. The format method is actually to write some additional information, such as the tag bit, to the page. After the format, contents actually has content, and then calls the append method to append the data in contents to the filename file, and returns the reference of the disk block where the data is written to BLK. In this way, you will understand. Of course, before this set, it is necessary to process the dirty data in the original buffer, as described above.

To sum up the above two assign * methods: the buffer does not overwrite the constructor, And the BLK is initialized to null when it is defined. The entire buffer class code is unified, there are only two assign * methods for BLK value operations. Only assigntonew has the filename parameter, because it is clear that the block bound to the buffer must belong to a certain file, so the conclusion is: the newly opened file, the first called assigntonew, bind the file and take a block in the file to the buffer. Then, when reading data from the file block to the buffer in turn, assigntoblock is used.

Finally, let's talk about the buffer pins mark and the pin () and unpin () methods.

The pin () and unpin () methods in the buffer are used to modify the pins value. Determine the pins value to mark whether the buffer is used. Combined with basicbuffermanager, the buffer slices in the buffer pool are managed through the pin and unpin, And the Buffer Allocation and recovery are realized. The details are left in the subsequent sections of basicbuffermgr.

> Basicbuffermgr

Is the Basic buffer scheduler that manages the buffer pin and unpin in the buffer pool. Do not consider busy waiting and any scheduling policies.

Basicbuffermgr maintains a buffer array as the buffer pool, and uses numavailable to mark the number of available buffer.

As a critical resource, the buffer pool needs to be accessed by mutual exclusion. Because the number of available buffers is reduced in sequence when multiple threads apply for a buffer at the same time, and concurrency may cause phantom problems, you need to set threadlock and lock the "pile ". At the same time, the methods used to operate the bufferpool must be locked by lock to ensure mutual access.

First, let's look at the constructor. The input parameter is numbuffs, that is, the number of buffers in the buffer pool. In the constructor, The bufferpool array is initialized.

Let's look at the traversal of bufferpool:

Findexistingbuffer, input a reference to block cash, traverse the buffer pool, and check whether the buffer has been allocated to the block.

Chooseunpinnedbuffer traverses the entire buffer pool and displays unused buffers.

These two methods only perform read operations on the bufferpool and do not use lock. Why is read unlocked? Because the two methods are called inside the pin and unpin, only one thread is read in the bufferpool in the critical section, so no locks are needed.

Then there are several methods to operate bufferpool:

Pin () assigns a buffer object to the block in the parameter. Lock first. Call findingexistingbuffer to check whether the block object has been allocated a buffer. If not found, call chooseunpinnedbuffer () and try to find an unused buffer. If no buffer is found, null is returned. If no buffer is found, the buffer is allocated to the block. Modify the numavailable quantity, call the buffer's own pin, and modify the buffer's own counter pins. Finally, the set buffer object is returned.

Pinnew () and pin () are not used because pinnew acquires a block from a new file filename, while pin obtains the block by specifying it in the input parameter. In this way, when using pinnew, we use the assigntonew method of buffer. The allocation method of the buffer is similar to that of the pin. Because the block is obtained from a brand new file, there is no "check whether the block object has been allocated buffer.

Unpin () detaches a specified buffer. After locking, only modification of the pins value of the buffer object and modification of the numavailable value are involved.

Flushall () traverses the bufferpool of the entire buffer pool and writes data in each buffer block to the corresponding disk block.

> Buffermgr

Is a buffer manager that can be publicly accessed by other modules in the system. The same method as basicbuffermgr is provided, but the difference is that a busy wait is added when the request buffer is added, so that the pin and pinnew do not return NULL values.

Busy waiting mechanism: If no buffer is available, the request thread enters a waiting sequence. When a buffer is available, the request thread is removed from the waiting queue. The waiting time of the Request thread has a threshold. When the threshold is exceeded, the system runs an exception bufferabortexception.

The threshold value of the Set wait time is 10 s. If the thread wait time exceeds 10 s, it exits automatically.

Pin () uses try {} catch () {} to handle the exception throw. At the beginning, record the timestamp: Long timestamp = datetime. now. ticks; then, call the basicbuffermgr pin () method to apply for a buffer. If it succeeds, return. Otherwise, the thread enters the waiting state. There are two conditions for continued waiting: A. The buffer has not been applied for, B. The waiting has not timed out. The waittoolong () method does not stop detecting timeout. Monitor is used during the waiting period. Monitor only blocks the maximum max_time of the current thread.

Pinnew () is similar to pin. For more information, see basicbuffermgr. pinnew ().

Unpin () wakes up the blocking thread while releasing the buffer.

Other methods are basically packaged basicbuffermgr and are no longer repeated.

> Pageformatter

Is an interface used to initialize a data block. There is only one method, format. As mentioned above, the format method is to write some auxiliary information to the file block and format the disk block into the specified format. The two pageformatter implementations are btpageformatter and recordpageformatter, which are the B + tree page formatter and the data record page formatter respectively.

> Bufferabortexception

Is an exception class, not to mention.

--

The end.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.