This semester's database course, the last big trip is to write a minisql database implementation, the requirements are very simple, build delete table, build delete single value index, support primary key and unique definition, support the simplest select, as long as support 3 types: Int,float,char (0~255). At the very beginning, considering the characteristics of the database runtime to determine the type, you chose the runtime's powerful C # and can also integrate into LINQ. But a week later found that C # operand binary structure of almost 0, in writing BufferManager also found completely unable to control the object life cycle, and the implementation of IDisposable is also too fascinated, compared with the powerful runtime and LINQ, these weaknesses are not acceptable to me, Move decisively to C + +.
C + + 's RAII provides an accurate object lifecycle and ownership control, which makes it very enjoyable to write a buffermanager that accurately controls every chunk of memory, and smart pointers, especially unique_ptr, play a huge role in it and, incidentally, understand C + + 11 The concept of finer objects around values is more familiar with the definition and control of object ownership and transfer of ownership. C + + is close enough to the underlying pointers and memory operations, it also allows me to directly reinterpret_cast an object pointer into byte*, and then write back directly to the file in binary mode (of course, I manually control the object can not have pointers and heap memory), you can also read from the file into a binary content , reinterpret_cast the first address directly into an object pointer and then use it directly. The template also helped a lot in it.
The part of the index, the first use of the template B + tree, and then found that the database requires a runtime type, template such a compile-time static type of things can not meet my needs, so let template inherit from common base class, erase template type, base class virtual function parameters using byte*, erase parameter type, It is an emergency walk around to find the correct derived class by the virtual function call and then the type of the derived class to participate in the operation, which is not allowed to change the template container to a runtime dynamic type container, nor time to recreate a dynamic type wheel. When the value of the runtime is converted to a static template type, it takes a lot of effort, mainly to implement a similar
Template<typename t>treebase* make_tree (); switch (type) {case INT: return make_tree<int> (); case Float : return make_tree<float> (); case Char: switch (size) {case 1: return Make_tree<array <char, 1>> (); Case 2: return Make_tree<array<char, 2>> (); //..... Case 255: return Make_tree<array<char, 255>> (); } break;}
Similar to the dynamic value of the thing to the static type of things, the most disgusting part of the switch (size) part, to case 1~255, handwritten words must be disgusting, so I wrote a template to do binary search, up to 6 layers function call can find the correct value. Paste so much, or to fill in the original wrote a static template B + Tree pits. The B + Tree debug process uses the Natvis visualizer provided by vs debugger to customize the way I write my own classes in the Watch window, which is really cool and easy to see.
Template B + Tree implementation of the internal pits more, the original design is that since all types are static, it is possible to statically determine the degree of a tree node, and can make the node size just up to 4KB space, realized up out of the pit, because the memory alignment of the fans occupy the space, using (4kb–sizeof ( Non-pointer members)/sizeof (pointer size) to determine the way the tree node, always will make the size of more than 4KB, fortunately in advance guessed that there may be such a problem, add static_assert, or do not know where to debug to, say, finally used #pragma Pack (1) This pit Daddy, close the memory alignment, reluctantly bypassed the problem. The pit below is from the tree node itself, because I want to put the entire node directly binary write file, so at least need to ensure that the node is the pod type (although later found to do this actually superfluous, but also add complexity), which also led to the node can not put any member function, so I wrote a wrapper class, Holding a tree node, and then in this wrapper class to write the required member functions, because the tree operation needs to read multiple nodes at the same time, each node is a block, in order to prevent reading the back node when the previous node is recycled, I designed the wrapper class in the packaging node automatically lock the node mechanism, This is somewhat similar to shared_ptr's reference count, and then the lock count control of the wrapper class is out of the hole, which makes me realize that the move ctor the original does not write is automatically generated, instead of changing the move action to copy, so here's a bit suspicious of the old C + + If you do a mechanism similar to the reference count in your code, can you run it directly in the C++11 environment?
Interpreter part handwritten tokenizer, with Vczh wheel uncle in his blog to Tinymoe write Tokenizer handwriting state machine method, grammar analysis is simple recursive descent at the same time interpretation of execution.
The rest of the part does not have too much technical content, just simple business logic, the only pit point is that I wrote in this database three copies of the management string to the digital mapping of things, unexpectedly did not think of hash to do, failure, failure.
At the end of the test and run phase, ran the VS comes with the performance analysis tool, really found the limit performance place, is in the LRU algorithm replaces the part of the file block, the selection of the replaced block out of the pit, resulting in the block switching frequency is too high, the program performance is pulled down, after optimization, Inserting 1W data at the same time check 3 unique keys and insert the index, only 20 seconds more, or more satisfactory.
The process of writing this thing still found that the C + + standard library has a lot of missing things, such as serialization (this in the future has a static reflection should be improved, of course, there are also many serialized open-source libraries), such as Fancy stitching strings (sstream too slow), such as binary file stream (boost has a buffer_stream, but too old), such as the runtime to determine the size and storage space continuous two (multi-dimensional) array (new is not new Int[m][n], n must be the literal value), and char* and string conversion when the trouble and so on.
A person handle such a end to write more than 5,000 lines of things, in the middle with a variety of their own learned C + + technology, tried a variety of previously useless tools such as Natvis,profiler, but also in the Debug B + Tree brush night, unlocked a lot of achievements, Although the final database score is not very high (I guess most of my mid-term exam is too bad, big range of visual division is not low), or feel very fruitful, said so much, finally with a vczh wheel uncle once said to do the end of it.
What you need to take the time to do depends on whether the problem is hard enough, whether it's just good you can do it, and then you can't do it any harder . As long as you keep this training method for 10 years, it's hard to think about it.
Handle a database on its own