Teach you to write a database with more than 100 lines

Source: Internet
Author: User
Tags uuid timedelta
This article introduces a simple database written by China's IT veteran, which is not as powerful as the database we use, but it is worth learning from. Can be used in a specific environment, more flexible and convenient.

The database is named WAWADB, and is implemented in Python. This shows that Python is often powerful!

Brief introduction

The requirements for logging are generally the same:

Append only, not modify, write in chronological order;

A large amount of writing, a small amount of reading, query general query a time period of data;

The fixed set of MongoDB satisfies this requirement, but MongoDB accounts for a large amount of memory, a bit of fire on mosquitoes, a fuss about the feeling.

WAWADB's idea is to write 1000 logs each, recording the current time and the log file offset in an index file.

And then the time to consult the log, the index is loaded into memory, using dichotomy to isolate the point of time offset, and then open the log file seek to the specified location, so that the user can quickly locate the required data and read, and do not need to traverse the entire log file.

Performance

Core 2 p8400,2.26ghz,2g memory, with a bit Win7

Write test:

Analog 1 minutes Write 10,000 data, write 5 hours of data, insert 3 million data, 54 characters per data, spents 2 minutes 51 seconds


Read Test: Reads a log containing a substring within a specified time period

Data range traversal data volume results time elapsed (seconds)

5 hours 3,000,604 6.6

2 hours 1,200,225 2.7

1 hours 600,096 1.3

30 min 300,044 0.6

Index

Only the time of the log is indexed, the introduction of the next index is probably the implementation of the two-point search is certainly not a B tree efficiency, but the general situation is not an order of magnitude, and the implementation is particularly simple.

Because it is a sparse index, not every log has an index to record its offset, so read the data to read some more data, to prevent the missed reading, and so on to read the real data needed to really return data to the user.

such as, for example, the user to read 25 to 43 of the log, using the binary method to find 25, found the point of 30,

Index: 0 10 20 30 40 50 log: |.........| .........| .........| .........| .........| >>>a = [0, ten, A, A, 50]>>>bisect.bisect_left (A, b) >>>3>>>a[3]>>> 30>>>bisect.bisect_left (A, >>>5>>>a[5]>>>50)

So we're going to go backwards, starting from 20 (30 of the previous tick) to read the log, 21,22,23,24 read because it is smaller than 25, so throw it away and read it to 25,26,27,... Returned to the user after

After reading to 40 (50 of the previous tick) to determine whether the current data is greater than 43, if greater than 43 (return to the full range of data), will stop reading.

As a whole, we only operate a small portion of the large file to get the data that the user wants.

Buffer

In order to reduce the write log when a large number of disk write, index in the append log, the buffer is set to 10k, the system should be 4k by default.

Similarly, in order to improve the efficiency of reading the log, the buffer is also set to read 10k, also need to be based on the size of your log to make appropriate adjustments.

The read and write settings for the index are set to row buffer, flush to disk on each full line, preventing the reading of incomplete index rows (in practice, the row buffer is set, or the ban La Line is read).

Inquire

What? To support SQL, don't make a fuss, 100 lines of code How to support SQL AH.

Now the query is passed directly to a Lambada expression, and the system iterates through the data rows within the specified time range, and the user's Lambada condition is returned to the user.

Of course, this will read a lot of users do not need data, and each line is to do the lambda expression operation, but no way, simple is beautiful.

I used to be a need to query the condition and log time, the log file offset is recorded in the index, so that the index to find a qualifying offset, and then each piece of data as a log file seek once, read once. The benefit is only one, that is, the amount of data read less, but the disadvantage of two:

The index file is extremely large and inconvenient to load into memory

Each read to seek first, it seems that the buffer is not used, particularly slow, than the continuous reading of a segment of data, and lambda filter four or five times times slower

Write

As I said earlier, only append, not modifying the data, and the front of each line of the log is the timestamp.

Multithreading


Query data, can be multi-threaded simultaneous query, each query will open a new log file descriptor, so parallel multiple reads will not fight.

Writes, although only append operation, but does not confirm that multithreading to the file append operation is safe, it is recommended to use a queue, a dedicated thread to write.

Lock

There are no locks.

Sort

The default query out of the data is ordered by the time sequence, such as the need for other sorting, it is advisable to use Python's sorted function after memory sorting, how to arrange the row.


100 + Rows of database code

#-*-Coding:utf-8-*-import osimport timeimport bisectimport itertoolsfrom datetime import datetimeimport logging Defaul T_data_dir = './data/' default_write_buffer_size = 1024*10default_read_buffer_size = 1024*10default_index_interval =     + def ensure_data_dir (): If not os.path.exists (Default_data_dir): Os.makedirs (Default_data_dir) def init (): Ensure_data_dir () class Wawaindex:def __init__ (Self, index_name): Self.fp_index = open (Os.path.join (defaul T_data_dir, Index_name + '. Index '), ' A + ', 1) self.indexes, self.offsets, Self.index_count = [], [], 0 self._ _load_index () def __update_index (self, Key, offset): Self.indexes.append (Key) Self.offsets.append (OFFSE                T) def __load_index (self): Self.fp_index.seek (0) for line in Self.fp_index:try: Key, offset = Line.split () self.__update_index (key, offset) except ValueError: # index if not flus                h, you might read half a row of dataPass Def append_index (self, Key, offset): Self.index_count + = 1 if self.index_count% default_index_int Erval = = 0:self.__update_index (key, offset) Self.fp_index.write ('%s%s '% (key, offset, os.lines        EP)) def get_offsets (self, Begin_key, end_key): left = Bisect.bisect_left (self.indexes, str (begin_key)) right = Bisect.bisect_left (self.indexes, str (end_key)) left, right = Left-1, right-1 if left < 0:le        FT = 0 if right < 0:right = 0 if right > len (self.indexes)-1:right = Len (self.indexes)-1        Logging.debug (' get_index_range:%s%s%s%s '%s ', self.indexes[0], self.indexes[-1], Begin_key, End_key, left, right) Return Self.offsets[left], Self.offsets[right] class Wawadb:def __init__ (Self, db_name): Self.db_name = Db_name self.fp_data_for_append = open (Os.path.join (default_data_dir, Db_name + '. db '), ' a ', default_write_buffer_s ize) Self.index = Wawaindex (db_name) def __get_data_by_offsets (self, begin_key, End_key, Begin_offset, end_offset): Fp_data = O Pen (Os.path.join (Default_data_dir, Self.db_name + '. db '), ' R ', default_read_buffer_size) fp_data.seek (int (begin_off Set)) line = Fp_data.readline () Find_real_begin_offset = False Will_read_len, Read_len = in T (end_offset)-Int (begin_offset), 0 while Line:read_len + = Len (line) if (not Find_real_begi N_offset) and (Line < STR (begin_key)): line = Fp_data.readline () Continue fin D_real_begin_offset = True if (read_len >= Will_read_len) and (Line > str (end_key)): Break Yie        LD Line.rstrip (' \ r \ n ') line = Fp_data.readline () def append_data (self, data, Record_time=datetime.now ()): Def Check_args (): If not data:raise valueerror (' data was null ') if not Isinsta  NCE (data, basestring):              Raise ValueError (' data is not string ') if Data.find (' \ r ')! =-1 or data.find (' \ n ')! =-1: Raise ValueError (' data contains linesep ') Check_args () Record_time = Time.mktime (record_ Time.timetuple ()) data = '%s%s '% (record_time, data, os.linesep) offset = Self.fp_data_for_append.tell ( ) self.fp_data_for_append.write (data) self.index.append_index (Record_time, offset) def get_data (self, b Egin_time, End_time, Data_filter=none): Def check_args (): If Not (Isinstance (Begin_time, datetime) and I Sinstance (End_time, DateTime)): Raise ValueError (' Begin_time or end_time is not datetime ') check_a RGS () begin_time, end_time = Time.mktime (Begin_time.timetuple ()), Time.mktime (End_time.timetuple ()) Begin_ Offset, End_offset = Self.index.get_offsets (Begin_time, end_time) for data in Self.__get_data_by_offsets (Begin_ti Me, End_time, begin_offsET, End_offset): If Data_filter:if data_filter (data): Yield data Else:yield Data def Test (): From datetime import datetime, Timedelta import uuid, random logging . GetLogger (). SetLevel (Logging. NOTSET) def time_test (test_name): Def inner (f): Def inner2 (*args, **kargs): Start_ti me = DateTime.Now () result = f (*args, **kargs) print '%s take time:%s '% (test_name, Dateti Me.now ()-start_time) return result return inner2 return inner @time_test (' Gen_tes T_data ') def gen_test_data (db): now = DateTime.Now () Begin_time = Now-timedelta (hours=5) whil E Begin_time < Now:print begin_time for I in Range (10000): Db.append_data (str (ran Dom.randint (1,10000)) + "+str (Uuid.uuid1 ()), begin_time) Begin_time + = Timedelta (Minutes=1) @tIme_test (' Test_get_data ') def test_get_data (db): Begin_time = DateTime.Now ()-Timedelta (hours=3) end_ Time = Begin_time + Timedelta (minutes=120) results = List (Db.get_data (Begin_time, End_time, Lambda x:x.find (' 1024 ')        ) =-1)) print ' Test_get_data get%s results '% len (results) @time_test (' get_db ') def get_db (): Return wawadb (' test ') if not os.path.exists ('./data/test.db '): db = get_db () gen_test_data (db) # Db.index.fp_index.flush () db = get_db () test_get_data (db) init () if __name__ = = ' __main__ ': Test ()
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.