Simple analysis of a column-type database

Source: Internet
Author: User

Turn from: Simple analysis of a column-type database

These days look at the contents of the Data Warehouse and discover a new content-Columnstore. Once thought about the database row and column to be indexed, but did not think deeply, did not expect the database has started to develop.
First look at the interpretation of the column database on the wiki:

The

Column database is a database of data stored in a column-dependent storage schema and is primarily suitable for bulk data processing and ad hoc queries. Corresponding to the row database, the data in the row-related storage architecture for spatial allocation, mainly suitable for small batches of data processing, often used for online transaction-type data processing. The
database stores data as a two-dimensional table of rows and columns, but is stored as a one-dimensional string, such as one of the following tables:
EmpId Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000
This simple table includes the employee code (EMPID), the Name field (Lastname and Firstname), and the salary (Salary).
This table is stored in the computer's memory (RAM) and storage (hard disk). Although the memory and the hard disk are different in mechanism, the computer's operating system is stored in the same way. The database must store the two-dimensional table in a series of one-dimensional "bytes", and the operating system is written to memory or hard disk. The
Row database stores data values in a row together, then stores the next row of data, and so on.
1,smith,joe,40000;2,jones,mary,50000;3,johnson,cathy,44000; The
Column database stores the data values in a column together, then stores the next column of data, and so on.
A. Smith,jones,johnson; joe,mary,cathy;40000,50000,44000;
This is a simplified statement.

Yesterday installed the next two MySQL-based data Warehouse, INFINDB and Infobright, read the document found they are a column database, more than 40 m of data import infobright, did not expect data files only 1M, compression ratio surprised me!
Then test the next selection of a column, in the column to do the calculation, are faster than MyISAM and InnoDB, read some of the principles of the document, I simulated a bit, wrote a program test.
Reading from memory is very efficient, but reading from disk (assuming that the index of the row database is in memory) the Bill database is slow (starting on Twitter says that the program is wrong on the line), but I think it's still my design problem, at least infobright than myisam/ InnoDB fast, the column should also have its own special indexing mechanism and caching mechanisms, such as separate files for each column, so that the file pointer transfer is faster.
2010-02-04 supplement: With multiple file pointers, column storage is significantly faster, and if you give each column a pointer to a file, it can be very efficient, and also be sure that if each column stores a single file, the efficiency will increase. The efficiency of the column table reading in the file is now reduced by 4/5, which is close to the row table. Continued optimization can also improve.

Code, expand:

#include <iostream> #include <fstream> #include <cstdlib> #include <memory> #include <string > #include <cstring> #include <time.h> #include <map>  #define MAXINT rand_max#define MAXROWS 1000000#define minval 1#define maxval 150000000 using namespace std;  /* Timer */class timer {public://Construction Letter    Number of Timer ();    destructor ~timer ();    Start time void begin ();    Chronograph ends void end ();    Get time, Ms Double Get_time ();p rivate:clock_t start, finish; Double time;};     timer::timer () {start = 0; finish = 0;}     timer::~timer () {start = 0; finish = 0;}  void Timer::begin () {start = Clock ();}  void Timer::end () {finish = Clock ();}     double Timer::get_time () {time = (double) (Finish-start)/clocks_per_sec*1000; return time;}             //Timer timer; /* Record the space occupancy of various structural tables */struct Size {struct {struct {int _static;        int _dynamic;    } Col,row; } mem, file;}            size; /* records the file pointers of the various structural tables */struct files {struct {struct {fstream _static;        FStream _dynamic;    } Table,index; } Col,row;}    file; /* Static Row-table structure */struct staticrowtable {int id;    Char name[255];    int num;    Double score; BOOL Flag;}    * static_row_table; /* static Row table index */struct staticrowtableindex {multimap<int,int> id;    Multimap<char*,int> name;    Multimap<int,int> num;    Multimap<double,int> score; Multimap<bool,int> Flag;}    static_row_table_index; /* static column table structure */struct staticcoltable {int* id;    char (*name) [255];    int* num;    Double* score; bool* Flag;}    static_col_table; /* Dynamic Row-table structure */struct dynamicrowtable {int id;    int Char_len;    Char *name;    int num;    Double score; BOOL Flag;}    * dynamic_row_table; /* Dynamic Row table index */struct dynamicrowtableindex {multimap<int,int> id; Multimap<char*,int> Name    Multimap<int,int> num;    Multimap<double,int> score; Multimap<bool,int> Flag;}    dynamic_row_table_index; /* Dynamic column Table structure */struct dynamiccoltable {int* id;    int* Char_len;    char** name;    int* num;    Double* score; bool* Flag;} * dynamic_col_table; /* Random character */char Randchar () {return rand ()%26+ ' A ';}     /* Random string */void randstring (char col[], int len) {for (int i=0; i<len; ++i) {Col[i] = Randchar ();     }} /* Initializing table Data */void init_statictable () {double time;  cout << "+-----static Data-----+" << endl;     Allocate space cout << "Allocate space ..." << Endl;    Timer.begin ();    static_row_table = new Staticrowtable[maxrows];    Static_col_table.id = new Int[maxrows];    Static_col_table.name = new char[maxrows][255];    Static_col_table.num = new Int[maxrows];    Static_col_table.score = new Double[maxrows];    Static_col_table.flag = new Bool[maxrows];    Timer.end (); Time = timer.get_tIME (); cout << "Space allocation complete!" << Endl << "time to allocate Space:" << duration << "MS" << endl;&nbsp    ;    Generate random data and index cout << "Generate Data ..." << Endl;    Timer.begin ();        for (int i=0; i<maxrows; ++i) {Static_col_table.id[i] = Static_row_table[i].id = i; Static_row_table_index.id.insert (pair<int,int> (static_row_table[i].id,i));  randstring (static_row_        Table[i].name, Rand ()%20+1);        strcpy (Static_col_table.name[i],static_row_table[i].name); Static_row_table_index.name.insert (pair<char*,int> (static_col_table.name[i],i));  Static_col_        Table.num[i] = Static_row_table[i].num = rand (); Static_row_table_index.num.insert (pair<int,int> (static_row_table[i].num,i));  Static_col_table.score        [i] = Static_row_table[i].score = rand ()/rand ();      Static_row_table_index.score.insert (pair<double,int> (static_row_table[i].score,i));   Static_col_table.flag[i] = Static_row_table[i].flag = rand ()%2;    Static_row_table_index.flag.insert (pair<bool,int> (static_row_table[i].flag,i));    } timer.end ();    Time = Timer.get_time ();    cout << "Data generation complete!" << Endl;    cout << "Time to generate data:" << timing << "MS" << endl;  //Initialize file pointer Timer.begin ();    File.row.table._static.open ("Row_table_static.dat", Ios::binary | ios::out);    File.row.index._static.open ("Row_index_static.dat", Ios::binary | ios::out);        File.col.table._static.open ("Col_table_static.dat", Ios::binary | ios::out);  if (!file.row.table._static | |        !file.row.index._static | |    !file.col.table._static) {cout << "failed to open file" << Endl;    }  cout << "writing data to File ..." << Endl; for (int i=0; i<maxrows; ++i) {file.row.table._static.write (Reinterpret_cast<char *> (&static_row_table                                   [i]), sizeof (staticrowtable));    } file.row.table._static.close (); for (int i=0; i<maxrows; ++i) {file.row.index._static.write (Reinterpret_cast<char *> (&static_row_table    _index), sizeof (Staticrowtableindex)); } file.row.index._static.close ();  for (int i=0; i<maxrows; ++i) {file.col.table._static.write (reinte    Rpret_cast<char *> (&static_col_table.id[i]), sizeof (int)); } for (int i=0; i<maxrows; ++i) {file.col.table._static.write (Reinterpret_cast<char *> (static_col_table    . Name[i]), sizeof (char[255)); } for (int i=0; i<maxrows; ++i) {file.col.table._static.write (Reinterpret_cast<char *> (&static_col_    Table.num[i]), sizeof (int)); } for (int i=0; i<maxrows; ++i) {file.col.table._static.write (Reinterpret_cast<char *> (&static_col_ TAble.score[i]), sizeof (double)); } for (int i=0; i<maxrows; ++i) {file.col.table._static.write (Reinterpret_cast<char *> (&static_col_    Table.flag[i]), sizeof (BOOL));    } file.col.table._static.close ();    Timer.end ();    Time = Timer.get_time ();    cout << "Data write complete!" << Endl; cout << "Write Data time:" << timing << "MS" << endl; //Calculates total footprint size.mem.row._static =    sizeof (*static_row_table) *maxrows +sizeof (static_row_table_index) *maxrows; Size.mem.col._static = (sizeof (int) *2+sizeof (double) +sizeof (bool) +sizeof (char) *255 ) *maxrows;  cout << "Static row storage consumes space:" << size.mem.row._static/1024/1024 << "M" << en    dl cout << "Static Columnstore Cost:" << size.mem.col._static/1024/1024 << "M" << Endl;}     void init_dynamictable () {Double time;  cout << "+-----Dynamic Data-----+" << Endl;}  void init () {double time1, time2;  Srand (Time (0));  cout << "====== Generate Data ======" << en    dl;  init_statictable (); Init_dynamictable ();}     /*select namefrom table WHERE num between minval and maxval;*/  /* test in-memory static row storage */int Mem_static_testrow () {    Double time;    int count = 0;    int id;    Multimap<int,int>::iterator it,itlow,itup;  cout << "Testing in-memory read row static table ..." << Endl;    Timer.begin ();    Itlow = Static_row_table_index.num.lower_bound (minval); Itup = Static_row_table_index.num.upper_bound (maxval);  for (it=itlow; it!=itup; ++it) {id = (*it). Second        ;        staticrowtable row = Static_row_table[id];        Results//cout << row.id;        /*cout << ' t ' << */row.name;        cout << ' t ' << row.num;        cout << Endl;    Count ++count;  } timer.end ();  Time = Timer.get_time ();    cout << "Memory row static table read Test finished!" << Endl; cout << "Read time:" << timing << "MS" << endl;  return count;}     /* static row storage in test disk */int File_static_testrow () {double time;    int count = 0;    int id;    Char *name;    int num;    int POS;    Staticrowtable Row;     Multimap<int,int>::iterator it,itlow,itup; //initialization file pointer cout << "Reading rows static table in test disk ..." << Endl;    Timer.begin ();    File.row.table._static.open ("Row_table_static.dat", Ios::binary | ios::in);        File.row.index._static.open ("Row_index_static.dat", Ios::binary | ios::in);  if (!file.row.table._static) {    cout << "Failed to open file" << Endl;    }//Assuming that the index is in memory Itlow = Static_row_table_index.num.lower_bound (minval); Itup = Static_row_table_index.num.upper_bound (maxval);  for (it=itlow; it!=itup; ++it) {id = (*it). Second        ;        pos = sizeof (staticrowtable) *id; FILE.ROW.TABLE._STATIC.SEEKG (POS);       File.row.table._static.read (Reinterpret_cast<char *> (&row), sizeof (Sta        ticrowtable));        Results//cout << row.id;        /*cout << ' t ' << */row.name;        cout << ' t ' << row.num;        cout << Endl;    Count ++count;    } file.row.table._static.close ();    File.row.index._static.close ();    Timer.end ();    Time = Timer.get_time ();    cout << "On-disk static table read Test complete!" << Endl; cout << "Read time:" << timing << "MS" << endl;  return count;}     /* static column storage in test disk */int Mem_static_testcol () {double time;    int count = 0;    int id;    int num;    Char *name;  cout << "Testing in-memory column static table reads ..." << Endl;    Timer.begin ();        for (int i=0; i<maxrows; ++i) {int num = static_col_table.num[i];            if (Num>minval and Num<maxval) {//result//cout << i; /*cout << ' t ' <&Lt            */static_col_table.name[i];            cout << ' t ' << static_col_table.num[i];            cout << Endl;        Count ++count;    }} timer.end ();    Time = Timer.get_time ();    cout << "In-memory column static storage table read Test finished!" << Endl; cout << "Read time:" << timing << "MS" << endl;  return count;}     /* static column storage in test disk */int File_static_testcol () {double time;    int count = 0;    int id;    int num;    Char *name = new char[255];    int pos_num;    int pos_name;    int pos;  cout << "Testing a column static table read in Disk ..." << Endl;    Timer.begin ();    File.col.table._static.open ("Col_table_static.dat", Ios::binary | ios::in);        FStream tmpfile ("Col_table_static.dat", Ios::binary | ios::in);  if (!file.col.table._static | |!tmpfile) {    cout << "Failed to open file" << Endl;    }  pos_name = sizeof (int) *maxrows;    Pos_num = (sizeof (int) +sizeof (char[255])) *maxrows; File.col.table. _STATIC.SEEKG (Pos_num);                                    for (int i=0; i<maxrows; ++i) {file.col.table._static.read (Reinterpret_cast<char *>),        sizeof (int));            if (Num>minval and Num<maxval) {//result id = i;            cout << ID;            pos = pos_name+sizeof (char[255]) *id;            TMPFILE.SEEKG (POS);            Tmpfile.read (Reinterpret_cast<char *> (name), sizeof (char[255]));            /*cout << ' t ' << */name;            cout << ' t ' << num;            cout << Endl;        Count ++count;    }} file.col.table._static.close ();    Timer.end ();    Time = Timer.get_time ();    cout << "On-disk static Columnstore table read Test finished!" << Endl; cout << "Read time:" << timing << "MS" << endl;  return count;}  void Test () {int count1, count2, Count3, count4;  cout << "===== Memory Access Test =====" << Endl;   cout << "+----static table test----+" << Endl;    cout << "* Row type Storage *" << Endl;    In-memory static row storage Table Count1 = Mem_static_testrow ();    In-memory static columnstore Table Count2 = Mem_static_testcol ();    cout << "* Column storage *" << Endl;    Static row-on-disk storage table Count3 = File_static_testrow ();        Static row-on-disk storage table Count4 = File_static_testcol ();  if (Count1==count2 and Count2==count3 and Count3==count4) {    cout << "co-matching:" << count1 << "lines" << Endl;    } else {cout << "error: Different rows per match" << Endl;    } } int Main () {init ();    Test ();    cout << "All ok!" << Endl; return 0;}

2010-02-04 Test Results:
====== Generating Data ======
+-– static Data-–+
Allocate space ...
Space allocation complete!
Allotted space time: 0ms
Generate Data in ...
Data Generation Complete!
Time to generate data: 4180ms
Writing data to File ...
Data Write Complete!
Write data time consuming: 2480ms
Static row-type storage consumes space: 495M
Static Columnstore space: 259M
+-– Dynamic Data-–+
===== Memory Access Test =====
+--static table test in--+
* Row-Type Storage *
Testing in-memory read-row static table ...
In-memory row static table read Test finished!
Read time: Ten MS
Testing in-memory column static table reads ...
In-memory column static storage table read Test finished!
Read Time: 0 ms
* In-row storage *
Reading row static table in test disk ...
On-disk static table read Test finished!
Read time: MS
Testing the disk for column static table reads ...
On-disk column static storage table read Test finished!
Read time: MS
Total matches: 69650 rows
All ok!

Simple analysis of a column-type database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.