When it comes to the fundamentals of database storage technology, it is necessary to understand the data characteristics and bottlenecks faced by enterprise applications.
1. Data characteristics of enterprise applications
As an example of courier tracking, the information read process includes the location of the current read operation, the timestamp, the current business process (such as the pickup, packaging, shipping), and other details. of Course, the analysis of online game activity data is more complex. The data set for each event data is small (BYTE/KB) and each entity can produce multiple events.
- Combination of structured/unstructured data
Structured data is stored in a standard format and can be automatically processed by the computer. Unstructured data is data that does not have a specific data resolution model and cannot be automatically processed by computers, such as video, images, and any unstructured textual information. In the case of patient data, gender and age are structured, and medical history and diagnostic data are unstructured. Enterprises need to deal with unstructured data in order to achieve efficient search.
2. Bottlenecks in the database
modern businesses tend to be "data-driven". Enterprises need to deal with the large amount of data generated by people and machines to support decision-making, to integrate data from different data sources, and to conduct interactive decision-making to analyze data in real time. The efficiency of data transfer is limited by the CPU bus, while parallel processing can exceed the bus rate. Disks are used for data backup and archiving, and are not a concern for online service performance. Therefore, access to memory becomes a new bottleneck for the database.
3. How to improve the bottleneck
Of course, sub-Libraries can improve memory bottlenecks, but the essence of improving bottlenecks is to reduce access to memory. We should minimize the number of representations of the data, which can reduce memory consumption and reduce memory access. At the same time, it is prudent to access only the columns that are used when performing the access.
The most basic idea of reducing the number of data representation is the dictionary coding , which is simple and easy to use, and also the basis of coding column compression technology.
4. Dictionary encoding
Dictionary encoding operates as a unit, replacing different values with different integer values (short integer precedence) with a simple conversion, and compressing long text values into short integers, thus not altering the scale of the table. Generally, the entropy of enterprise data is low, that is, the data repeatability is large, so the compression effect is more ideal. Take the gender column compression as an example: The Sex column contains only two values, and if you say "M", "F", then you need 1byte. Assuming that there are 7 billion people worldwide, then 7 billion *1byte is required to be about 6.52GB. If you use dictionary compression, 1 bits is sufficient to express the same information, which requires 7 billion) 1BIT=0.81GB, where the dictionary requires 2*1 bytes = 2 bytes. Compression ratio = uncompressed size/compression size is approximately 8.
Usually, the name, country, birthday and other text data can be obtained by dictionary encoding 10~20 times the compression ratio.
5. Sort Dictionaries
The full scan time of the dictionary encoding above is O (n), if the dictionary is sorted, the dictionary retrieval time using the dichotomy is only O (Lon (n)). In general, we want the dictionary to be both refined and redundant , because such optimizations pay a price: the addition of new words causes the dictionary to reorder, and if the new word is not at the end of the dictionary, the data table is updated, i.e. the value of the word after the new word is shifted backwards. Therefore, for countries, birthdays and other "can be listed" (Can be poor) of the column using a dictionary code to obtain a better search speed, and almost no fear of the dictionary changes.
Reference books
[1]. A Course in in-memory Data management:the Inner mechanics of In-memory Databases. Hasso Plattner. 2012
Database Storage Technology Fundamentals (i) Dictionary encoding