How relational databases work-data structures (translated from coding-geek articles)

Source: Internet
Author: User
Tags data structures

This article is translated from Coding-geek article: How does a relational the database work.
Original link: http://coding-geek.com/how-databases-work/#Buffer-replacement_strategies
This article translates the following chapters:

One, Array, Tree and Hash table

Through the previous chapters, we have already understood the concept of time complexity and merge ordering, and I will introduce three data structures. These three kinds of data structures are very important and they are the basis of modern database systems. I will also introduce the concept of database indexing.

Second, array-array

A two-bit array is the simplest data structure, and a database table can be viewed as a two-dimensional array. For example:


This two-dimensional array is a table structure with rows and columns:

    1. Each row represents an object
    2. The data for all columns in a row represents the properties of an object
    3. Each column is fixed to store some kind of data (integer, string, date ...)
      Although a two-dimensional array is good for storing data, performance is very poor when you need to query data from an array based on a condition.

For example, if you want to find out who is working in the UK, you need to traverse every row of data to determine if he is in the UK. This process requires an N-step operation (n depends on the number of rows in the array). Sounds like performance isn't too bad, but is there a faster way?
There must be a tree structure coming up.

Note: Modern databases use a more efficient array structure to store table data, such as heap-organized tables or index-organized tables. But none of them solves the problem of how to quickly filter data in an array based on the filtering conditions of some columns.

Third, tree and database index– trees and databases Index

This tree has 15 nodes. Let's take a look at how to find the 208 element from this:

    1. Starting from the root node, the root node is 136. Because of 136<208, it is found in the right subtree of 136.
    2. 398>208, in 398 of record continue to find.
    3. 250>208, the left sub-tree species in 250 continue to be searched.
    4. 200<208, look in the right subtree of 200. But 200 has no right subtree. No 208 was found (because if it can be found, it should be on the right subtree of 200).

Then look at how to find the 40 element:

    1. Queries are also started from root node 136. Because of 136>40, it is queried in the left subtree of 136.
    2. 80>40, looking in the left subtree of 80
    3. 40=40, the elements are found! Extracts the row ID index of the corresponding array stored in this node 40.
    4. With this row ID index, want to take this row of data, can immediately get to.

Finally, the number of steps for two queries is the height of the tree. If you read the merge sort section carefully, you should know that the height of the tree is log (n), so the time complexity of the lookup algorithm is log (n). Not bad.

A Back to our problem– return to the question

The material is very abstract, let's go back to the question. In addition to simple integer data, consider the string, which is used to represent the country information of a person in a previous table. Suppose you've built a tree that contains the "Country" field data from the previous table.

    1. You want to know who's working in the UK?
    2. You need to find the tree and find the UK node.
    3. Within the UK node, you can find the array row ID index information for all the people working in the UK.

This query operation consumes only the log (n) step, not the N-step required to query the array directly. Now can you guess what the database index is?

You can index any number of columns of data (a column of strings, a column of integers, 2 column strings, a column of integers + a column of strings, a list of date types, and so on). As long as you have a comparison function for these columns, you can control the order in which the primary key is arranged in the tree (the database already implements the comparison function for the base data type).

Two B + Tree index–b+ tree Index

Although the above two-fork tree works fine when querying a fixed value, performance is very low if you want to query for all values in a range. It takes a n-step operation because each node in the tree needs to be compared to determine whether it is within the specified range. In addition, this approach also consumes I/O resources because the entire index tree is being read. We need to find an efficient range query method. To solve this problem, the modern database uses B + trees, and the B + tree is optimized for the previous two-fork query tree. Inside the B + tree:

    1. Only leaf nodes store the index of each row of data in the associated table.
    2. Other nodes are used only when querying,


      As shown, the B + Tree stores more redundant nodes (twice times). There are additional subordinate nodes inside the tree, and these "decision nodes" are useful to help you find the right leaf node (the node that stores the table Data Index). The query time complexity of the B + tree is still log (N), and the tree is just one more layer. The biggest difference is that the leaf node stores a link to the next node.

In this B + tree, if you look for data before 40 to 100:

    1. You only need to query a node with a value of 40 (or a node that is slightly larger than 40 if the 40 node does not exist). The query is the same as the previous two-fork tree.
    2. Collects 40 successive nodes, through which links to subsequent nodes that it stores, know to encounter 100 (or less than 100 of the number).

If you need to query m nodes, the tree has n nodes. The time complexity of querying the specified value (40) is log (N), as in the previous two-tree query. However, once you have found the node (40), you also need to walk through M-step to collect m successive nodes. The time complexity of the B + range query is O (M+log (n)), which is significantly improved compared to the previous binary tree O (n) complexity. The larger the amount of data, the more noticeable the performance gap. You don't need to read the entire data, which also means smaller disk I/O reads.

However, this also brings new problems (once again encountering problems). If you add or delete a row of records to the database, you also need to update the data in the B + Tree:

    1. You need to guarantee the order of the nodes in the B + tree, otherwise you won't be able to find nodes in a chaotic tree.
    2. You must ensure that the leaf nodes are ordered from small to large, otherwise the time complexity of the range query will be degraded by O (Log (n)) to O (n).

In other words, B + trees must have the ability to self-adjust tree balance and node order. Thankfully, intelligent data deletion and data insertion allow the B + tree to maintain the above characteristics. This also brings costs: the time complexity of inserting and deleting data is O (Log (N)), which is why you often hear the idea that indexing too much is not a good thing. In fact, this slows the efficiency of the INSERT/update/delete operation because the database needs to update the index of the table at the same time, and each index spends O (Log (N)).

Translator Note: All have two sides, the advantage must have au; what data structure to choose is based on your application scenario.

Also, the index increases the complexity of the Tansaction manager (the last one will talk about Tansaction manager).

For more details, you can search for B + tree on Wikipedia. If you want a B + tree to implement the sample, you can read this article (https://blog.jcole.us/2013/01/07/the-physical-structure-of-innodb-index-pages/), The author of this article is the core developer of MySQL. He explains in detail how InnoDB (MySQL Database engine) implements indexing.

Four, hash table– hash table

The last important data structure is hash table. Hash table is useful when you need to quickly find a data. In addition, understanding the hash table will help us understand a common database connection technique that will be mentioned later: hash join. Hash table is also often used to store some of the database's internal management data, such as lock Table,buff pool. These concepts will be covered in the following.

Hash table is a type of data structure that can be quickly found based on keywords. To build a hash table, you need to define some of the following:

1) Element keyword
2) a hash function defined for the keyword (hash functions). The element's keywords are computed with a hash function to indicate where the element is stored (called buckets).
3) keyword comparison function. Once the bucket of the element is found, the next step is to find the corresponding value within the bucket by comparing the function.

(i) A simple example-example of a hash table


This hash table has 10 buckets. I only drew 5 buckets in the picture, and the other 5 groups had to be self-repairing. Hash function I am defined as 10 modulo (divided by 10 for remainder), in other words, I can determine the bucket by the last digit of the keyword.

    1. If the last digit of the number is 0, store the number in bucket 0
    2. If the last digit of the number is 1, store the number in bucket 1
    3. If the last digit of the number is 2, store the quality in bucket 2
    4. .....

Compare functions I use to determine whether two integer values are equal.

Let's see how to find the element 78 in the hash table:

    1. Computes the hash value by a hash function 8
    2. Looking for an element in bucket 8, the first element is 78
    3. return element 78
    4. A query requires only two steps: The first step is to calculate the hash value, determine the bucket position, and the second step to view the element in the bucket.

Let's look again at how to find element 59:

    1. Computes the hash value and gets 9
    2. Find the Element 59 in the bucket. The first element is 99,99!=59,99 not the element to look for.
    3. In the same way, find the second element (9), the third element (79), ... The last Element (29).
    4. There is no 59 this element
    5. This query performs a 7-step operation
Two A good hash function– a good hashing function

As you can see, the time complexity is different for querying different values.

If you change the hash function to divide by 1000000 modulo (take the last 6 digits of the number, as the bucket ID). The second query above only takes one step, and there is no data in bucket 59. It is very difficult to find a good hash function that ensures that the smallest amount of data is stored in each bucket.

It's easy to find a good hash function in the example above. But this is just a simple example, and if the keyword is the following data type, it will be very difficult:
1. A string (for example, the name of a person)
2. Two strings (for example, a person's first and last name)
3. Two strings + one date (for example, a person's first name, first name, and birthday)

Design a good hash function, the hash table query time is O (1).

V, Array VS Hash table

Why not use arrays? Good question.

    1. Hash table supports loading portions of the content into memory, while the other part is stored on disk. Save memory space.
    2. The array must use contiguous memory space. If you want to load a large table of data into memory, it is difficult to find a large contiguous memory space. Memory allocation failure is a high risk.
    3. Hash table allows you to select the field you want as a keyword (for example: the country of the person, plus the name of the person.) any combination).

For more information, you can read the article about how to implement hash map in Java, an example of how to implement hash map efficiently. You don't need to understand Java to understand the concepts in this article.

Translated "How does a relational" other chapters link:
1. How relational databases work-time complexity: http://blog.csdn.net/ylforever/article/details/51205332
2. How relational databases work-merge sort: http://blog.csdn.net/ylforever/article/details/51216916
3. Relational database working principle-data structure: http://blog.csdn.net/ylforever/article/details/51278954
4. How relational databases Work-cache: http://blog.csdn.net/ylforever/article/details/50990121
5. How relational databases work-transaction management (i): http://blog.csdn.net/ylforever/article/details/51048945
6. How relational databases work-transaction management (b): http://blog.csdn.net/ylforever/article/details/51082294

How relational databases work-data structures (translated from coding-geek articles)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.