What is database sharding technology?

Last Update:2013-11-25 Source: Internet

Author: User

Tags database sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is database/table sharding technology I. Overview table sharding is a popular concept. Especially under heavy load, table sharding is a good way to distribute database pressure. First, you need to understand why Table sharding is required. What are the benefits of table sharding. Let's take a look at the SQL Execution Process in the next database: when an SQL statement is received, the SQL statement is put into the SQL Execution queue, and the SQL statement is decomposed by analyzer, the data is extracted or modified according to the analysis result, and the returned processing result is incorrect, this is just my own subjective consciousness. So what are the most common problems in this process? That is to say, if the previous SQL statement is not completed, the subsequent SQL statement will not be executed, because to ensure data integrity, the data table file must be locked, two types of locks are available: Shared locks and exclusive locks. During the lock period, other threads can access the data file, but the modification operation is not allowed. Correspondingly, the exclusive lock means that the entire file is owned by one thread, other threads cannot access this data file. In general, MyISAM is the fastest storage reference in MySQL. It is locked based on tables. That is to say, if one is locked, the entire data file cannot be accessed from outside. After the previous operation is completed, to receive the next operation, the previous operation is not completed, and the latter operation is not executed in the queue. Blocking is usually called a "lock table ". What are the consequences of locking a table directly? That is, a large number of SQL statements cannot be executed immediately. You must wait until all the SQL statements in front of the queue are executed. This unexecutable SQL will result in no results, or the delay is serious, affecting the user experience. Especially for tables that are frequently used, such as the user information table in the SNS system and the post table in the forum system, these tables are all tables with a large access volume, to ensure quick data extraction and return to users, you must use some processing methods to solve this problem. This is the table sharding technology I will talk about today. As the name suggests, the table sharding technology divides several tables that store the same type of data into several table sharding stores. When extracting data, different users access different tables and do not conflict with each other, reduce the chance of locking a table. For example, there are two sub-tables in the pre-Stored User table, one is user_1 and the other is user_2. The two tables store different user information, and user_1 stores the first 0.1 million user information, user_2 stores the information of the last 0.1 million users. If you query the users heiyeluren1 and heiyeluren2 at the same time, the sub-tables are extracted from different tables to reduce the possibility of locking the table. The two table sharding methods I will discuss below have not been tested by myself and cannot be used accurately. They just provide a design idea. The following table sharding example is based on a paste bar system. (If you haven't used the Post-It service, please Google it.) 2. The basic table-based table sharding method is roughly like a main table, all basic information is saved. If a project needs to find the table it stores, you must find the corresponding table name and other items from the basic table so that you can directly access the table. If you think this basic table is not fast enough, you can save the entire basic table in the cache or memory for convenient and effective query. Based on the post, we construct the following three tables: 1. clipboard forum table: Save the information of the Forum in the clipboard. 2. paste bar topic table: Save the topic information in the paste bar for browsing 3. paste the reply table: Save the original content of the topic and the reply content "paste the forum table" contains the following fields:

Forum ID board_id int (10) Forum name board_name char (50) subtable ID table_id smallint (5) generation time created datetime "" contains the following fields: topic ID topic_id int (10) topic name topic_name char (255) Forum ID board_id int (10) creation time created datetime "paste it reply table" field is as follows: reply ID reply_id int (10) reply content reply_text text topic ID topic_id int (10) Forum ID board_id int (10) creation time created datetime

The table structure information in our whole post bar is saved above. The relationship between the three tables is: Forum --> multiple topic topics --> Multiple replies. That is to say, the relationship between the table file size is: Forum table file <topic table file <reply table file. Therefore, you can determine whether to split the topic table and reply table, the speed and performance of our data retrieval and query changes have been increased. After reading the table structure above, we can see that a "table_id" field is saved in the "forum table, this field is used to save the topic corresponding to a forum and the table in which the reply table is saved. For example, we have a post called "PHP". If board_id is 1 and the sub-Table ID is also 1, the record is: board_id | board_name | table_id | created1 | PHP | 1 | 00:30:12. If you need to extract all topics in "PHP, then, you must combine a table name that stores the topic according to the table_id saved in the table. For example, the prefix of the topic table is "topic _", the topic table corresponding to the "PHP" should be "topic_1". Then we will execute: hash-based table sharding. We know that a Hash table is a value calculated using a special Hash algorithm. This value must be unique, you can use the calculated value to find the expected value. This is called a hash table. The hash algorithm in the table is similar to this idea: Calculate the table name of the data storage table by the ID or name of the original target using a certain hash algorithm, and then access the corresponding table. Continue to take the above post for example. Each post has a forum name and a forum ID. The two values are fixed and unique, then we can consider performing some operations on one of these two values to obtain the name of a target table. Now, if we assume that the system allows a maximum of 0.1 billion data records for each table, and we want to save 1 million records for each table, the system will not have more than 100 tables. According to this standard, we assume that hash is performed on the Forum ID of the clipboard to obtain a key value. This value is our table name and then access the corresponding table. We construct a simple hash algorithm:

function get_hash($id){   $str = bin2hex($id);   $hash = substr($str, 0, 4);   if (strlen($hash)<4){       $hash = str_pad($hash, 4, "0");   }   return $hash;}

The algorithm is generally to pass in a forum ID value, and then the function returns a four-character string. If the string length is not enough, use 0 to complete. For example: get_hash (1), the output result is "3100". Input: get_hash (23819). The result is 3233, then we can access the table by simply grouping with the table prefix. When we need to access the content with ID 1, the combined tables will be topic_3100 and reply_3100, so we can directly access the target table. Of course, after the hash algorithm is used, some data may be in the same table. This is different from the hash table. The hash table tries its best to solve the conflict. We do not need it here, of course, the name of the table that may be stored in the Prediction and Analysis tables. If more data needs to be stored, you can perform the hash operation on the Forum name. For example, the preceding binary conversion is also in hexadecimal format, because Chinese characters are much more likely to be repeated than numbers and letters, but more tables may be combined. Therefore, some other problems must be considered. In the final analysis, if you use the hash method, You must select a good hash algorithm to generate more tables. However, data query is faster. [Advantage: the hash algorithm directly obtains the name of the target table, which is highly efficient.] The [disadvantage] is poor scalability. A hash algorithm is selected to define the amount of data, which can only be run on this amount in the future, this data volume cannot exceed, And the scalability is slightly lower. 4. Other problems 1. now that we have split tables, we cannot search for tables directly, because you cannot search for dozens or hundreds of tables that may already exist in the system, therefore, search must be performed by using third-party components. For example, Lucene is a good choice for intra-site search engines. 2. table file problems we know that MySQL's MyISAM engine generates three files for each table ,*. frm ,*. MYD ,*. MYI files. Table shards are used to store table structures, table data, and table indexes. In Linux, the number of files in each directory should not exceed 1000. Otherwise, data retrieval will be slower, so each table will generate three files. If the number of sub-tables exceeds 300, therefore, the retrieval is very slow, so at this time the score must be performed again, for example, the database is separated. With the basic table, we can add a new field to save the data stored in the table. In Hash mode, we must take the nth digit of the hash value as the database name. In this way, the problem is solved in good condition.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More