Tens of millions of data records? How to design data tables and optimize data tables

Source: Internet
Author: User
Tags database sharding
Data Table Design Principles

(1) Databases should not be designed for the entire system, but should be designed for the Services processed by each component according to the components in the system architecture; the associations between database tables corresponding to different components should be minimized. If tables of different components need foreign key associations, try not to create foreign key associations, instead, it only records a primary key of the associated table, ensures the independence between the tables corresponding to the component, and provides the possibility of restructuring the system or table structure.

(2) Use the domain model-driven approach and top-down thinking to design databases. First, analyze the system business and define objects according to their responsibilities. The object must comply with the encapsulation characteristics to ensure that the data items related to responsibilities are defined within the same object. These data items can fully describe this responsibility without missing responsibility descriptions. In addition, an object has only one responsibility. If an object is responsible for two or more responsibilities, it should be split.

(3) map database tables based on the established domain model. In this case, we should refer to the second paradigm of Database Design: all non-keyword attributes in a table depend on the entire keyword. A keyword can be either an attribute or a set of multiple attributes. In this way, you must ensure that the keywords are unique. When determining the keywords, ensure that the keywords are not involved in the business and there is no update exception. In this case, the optimal solution is to use an auto-incrementing numeric attribute or a random string as the keywords of the table.

(4) because the domain model-driven method described in the first point designs the database table structure, each object in the domain model has only one responsibility. Therefore, the data items in the object do not have the transfer dependency, in this way, the database table structure design satisfies the third paradigm from the very beginning: A table must satisfy the second paradigm, and there is no transfer dependency between attributes.

(5) Likewise, because the singularity of the object's responsibilities and the relationship between objects reflect the relationship between business logic, objects in the domain model are divided into primary objects and slave objects, from an object is from the perspective of 1-N or N-N further master object business logic, so from the object and object relationship mapped to the table and table association does not exist to delete and insert exceptions.

(6) In the database table structure after ing, further modification should be made based on the fourth paradigm to ensure that there is no multi-value dependency. At this time, the domain model should be fed back to the reverse engineering idea. If the table structure has multi-value dependencies, it proves that the objects in the domain model have at least two responsibilities and should be designed and corrected according to article 1. Fourth paradigm: If a table meets bcnf, there should be no multi-value dependency.

(7) When the analysis confirms that all tables meet the 2, 3, and 4 paradigms, try to use weak associations between tables to facilitate the adjustment and restructuring of table fields and table structures. In addition, I think that tables in the database are used to persist the status of an object instance at a specific time and under specific conditions. They are only a storage medium. Therefore, there is no strong association between tables to express the business (Data Consistency). This responsibility should be ensured by the logic layer of the system, this method also ensures the system's compatibility with incorrect data (dirty data. Of course, from the perspective of the entire system, we still need to do our best to ensure that the system will not produce dirty data. From another perspective, the generation of dirty data is inevitable to a certain extent, we also need to ensure the system's fault tolerance in this case. This is a compromise.

(8) indexes should be created for primary keys and Foreign keys of all tables, and composite attribute indexes should be created (for some large data volumes and common retrieval methods) to improve the retrieval efficiency. Although indexing consumes part of system resources, it compares the performance impact of searching the data in the entire table during retrieval, especially when the data volume in the table is large, and the performance impact of sorting operations without indexes, this method is still worth advocating.

(9) Use as few stored procedures as possible. Currently, many technologies can replace stored procedures, such as "Object/relationship ing", to ensure data consistency in the database, regardless of version control, development, deployment, and database migration. However, it is undeniable that stored procedures have performance advantages. Therefore, when the hardware available in the system is not improved and performance is a very important quality attribute, you can use stored procedures in a balanced manner.

(10) When the cost (often the usability cost) paid for processing the association constraints between tables exceeds the cost to ensure that no modification, deletion, or change exceptions occur, in addition, when data redundancy is not a major problem, the table design may not conform to the four paradigms. The four paradigms ensure that there is no exception, but they may also lead to an overly pure design, making the table structure difficult to use. Therefore, comprehensive judgment is required during design, but first, it must conform to the four paradigms, then, refined correction is the best method that can be used when you enter the database design field.

(11) The designed tables should have good usability, mainly because of whether to associate multiple tables during query and complex SQL skills.

(12) Design tables should minimize data redundancy, ensure data accuracy, and effectively control Redundancy to improve database performance.


Character
Varchar Variable characters and stored character bytes are equivalent columns: ABC123 occupies 6 bytes (expressed as the disk size, the same below)
Nvarchar Variable characters are twice the size of actually stored bytes. For example, ABC123 occupies 12 bytes. The character uses unicode encoding to support multiple languages.
Char Actual manifestation of fixed characters: four spaces will appear after the specified column length is 10 and ABC123 is filled in.
Integer
Bigint occupies 8 bytes
Integer Data from-2 ^ 63 (-9223372036854775808) to 2 ^ 63-1 (9223372036854775807) (All numbers ).
Int
Integer Data from-2 ^ 31 (-2,147,483,648) to 2 ^ 31-1 (2,147,483,647) (All numbers ).
Smallint occupies 4 bytes
Integer Data from-2 ^ 15 (-32,768) to 2 ^ 15-1 (32,767.
Tinyint occupies one byte
Integer Data from 0 to 255.

Summary:
1. If the website is only Chinese and does not involve multiple languages, varchar is recommended. Using nvarchar will waste space and affect efficiency.
2. If the character length is relatively fixed, the char field, such as user_ip, should be used. Char is fixed length. to move to the next record, you only need to make a pointer offset with a fixed length. varchar must calculate the offset of the next Data Pointer based on the length of the current record.
3. It is best to use numeric data if the character is a number. Because it occupies a smaller byte and has a dominant position in sorting. If the mobile phone number uses varchar, it takes 8 digits to use bigint for 11 digits.

Several questions:
1. Is char (n) faster than varchar (n?
Not necessarily
Calculating the varchar offset will take some CPU time, but the performance bottleneck is not here, in Io.
The IO unit of DB is data page (8192 bytes) (one page contains multiple data rows, and data rows cannot span pages. Of course, exceptions such as image and text ).
Therefore, the more rows on a page, the better the performance.
2. the number used may be small, but does it affect the efficiency if all values are defined as Int?
Yes
This is of course because int and tinyint occupy different bytes. It can be understood that the disk size is different, smaller bytes can save unnecessary memory consumption and related pointer transfer.
3. varchar/Char does not support multiple languages. Why can both English and Chinese be displayed normally?
This is because the operating system of the server and the installed MSSQL are both Chinese.
Reference: http://topic.csdn.net/t/20060324/18/4638303.html
4. Other Methods to Improve database query efficiency
Reasonable index creation is essential.
See http://topic.csdn.net/t/20040415/10/2967554.html
Test the length of different encoding bytes Code :
System. Text. encoding. Unicode. getbytecount (strtest );
System. Text. encoding. utf8.getbytecount (strtest );
System. Text. encoding. getencoding ("gb2312"). getbytecount (strtest );


Database sharding

I. Overview
Table sharding is a popular concept, especially in the case of large loads. Table sharding is a good way to distribute database pressure.
First, you need to understand why Table sharding is required. What are the benefits of table sharding. Let's take a look at the SQL Execution Process in the next database:
Receive SQL --> put in SQL Execution queue --> Use analyzer to break down SQL --> extract or modify data according to analysis results --> return processing results
However, this flowchart is not necessarily correct, but I think it is my own subjective consciousness. So what are the most common problems in this process? That is to say, if the previous SQL statement is not completed, the subsequent SQL statement will not be executed, because to ensure data integrity Data Table The file is locked, including the shared lock and exclusive lock. During the lock period, other threads can access the data file, but the modification operation is not allowed. Correspondingly, the exclusive lock means that the entire file is owned by one thread, other threads cannot access this data file. In general, MyISAM is the fastest storage reference in MySQL. It is locked based on tables. That is to say, if one is locked, the entire data file cannot be accessed from outside. After the previous operation is completed, to receive the next operation, the previous operation is not completed, and the latter operation is not executed in the queue. Blocking is usually called a "lock table ".
What are the consequences of locking a table directly? That is, a large number of SQL statements cannot be executed immediately. You must wait until all the SQL statements in front of the queue are executed. This unexecutable SQL will result in no results, or the delay is serious, affecting the user experience.
Especially for tables that are frequently used, such as the user information table in the SNS system and the post table in the forum system, these tables are all tables with a large access volume, to ensure quick data extraction and return to users, you must use some processing methods to solve this problem. This is the table sharding technology I will talk about today.
As the name suggests, the table sharding technology divides several tables that store the same type of data into several table sharding stores. When extracting data, different users access different tables and do not conflict with each other, reduce the chance of locking a table. For example, there are two sub-tables in the pre-Stored User table, one is user_1 and the other is user_2. The two tables store different user information, and user_1 stores the first 0.1 million user information, user_2 stores the information of the last 0.1 million users. If you query the users heiyeluren1 and heiyeluren2 at the same time, the sub-tables are extracted from different tables to reduce the possibility of locking the table.
I have not tried either of the two table sharding methods described below. it is not guaranteed that they can be used accurately. Design Train of Thought. The following table sharding example is based on a paste bar system. (If you haven't posted any posts, Google them)
Ii. Basic table-based table sharding
This basic table-based sharding method is generally used to store all basic information about a primary table. If a project needs to find the table it stores, you must find the corresponding table name and other items from the basic table so that you can directly access the table. If you think this basic table is not fast enough, you can save the entire basic table in the cache or memory for convenient and effective query.
Based on the post, we construct the following three tables:
1. paste the forum table: Save the Forum information
2. pagination topic table: Save the topic information in the Forum for viewing.
3. paste the reply table to save the original content and reply content of the topic.
The "Post Bar forum table" contains the following fields:
Forum ID board_id int (10)
Forum name board_name char (50)
Sub-Table ID table_id smallint (5)
Generation time created datetime
The "Post Bar topic table" contains the following fields:
Topic ID topic_id int (10)
Topic name topic_name char (255)
Forum ID board_id int (10)
Creation Time created datetime
The "post a reply table" field is as follows:
Reply ID reply_id int (10)
Reply content reply_text text
Topic ID topic_id int (10)
Forum ID board_id int (10)
Creation Time created datetime
We saved the table structure information in the whole post bar above. The relationship between the three tables is as follows:
Forum --> multiple topics
Topic --> Multiple replies
That is to say, the relationship between the table file size is:
Forum table file <topic table file <reply table File
Therefore, we can basically determine that we need to split the topic table and the reply table, and the speed and performance of our data retrieval and query changes have been increased.
After reading the table structure above, we can see that a "table_id" field is saved in the "forum table, this field is used to save the topic corresponding to a forum and the table in which the reply table is saved.
For example, we have a post called "php". If board_id is 1 and the sub-Table ID is also 1, the record is:
Board_id | board_name | table_id | created
1 | PHP | 1 | 00:30:12
Correspondingly, if I need to extract all topics in the "php" bar, a table name that stores the topic must be combined according to the table_id saved in the table, for example, if the prefix of the topic table is "topic _", the topic table corresponding to the "php" should be "topic_1", and we will execute:
Based on Hash Algorithm Table sharding
We know that a hash table is a value calculated using a special hash algorithm. This value must be unique and can be used to find the desired value, this is called a hash table.
The hash algorithm in the table is similar to this idea: Calculate the table name of the data storage table by the ID or name of the original target using a certain hash algorithm, and then access the corresponding table.
Continue to take the above post for example. Each post has a forum name and a forum ID. The two values are fixed and unique, then we can consider performing some operations on one of these two values to obtain the name of a target table.
Now, if we assume that the system allows a maximum of 0.1 billion data records for each table, and we want to save 1 million records for each table, the system will not have more than 100 tables. According to this standard, we assume that hash is performed on the Forum ID of the clipboard to obtain a key value. This value is our table name and then access the corresponding table.
We construct a simple hash algorithm:
Function get_hash ($ id ){
$ STR = bin2hex ($ id );
$ Hash = substr ($ STR, 0, 4 );
If (strlen ($ hash) <4 ){
$ Hash = str_pad ($ hash, 4, "0 ");
}
Return $ hash;
}
The algorithm is generally to pass in a forum id value, and then the function returns a four-character string. If the string length is not enough, use 0 to complete.
For example: get_hash (1), the output result is "3100". Input: get_hash (23819). The result is 3233, then we can access the table by simply grouping with the table prefix. When we need to access the content with ID 1, the combined tables will be topic_3100 and reply_3100, so we can directly access the target table.
Of course, after the hash algorithm is used, some data may be in the same table. This is different from the hash table. The hash table tries its best to solve the conflict. We do not need it here, of course, the name of the table that may be stored in the Prediction and Analysis tables.
If more data needs to be stored, you can perform the hash operation on the Forum name. For example, the preceding binary conversion is also in hexadecimal format, because Chinese characters are much more likely to be repeated than numbers and letters, but more tables may be combined. Therefore, some other problems must be considered.
In the final analysis, if you use the hash method, You must select a good hash algorithm to generate more tables. However, data query is faster.
[Advantage: the hash algorithm directly obtains the name of the target table, which is highly efficient .]
[Disadvantage] poor scalability. A hash algorithm is selected to define the amount of data. In the future, it can only run on this amount of data. It cannot exceed this amount of data, and the scalability is slightly poor.
4. Other problems
1. Search Problems
Now that we have already performed table sharding, we cannot directly search for tables, because you cannot search for dozens or hundreds of tables that may already exist in the system, therefore, search must be performed by using third-party components. For example, Lucene is a good choice for intra-site search engines.
2. Table file Problems
We know that MySQL MyISAM engine generates three files for each table ,*. FRM ,*. MYD ,*. myi files. Table shards are used to store table structures, table data, and table indexes. In Linux, the number of files in each directory should not exceed 1000. Otherwise, data retrieval will be slower, so each table will generate three files. If the number of sub-tables exceeds 300, therefore, the retrieval is very slow, so at this time the score must be performed again, for example, the database is separated.
With the basic table, we can add a new field to save the data stored in the table. In hash mode, we must take the nth digit of the hash value as the database name. In this way, the problem is solved in good condition.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.