MySQL Big Data Processing

Source: Internet
Author: User
1. Overview Table sharding is a popular concept, especially in the case of large loads. Table sharding is a good way to distribute database pressure. First, you need to understand why Table sharding is required. What are the benefits of table sharding. Let's take a look at the SQL Execution Process in the next database: after receiving the SQL statement, put it into the SQL Execution queue, and use the analyzer to break down the SQL statement. extract or modify the data according to the analysis result. The returned processing result is incorrect, this is just my own subjective consciousness. So what are the most common problems in this process? That is to say, if the previous SQL statement is not completed, the subsequent SQL statement will not be executed, because to ensure data integrity, the data table file must be locked, two types of locks are available: Shared locks and exclusive locks. During the lock period, other threads can access the data file, but the modification operation is not allowed. Correspondingly, the exclusive lock means that the entire file is owned by one thread, other threads cannot access this data file. In general, MyISAM, the fastest storage engine in MySQL, is locked based on tables. That is to say, if one is locked, the entire data file cannot be accessed from outside. After the previous operation is completed, to receive the next operation, the previous operation is not completed, and the latter operation is not executed in the queue. Blocking is usually called a "lock table ". What are the consequences of locking a table directly? That is, a large number of SQL statements cannot be executed immediately. You must wait until all the SQL statements in front of the queue are executed. This unexecutable SQL will result in no results, or the delay is serious, affecting the user experience. Especially for tables that are frequently used, such as the user information table in the SNS system and the post table in the forum system, these tables are all tables with a large access volume, to ensure quick data extraction and return to users, you must use some processing methods to solve this problem. This is the table sharding technology I will talk about today. As the name implies, the table sharding technology divides several tables that store the same type of data into several table sharding stores. When extracting data, different users access different tables and do not conflict with each other, reduce the chance of locking a table. For example, currently there are two tables for saving user sub-tables, one for user_1 and the other for user_2. The two tables store different user information, and user_1 stores the first 0.1 million user information, user_2 stores the information of the last 0.1 million users. If you query the users heiyeluren1 and heiyeluren2 at the same time, the sub-tables are extracted from different tables to reduce the possibility of locking the table. The two table sharding methods I will discuss below have not been tested by myself and cannot be used accurately. They just provide a design idea. The following table sharding example is based on a paste bar system. (If you haven't posted any posts, Google them) Ii. Basic table-based table sharding The basic table-based table sharding method is like: a main table stores all the basic information. If a project needs to find the table it stores, you must find the corresponding table name and other items from the basic table so that you can directly access the table. If you think this basic table is not fast enough, you can save the entire basic table in the cache or memory for convenient and effective query. Based on the post, we construct the following three tables: 1. clipboard forum table: Save the information of the Forum in the clipboard. 2. paste bar topic table: Save the topic information in the paste bar for browsing 3. paste the reply table: Save the original content of the topic and the reply content "paste the forum table" contains the following fields: Forum Id board_id int (10) Forum name Board_name char (50) Sub-table Id table_id smallint (5) Generation Time Created datetime The "Post Bar topic table" contains the following fields: Topic Id topic_id int (10) Topic name Topic_name char (255) Forum Id board_id int (10) Creation Time Created datetime The "post a reply table" field is as follows: Reply Id reply_id int (10) Reply content Reply_text text Topic Id topic_id int (10) Forum Id board_id int (10) Creation Time Created datetime We saved the table structure information in the whole post bar above. The relationship between the three tables is as follows: Forum --> Multiple topics Topic --> Multiple replies That is to say, the relationship between the table file size is: Forum table File < Topic table File < Reply to table File Therefore, we can basically determine that we need to split the topic table and the reply table, and the speed and performance of our data retrieval and query changes have been increased. After reading the table structure above, we can see that a "table_id" field is saved in the "forum table, this field is used to save the topic corresponding to a forum and the table in which the reply table is saved. For example, if we have a post named "php", the board_id is 1, and the sub-Table ID is also 1, then the record is: Board_id | board_name | table_id | created 1 | PHP | 1 | 00:30:12 Correspondingly, if I need to extract all topics in the "php" bar, a table name that stores the topic must be combined according to the table_id saved in the table, for example, if the prefix of the topic table is "topic _", the topic table corresponding to the "php" should be "topic_1", and we will execute: Select * From topic_1 where board_id = 1 order by topic_id DESC limit 10 In this way, you can obtain the reply list under this topic for us to view. If you need to view the reply under a topic, we can continue to use the "table_id" saved in the forum table for query. For example, if the prefix of the reply table is "Reply _", you can combine the reply of the topic with the ID of 1 in "php: Select * From reply_1 where topic_id = 1 order by reply_id DESC limit 10 Here, we can clearly see that we actually use the basic table here, which is our forum table. Then, we will certainly say: how can we ensure the speed and efficiency of a base table when the data volume is large? Of course, we must keep this basic table at the best speed and performance. For example, we can use MySQL memory tables for storage or store them in memory, such as memcache and other memory caches, it can be adjusted according to the actual situation. Generally, the basic table-based table sharding mechanism is a good solution for SNS, dating, forums, and other Web2.0 websites. In these websites, you can use a single table to save the relationship between the basic identifier and the target table. The advantage of using tables to save the corresponding relationship is that it is very convenient to expand in the future. You only need to add a table record. 【 Advantages ] It is very convenient to add and delete nodes, which makes it very convenient for later upgrade and maintenance 【 Disadvantage ] If you want to add a table or operate a table, you still cannot leave the database, which may cause a bottleneck. Iii. Based on Hash AlgorithmTable sharding We know that a hash table is a value calculated using a special hash algorithm. This value must be unique and can be used to find the desired value, this is called a hash table. The hash algorithm in the table is similar to this idea: Calculate the table name of the data storage table by the ID or name of the original target using a certain hash algorithm, and then access the corresponding table. Continue to take the above post for example. Each post has a forum name and a forum ID. The two values are fixed and unique, then we can consider performing some operations on one of these two values to obtain the name of a target table. Now, if we assume that the system allows a maximum of 0.1 billion data records for each table, and we want to save 1 million records for each table, the system will not have more than 100 tables. According to this standard, we assume that hash is performed on the Forum ID of the clipboard to obtain a key value. This value is our table name and then access the corresponding table. We construct a simple hash algorithm: Function get_hash ($ id ){ $ STR = bin2hex ($ id ); $ Hash = substr ($ STR, 0, 4 ); If (strlen ($ hash) <4 ){ $ Hash = str_pad ($ hash, 4, "0 "); } Return $ hash; } The algorithm is generally to pass in a forum id value, and then the function returns a four-character string. If the string length is not enough, use 0 to complete. For example: Get_hash (1) The output result is "3100". input: Get_hash (1, 23819) The result is: 3233. After a simple combination with the table prefix, we can access the table. When we need to access the content with ID 1, the combined tables will be topic_3100 and reply_3100, so we can directly access the target table. Of course, after the hash algorithm is used, some data may be in the same table. This is different from the hash table. The hash table tries its best to solve the conflict. We do not need it here, of course, the name of the table that may be stored in the Prediction and Analysis tables. If more data needs to be stored, you can perform the hash operation on the Forum name. For example, the preceding binary conversion is also in hexadecimal format, because Chinese characters are much more likely to be repeated than numbers and letters, but more tables may be combined. Therefore, some other problems must be considered. In the final analysis, if you use the hash method, You must select a good hash algorithm to generate more tables. However, data query is faster. 【 AdvantagesThe hash algorithm directly obtains the target table name, which is highly efficient. ] Through 【 Disadvantage ] Poor scalability. A hash algorithm is selected to define the amount of data. In the future, the system can only run on the data volume, but cannot exceed the data volume. The scalability is slightly lower. 4. Other problems 1. Search Problems Now that we have already performed table sharding, we cannot directly search for tables, because you cannot search for dozens or hundreds of tables that may already exist in the system, therefore, search must be performed by using third-party components. For example, Lucene is a good choice for intra-site search engines. 2. Table file Problems We know that MySQL MyISAM engine generates three files for each table, namely, *. frm, *. MYD, and *. myi. Table shards are used to store table structures, table data, and table indexes. In Linux, the number of files in each directory should not exceed 1000. Otherwise, data retrieval will be slower, so each table will generate three files. If the number of sub-tables exceeds 300, therefore, the retrieval is very slow, so at this time the score must be performed again, for example, the database is separated. With the basic table, we can add a new field to save the data stored in the table. In hash mode, we must take the nth digit of the hash value as the database name. In this way, the problem is solved in good condition. V. Summary In large-load applications, databases have always been an important bottleneck and must be broken through. This article describes two table sharding methods, hoping to inspire many people. Of course, this article Code And the idea has not gone through any code testing, so it is impossible to ensure that the design is completely accurate and practical. Specifically, readers need to carefully analyze and implement it during use. Article The writing is in a hurry, and the quality may not be guaranteed. If you encounter any errors, don't be surprised. Thank you for your criticism. Thank you ~~~~!
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.