I. OverviewThe sub-table is a relatively popular concept at present, especially in the case of large load, the sub-table is a good way to disperse the database pressure. the first thing to know is why the tables are divided and what the benefits of the tables are. Let's start with a general understanding of one of the following database execution SQL procedures:receive SQL--put into SQL execution queue--use parser to decompose SQL--Extract data by analysis results or modify--return processing resultsof course, this flowchart is not necessarily correct, it is just my own subjective consciousness so I think. So what is the most likely problem in this process? That is, if the previous SQL is not completed, then SQL will not be executed, because in order to guarantee the integrity of the data, the data table file must be locked, including the shared lock and the exclusive lock two locks. A shared lock is a lock during which other threads can access the data file, but no modification is allowed, and the corresponding, exclusive lock is that the entire file is owned by one thread, and other threads cannot access the data file. General MySQL is the fastest storage engine MyISAM, it is based on table locking, that is, if a lock, then the entire data file can not be accessed outside, must wait until the previous operation is completed before receiving the next operation, then the previous operation did not complete, The latter operation waits in the queue to be unable to execute the situation is called the blockage, generally we are called "the lock Table" in the popular sense. What are the immediate consequences of the lock table? Is that a large amount of SQL cannot be executed immediately, and must wait until the SQL in front of the queue has been fully executed to continue execution. This non-executable SQL can result in no results, or severe delays that affect the user experience. especially for some use more frequent tables, such as the User Information table in the SNS system, the post table in the forum system, and so on, is a large number of access to the table, in order to ensure the rapid extraction of data back to the user, must use some processing methods to solve the problem, this is what I want to talk about the sub-table technology. Sub-table technology as the name implies, is to store several tables of the same type of data into several table sub-table storage, when extracting data, different users access to different tables, non-conflict, reduce the probability of locking the table. For example, the current save User sub-table has two tables, one is the user_1 table, there is a user_2 table, two tables save different user information, user_1 saved the first 100,000 of the user information, user_2 saved the last 100,000 users of information, now if you query the user Heiyeluren1 and Heiyeluren2 This two users, then is the table from the different tables extracted, reduce the possibility of locking table. I want to tell the following two kinds of sub-table method I have not experimented with, do not guarantee the accurate use, but provide a design ideas. The following is an example of the sub-table I assume is a post-paste system based on the processing and construction. (If you have not used the post bar users hurriedly Google a bit)second, the table processing based on the basic tableThe general idea of this table-based table- processing approach is that a primary table holds all the basic information, and if a project needs to find the table it stores, it must look for items such as table names from the underlying table, so that you can access the table directly. If you feel that the base table is not fast enough, you can completely save the entire base table in cache or in memory for efficient query. based on the situation, we build 3 tables that assume the following:1. Paste the Section table: Save the information of the section2. Stick to the theme table: Save the topic information in the section, for browsing3. Post-paste Reply table: Save the original content and reply content of the themethe "Bar table" contains the following fields:section ID board_id Int (ten) section name board_name char (+) child table ID table_id smallint (5) generation time created datetime the "Stick theme table" contains the following fields:subject ID topic_id Int (ten) Subject name topic_name char (255) section ID board_id Int (ten) creation time created datetime The fields of "post-paste reply form" are as follows:reply ID reply_id Int (ten) reply content reply_text text subject ID topic_id Int (ten) section ID board_id Int (ten) creation time created datetime then the above holds the table structure information of our entire bar, the three tables corresponding to the relationship is:sections-- multiple themes topics-- multiple replies So, the relationship between the table file size is:section Table File < topic table file < reply Table file so it's almost certain that the topic table and the reply table need to be divided into tables, which has increased the speed and performance of our data retrieval query changes. looking at the table structure above, it is obvious that a "table_id" field is saved in the "section table", this field is used to save a section corresponding to the topic and the reply is the table is saved in the list. For example, we have a bar called "PHP", board_id is 1, the sub-table ID is also 1, then this record is:board_id | board_name | table_id | created1 | PHP | 1 | 2007-01-19 00:30:12Accordingly, if I need to extract the "PHP" bar in all the topics, then you must follow the table to save the table_id to combine a stored theme tables name, such as our topic table prefix is "topic_", then the combination of "PHP" bar corresponding theme table should be: " Topic_1 ", then we do:SELECT * from topic_1 WHERE board_id = 1 ORDER by topic_id DESC LIMITThis will allow us to get a list of replies to this topic, so that we can view them, and if we need to see the responses below a topic, we may continue to use the "table_id" saved in the section table to query. For example, our reply table prefix is "reply_", then we can assemble the "PHP" bar ID 1 of the theme of the reply:SELECT * from reply_1 WHERE topic_id = 1 ORDER by reply_id DESC LIMIThere, we can see clearly, actually we use the basic table here, the base table is our section table. The corresponding, will certainly say: the basic table of the amount of data in the future how to ensure its speed and efficiency? of course, we have to keep the base table at its best speed and performance, for example, it can be stored in MySQL's memory table, or in memory, such as memcache memory cache, etc., can be adjusted according to the actual situation. generally based on the table of the table mechanism in SNS, friends, forums and other Web2.0 site is a relatively good solution, in these sites, you can use a single table to save the basic identity and the relationship between the target table. The advantage of saving correspondence with tables is that it is convenient to expand later, and only one table record is added. " advantage" to increase the deletion of nodes is very convenient, for late upgrade maintenance brings great convenience " disadvantage" requires adding a table or working on a table or not leaving the database, creating a bottleneck third, the hash algorithm based on the sub-table processingwe know that a hash table is a value computed by a particular hash algorithm that must be unique and can use the computed value to find the desired value, which is called the hash table. our hash algorithm in the table is similar to this idea: the ID or name of a primitive object is used to calculate the table name of the data store table by a certain hash algorithm, and then access the corresponding table. continue to take the above bar, each bar has a section name and the section ID, then the two values are fixed, and is unique, then we can consider the two values by some of the operation of a number of operations to derive the name of a target table. now, if we are aiming at our post-paste system, assuming that the system allows a maximum of 100 million data, consider saving 1 million records per table, then the entire system will be able to accommodate no more than 100 tables. According to this standard, we assume that we hash the section ID of the bar, get a key value, this value is our table name, and then access the corresponding table. we construct a simple hash algorithm:function Get_hash ($id) {$str = Bin2Hex ($id);$hash = substr ($str, 0, 4);if (strlen ($hash) <4) {$hash = Str_pad ($hash, 4, "0"); }return $hash;}The algorithm is basically passing in a section ID value, and then the function returns a 4-bit string, if the string length is not enough, use 0 to complete the completion. For example:get_hash (1), the result of the output is "3100", Input:get_hash (23819), the result is: 3233, then we have a simple combination of table prefix, we can access the table. Then we need to access the content of the ID 1, OH, the combined table will be: topic_3100, reply_3100, then you can directly access to the target table. of course, after using the hash algorithm, there is part of the data may be in the same table, this is different from the hash table, hash table is to resolve the conflict as far as possible, we do not need, of course, also need to predict and analyze the table data may be saved table name. If you need to store more data, the same, you can hash the name of the section, such as the above binary conversion to 16, because the Chinese characters more than the number and letter, then the repetition probability is smaller, but may be combined into more tables, the corresponding must consider some other problems. In the final analysis, the use of hash method must choose a good hash algorithm, in order to generate more tables, but the data query faster. "The Advantage hash algorithm directly obtains the target table name, the efficiency is very high" through " disadvantage" extensibility is poor, choose a hash algorithm, define how much data, the future can only be in this amount of data to run, can not exceed the amount of data, scalability slightly worse Iv. Other issues1. Search QuestionsNow that we have a table, we can't search directly on the table, because you can't retrieve dozens of or hundreds of tables that already exist in the system, so the search has to be done with third-party components, such as Lucene, which is a good choice for the site search engine. 2. Table File Issueswe know that MySQL's MyISAM engine will generate three files per table, *.frm, *. MYD, *. MYI three files, tables are used to save table structure, table data, and table indexes. The number of files under each directory of Linux is best not more than 1000, otherwise the retrieval data will be slower, then each table will generate three files, the corresponding if the table more than 300 tables, then the retrieval is very slow, so this time must be divided, such as in the database separation. using the underlying table, we can add a new field to hold what data the table holds. Using hash, we must intercept the hash value of the first few to be the name of the database. In this way, the problem is solved in good condition. v. Summaryin the heavy load application, the database has been a very important bottleneck, must break through, this article explained two kinds of table's way, hoped for many people to have the inspiration function. Of course, the code and the idea of this article has not been tested by any code, so the design is not guaranteed to be completely accurate and practical, or need the reader in the use of the process of careful analysis and implementation. article written in a hurry, the quality may not be guaranteed, encountered errors, do not take offense, welcome to criticize, thank you ~~~~!
How MySQL database solves big data volume storage problems