Original: http://blog.sina.com.cn/s/blog_4d8a2c970100fe4l.html
I. Overview is a relatively popular concept at present, especially in the case of large load, the sub-table is a good way to spread the pressure of the database. First to understand why the table, the benefits of the sub-table is what. Let's take a look at one of the following database execution SQL procedures: Receive SQL--put into SQL execution queue--use parser decomposition sql --> extract data based on analysis results or modify--return processing results of course , this flowchart is not necessarily correct, this is just my own subjective consciousness so I think. So what is the most likely problem in this process? That is, if the previous SQL is not completed, then SQL will not be executed, because in order to guarantee the integrity of the data, the data table file must be locked, including the shared lock and the exclusive lock two locks. A shared lock is a lock during which other threads can access the data file, but no modification is allowed, and the corresponding, exclusive lock is that the entire file is owned by one thread, and other threads cannot access the data file. General MySQL is the fastest storage engine MyISAM, it is based on table locking, that is, if a lock, then the entire data file can not be accessed outside, must wait until the previous operation is completed before receiving the next operation, then the previous operation did not complete, The latter operation waits in the queue to be unable to execute the situation is called the blockage, generally we are called "the lock Table" in the popular sense. What are the immediate consequences of the lock list? Is that a large amount of SQL cannot be executed immediately, and must wait until the SQL in front of the queue has been fully executed to continue execution. This non-executable SQL can result in no results, or severe delays that affect the user experience. Especially for some use more frequent tables, such as the SNS system in the User Information table, forum system posts table and so on, is a large number of access to the table, in order to ensure that the rapid extraction of data back to the user, must use some processing methods to solve the problem, this is what I want to talk about the sub-table technology today. Sub-table technology as the name implies, is to store several tables of the same type of data into several table sub-table storage, when extracting data, different users access to different tables, non-conflict, reduce the probability of locking the table. For example, the current save User sub-table has two tables, one is the user_1 table, there is a user_2 table, two tables save different user information, user_1 saved the first 100,000 of the user information, user_2 saved the last 100,000 users of information, now if you query the user Heiyeluren1 and Heiyeluren2 This two users, then is the table from the different tables extracted, reduce the possibility of locking table. I'm going to tell you two kinds of sub-table methods I have not experimented with, do not guarantee the accuracy of use, but to provide a design ideas. The following is an example of the sub-table I assume is a post-paste system based on the processing and construction. (If you have not used the bar users hurriedly Google a bit)
Second, table-based processing of the underlying table This is based on the basic table of the main idea is: a primary table, save all the basic information, if a project needs to find the table it is stored, then you must find the corresponding table name and other items from the base table, so that you can directly access the table. If you feel that the base table is not fast enough, you can completely save the entire base table in cache or in memory for efficient query. Based on the situation of bar paste, we construct 3 tables as follows: 1. Bar Section Table: Save the bar in the section of Information 2. Stick to the theme table: Save the bar in the section of the topic Information for browsing 3. Post bar reply table: Save the original content of the theme and reply content "Bar table" contains the following fields: Section id board_id int (10) section name board_name char (50) Sub-table id table_id smallint (5) generation time created datetime "Stick Theme Table" contains the following fields: Subject ID topic_id int (10) Theme name topic_name char ( 255) Section ID board_id int (10) when createdRoom created datetime The fields of "post-bar reply table" are as follows: Reply id reply_id int (10) Reply to content reply_text text Theme ID topic_id int (10) Section id board_id int (10) Creation time created datetime then the above saved the table structure information of our entire bar, three tables corresponding to the relationship is: section --> multiple Theme -- > multiple replies So the table file size relationship is: section table file < subject Table file < reply to table file So it's almost certain that the topic table and the reply table need to be divided into tables, which has increased the speed and performance of our data retrieval query changes. looked at the above table structure, it will be obvious that in the "section table" saved a "table_id" field, this field is used to save a section corresponding to the topic and replies are the table is saved in what form. For example, we have a bar called "PHP", board_iD is 1, the child ID is also 1, then this record is: board_id | Board_name | table_id | created1 | PHP | 1 | 2007-01-19 00:30:12 Accordingly, if I need to extract all the topics in the "PHP" bar, then you have to follow the table to save the table_id to combine the name of the tables that store the theme, such as our topic table prefix is "topic_", then the combination of " PHP "bar corresponding to the topic table should be:" Topic_1 ", then we do: select * from topic_1 WHERE board_id = 1 ORDER by topic_id DESC LIMIT 10 This will be able to obtain Take this topic to reply to the list, so that we can view, if you need to see a topic below the reply, we may continue to use the "table_id" saved in the section table to query. For example, our reply table prefix is "reply_", then we can combine the "PHP" bar ID 1 of the subject's reply: select * from reply_1 WHERE topic_id = 1 ORDER by reply_id DESC LIMIT 10 here, we can see clearly, actually we use the basic table here, the base table is our section table. The corresponding, will certainly say: the basic table of the amount of data in the future how to ensure its speed and efficiency? Of course, we have to make this base table the best speed and performance, for example, can be stored in MySQL memory table, or stored in memory, such as memcache memory cache, etc., can be adjusted according to the actual situation. generally based on the basic table of the table mechanism in SNS, friends, forums and other Web2.0 site is a relatively good solution, in these sites, you can use a single table to save the basic identity and the relationship between the target table. The advantage of saving correspondence with tables is that it is convenient to expand later, and only one table record is added. "
Advantages"Increased deletion of nodes is very convenient, for late upgrade maintenance brings great convenience"
Disadvantage"Adding a table or working on a table, or not leaving the database, can create bottlenecks
third, based on
Hash
the table processing of the algorithm We know that a hash table is a value computed by a particular hash algorithm, which must be unique and can use the computed value to find the desired value, called a Hashtable. Our hash algorithm in the table is similar to this idea: through a primitive target ID or name through a certain hash algorithm to calculate the table name of the data storage table, and then access the corresponding table. Continue to take the above bar, each bar has a section name and the section ID, then the two values are fixed, and is unique, then we can consider the two values by doing some operations to get the name of a target table. Now if we are aiming at our post-paste system, assuming that the system allows up to 100 million data, consider saving 1 million records per table, then the entire system will be able to accommodate no more than 100 tables. According to this standard, we assume that we hash the section ID of the bar, get a key value, this value is our table name, and then access the corresponding table. We construct a simple hash algorithm: function get_hash ($id) { $str = Bin2Hex ($id); $hash = substr ($str, 0, 4); if (strlen ($hash) <4) { $hash = Str_pad ($hash, 4, "0"); } return $hash;} The algorithm is basically passing in a section ID value, and then the function returns a 4-bit string, and if the string is not long enough, use 0 to complete the completion. such as: Get_hash (1), the result of the output is "3100", Input: Get_hash (23819), the result is: 3233, then we have a simple combination of table prefix, we can access the table. Then we need to access the content of the ID 1, OH, the combined table will be: topic_3100, reply_3100, then you can directly access to the target table. Of course, after using the hash algorithm, some of the data is probably in the same table, this is different from the hash table, hash table is as far as possible to resolve the conflict, we do not need here, of course, also need to predict and analyze table data may be savedName of the table. If you need to store more data, the same, you can hash the name of the section, such as the above binary conversion to 16, because the Chinese characters are much more than numbers and letters, then the probability of repetition is smaller, but may be combined into more tables, Some other problems must be considered accordingly. In the final analysis, the use of hash method must choose a good hash algorithm, in order to generate more tables, but the data query faster. "
The advantage hash algorithm directly obtains the target table name, the efficiency is very highBy
Disadvantage"Poor extensibility, the choice of a hash algorithm, defines how much data, in the future can only run on this data volume, can not exceed the amount of data, scalability is slightly worse
Iv. Other issues
1.
Search QuestionsNow that we have a table, we can't search directly on the table, because you can't retrieve dozens of or hundreds of tables that already exist in the system, so the search has to be done with third-party components, such as Lucene, which is a good choice for the site search engine.
2.
Table File IssuesWe know that MySQL's MyISAM engine will generate three files per table, *.frm, *. MYD, *. MYI three files, tables are used to save table structure, table data, and table indexes. The number of files under each directory of Linux is best not more than 1000, otherwise the retrieval data will be slower, then each table will generate three files, the corresponding if the table more than 300 tables, then the retrieval is very slow, so this time must be divided, such as in the database separation. Using the underlying table, we can add a new field to hold what data the table holds. Using hash, we must intercept the hash value of the first few to be the name of the database. In this way, the problem is solved in good condition.
v. SummaryIn the heavy load application, the database has been a very important bottleneck, must break through, this article explained two kinds of table's way, hoped for many people to have the inspiration function. Of course, the code and the idea of this article has not been tested by any code, so the design is not guaranteed to be completely accurate and practical, or need the reader in the use of the process of careful analysis and implementation. Article written in a hurry, the quality may not be guaranteed, encountered errors, do not take offense, welcome to criticize, thank you ~~~~! The design idea and realization of the sub-table processing
The design idea and realization of the sub-table processing