MySQL performance tuning and Architecture design--9th chapter The performance optimization of MySQL database schema design
Many people think that performance is optimized in the process of writing code (program code or database code), in fact, this is a very big misunderstanding. The real impact of the most performance of the part is in the design has been produced, the later optimization of many times can be brought about by the improvement of the design of the former wife is only a few of the problems left behind, but also can solve the problem is usually relatively limited. This chapter will ensure that the MySQL database Schema is designed to be as efficient as possible and to minimize late-stage annoyance.
9.1 Efficient Model design
is the most standard must be the most reasonable?
In the database schema design theory, there has always been a "sunflower treasure" of the standardized Paradigm theory. The database schema designed by the paradigm theory is clear, the relationship is clear, the expansion is convenient, and even the amount of data stored is as small as possible, especially when the paradigm level is high, almost no redundant data can be found. In many people's eyes, the higher the paradigm level the database schema satisfies, the better the schema is designed. However, many people neglect the point, that is the time and the starting point of the theory. The normalization paradigm Theory of relational database was born in the early 70, the most fundamental goal is to make the database to eliminate the redundancy of data, keep the data consistent, make the data modification simple.
In fact, as much as possible to eliminate the redundancy of the data is not only to allow us to query the same amount of data can return several records, but also a very important reason is that in that time, the data storage space is expensive, and storage device capacity is very small, This is where hardware storage devices have grown so rapidly that the size of the space is no longer too big a problem. However, the consistency of data in the paradigm theory and the simple guarantee of data modification mainly depend on adding various kinds of constraints to the database, and all kinds of constraints are actually a very consuming resource for the database. Therefore, for the performance-based database schema design, we can not completely take the canonical paradigm theory as the only guidance. In the design process, should start from the actual demand, performance improvement as the fundamental goal to carry out the design work, many times in order to maximize performance, we must do anti-paradigm design.
Moderate Redundancy-let query do two fewer joins
Readers who are familiar with the MySQL optimizer may be aware that MySQL's optimizer, while claiming to be very good at using a new generation of optimizer technology, is not particularly satisfying because of the data statistics currently collected by MySQL. Nor is it as intelligent as the official MySQL propaganda. While processing common joins generally can be more intelligent to get more efficient execution plan, but when it comes to some self-query or more complex joins, it is easy to have a less reasonable execution plan, many times the order of access to the table is not appropriate to choose, The overall execution of complex query is inefficient. So, in order to make our query execution plan as optimal as possible, the most direct and effective way is to minimize join, and to reduce the join, we will inevitably need to be implemented by table field redundancy.
Here we continue to analyze the differences in resource consumption further through an example of "The impact of schema design on performance" in the chapter "factors affecting MySQL Server performance". The Group_message table in scenario one holds only the ID information of the person who published the information, and the nick_name information of the Group_message table in the redundant optimization is saved as author.
The query and execution plan for the pre-optimization implementation of the list function (Group_message_bad is the pre-optimized table, optimized for the Group_message table):
[Email protected]: example 09:13:41> explain
--SELECT T.id, T.subject,user.id, User.nick_name
SELECT ID, user_id, subject
--WHERE group_id = 1
ORDER by gmt_modified DESC LIMIT 1,10
) T, user
--WHERE t.user_id = user.id\g
1. Row ***************************
2. Row ***************************
3. Row ***************************
Query and execution plan for the list function after optimization:
[Email protected]: example 09:14:06> explain
--SELECT T.id, T.subject, t.user_id, T.author
-From Group_message t
--WHERE group_id = 1
ORDER by gmt_modified DESC LIMIT 1,10\g
1. Row ***************************
Extra:using where; Using Filesort
From pre-optimization and optimized execution plans, it can be seen that the difference between the two is very large, the optimization must be retrieved 2 tables (group_message and user) to obtain results, and after optimization only need to retrieve group_message a table can be done, because we will "author" The information is redundant to the group_message.
From the perspective of database paradigm, such a design is unreasonable. Because the user's nickname data may not be inconsistent in the Users table and the Group_message table. Every time you update a user's nickname, you need to update the data for two tables, and in order to make the data as consistent as possible, more logic needs to be handled in the application. However, from a performance point of view, this redundancy is very valuable, although our data update logic is complex, but when we consider the additional cost of the update, we should also consider how many of our updates occur on the user nickname? What we need to consider is the overall performance of a system, not the performance of a single behavior in the system. Just like the nickname data in the example, although the cost of the update increases, the efficiency of the query is increased, and the frequency of the query in the example is much greater than the frequency of the update, in exchange for more performance gains through the cost of a small part of the operation, which is in fact the strategy used frequently in our system performance optimization.
In most applications, such queries, similar to the one in the example above, are very large, but the data is much more frequent, and many times if we blindly pursue the paradigm-theoretic schema design is very inappropriate in high-performance systems. I personally think that the database of the normalization theory of its essence is in the conceptual single, although the specification of the database table is generally small, so that the table is the least relevant columns. Although this may in some cases enhance the maintainability of the database, but when the system to complete some data query retrieval, it is possible to use a complex join to achieve, which will inevitably result in poor query retrieval performance. If we implement the join logic in the application by splitting the join and using several simple queries, the network overhead will be enormous.
vertical partition of large-scale Segment-Summary table optimization
In fact, in the example above we also used another optimization strategy, that is, the "large segment vertical split" strategy. The vertical split strategy of large-scale segment is quite contrary to the assumption that the moderately redundant strategy described earlier is less productive. The moderately redundant strategy is to take the fields from the other tables and save a copy of the data on your own, and the vertical splitting of large pieces is simply to split the fields on your body and place them in another (separate) table. Maybe a lot of readers will have doubts, we just analyzed the other fields to put it in their own table why now again to separate their own fields? Isn't that a little contradictory?
In fact, there is no contradiction, we will be the other person's field that come over, because we often need to use this field of query, in order to reduce the performance of join brought about by the consumption. And we will take out the large number of paragraphs, but also some we do not need to use this field in most queries will be taken out. And, before we take it out, we will certainly be able to make a split decision after a comprehensive evaluation comparison.
So what kind of field is right for splitting out of a table?
First of all, it must be big. Why? The reason is very simple, because of his big. Large print section is generally stored in some longer detail information, such as the content of the article, the content of the post, product introduction and so on.
The second is significantly less access than the other fields in the table. Because large pieces of content storage, most of the situation is accounted for more than 80% of the entire record, and the database data in the data file format is generally in a single record as a unit to store. That is, if we want to query some fields of some records, the database does not just need to access the fields we need to query, but we need to read all the other fields (except in the case that we can complete the entire query in the index), and we can't just read the data of the few fields we need. In this way, we have to read a lot of irrelevant data, including large print segments. Because of the large volume of space, the natural waste of IO resources is very large.
In this scenario, we need to split the large segment from the original table and store it through a separate table, allowing us to significantly reduce IO access when accessing other data, resulting in improved performance.
One might wonder, though, that the efficiency of accessing other fields after moving out has improved, but when we need the message of large pieces of information, we cannot avoid the need to do this through join, and the processing efficiency after the use of join may be greatly compromised. In fact, this worry is very reasonable, this is what we need to take into account the second factor, the frequency of access factors before splitting out the big print. Before we introduced, decide whether to spin out, out of the "big", but also "low frequency", of course, here the "low frequency" is only "relative frequency" just. Furthermore, the relationship between the two tables after this split is fully deterministic one by one correspondence, and the performance impact of using join is not particularly large.
So are we going to have to move the other fields out as well when we're moving out of the big print? In fact, if we have determined that a large number of fields need to be split out of the main table, for the rest of the field, as long as the access frequency and the same as the big pieces of the table in relation to the other fields are much lower than the other field can be separated from the characters. In fact, in some cases, we don't even have to have a big paragraph to split vertically. In some scenarios, most of the fields in some tables are rarely accessed, and some of them are accessed very frequently. For this kind of table, it is also very suitable for the purpose of optimizing performance by vertical spin-up.
In the example in the "Schema design Impact on performance" section, there are actually two use of the "vertical split" optimization strategy. One is that the content of the Group_message_bad table is separated from the original table as the Group_message_content table. The other is to split the original User_bad table, although it is not very small but usually use a few fields to break up the new User_profile table.
Large Table Horizontal split-type-based spin-off optimization
The "Big Table horizontal split" strategy may not be used too often in performance optimization, but if used properly, it is likely to bring us a lot of surprises.
Let's just go through the examples to illustrate the problem. Let's say we extend the requirements in the previous example slightly, and we want the group system administrator to be able to publish system messages, and to display them on top of each group's discussion posts.
After we get this demand, our first reaction must be to add an identity column to the Group_message table, to hold the type of the post, to identify the discussion post for the regular member or the top of the system administrator. Then in each list display page through two queries to the Group_message table (one top message, one general discussion post) and then merge and display in the application. This result is because the entire group_message table of data is large, query the top information is relatively high cost.
Let's take a different approach to consider this question:
First, the top message and other discussion posts do not produce any associated interactions at all;
Secondly, the change of the top information is seldom changed relative to other discussion posts;
Thirdly, the frequency of the top information is very high;
Finally, the amount of overhead information and the general discussion of the ratio of the very few;
With these analyses above, if we store the top information separately in a table other than the ordinary discussion post, there will be no additional performance overhead, and the cost of each retrieval of the pinned information will be reduced. Because of the very high frequency of access, the cost of each retrieval of the overhead information is greatly reduced. A small number and less frequent changes are well suited for MySQL's query cache, and if you're with a regular discussion, the Group_message table-related query cache invalidation problem will make it impossible for him to use query because of the frequent changes in general discussion posts. Cache function.
From the above analysis, it is easy to come up with a more optimized scheme to hold these pinned messages, that is, a new table similar to Group_message to store the top information specifically, we will be named Top_message as follows:
Because it is global, the group_id information is omitted, and the content information can be stored in the Group_message_content table as well.
The above is just an example, it may not be so simple in practical applications, but here is just a way of thinking about how to provide the overall performance of the system by optimizing the schema design by the horizontal splitting of the large table. In many large-scale applications, because the volume of data is very large, and concurrent access is very high, to reach a single host can not support the access of a single table, often through the horizontal split of this large table, stored in multiple hosts multi-database to achieve the overall scalability of the promotion, this content we will be in the "Architecture Design" section of the " A more detailed introduction to the section "Data Segmentation for Scalable design".
Quasi-real-time optimization strategy for statistical tables in fact, we have raised the "impact of business requirements on performance" section of the chapter "influencing the performance of MySQL Server." In short, it is to replace real-time statistical queries with timed statistics.
Why should it be real-time?
Many people may ask this question after seeing this optimization strategy, why change the need to make the "real-time" statistical information into quasi-real-time? The reason is simple because the performance cost of real-time statistics is too high. Because every display (that is, every refresh of the page) requires statistical calculation, resulting in a lot of duplication of resources waste. After making quasi-real-time statistics, we only need to access a small amount of data each time, do not need frequent statistical calculation of the work.
Of course, not all statistics are suitable for the quasi-real-time statistical optimization strategy, even if we hope that the product managers will not allow, even if the product managers would like that, our users will certainly disagree.
What types of statistical information are suitable for optimizing implementations through quasi-real-time statistics?
First of all, the accuracy of statistical information requirements are not particularly strict;
Secondly, statistical information is not too sensitive to time;
Thirdly, the access of statistical information is very frequent, and repeated execution is more;
Finally, the amount of participation in statistical data is large;
Look at the above requirements, really a lot. However, there is a real likelihood that such a statistical data display will be available in the systems that you maintain. such as the current number of online system, forum system, the total number of posts, replies, and so on, multi-criteria large result set query page total results and total number of pages, some of the top N rankings of virtual integrals and so on.
The calculation of these statistics will be designed to a large number of data, but also need a lot of computing resources, access frequency is very high. If you are using real-time statistics, I am afraid that as long as the amount of data is slightly larger, it will bring a lot of hardware resource overhead. But in a short period of time is not accurate enough to bring too much user experience to reduce. So it can be completed by the scheduled task program, not a certain period of time after the statistics are stored in a specially designed statistics. This way, when the statistics need to be displayed, we just need to get the data from the statistically good results. In this way, the performance of each statistic will be increased by an order of magnitude, but it will increase the overall user experience.
9.2 Appropriate data types
In fact, in many database design optimization documents have to optimize the data type of optimization instructions, in MySQL, we can also through the data type optimization to optimize the entire schema design.
The main principles for optimizing data types for high performance are the following:
1. Reduce storage space by selecting more "small" data types, so as to reduce the IO resources required to query the same data;
2. Accelerate the comparison of data with appropriate data types;
Let's look at the data storage format and length of some common data types to see which data types can be used in optimization.
Numeric Date type
Let's take a look at the storage length and range of some data types that have a basic fixed length.
For numeric types, the integer and decimal types are listed here, which is the floating-point type. In fact, there is a class of numeric types, such as decimal (DEC) [(M[,d])],numeric[(M[,d]), which are stored in a string in binary format, because the length of its storage is determined primarily by the m defined by its definition, how large the M is, and how long it is actually stored. m represents the length of the entire bit, while D represents the number of digits after the decimal point, and the default m is 10,d 0. In general, the main use in fixed-precision occasions, because of its large storage length, and considering that this data can be completely variable form to be stored in integers, so I personally is not a special recommendation.
For the storage of numbers, there should not be too many occasions when using floating-point data. For two reasons, one is that the floating-point data itself is actually an inexact number, just an approximation, and another reason is that it can be stored by multiplying a fixed factor into integer data. This will not only solve the problem of inaccurate data, but also make the data processing more efficient.
The time storage format General class is not too many, we commonly used mainly is the datetime,date and the timestamp these three kinds. In terms of storage space, timestamp has a minimum of four bytes, while the other two data types are eight bytes, one more times. The disadvantage of timestamp is that he can only store time since 1970, while the other two types of time can be stored as early as 1001. If there is a need to store the time before 1970, we must discard the timestamp type, but as long as we do not need to use the time 1970 years ago, it is best to use timestamp to reduce storage space consumption.
The main ones listed above are some of the types that are stored in fixed lengths and which we might normally use. Through this control table, we can intuitively see which type occupies a large amount of storage space, which occupies a small space. Thus, when the data type is selected, we can combine the various types of storage and the data that may exist in the business, and then choose the first type of storage space to use.
Character Store Type
Let's take a look at the data types that hold the characters.
The char[(M)] type is a static length type, and the storage length is fully calculated with the number of characters, so the final memory length is character set based, such as latin1 the maximum storage length is 255 bytes, but if GBK is used, the maximum storage length is 510 bytes. The Char type storage feature is that no matter how long we actually store the data, it will be stored in the database m characters, not enough by the space to fill, m default is 1. Although Char will be filled with space, but when accessing the data, MySQL ignores the last of all the spaces, so if we actually need a space at the end of the actual data, we cannot use the char type to store it. In versions prior to MySQL5.0.3, if we defined char with an M value greater than 255,mysql would automatically convert the char type to a TEXT type that could be deposited into the corresponding amount of data, such as CHAR (1000) automatically converted to Text,char (10000) will be converted to Mediumtext. From MySQL5.0.3 onwards, all definitions of more than 255 MySQL will be rejected directly and given the error message, no longer automatically converted.
varchar[(M)] belongs to the dynamic storage length type and only occupies the length of the actual stored data. The maximum length of its storage is related to the MySQL version, the version of varchar prior to 5.0.3, which controls the maximum length stored in characters, can only hold up to 255 characters, and the actual size of the storage space is related to the character set. However, starting with 5.0.3, the maximum storage limit for varchar has been changed to a limit of bytes, extended to data that can hold 65535 bytes, and different character sets may not hold the same number of characters. That is, in the previous version of MySQL5.0.3, M represents the number of characters, and since the 5.0.3 version, the meaning of M is already a number of bytes.
The storage feature of varchar is that no matter how big a value we set M, the actual amount of storage space that we actually occupy is only the size of the real data we've deposited, and unlike char, varchar retains the last space we've deposited, which means what we're depositing, What it would be like for MySQL to return to us. In data in the varchar Type field, MySQL uses 1 or 2 bytes in each varchar data to hold the actual length of the varchar data, and when our actual data is within 255 bytes, it uses 1 bytes to hold the actual length, which is greater than 255 bytes. , you need to use 2 bytes to store it.
Tinytext,text,mediumtext and Longtext These four types belong to one kind of storage way, all is the dynamic storage length type, the difference is only the maximum length limit. All four types of definitions are limited by the maximum number of characters, but their character limit is actually understood to be limited by the number of bytes, because when we use multibyte character sets, the actual number of characters that can be stored is not so much as the maximum number of Fu Shu, but rather a single-byte character count. In addition, because it is a dynamic storage length type, as with varchar, each field data needs a space for the actual length before it is stored. Tinytext requires 1 bytes to hold, Text requires 2 bytes, and Mediumtext and Longtext each require 3 and 4 bytes to hold the actual data length. In fact, in addition to the maximum length limit that is embedded in MySQL, they are also limited by the maximum network communication buffer (Max_allowed_packet) on the client and server side.
There are several differences between the four text types and char and varchar in actual use:
Default value cannot be set;
Only text can be sized by M using the text[(m)] method;
The prefix length must be specified based on these four types of indexes;
Other common types
In addition to the above-mentioned field types, which are often used by us, we will use the following data types primarily.
For bit types, M represents the number of bits per value, which defaults to 1 and up to a maximum of bits. This is a new type for MySQL because it starts with MySQL5.0.3 (actually tinyint (1)) and only supports MyISAM storage engine, but Memory,innodb and NDB from MySQL5.0.5 The cluster storage engine also started to "support". In MyISAM, bit storage space is very small, is really implemented by bit to store, but in some other storage engine is not the same, because they are converted to the smallest type of int storage, so take up the space is not saved, it is better to directly use the data type of the int class to store it intuitively.
For set and enum types, the primary content is basically in a field with less variation and a lower value. Although these two fields occupy less storage space, they are generally used in real-world environments because they are slightly more complex to use than other data types.
Everyone knows that the increase in data volume (mainly referred to as the number of data records) will certainly reduce the efficiency of database retrieval queries. So most of the time people want to improve the database performance by reducing the number of records in the key tables in the database. In fact, in addition to this way of controlling the total amount of data by controlling the number of data records, we can also use smaller data types to allow the database to hold the same amount of data through a smaller space, which will naturally degrade the IO consumption of retrieving the same data, and the performance will naturally be improved.
In addition, due to the different processing methods of the CPU, the processing efficiency of different types of data in various operations such as comparison, sequencing and so on can be differentiated. Therefore, for fields that require CPU resources to be used frequently for comparison calculations and sorting, you should choose to handle more rapid field types as much as possible. such as the use of integer types instead of floating-point numbers or character types.
9.3 Specification of object naming
The naming of the specification itself does not have any effect on performance, and here is a separate section, mainly because it is a less valued, but very significant for later database maintenance. Just like the programming language of some of the basic rules of the unwritten code, although in the initial use and see a good many benefits, but will be considered a kind of bondage, but when everyone in the maintenance of a section to write very irregular code, I guess most people will be very depressed, even in the heart of the original writer.
In fact, any system is the same, there is no standard to follow, a completely unrestrained style, will only give posterity (and perhaps even their own) left a person can not touch the head of the mess, difficult to maintain.
The naming conventions of database objects can be very simple, and the industry does not exist a strict uniform rules, only need to be in a company enough to unify the basic.
In general, I personally recommend that you pay attention to the following aspects:
1, the database and the table name should be as far as possible and the service module name consistent;
In this way, when the DBA maintains the relevant database objects, the relationship between the relevant technical (or non-technical) staff in the process of developing the business logic and data is very easy to understand during the development of the new developer program.
2. A class of tables serving the same submodule should prefix or suffix the submodule name (or part of the word) as far as possible;
Adding a prefix or suffix to a table of the same type of functionality is also a means for people looking to use the table to be able to quickly associate functions and related business with the names of related objects. It can be very convenient from a maintenance point of view and from a point of view of use.
3. The name of the table should contain the words corresponding to the stored data as far as possible;
This is especially important for new employees, to be familiar with the data as soon as possible, to understand the relevant business quickly, the rapid positioning of the data in the database table corresponding to the meaning is very helpful.
4, the field name also try to keep the actual data corresponding to
The meaning of this point I think you readers should be very clear, each table will have a lot of fields corresponding to the various properties of the data, to understand the meaning of their representatives, in addition to the full specification of the document, a well-named field name is also a useful supplement, and more direct.
5, the index name as far as possible to include all the index key field name or abbreviation, and the order of each field name in the index name should match the index key in the index order, and try to include a similar to the IDX or IND, such as a prefix or suffix, the table name whose object type is index, but also can contain the name of the table The greatest benefit of this is that the DBA is able to understand most of the information in the index through the index name very directly during the maintenance process.
6. Other objects, such as constraints, should include, as far as possible, the name of the owning table or other object, with the table name respective relationship.
Listed above is only a comparison of the specifications of the proposal, the reader friends can completely according to their own habits, the establishment of their own naming norms, as long as applicable, it can be. The specification is not much, but in practical. And once the norm has been enacted, it must be strictly in accordance with the norms, otherwise it becomes a showy without any practical significance.
Through the content of this chapter, hope to let everyone understand a truth, "database system performance is not optimized, more is designed out." Database Schema design is not as many people think of as a simple object corresponding to the implementation, but a system engineering. In order to design a database schema structure that is both efficient and sufficient to meet business requirements, it is not only necessary to have sufficient knowledge of the database system, but also enough to understand the business logic of the application system.
Excerpt from: "MySQL performance tuning and architecture design" Jane Chaoyang
Reprint please specify the source:
MySQL performance tuning and Architecture design--9th chapter The performance optimization of MySQL database schema design