Mysql High Performance Learning notes-02 Mysql High Performance Learning Notes 2
Liu Yan
Suhuanzheng7784877@163.com
Blog: suhuanzheng7784877.iteye.com
1. Preface
In high-performance Mysql, Chapter 2-benchmark test and Chapter 3-server performance analysis require global consideration. in different application scenarios, the benchmark test method is different from the input data. So we will discuss these two issues later. let's move on to the topic of directly optimizing the schema and data type.
2. Optimize Data types
Optimizing the data type is basically the scenario of creating and modifying tables. the whole topic of optimizing the data type is basically focused on: efficient storage and query of DB data. In native Mysql, data types are generally divided into the following types: integer type, real number type, string type, date and time type, number of digits type, and special type.
To optimize data types, follow these principles:
1): Use a small data type. use an int instead of a long value. Small Data types consume less resources and reduce the CPU computing time when addressing disks. in this way, iowait does not consume too much resources because of large fields.
2): Simple type is preferred. This should be explained based on the knowledge of the application layer language. for example, the int and Integer types in Java consume more resources. The answer must be Integer, I have mentioned in Java code optimization that the attribute value is expressed using the original type. This is also true in Mysql. the simplest type represents the field, and the simplest type is used as much as possible. This is closer to the computing types supported by the CPU native. For example, you can use an integer to store timestamps, an integer to store IP addresses, and an integer to store currency Floating Points. at the application layer, you can use multiplication and division to convert the decimal point precision.
3): Do not use NULL unless necessary, so that all fields should be non-empty even if they have default values. For optimized indexes, if the field is NULL, the NULL index cannot be sorted. However, InnoDB provides special bit storage for NULL.
3. integer type
Tinyint (8 bits)
Range: unsigned (0 ~ 256), signed (-128 ~ 127)
Scenario: it is generally used to store digital dictionaries. the id of a constant table. because the data size is very limited and it is also a constant table, it can be used to store
Smallint (16-bit)
Range: unsigned (0 ~ 65536), signed (-32768 ~ 32767)
Scenario: replace Tinyint. if the regular table has a large amount of data, such as province, city, autonomous region, district, and village in China, it is enough. Is there 65536 villages and towns (districts and counties) in China?
Mediumint (24-bit)
Range: unsigned (0 ~ 16777216), signed (-8388608 ~ 8388607)
Scenario: less than data. if it is a log table, the data volume can be controlled within a period of time and cleaned regularly. Mediumint is a lightweight int id.
Int (32-bit): generally, Java int cannot support such a long integer in most scenarios!
Range: unsigned (0 ~ 4294967296), signed (-2147483648 ~ 2147483647)
Scenario: most auto-increment id scenarios are sufficient. Unsigned more than 4 billion data, generally small and medium-sized, Internet, basically enough.
Bigint (64-bit) range: astronomical number, which must be specially processed in Java-BigDecimal.
Range: unsigned (0 ~ 18446744073709551616), signed (-922337203685478 ~ 922337203685477 ).
Scenario: Use a relational database to store massive data IDs. Ten million is one million, and one million is a mega. what is a mega ???? However, the data volume is in this scope and it is hard to imagine that it is still managed using RDBMS.
The biggest difference between signed and unsigned is whether negative numbers are supported. Once Unsigned is selected, a negative number is not allowed, that is, the number of Unsigned data is stored. Generally, fields of the unsigned int type can meet the system requirements, even the auto-increment id type. More than 4 billion of the mysql data volume is also small. The daily transaction volume is recorded at a ratio of 10 million, and a month is recorded at a rate of 0.3 billion million records. If the data larger than this order of magnitude is real-time data, we should consider table sharding and Database Sharding. Alternatively, NoSQL can be used to split up data volumes. This is the range supported by numeric fields.
4. Real number type
In fact, it basically refers to the number of decimal places, that is, the data type of the floating point type.
Float: 4-byte storage
Double: 8-byte storage
Decimal: 65 digits allowed
Here is an article about floating-point and fixed-point computing, which is very good at http://www.163ns.com/zixun/post/5226.html.
Basically, float can be used as a percentage. there is no difference between them. double is more accurate than float. Decimal is a full amount calculation. Some non-sensitive systems are not especially accurate in business scenarios, and I have seen some people use double. (Who did you say that the money for the four homes went? Even if 100 million people are involved, how much is the money that each person has been paid for, or even the money difference, enough to buy a house ?)
Storage and computing costs: Float
The native floating point type allows the CPU to directly participate in computing. it seems that the evaluation of CPU performance is to look at the number of floating point operations that the CPU can run per second.
For CPU vendors that support floating point computing, such as AMD and Intel, which have higher floating point computing power, different product generations will change with the times. I personally prefer Intel Xeon processor.
5. string type
String types are mainly divided into the PK between varchar, char, blob, and text.
You must optimize the fields of the string type to the extreme, because in the database, the most common type is the string, and the most common scenario we face every day is the processing of text.
Varchar: this type is used to store variable-length strings, which saves space (in general) compared with a fixed-length char type ). Unless row_format = fixed is set, each row is a fixed-length storage. Varchar requires an additional 1 ~ 2 bytes to store the length of the string. If the maximum length of a column is <= 255 bytes, the length is stored in 1 byte. Otherwise, two bytes are taken. In addition, after Mysql5, the varchar field does not remove spaces at the end.
Char: char is a fixed-length type. in the access process, sufficient space is allocated based on the string length. The fixed-length string type is not prone to disk fragments. for a fixed-length column, char is more effective than varchar. For example, store MD5 or SHA1 values.
Blob type:
Stores big field data of the binary type without sorting rules and character sets.
The types include tinyblob, blob, mediumblob, and longblob.
It is usually used to store images and document files. When blob is large, the storage engine uses external storage (operating system FS interface) for special processing.
Text-clob corresponding to Oracle:
The character storage method stores big field data, including sorting rules and character sets.
The types include tinytext, text, mediumtext, and longtext.
Generally, you can store articles and html page content. Similarly, when the text is large, the external storage is used for special processing.
Experience:
1) when retrieving blob or text records, truncate the original record value-substring (field name, size) function. And then convert it into a corresponding string. In this way, the Mysql memory temporary table can be used to avoid the IO from the disk.
2) when the size of the temporary table exceeds the configured max _ heap_table_size (tmp_table_size), the temporary memory table uses the disk temporary table. (Load the memory-intensive case to IO-intensive)
1. enumeration type
Mysql access enumeration, compact. It generally replaces the commonly used string type. Mysql compresses the number of enumerative lists by 1 ~ 2 bytes. Then, save each enumerated value as an integer number, and map the integer number to the enumerated string value. That is to say, in fact, the field values referenced by enumeration in the table store numbers.
Experiments prove that this is true.
Create table 'user2 '( 'Id' int (10) unsigned not null AUTO_INCREMENT, 'Type' enum ('Wei ', 'Shu', 'Wu ') character set utf8 DEFAULT 'Wei ', Primary key ('id ') ) ENGINE = InnoDB |
When the query is executed, a number is added to the type field. The result is a number, which indicates that the underlying enumeration uses the value type for access enumeration. In addition, if you need to enumerate for foreign keys, the external key association QPS between enumeration and enumeration is the highest based on the benchmark test results. Mysql optimizes the sorting of enumerated values.
Scenario: Do not use the string type when enumeration can be used as a constant.
2. date and time
The date and time types are as follows: date; time; year; timestamp; datetime;
Date: it is equivalent to intercepting the date of datetime, which ranges from 0 to January 1, and can be up to December 31, 9999 AD.
Time: It is equivalent to intercepting the datetime time in the range of 24 hours a day.
Year: relatively embarrassing. the critical values are 69 and 70. enter 69, which basically represents 2069. 70 represents 1970. The range value is 0 ~ 99, representing, 0 ~ 69: 2000 ~ 2069; 70 ~ 99: 1970 ~ 1999. This is not a special case. it is basically useless.
The most common types are datetime and timestamp.
Datetime: uses 8 bytes to store the date and time. it can be concluded that date uses 4 bytes, and time is also 4 bytes. Accurate to the second level, irrelevant to the time zone. The range is the date and time from January 1, 1000 to January 1, 9999.
Timestamp: 4 bytes are used to store the date and time, but the range can only be from January 1, 1970 ~ May 2038. If nothing happens, most of the comrades who read this article will survive that year. will timestamp2 be used to expand the timestamp range, it depends on whether there is a type that supports a larger integer value. Insert the value of this field using the long type at the application layer. the correct date and time can be stored, and the field depends on the time zone. Pay special attention to international products!
3. SET type
This class is used to store collection classes of the set type. The set elements basically store constant values. a column is an appropriate set of permissions for permission control. It actually represents an aggregate element of a person. However, in fact, permission control can be expressed by an integer, which is similar to the linux permission number. for example, 777 means that the folder has no restrictions and can be used, accessed, and modified by other users.
For the SET type (in mysql database), you can obtain the value of this type at the Java application layer by using a string. However, the obtained value must be processed separately and the string is split into a string array (use, split ).
4. Special Field-ipv4 address access
You can use two functions in mysql to convert an ipv4 string to an integer. the access to an integer is faster than that of a string. Two special functions are:
IP address to number: select inet_aton ("192.168.1.1 ");
Result
+ -------------------------- + | Inet_aton ("192.168.1.1") | + -------------------------- + | 1, 3232235777 | + -------------------------- + |
Convert numbers to IP addresses
Select inet_ntoa (3232235778 );
Result:
+ ----------------------- + | Inet_ntoa (3232235778) | + ----------------------- + | 192.168.1.2 | + ----------------------- + |
5. primary key type selection
This is an old topic. Chinese people, especially some traditional IT enterprises that take taxpayers' money, alas ~ We are always eager for these terrible details!
Basically, the auto-increment type of unsigned integer is used for single point of use. It is the best choice for single point of failure. The use of strings, especially the instant string-UUID, causes a high reduction in read/write performance.
In a cluster, you need to decide whether to use an integer or UUID based on the scenario. The steps for integer data to be more complex than UUID are to know what the global primary key is, that is, to lock the data (whether exclusive or read/write locks), avoiding the lock overhead. Other redundant logic overhead is basically absent. in terms of the underlying access, as the data volume increases, the integer type has obvious advantages. if the concurrency is large and the throughput is pursued, the UUID advantage is slightly obvious. If the data size is large, the concurrency is large, and the number of reads and writes is small, the integer type is basically used. If the read/write frequency is quite the same, we should consider sacrificing data consistency and accuracy to ensure throughput. then the UUID and integer types are almost the same in this scenario.
6. some principles for designing table structures
1) as few fields as possible:
Long ago... long ago, the teacher told us the same thing. There are too many fields in the table. if there are too many fields, split them into two tables. if the split table still has a large field, continue to split it. So what is a big table? how is it called too many fields? I think there should be no standard answer. readers may be in different industries and business scenarios. The author's experience and the author's statement are provided here ~ Just listen. Those moles can make bricks at any time !!!
Industry |
Scenario |
Average number of fields in the master table |
Remarks |
Traditional IT |
MIS system |
16 ~ 25 |
Most representative OA |
|
Financial system |
20 ~ 25 |
Cashier, salary |
|
Asset management system |
8 ~ 14 |
If you need a resource topology, there can be many tables, but each table requires as few fields as possible. |
|
Organization Management |
6 ~ 12 |
Tree-like auto-join is required. |
|
Function menu |
4 ~ 8 |
Function tree or tab |
|
BOSS integrated business processing system |
12 ~ 40 |
Because the business of the BOSS system is quite complex, the owner user is used as the main table, and other value-added businesses around the owner are also equivalent to the main table, so this range is quite large. |
|
Workflow System |
16 ~ 24 |
In this case, open-source workflows, such as JBPM, have a base table of about 12. |
|
Banking system-core user information |
16 ~ 25 |
When I opened an account, I entered a lot of information, but it was stored separately, and the table structure of each bank was not uniform, but it was roughly the same as that of the BOSS. |
Internet |
SNS |
10 ~ 15 |
The main table should also calculate the user's profile. in this case, the apache shindig framework specification is referenced. However, it uses the JPA specification to generate tables with many foreign keys. |
|
Weibo |
12 ~ 16 |
In social media, the function of Weibo is basically available in tables. in this case, the table becomes the master table, and the foreign key relationship is mainly being processed. basically, all the more fields are associated with foreign keys. |
|
Content Publishing-News |
8 ~ 14 |
You can refer to the main table in openCMS. |
|
BBS |
10 ~ 16 |
If it is not in this range, consider whether there is room for table sharding. |
...... |
...... |
...... |
You need to add and correct it. one person has limited energy, a wide range of research, and there are blind spots that have never been touched! |
|
|
|
|
To return data to the Mysql service layer, the storage engine must buffer data rows. the server layer must decode the Buffered content (because the data is returned from the storage engine) as fixed columns.
If the number of columns is large, the cost of data row conversion is high! In particular, the variable length structure-the variable length structure of myisam and innodb, which is the most representative and frequently used-varchar.
1) as few associations as possible:
Well, I think we need to follow our own business characteristics! The operations associated with each primary table in Mysql can only reach 61 Slave table depths (remember, here is the depth, a bit like the Java stack depth, not the breadth ). I was wondering which application can be complicated to be associated with such depth? Unless in such an actual scenario-you find that the identities and ages of Li tian yi (oh, now the media called Li Moumou) are false. once there is conclusive evidence! He dug up Lao Tzu and his mother first, and then began in-depth Association mining. he dug up a bunch of corrupt officials with skirts and continued to dig down these corrupt officials, it is estimated that more than 61 layers of depth can be detailed. this is the characteristic of a "big country". The water is very deep. don't ask, don't think, don't listen, there are more unfair things in the world.
The author of High-Performance Mysql advocates that the general correlation depth is within 12 layers.
2) Enumeration
This does not need to be said. you will know the actual scenario of enumeration several times in a real project, and when to use enumeration and foreign keys to associate a constant table (data dictionary ). Yes, no; male, female, and gender are unknown. these options will not change for decades!
Which country is a person? this is a regular table! In case, one day, because of war, a great power (you know ......) It no longer exists. we still need to use alert to block application corrections for database enumeration.
Another point is that if SET is used, you need to know that the elements in the SET are not mutually exclusive. if they are mutually exclusive, it is better to use enumeration.
3) Is it anti-paradigm?
Benefits in line with the paradigm:
1: During the query, only the local field data of the table you are concerned about is queried, and no associated IO is required.
2: for the update operation, you only need to care about the changed data, instead of locking a row in the entire large table for update. In fact, this already contains a bucket, the concept of read/write splitting in heap management is now.
3: basically, the table above is a small table. if it is not a global dataset, it can basically be stored in the database query cache, that is, cache in the memory.
4: the paradigm-based tables are basically grouped by a foreign key, which is equivalent to the group by operation during table design. The complexity of SQL statement query can be simply queried based on a where foreign key = a value.
Disadvantages that conform to the paradigm:
1: In fact, it is also the opposite case of the first advantage. when a query request requires full-field data, it has to perform multiple random io de-join tables to query the overall entity information, for example, the profile information of a user in SNS. In the anti-paradigm, all the data in a table is basically (a small part is not) sequential IO, and the disk reads data quickly.
2: The more associated tables, the index site does not close, it may cause the associated index, or the aggregation index fails.
The advantages and disadvantages of non-conforming paradigm are exactly the inverse of the paradigm, which will not be described here.
In actual development, the paradigm + anti-paradigm = hybrid paradigm is used according to the project business scenario. Different hybrid policies are adopted for different stages of products and projects, different data volumes, different user volumes, and different concurrency.
1. sacrifice data timeliness for high performance
If you have read the author's article "single point of stress optimization for Web sites", you should remember that there is an optimization link.
Repeat it here. at that time, multiple records of a table were queried for display in the grid list. Therefore, each business basically contains two query transactions, one for querying select count records by page; another normal select * from operation used to query records. As the number of records increases, the cost of two queries is too high. although it is in the same business, but it is two transactions, there is a data inconsistency at this time, therefore, I simply put the count record in the cache table-memroy. The table name tableinfo only records the table name and total number of records. the scheduled task regularly runs the select count statement, assign the result to the record of the table.
Create table 'tableinfo '( 'Tablename' varchar (40) not null, 'Datacount' int (10) default null, Primary key ('tablename '), Unique key 'tablename' ('tablename') USING HASH ) ENGINE = memory default charset = utf8; |
The engine is memroy, and the index key is the table name, index type, and hash. Each time the application is started, the total number of records in the business table is also scanned, and the latest total number of records is assigned to the records in the table. In this way, you can query the table every time you execute count. the summary table is valid as long as the database service is not restarted. It must be noted that the summary information must sacrifice the timeliness of the data.
2. bucket-based thinking
In fact, the idea of sharding can be seen everywhere in the program world, such as the concurrent HashMap of Java concurrent packet sending. Applying this idea to database theory is actually the idea of read/write splitting.
The demo in high-performance Mysql is very classic. Let's use a picture to describe the author's intent.
Let's take a look at this idea. Both the code and the database can be transplanted and used for reference.
1. Summary
Recently, I am a little busy. this second summary is a little hasty. Fortunately, the subsequent chapters will continue to summarize.
Actually, based on experience ~ The benefits of schema optimization are greater than those of other optimizations. The disadvantage is also obvious. the table structure has changed, and your SQL may need to be modified accordingly. This is also why many Internet companies, ordinary R & D personnel, do not have the permission to design or modify the table structure. only DBA has this right. To modify the structure of a table, a strict approval process is required. DBAs have strict control over the table structure. if they are unconvincing, they can make table design teams with their heads. after the data volume increases, they can modify the schema, especially in scenarios where many cluster databases exist, the cost is relatively high.
PS: We recommend a good drawing tool. I always feel that it is time-consuming to draw a picture and it is ugly to draw it! Which of my friends taught me the aesthetic? I hope you can give me some advice on how to draw good-looking images! We will discuss the key points in the future, index!