(Architecture and history)
Mysql logical architecture
If you have a structural diagram of how Mysql components work together in your mind, this will help us understand the Mysql server.
Mysql logical architecture
The upper-layer connection/thread processes connection, authorization, and security.
The second layer is the parser, query cache, and Optimizer. Process query parsing, analysis, optimization, caching, and all built-in functions, all functions across storage engines (stored procedures, triggers, views ).
The third layer is the storage engine. The server interacts with various storage engines through APIs. These interfaces shield the differences between different storage engines.
In addition, Mysql also has various features, including: execution and optimization, concurrency control, read/write locks, lock granularity (row-level locks, table-level locks), and explicit locks (select... For update), transactions, etc.
History
This section also introduces the history of writing Mysql, so I will not issue it.
Now Mysql and Innodb engines are both oracle.
Mysql storage engine
The two main types are InnoDB and MyISAM. There are also some built-in engines such as Archive, Blackhole, CSV, Federated, and Memory, and some third-party engines will not be listed.
Get table information
Run the show table status like 'tablename' \ G command to obtain the table information.
The following information is displayed.
1. Name
Table name
2. Engine:
Table Storage Engine
3. Version:
Version
4. Row_format
Row format. For the MyISAM engine, this may be Dynamic, Fixed, or Compressed. The row length of a dynamic row is variable, for example, a Varchar or Blob field. A fixed row means that the row length remains unchanged, for example, Char and Integer fields.
5. Rows
The number of rows in the table. For non-transactional tables, this value is accurate. For transactional engines, this value is usually estimated.
6. Avg_row_length
Average number of bytes per line
7. Data_length
Data volume of the entire table (unit: bytes)
8. Max_data_length
Maximum data size that a table can accommodate
9. Index_length
Disk space occupied by Indexes
10. Data_free
For the MyISAM engine, it identifies the allocated space that is not used now and contains the space of the deleted row.
11. Auto_increment
Value of the next Auto_increment
12. Create_time
Table creation time
13. Update_time
Last table update time
14. Check_time
Use the check table or myisamchk tool to check the latest time
15. Collation
Default character set and character sorting rules of a table
16. Checksum
If enabled, the checksum is calculated for the content of the entire table.
17. Create_options
All other options during table creation
18. Comment
Other additional information is included. For the MyISAM engine, note Xu Biao. If the table uses the innodb engine, the actual table space is left. For a VIEW, the comment contains the VIEW text.
(Index 1)
Introduction
An index (a KEY in Mysql) is a data structure used by the storage engine to quickly find records.
We all know that indexes are very useful for queries with a large amount of data. Of course, improper indexes have a great impact on performance when the data volume is large.
Index type
B-Tree (actually B + Tree)
Normally, the Mysql index is implemented by B + Tree. Each leaf node has a pointer to a child leaf node, these pointers also store the upper and lower limits of the value of the child leaf node. B + Tree stores index columns sequentially, so it is suitable for searching range data.
Hash index
Hash indexes are implemented based on hash tables. Only queries that precisely match all columns of an index are valid.
Others
Spatial Data Index (R-Tree), full-text index, and other third-party index
B + Tree index example
Advantages of indexes
In addition to quickly locating the data rows to be queried, taking B + Tree as an example, it is stored in sequence, so it can also be used for ORDER BY or GROUP BY operations; in addition, the actual value is stored in the index. If the queried value only exists in the index, it can be obtained directly from the index.
Indexes greatly reduce the amount of data to be retrieved by the database.
Indexes can help the server avoid sorting and temporary tables.
The index can change random I/O to sequential I/O.
Index optimization
Independent columns
The index used cannot be part of an expression or function.
For example, where id + 1 = 5
Prefix index and Index selectivity
Prefix index
Prefix index refers to a varchar column, which uses only the previous part as the index, thus saving the index space, however, the data cannot be obtained directly from the index (because the index only contains a part ).
Index selectivity
Index selectivity refers to the ratio of non-repeated index values (also called the base) to all data (# T. The value range is 1/# T to 1. Of course, when the selectivity is 1, it is the best. Generally, our primary key index is 1.
Mysql does not allow all text columns to be indexed. At this time, we need to choose a long enough content to ensure selectivity, but it cannot be too long, resulting in a large storage space.
The solution provided in the book is to take the first 10 samples, check the total amount (COUNT (*), and then cut the amount (LEFT (xxx, n) to increase a little. If it is similar to the total amount, you can use this length as the prefix.
Another solution is to compare COUNT (DISTINCT xxx)/COUNT (*) with COUNT (DISTINCTLEFT (xxx, n)/COUNT (*). n increases a little bit, this length can be used when the ratio is similar.
Multi-column index
Note the order of indexes for multiple-column indexes. Some people think that it is better to index each column. In fact, this is a wrong idea. Although mysql later has the index merge function, the performance is not very good, it also shows that this index is not optimal.
Index column order
B + Tree is an index stored in sequence. When multiple columns of indexes exist, mysql first stores the indexes in the sequence of the first column and then stores the indexes in the sequence of the second column.
There is a concept of a Samsung index: the index puts relevant records together to obtain the first star. If the data order in the index is the same as that in the search, the second star is obtained; if the column in the index contains all the columns in the query, the system obtains the Samsung ID.
In general, the index sorting should be: sorting column, grouping column, and selective column sorting.
Clustered index
Clustering means clustering.
In fact, generally, the primary key we define will be the clustered index. If we do not define the primary key, mysql will also select a unique non-null index as the clustered index, without such an index, mysql will implicitly define a primary key as a clustered index.
Advantages
Related data is stored together.
Fast data access. Clustered indexes and data rows are stored in B-tree at the same time.
You can directly use the primary key value on the page node when overwriting the index.
Disadvantages
If the data is in the memory, clustering indexes are useless.
If it is not sequential insert, the speed will be slow. (In fact, you can use auto-increment gradually, and UUID will be super slow)
It is costly to update clustering index columns.
Full table scan is slow if the rows are sparse.
A secondary index contains a primary key and a secondary index requires two queries (first finding the primary key and then querying data rows through the primary key ).
Non-clustered index (secondary index)
A secondary index is an index other than a primary key.
Because primary keys and data rows are stored in B + Tree, primary key index queries are very fast, but secondary indexes are stored in primary keys. Therefore, secondary queries are required.
Overwrite index
Index query data is indeed very fast, but Mysql can also directly obtain data from the index. If all the data to be queried is indexed, Mysql does not need to read data from the data row.
If an index contains all the fields to be queried, this index is called overwriting index.
Secondary indexes also contain primary key indexes. Therefore, secondary index columns and primary keys can also be overwritten indexes to complete queries.
Sort by index scan
Mysql can be sorted by two methods: sorting by results and sorting by index scanning.
Sort BY index (if not consistent, you can change one of them to the opposite number or reverse string ).
The limit of sorting is that the first field in the order by clause must be the leftmost prefix of the index. Otherwise, the index sorting cannot be used.
An exception is that if the column in front of the sorting index is specified as a constant, it can be used.
For example, if there is an index (date, id1, id2), the statement is as follows:
Select... From xxx where date = '2014-04-30 'order by id1, id2
In this case, the date of the first index is a constant. You can use the index of the second and third columns to sort the index.
Redundant index and duplicate index
Some people may make multiple types of indexes for a field, which is redundant (repeated index ).
However, redundant indexes and duplicate indexes are different. If you create an index (A, B) and an index (A), it is A redundant index. In fact, (A) is the prefix of (A, B, b) replace (for B + Tree ). However, if you create an index (B), it will be different.
Unused index
Some indexes were created when we first created the table, but they may not be used at all. As we all know, indexes will reduce the efficiency of table data updates. Such indexes are actually cumbersome and we should delete them.
We can use pt-index-usage of Percona Toolkit to read the query log, EXPLAIN each query in the log, and then print the report related to the index and query. In this way, we will find that the indexes have almost never been used and can be deleted.
(Index 2)
Index optimization case
I will introduce the case in the book.
Assume that you want to design an online dating website. The user information table contains the following columns: country, region, city, gender, and eye color. It needs to be designed to search based on the user's various information, and sort the user according to the user's final launch time and the rating of other members on the user.
Use Index sorting whenever possible
Index sorting strictly limits the design and query of indexes. If you want to sort by user rating, you cannot use the index to query the age range. If you want to use the range query, you cannot use other indexes (the index after it) for sorting. If this where condition is frequently used, check the data for sorting.
Supports multiple filtering conditions
Country and sex are not highly selective. Most countries have only a few sex cases, but these query conditions are basically used. Sometimes sex is single-choice, therefore, you can add these options to the index. The index is (sex, country ). If you need to query multiple cases, you can use in to query the required data or use prefix indexes.
However, the in condition should not be too long. Although age can also be completed with in, so that it does not need to be changed to a range index, most age ranges are not very short.
Therefore, we should try to put age at the end, because age> = xx and age = <xx is usually used, so that after the age index is used, the subsequent indexes will no longer be used, the usage of age = xx is rare.
We can also use in to process the eye color and hair color, but this method will actually increase exponentially. The actual combination form is the multiplication of the quantity of various in.
Avoid multiple range conditions
For example, if we want to query users who have logged on to the system for nearly one week (7 days) and users of the age range, we can only use one index in any case. Because no index column can be used after the range index. Of course we can use age in, but the age range is really a bit large.
We also provide a way to prevent users from entering time periods, the options include logon in the last day, logon in the last three days, logon in the last seven days, logon in the last half month, and logon in the last month.
Add a field, set the value to 0 for each user login, and assign the values to 1, 2, and 3 for the users that meet the preceding requirements by the scheduled task, in this way, you only need to use the = condition, instead of the range index.
Optimize sorting
When we execute sorting, when we use select * from xxx order by xxx limit xx, xx, even if there is an index, the performance will be very low after turning pages, the IO is very high because the data is correlated to the specified row each time.
In this case, you can use the delayed Association method to use indexes more efficiently.
Select * from xxx inner join (
Select pk from xxx order by xxx limit xx, xx
) As x using (pk)
In this way, secondary queries can be optimized using the secondary index's built-in primary key index.
Index and table maintenance
Maintenance tables have three main purposes: finding and repairing damaged tables, maintaining accurate index statistics, and reducing fragmentation.
Locate and repair the damaged table
Table corruption is usually caused by system or hardware problems. If you find any inexplicable problems, try to use the check table command to check whether the table is damaged. In general, the innodb engine table can be repaired using the following command:
Alter table xxx ENGINE = INNODB;
If the innodb table is damaged, it is generally a serious problem because innodb is not damaged. If it is damaged, it may be a hardware problem or DBA error operations, such as files operated outside Mysql.
Update statistics
You can run the analyze table command to generate statistical information again (for example, the number of rows, INNODB is usually inaccurate ).
You can use show index from xx to view the INDEX base (Cardinality ). This will have a lot of information. It should be noted that Cardinality gives (estimation) how many different values this index column has.
Reduce index and data fragmentation
BTree may be fragmented, which seriously affects the query efficiency. Fragmented indexes may be stored on disks in poor or unordered ways.
There are three types of data fragments:
Row fragmentation: This fragmentation refers to the multiple fragments where data rows are stored in multiple locations. Even if a query accesses only one row of records from an index, row fragmentation may degrade the performance;
Inter-line fragmentation: row fragmentation refers to logically ordered pages, or rows are not stored sequentially on disks;
Remaining space fragmentation: the remaining space fragmentation refers to a large amount of free space in the index data page. This will cause the server to read a large amount of unnecessary data, resulting in waste.
For the MySIAM storage engine, all three fragments will appear.
For InnoDB, row fragmentation does not occur, and InnoDB moves short rows and writes them to fragments.
You can use the Optimize table name to sort data. If this command is not supported, you can use alter table xxx ENGINE = INNODB; to recreate the TABLE.
Index Summary
When creating indexes or using these indexes to write query statements, note the following three points:
Single-row access is slow. If the server reads only one row of data from the storage, it will waste a lot of work. The best read block can contain as many rows as possible.
The range data is quickly accessed in order, and the group by operation does not need to consume too much performance for sorting and GROUP aggregation.
Index overwrite queries are fast. If the query results can be obtained directly from the index, you can skip the secondary query to obtain row data.
In fact, in most cases, we all want to design a perfect Samsung index for all queries, but this is really difficult. We need to choose between them. For common queries, we must have good indexes, for less-used queries, we can sort and filter the queried data blocks.
(Query performance optimization)
To optimize mysql, in addition to the optimal database and table design and good Index design, you also need to compile reasonable queries. If the query write speed is poor, the index may not be used at all. In this way, no matter whether the database table or index design is good or useless.
The following describes how to write efficient query statements and how to understand the reasons for efficiency and inefficiency.
Why is the query speed slow?
Query needs to be fast, mainly because the response time is fast.
The query takes the following time: Network, CPU computing, statistical information generation and execution plan, Lock wait (mutex wait), etc, these calls are especially used to retrieve data from the underlying storage engine. These calls consume time on IO operations caused by memory operations, CPU operations, and insufficient memory.
Some operations are unnecessary or even repeated. Therefore, we need to optimize these queries to optimize and eliminate the time required for these operations.
Slow query: optimizes data access
The primary reason for low query performance is that too much data is accessed. Sometimes we may not always need so much data, but we still need to access so much data.
The analysis is usually effective in two ways:
1. Check whether the application is retrieving a large amount of data that exceeds the required data. It is usually because too many rows are accessed, or too many columns are accessed.
2. Check whether Mysql is analyzing a large amount of data that exceeds the data to be analyzed.
Whether unnecessary data is requested from the database
Some queries request more than the required data, and the excess data will be discarded by the application. This puts an extra burden on the Mysql server, increases network overhead, and also consumes the CPU and memory resources of the application server.
The following are some common errors:
Query unwanted records
When we use ResultSet to return the result set, we may only need the first 10 results. Some people will naively think that the application is a loop, take a piece of data, and we close the result set, the query will not continue. In fact, Mysql will not do this, but it will all be detected, and then there will be memory. In this case, we 'd better use LIMIT to intercept data.
All columns are returned when multiple tables are joined.
It is best to take only the required columns for join to reduce the subquery content (sometimes it is better to overwrite the subquery ).
Always retrieve all columns
Select * is used for query. When we see such a statement, we always have to look with suspicion to see if we really need all the columns. This will cause the query to fail to overwrite the query using the index. However, many people think that this can simplify development and sometimes use the query cache for such writing.
Multiple queries of the same data
Sometimes an interface requires several identical data records. We try to obtain the same data through a single query. Do not query over and over again, which consumes performance.
Is Mysql scanning additional records?
After confirming that only the required records are returned, we need to determine whether only the required records are scanned. The simplest three criteria for measuring Mysql Query overhead are:
Response time
Number of scanned rows
Number of returned rows
All three indicators are recorded in Mysql slow logs. It is most important to check whether too many rows have been scanned.
Response time
The response time is actually a superficial value.
Response time = Service time + queue time.
Generally, I/O and lock are usually used for waiting in the queue. Generally, this is not a good analysis of whether the time is within a reasonable range. Most of them rely on experience to determine whether such a query has a reasonable response time value.
Number of scanned rows and number of returned rows
Ideally, the number of scanned rows should be the same as the number of returned rows, but ideally there are very few, especially when performing associated queries. However, we should also try to reduce the number of scanned rows.
Number of scanned rows and access types
Mysql can query a piece of data in several ways. In some ways, you need to scan multiple rows to obtain a piece of data. In some ways, you do not need to scan to obtain the data.
The type column returned by the EXPLAIN statement reflects the return type. There are many access types: from full table scan to index scan, range scan, unique index scan, constant reference, and so on. Here, the speed is from slow to fast, and the number of scanned rows is also from large to small.
If you cannot find an appropriate access type for the query, the best way is to create a suitable index for it.
For example, select * from xxx where id = 1
In this query, when id is the primary key index, we use the EXPLAIN command to find that the type is ref, and the value of another parameter rows is 1, which means that this query only needs to access one piece of data. If there is no index, the type will be ALL at this time, and rows will usually be relatively large, about equal to the number of data records. In addition, the Extra parameter here will be displayed as Using where, the where condition is used to identify Mysql data rows.
Mysql generally applies the where condition in three ways:
Use the where condition in the index to filter records that do not match. This is done at the storage engine layer.
Overwrite the index to return the record (Extra will prompt to be Using index), filter from the index directly and return the required record.
Return data from the data table and then filter the data (Extra will prompt Using where ).
Therefore, creating a good index can greatly optimize our query performance, but it is not always possible to optimize it. Sometimes it is only possible to access a lot of data and there is no good index to help.
Generally, if we find that a query scans more rows than the returned rows, we can solve the problem by using the following methods:
Use Index coverage scan.
Change the database table structure, such as using a separate summary table.
Rewrite the query statement so that Mysql can better optimize the query (will be discussed later)