Database performance is critical to a data-centric application that directly affects the performance of the program. In general, to ensure the efficiency of the database, to do the following four aspects of the work: database design, SQL statement optimization, database parameter configuration, the appropriate hardware resources and operating system, this order also shows the impact of these four work on the performance of the size. Let's clarify each of the following:
First, the database design
Moderate anti-paradigm, attention is moderate
We all know that the three-paradigm model, based on three-paradigm, is the most efficient way to save data and the easiest mode to expand. When we develop an application, the database is designed to adhere to the three paradigms to the maximum extent, especially for OLTP-based systems, which are rules that must be adhered to. Of course, the biggest problem with the three paradigms is that many tables are often required for queries, resulting in inefficient queries. So sometimes based on performance considerations, we need to deliberately violate the three paradigms, moderate redundancy, in order to achieve the purpose of improving query efficiency. Note that the inverse paradigm here is modest and must provide sufficient justification for this practice. The following is a bad example:
Here, in order to improve the efficiency of student activity record retrieval, the unit name is redundant to the Student Activity record table. The unit information has 500 records, and the student activity record in a year about 2 million data volume. If the Student Activity record table is not redundant this unit name field, which contains only three int fields and one timestamp field, occupies only 16 bytes, is a very small table. and redundant a varchar (32) of the field is 3 times times the original, retrieving the corresponding also has so much more I/O. and the number of records is very different, with a difference of 2000000, resulting in updating a unit name and updating 4,000 redundant records. Thus, this redundancy is simply counterproductive.
The following redundancy is very good
As you can see, [the student's total score] is redundant, which can be summed up entirely by [score status]. In the "Student Test score", a test one student only one record, and in the "score situation", a student for a small question in a quiz, a small question a record, roughly calculate the proportion is about 1:100. And sentence score is not easy to change, the frequency of the update is not high, so that this redundancy is relatively good.
Proper index creation
Speaking of improving database performance, the index is the most inexpensive thing. Do not add memory, do not change the program, do not have to tune SQL, as long as the correct ' create index ', the query speed may increase hundreds of thousands of times, this can be really tempting. But there is no free lunch, the increase in query speed is at the cost of inserting, updating, deleting, these write operations, increase the number of I/O. Because the storage structure of the index differs from the storage of the table, the index of one table often takes up more space than the space occupied by the data. This means that we have done a lot of extra work when we write the database, and this job is just to improve the efficiency of reading. Therefore, we create an index that must ensure that the index is not "at a loss". It is generally necessary to follow these rules:
The indexed field must be a field that is frequently used as a query condition;
If you index multiple fields, the first field is frequently used as a query condition. This index does not work if only the second field is a query condition;
The fields of the index must be of sufficient sensitivity;
Mysql supports prefix indexing for long fields;
Horizontal partitioning of a table
If a table has too many records, such as thousands, and needs to be retrieved frequently, then we need to piecemeal. If I break into 100 tables, then there are only 100,000 records for each table. Of course this requires data to be logically divided. A good division basis, in favor of the simple implementation of the program, can also make full use of the advantages of the horizontal table. For example, the system interface only provides monthly query function, then the table by month split into 12, each query query only one table is enough. If you want to divide by region, even if the table is smaller, the query or to unite all the tables to check, it is better not to dismantle. So a good basis for splitting is the most important.
There's a better example here.
Each student has a title that is recorded in the table, including the right and wrong questions. Each question will correspond to one or more knowledge points, we need to according to the wrong question to analyze the students in which knowledge point to master the deficiencies. This table is easy to reach tens, the urgent need to split, then according to what to dismantle it? From the demand point of view, whether teachers or students, will eventually focus on a student. Students concerned about themselves, the teacher will be concerned about their classes. And the knowledge points of each subject are different. So it's easy to think of a joint discipline and knowledge point two fields to split this table. This way, each table is about 20,000 data, and the retrieval efficiency is very high.
to divide a table vertically
Some table records are not many, may also be 2, 30,000, but the field is very long, the table occupies a large amount of space, the table needs to perform a large number of I/O, greatly reducing performance. At this point, you need to split the large field into another table, and the table is a one-to-one relationship with the original table.
"Question content", "answer information" two tables, initially as a number of fields added to the "question information", you can see the question content and answer the two fields are very long, in the table has 30,000 records, the table has accounted for 1G of space, in the column list of questions very slow. After analysis, it is found that the system is often based on "book", "unit", type, category, difficulty degree and other query conditions, the page shows the details of the question. And each search is these tables do join, every time to scan the table 1G, very depressed ah. We can completely split the content and the answer into another table, only to show the details of the time to read the big table, resulting in the "question content", "answer information" two tables.
Select the appropriate field type, especially the primary key
The general principle of selecting a field is to keep it small, so that you can use a field that is small in size without large pieces of space. For example, the primary key, we strongly recommend the use of self-increment type, without the GUID, why? What is space? Space is efficiency! Press 4 bytes and press 32 bytes to locate a record, who is fast and who is slow too obvious. When it comes to several tables for joins, the effect is even more pronounced. It is worth mentioning that the DateTime and Timestamp,datetime occupy 8 bytes, while timestamp occupies 4 bytes, only half, and timestamp represents the range is 1970-2037, for most applications, especially the recording of what Test time, Login time this kind of information, more than enough AH.
Files, pictures and other large files with file system storage, without database
Needless to say, the Iron law!!! The database only stores paths.
Clear foreign keys for easy indexing
As we all know, establishing relationships in PowerDesigner for two entities, creating a physical model automatically indexes the foreign keys. So we should not be afraid to build a relationship to the line of Chaos, build a shortcut just fine.
Mastering the timing of table writes
With the same library pattern, how to use the database also plays an important role in performance. It is also written to a table, and write-first and post-write will have a significant impact on subsequent operations. For example, in the case of moderate redundancy mentioned above,
Our initial goal is to record the total score of the candidates in order to achieve the goal of improving the efficiency of the search, that is, when entering the results of the table written. In the requirements of such requirements: List all the students in this exam results, did not record the results also show the student's name, but the total score is shown as empty. This query needs to use "Student information" left OUTER join "student test score Information", we all know that the efficiency of the outer join is lower than the join, in order to avoid this problem, we are in the layout of the test when writing this table, all students are inserted into the score is null , so that we can use join to achieve this effect. And there is the advantage: in one exam, a class of all students are assigned to the exam, all students entered the score. Now the class to turn to a new student, then at this time if the query students results, will list the freshmen, the result is not entered the results, this is obviously wrong. If you write at the time of the arrangement, you can record the actual examinee in the examination, the function of this table, also do not know is redundant.
Prefer to centralize bulk operations to avoid frequent reading and writing
The system contains integral parts, students and teachers through the system to do the operation can earn points, and the integration rules are very complex, limit each type of operation to get different points, each day each type of points have an upper limit. For example, log in and get 1 points at a time, but no matter how many times you log in, you can only accumulate one login point per day. This is still simple, some points are abnormal, such as teacher points in a class is to see the teacher sentenced to work, the rule is: The teacher sentenced the homework, found that the students are wrong, the students have changed, the teacher again sentenced, if this time the students are right, the teacher to add points, if the students are wrong, then change, know that the students The teacher is finished, in order to add points to the teacher. If you use a program, it is likely that each function will write an extra heap of code to handle this chicken-like integral. Not only does the programmer work to find the key, but also gives the database a lot of pressure. After the discussion with the demand staff, it is not necessary to accumulate the integral in real time, so we take the background script batch processing way. In the dead of night, let the machine play on its own.
This perverted point rule is read in batches:
1
select
person_id, @semester_id, 301003, 0, @one_marks, assign_date, @one_marks
2
from
hom_assignmentinfo ha, hom_assign_class hac
3
where
ha.assignment_id = hac.assignment_id
4
and
ha.assign_date
between
@time_begin
and
@time_end
5
and
ha.assignment_id
not
in
6 (
7
select
haa.assignment_id
from
hom_assignment_appraise haa, hom_check_assignment hca
8
where
haa.appraise_id = hca.appraise_id
and
haa.if_submit=1
9
and
(
10 (hca.recheck_state = 3004001
and
hca.check_result
in
(3003002, 3003003) )
11
or
12 (hca.recheck_state = 3004002
and
hca.recheck_result
in
(3003002, 3003003))
13 )
14 )
15
and
ha.assignment_id
not
in
16 (
17
select
assignment_id
from
hom_assignment_appraise
where
if_submit=0
and
result_type = 0
18 )
19
and
ha.assignment_id
in
20 (
21
select
haa.assignment_id
from hom_assignment_appraise haa, hom_check_assignment hca
22
where
haa.appraise_id = hca.appraise_id
and
haa.if_submit=1
23
and
hca.check_result
in (3003002, 3003003)
24 );
|
This is just an intermediate process, and if the program is processed in real time, the database will rest even if the programmer does not strike.
Choose the right engine
MySQL offers many kinds of engines, and the most we use is the myisam,innodb,memory three categories. The Official handbook says myisqm reads faster than InnoDB, about 3 times times faster. However, the book can not be trusted ah, "OreIlly.High.Performance.Mysql" This book refers to the MyISAM and InnoDB comparison, in the test MyISAM performance is less than InnoDB. As for memory, haha, it is more useful. It is a good choice to make a temporary table in a batch (if the memory is large enough). In one of my batches, the speed ratio was almost 1:10.
second, SQL statement optimization
SQL Statement Optimization Tool
• Slow Log
If you find that the system is slow and you don't know where it is, then you should use this tool. Just configure the parameters for MySQL and MySQL will record the slow SQL statements itself. Configuration is simple, configuration in the parameters file:
Slow_query_log=d:/slow.txt
Long_query_time = 2
You can find the execution time more than 2 seconds in the d:/slow.txt, according to the file to locate the problem.
mysqldumpslow.pl
Slow log files can be very large, and it is difficult to see people. At this point we can analyze it with the tool that comes with MySQL. This tool can format slow log files, for statements that are just different arguments will classify the class and, for example, have two statements select * from a WHERE id=1 and select * from a where id=2, after this tool is organized, only select * FROM A where id=n, so it's much more comfortable to read. And this tool can be used to achieve a simple sort, let us targeted.
Explain
Now that we know which statement is slow, why is it slow? See how MySQL executes, using explain to see the MySQL execution plan, the following usage comes from the manual
Explain syntax (get select related information)
EXPLAIN [EXTENDED] SELECT select_options
The explain statement can be used as a synonym for describe, or to get information about how MySQL executes a SELECT statement:
· EXPLAIN Tbl_name is a synonym for describe tbl_name or show COLUMNS from Tbl_name.
· If the keyword is placed before the SELECT statement Explain,mysql will explain how it handles select, providing the order in which tables are joined and joined.
This section explains the 2nd usage of explain.
With explain, you know when you have to index the table to get a faster select that uses the index to find records.
If there is a problem with using an incorrect index, you should run the statistics for the Analyze table update tables (such as the cardinality of the keyword set), which will affect the selection made by the optimizer.
You can also know whether the optimizer joins a table in the best order. In order to force the optimizer to have a SELECT statement in the join order in the table naming order, the statement should begin with straight_join instead of just select.
Explain returns a row of information for each table used in the SELECT statement. Tables are listed in the order in which they will be read by MySQL during the process of querying. MySQL solves all joins in a way that scans multiple joins (Single-sweep Multi-join). This means that MySQL reads a row from the first table, then finds a matching row in the second table, then in the 3rd table, and so on. When all the tables have been processed, it outputs the selected columns and returns the list of tables until a table with more matching rows is found. Read the next line from the table and continue processing the next table.
When using the Extended keyword, explain generates additional information that can be browsed with show warnings. This information shows how the optimizer qualifies the table and column names in the SELECT statement, overrides and executes the optimization rule, what the SELECT statement looks like, and may also include additional annotations to the optimization process.
If you can't do anything, try a full index scan.
If a statement is really not optimized, there is another way to try it: Index overrides.
If a statement can fetch all the data from the index, it does not need to go through the index to read the table, saving a lot of I/O. such as a table
If I were to count each student's score for each question, we would index the primary key foreign key for each table, and the actual Score field index of the score, so that the entire query could get the data from the index.
Third, database parameter configuration
The most important parameter is the memory, we mainly use the InnoDB engine, so the following two parameters are very large
# Additional memory pool that's used by InnoDB to store metadata
# information. If InnoDB requires more memory for this purpose it would
# start to allocate it from the OS. As this is the fast enough on the most
# Recent operating systems, you normally don't need to
# value. SHOW INNODB STATUS would display the current amount used.
Innodb_additional_mem_pool_size = 64M
# InnoDB, unlike MyISAM, uses a buffer pool to cache both indexes and
# row data. The bigger you set this and the less disk I/O are needed to
# Access data in tables. On a dedicated database server if set this
# parameter up to 80% of the machine physical memory size. Do not set it
# too large, though, because competition of the physical memory may
# cause paging in the operating system. Note that on 32bit systems
# might be limited to 2-3.5g of the user level memory per process, so does not
# Set it too high.
Innodb_buffer_pool_size = 5G
For MyISAM, you need to adjust the key_buffer_size
Of course, adjust the parameters to see the state, with the show status statement can see the current state, to determine which parameters to adjust
Cretated_tmp_disk_tables Increase Tmp_table_size
Handler_read_key High indicates index is correct handler_read_rnd high indicates incorrect index
Key_reads/key_read_requests should be less than 0.01 compute cache loss rate, increase key_buffer_size
Opentables/open_tables Increase Table_cache
Select_full_join number of links that do not have a useful index. If it is not 0, the index should be checked.
Select_range_check if it is not 0, the check table is indexed.
Sort_merge_passes the number of merges that the sorting algorithm has performed. If the value is large, you should increase the sort_buffer_size
Table_locks_waited cannot immediately get the number of locks on a table, if the value is higher, the query should be optimized
Threads_created creates the number of threads used to process the connection. If the threads_created is larger, increase the thread_cache_size value.
The method of calculating the cache access rate is threads_created/connections.
Four, reasonable hardware resources and operating system
If your machine has more than 4G of memory, you should definitely use 64-bit OS and 64-bit MySQL
Read/write separation
If the database pressure is very large, a machine can not support, then use MySQL replication to achieve multiple machine synchronization, the pressure of the database is dispersed.
Master
Slave1
Slave2
Slave3
Master Library Master is used to write, slave1-slave3 are used to make select, each database share a lot less pressure.
To achieve this, the program needs special design, write operations master, read all operations slave, to the development of the program brings an additional burden. Of course, there is now a middleware to implement this agent, the program to read and write which databases are transparent. The official has a mysql-proxy, but it's alpha version. Sina has a amobe for MySQL, also can achieve this purpose, the structure is as follows
MySQL Database optimization summary has a diagram