Recently I used a very simple and easy-to-use method to solve the insert throughput problem in the business. Here I will summarize it.
First, let's clarify that the insert throughput is not the IPS (insert per second), but the RPS (effect rows per second ).
Next, let's talk about batch insert. In fact, as the name suggests, it is batch insert. This optimization idea is very basic. The most famous MySQL application is group commit.
Simply put, SQL a is converted into SQL B.
SQL A : insert into table values ($values);SQL B : insert into table values ($values),($values)...($values);
Next, let's take a look at the changes that may be caused by such abnormal and simple changes.
Test Environment Description: Single-id table structure, 10 W int values, local use socket to connect to MySQL server, use shell single process test.
First, we will see the time required to insert 1777 int values into the test table using SQL A, which takes seconds.
real 29m37.090suser 9m11.705ssys 5m0.762s
Next, let's take 53 seconds to insert int values into the test table using SQL B (insert 10 values each time ).
real 0m53.871suser 0m19.455ssys 0m6.285s
This is nearly 33 times faster. The reasons for this performance improvement are as follows:
1. Each connection to the MySQL server requires various initialization, permission authentication, Syntax Parsing, and other steps, consuming certain resources.
2. Updating a values is basically the same as updating n values. (The following compares the time consumed by the profile of insert 10 values with single values cores)
Single values:
+ Duration + ---------- + | Status | Duration | + ------------------------------ + ---------- + | starting | 0.000056 | checking permissions | 0.000010 | Opening tables | 0.000034 | System lock | 0.000010 | init | 0.000011 | update | 0.000061 | Waiting for query cache lock | 0.000003 | update | 0.000015 | end | 0.000003 | query end | 0.000053 | closing tables | 0.000009 | freeing items | 0.000021 | logging slow query | 0.000002 | cleaning up | 0.000003 | + ---------------------------- + ---------- +
10 values: + Duration + ---------- + | Status | Duration | + ------------------------------ + ---------- + | starting | 0.000061 | checking permissions | 0.000008 | Opening tables | 0.000027 | System lock | 0.000008 | init | 0.000012 | update | 0.000073 | Waiting for query cache lock | 0.000003 | update | 0.000010 | end | 0.000008 | query end | 0.000053 | closing tables | 0.000010 | freeing items | 0.000021 | logging slow query | 0.000002 | cleaning up | 0.000003 | + ---------------------------- + ---------- +
However, is the more values accumulate, the higher the efficiency? The answer is naturally no. No optimization solution is purely linear, and there will certainly be inflection points under certain conditions.
We perform tests based on different values numbers: 1, 10, 50, 100, 200, 500, 1000, 5000, and 10000.
As we can see, with the increase of the values number, the time consumed first drops sharply, from 1777s to 53 s, and then the increase of the values number will not change much until the values number exceeds 200, the last 10000 values number took 2 minutes.
As we can see, with the increase of values numbers, QPS (Blue Line) increases first, then decreases, and eventually falls below 1/s. The RPS (Green Line) soars to a high level with the increase, and then gradually declines with the increase. After more than 5000 values number, it begins to drop sharply.
In addition, the most important thing is that,The highest QPS peak and the highest RPS peak are not in the same values number. That is to say, the highest QPS does not mean the highest insert throughput..
In my simple test scenario, the most suitable value for values number is 50. Compared with a single values, the time consumption is reduced.97%, Increases insert Throughput36 times.
This value is related to the table structure, field type, and size. It can be obtained only after testing based on different scenarios, but generally, 50-is a relatively recommended value.
As for how to implement this, as long as the frontend is added to the queue when writing data, it can be merged according to two conditions.
- Data is written into the database after accumulating n values numbers in the queue. The advantage is that the performance is the highest, and the disadvantage is that the time is uncontrollable. It may take n seconds for the nth number, at this time, the service is no longer available.
- After a queue is accumulated for 1 s, the number of values is written. The advantage is that the time is controllable. The disadvantage is that the number of values numbers is not possible. In the case of high concurrency, thousands of values may have been accumulated for 1 s.
- The optimal solution is that two conditions take effect at the same time, that is, the number verification and time verification are performed. No matter which condition is met, subsequent database write operations are triggered.
Summary:
1. Using batch insert can increase the insert throughput.
2. Test the superimposed values number based on the actual situation.
3. Use both the number and time control valve values.
Record values for simple tests:
ValuesNum |
Time |
QPS |
Rows |
1 |
1777 |
56 |
56 |
10 |
53 |
188 |
1886 |
50 |
49 |
40 |
2040 |
100 |
50 |
19 |
2000 |
200 |
51 |
10 |
1960 |
500 |
57 |
3 |
1754 |
1000 |
60 |
2 |
1666 |
5000 |
69 |
0.3 |
1449 |
10000 |
133 |
0.07 |
751 |