Question brief:
A week ago, there was a MySQL server hardware failure, downtime. We have submitted the application to the students who are responsible for this piece, they are responsible to repair the server. Today, when the server is repaired, they boot it up. The 4 MySQL instances on the server start automatically after the boot and start La Micu Binlog. Because this server downtime is relatively long, log lost more, Crazy La Micu binlog, resulting in the main library network problems.
Phenomenon:
First of all, we did not realize that it was because of a broken server to restart the main library Binlog caused, because we do not know what the situation of this server, only know 1 weeks ago, we repair 1 servers. Concrete what situation, have not repaired, have no boot, we have no idea at all.
In such cases, suddenly heard the network of students said MySQL has a machine network traffic is too large, resulting in business feeling very slow, a total of 17 minutes. In fact, this is not a lot of clues.
Looking at Processlist, full-day, and slow logs did not find any problems.
Look at the monitor and discover that the server's read IO has increased dramatically during that time.
By looking at the history of Processlist, it was found that for some time, the user state of the master-slave replication was waiting for net, and by its IP it was discovered that the server was a slave server that broke down 1 weeks ago.
This server has 4 instances, after the server started, the MySQL instance starts automatically, starts to pull the Binlog to the main library, each main library daily Binlog amount approximately 6g,4 The example 1 weeks approximately 160 many G binlog.
1, the broken server when repaired, when the boot, we can not control, do not know, also did not pay attention to
2, this case is very simple, very typical cases that may cause impact or failure, we are not alert to this phenomenon in advance, although we know that this is a very easy to appear, but we have no awareness of this aspect of the situation. Thus causing the event to occur
3, for the network traffic this block, the lack of effective monitoring
1, all servers, cancel the boot automatically start MySQL, the server boot, artificial start instance, stop slave. (This, if the server is too much, may be too troublesome, for the moment this record, always better than the impact)
2. Be aware of the problem and incorporate it into the common sense library or work manual to avoid problems.