I was writing a speech ppt. I suddenly heard the text message and thought it was an advertisement. Just put down your mind, and text messages keep ringing. You don't have to think about which server should trigger an alarm.
When I opened the nagios monitoring meeting, I found that three servers, three servers, were in the same cluster, the business was a forum, and the number of online users was about 40 thousand.) the load was too high and it was in the warning status.
1. Check the access traffic first. The comparison is no different from the previous one.
2. Check the number of processes and cpu usage of each server, which is no different from the previous one.
3. view system logs. Each server has "TCP: Treason uncloaked! Peer 113.247.241.146: 21345/80 shrinks window 2128147967: 2128149427. retries red ."
4. View php logs. A large number of "[WARNING] fpm_request_check_timed_out (), line 158: child 25379, script'/mnt/html/bbs/forum. php '(pool default) execution timed out (120.306361 sec), terminating ". It took more than 120 seconds to open the Forum homepage. The execution interruption time set in the php configuration file is 120 seconds. If this value is exceeded, the sub-process is disabled. It seems that we should start from here.
First, ask someone else if I have changed the program recently. Is there any plug-in added? A: "No ". I checked the system carefully:
1) check whether the file system is damaged and cannot be written.
2) check whether the partition is full. If the partition is actually full, an SMS will trigger an alarm)
3) Check the tcp connection status. It seems that it is not a system problem.
Then, there are associated databases, nfs file systems, and memchached. Check whether it is easy! Check nfs first. Check memcached again. It seems that there is something wrong with the database.
Log on to the database and check the database error log. Run tail-f to scroll down the output. It seems that the problem has been found. The input content mainly includes the following lines:
[ERROR] Got error 134 when reading table './uc_mumayi/cdb_uc_members' [ERROR] Got error 134 when reading table './uc_mumayi_net/cdb_uc_members' [ERROR]/usr/local/mysql/libexec/mysqld: The table 'pre _ common_session 'is full |
Next, starting from processing the TABLE full, set its row value to a greater value. I set the value to 10 million and the command is: mysql> ALTER TABLE pre_common_session MAX_ROWS = 10000000; the load on the three web servers immediately drops. The error message indicates that two tables may be damaged. Check it. If it is broken, fix it!
1) check the first table: mysql> check table cdb_uc_notelist; the output is
+---------------------------+-------+----------+-----------------------------------------------------------+| Table | Op | Msg_type | Msg_text |+---------------------------+-------+----------+-----------------------------------------------------------+| uc_mumayi.cdb_uc_notelist | check | warning | 11 clients are using or haven't closed the table properly || uc_mumayi.cdb_uc_notelist | check | warning | Size of datafile is: 260372 Should be: 259760 || uc_mumayi.cdb_uc_notelist | check | error | Wrong bytesec: 101-114-110 at linkstart: 258412 || uc_mumayi.cdb_uc_notelist | check | error | Corrupt |+---------------------------+-------+----------+-----------------------------------------------------------+4 rows in set (0.04 sec)
If the damage is serious, fix it:
Mysql> repair table cdb_uc_notelist;
The output is
+---------------------------+--------+----------+-----------------------------------------------+| Table | Op | Msg_type | Msg_text |+---------------------------+--------+----------+-----------------------------------------------+| uc_mumayi.cdb_uc_notelist | repair | info | Wrong bytesec: 101-114-110 at 258412; Skipped || uc_mumayi.cdb_uc_notelist | repair | warning | Number of rows changed from 5715 to 5742 || uc_mumayi.cdb_uc_notelist | repair | status | OK |+---------------------------+--------+----------+-----------------------------------------------+
2) Restore 2nd tables. The method is the same as above.
3) Check the status again.
4) Ask the Administrator to log on from the background and check whether the operation is normal.
Original article: http:// B .formyz.org/2011/1124/53.html