System on-line that's something--remember one time. System fault

Source: Internet
Author: User
Tags unsupported

The project is a turntable game Lottery marketing project, due to the urgency of the operation of marketing time requirements, development and testing deployment on-line for 10 days, some of the preparatory work is not in place, such as:

1. Since the overall development was completed 2 days before the launch, testing to understand that the project requirement was in the development of the second week, and did not have sufficient time to complete the function of the UI model adaptation, System stress test.

2. Technically due to the partner's public key is not suitable for direct giving, so the other side of the package interface to obtain the required function, the other side of the encapsulated interface is given relatively late, three days before the scheduled start time;

Web interface authorization callback domain name only one, this callback domain name and other applications in use, can not directly change the domain name we deploy the application, requires the partner in the intranet set Nginx HTTP forwarding, ensure that the callback can be sent to our server, The API interface test for encapsulation also waits for the forwarding configuration to complete.

This type of network configuration also led to the subsequent encounter of some user pages can not be loaded, the problem is more difficult to troubleshoot, not in their own room to solve.

3. The online application machine is ready on the last day, and the test of Tomcat and the database deployment environment is not fully completed, leaving a hidden danger. If the Binlog function of MARIADB is not generated after setting the MY.CNF, the indexes of some core tables are not built completely.

And the activity only seven days, after estimating, think that the award of pressure most should be on the application side, the database is no pressure, so the configuration of 10 Tomcat and Redis cache, did not configure the master-slave structure for MARIADB to do backup, became a single point.

4. When the machine is ready due to operation and maintenance is also doing monitoring and log to view the overall migration of infrastructure, and human tension, in half a day can only do one thing, so priority to do server monitoring, here is another hidden danger is the alarm system. The company's internal server alarm system by the support department to do, think there should be, so on-line without testing alarm function, buried another mine.

Before the event

10:00AM mobile phone hanging agent Test found that the other side Nginx forwarded over the HTTP head of the host for the other address, so every request for game activity will go through the other side of the server again, and then forwarded back. This forwarding during the first walk verification, this delay in the game home page has a greater impact.

10:30AM still have half an hour to start, had wanted to test the other agent over the request redirect, but because before the official micro push sent the message, before the activity began, some scattered users have started to visit the activity page, but blocked in the activity does not start the page, the temporary Change program impact is relatively large, Plus the previous day in order to test the interface pressure test also got 1 points, the brain is more chaotic, slightly changed the test did not succeed, temporarily give up.

Event Start

11:00AM system officially opened, the user has been able to enter the Carousel lottery page. System monitoring is normal, the system load, the network is not abnormal.

2:00PM Observation Database A table a characters commonly used segment is not indexed, logically because only the user is not logged in the query only once, considering that the online library to do the ALTER index operation may have an impact on the time point of database operations, there is no complement to this index.

4:10PM company VPN disconnected, because unable to connect can not work, a few development go to the pantry to drink to relax. After a while suddenly be notified that the Activity page white screen can not access, operation and maintenance of the colleague notice said the server room mobile entrance line interrupted, hurriedly notify support department to troubleshoot the cause; At the same time, emergency switch the address of the domain name to the computer room telecommunications IP, and other domain names take 10 minutes.

4:50PM disconnected the engine room entrance channel recovery, in order to insure or wait for a while, the domain name resolution IP is re-cut back to the mobile line.

5:00pm another wave of official micro-subscription number began to push text guide users into.

7:00PM around the program to adjust, need to re-release the online program, operations colleagues in the high-speed home on the road, need to find a place on the roadside to push all the war package to the server, waiting. At the same time be told that the next wave of subscription number to push the game graphics, may immediately visit the volume will respond.

7:30pm several people finally have time to find food to eat, tragic discovery canteen and the whole family of rice are gone, can only eat noodles bread ...

7:50pm all war packages with left and right operations feedback.

Then found that the game page began to slow down, and the attention of the public number of users has begun to not enter the game page, return please focus on the boot interface.

8:00PM begins to troubleshoot why the error occurred. See the online machine tomcat is not unusual, at this time the landing database machine found in the command line under the system response is not normal, command input 2, 3 seconds or more to react.
Look at the top load, the CPU load is not normal, has been overloaded, the system load is the same.

After 8:30pm discovers the database exception, it decides to restart the database.

8:45pm systemctl Stop DB. Then tragedy, the system load down, but again start not up, MySQL Error-log in the startup problem:

InnoDB:Error:log file./ib_logfile0 is of different size 0 >5256780 bytes
Innodb:than specified in the. cnf file 0 1077645824 bytes!
[ERROR] Plugin ' InnoDB ' init function returned error.
[ERROR] Plugin ' InnoDB ' registration as a STORAGE ENGINE failed.
[ERROR] unknown/unsupported Storage Engine:innodb
[ERROR] Aborting

Check the information, the normal shutdown after the Logfile0 deleted after the boot can be successful, for insurance will this file MV to another directory, and then try to start, still do not come, completely dizzy vegetables

150703 23:44:27 innodb:could not open or create data files.
150703 23:44:27 Innodb:if You tried to add new data files, and >it failed here,
150703 23:44:27 innodb:you Should now edit >innodb_data_file_path in my.cnf back
150703 23:44:27 innodb:to What it was, and remove the new & Gt;ibdata files InnoDB created
150703 23:44:27 innodb:in this failed attempt. InnoDB only >wrote Those the files full of
150703 23:44:27 Innodb:zeros, and but do not yet with them in any >way. But is careful:do not
150703 23:44:27 innodb:remove old data files which contain your >precious data!
150703 23:44:27 [ERROR] Plugin ' InnoDB ' init function returned >error.
150703 23:44:27 [ERROR] Plugin ' InnoDB ' registration as a >storage ENGINE failed.
150703 23:44:27 [Note] Plugin ' FEEDBACK ' is disabled.
150703 23:44:27 [ERROR] unknown/unsupported storage Engine: >innodb
150703 23:44:27 [ERROR] aborting

At this time things have begun to be tricky, this is the first day of the activity on-line Friday night, from seven point more gradually service is not available to eight points multi-database shutdown after crash has been a long time, a large number of users in the official micro-large message pushed into the lottery game is blocked on the 404 page can not enter. And because this activity was previously said to be 7 days, the definition of this database as a temporary library, not hanging from the library, and no dump backup, and now the first day to actually encounter the problem of database crashes.

Departments do not have DBA, processing database can not start the problem can not find directly to consult the person, the superior to the other department of the DBA telephone consultation, slowly, the phone roughly chat down also said unclear, no time to recover this database.

At this time after the decision to re-initialize a database virtual machine, hurriedly dump an original test on the environment of the old database to the new database, re-publish the application to the new database, so that users can play the game again.

9:55PM Database Virtual machine initialization is complete, the application is back online, a few development slightly relieved tone.

10:00PM-1:20AM at this time the brain is already more sluggish, try to restore the old database once failed to give up. With colleagues reivew this problem, through the monitoring data found 7 o'clock in the evening database sudden connection spikes, while the CPU load soared, but the application server does not have a traffic explosion anomaly problem, if the application logic is not a problem, only temporarily suspected is the connection pool problem (with the Druid). The load monitoring at that time was as follows:

Postscript

After the weekend break, Monday came a little tinkering with the old database on the start, it seems to be unable to combat fatigue. After startup, the table data that needs to be synchronized is imported into the online library, so things are almost over.

This encounter a lot of unexpected situation, a variety of small problems accumulated together to crush the last straw. Record your experience and summarize your lessons in accordance with the review rules of scrum.

The article comes from the platform "malt bread". Reprint please specify.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

System on-line that's something--remember one time. System fault

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.