When the system goes online-an online system fault is recorded

Source: Internet
Author: User

When the system goes online-an online system fault is recorded

This project is a turntable game lottery marketing project. Due to the urgent requirements of operation and marketing time, it takes less than 10 days for the development, testing, and deployment to go online. Some preparations are not in place, for example:

1. as the overall development was completed two days before the launch, the test understood the needs of this project in the second week of development, and there was not enough time to complete the functions, UI model adaptation, and system stress testing.

2. Technically, because the partner's public account key is not suitable for direct delivery, the interface encapsulated by the other party obtains the required functions. The interface encapsulated by the other party is provided later, three days before the scheduled start time;

The webpage interface authorization callback domain name has only one. This callback domain name is used by other applications and cannot be simply changed to the domain name where the application is deployed. The partner needs to set nginx for http forwarding on its Intranet, ensure that the callback can be sent to our server. The encapsulated API interface test can only be performed after the forwarding configuration is complete.

This network configuration method also makes it more difficult to solve the problem when some user pages cannot be loaded.

3. the online application machine is ready on the last day. The tomcat and database deployment Environment check is not completely completed, leaving hidden risks. For example, the binlog function of mariadb is still not generated after my. cnf is set, and the indexes of some core tables are not completely created.

In addition, the activity is only for seven days. After estimation, it is estimated that most of the lottery stress should be on the application end, and the database is not under pressure. Therefore, more than 10 tomcat and redis caches are configured, no backup is configured for the master-slave structure of mariadb, which becomes a single point.

4. after the machine is ready, as O & M is also working on the overall migration of monitoring and log Viewing infrastructure, and the manpower is tight, only one thing can be done within half a day, therefore, server monitoring is given priority, and an alarm system is another risk. The internal server alarm system of the company is implemented by the Support Department in a unified manner. Therefore, the system does not test the alarm function when it is launched, and another mine is laid.

Before the activity

AM mobile phone agent test found that because the host on the http head forwarded by the other party's nginx is the address of the other party, every request for a game activity will be sent to the other party's server first and then forwarded back. During the first verification, the latency on the game homepage was greatly affected.

At AM, there is still half an hour to start. I tried to test the redirection request from the proxy of the other party. However, due to the previous notifications pushed by WeChat, before the activity began, some scattered users have started to access the activity page, but are blocked from the activity page. The temporary modification program has a great impact, plus the previous day's stress test for the test interface, my mind is chaotic. I failed to test it and gave up temporarily.

Activity starts

The AM system is officially available. Users can enter the turntable lottery page. System Monitoring is normal, system load and network are normal.

PM observe that a common field in a database table does not have an index. Logically, it is queried only once when the user is not logged on, considering that the alter index operation on the online database may affect the database operation at that time, this index is not added.

PM company VPN disconnected, unable to work because of the connection, a few developers go to the tea room to relax drinking water. After a while, I was suddenly notified that the white screen of the activity page could not be accessed. My O & M colleagues told me that the mobile entry line of the server room was interrupted, and they quickly notified the Support Department to troubleshoot the problem; at the same time, it takes 10 minutes to resolve the IP address of the domain name to the Telecom IP address of the data center.

The entry channel of the data center that is disconnected at PM is restored. for insurance purposes, the domain name resolution IP address will be switched back to the mobile line after a while.

5: 00 PM, another wave of official micro-subscription numbers started to push text and text to guide users.

The program was adjusted around and needs to be re-released online. On the way home at high speed, O & M colleagues need to find a place on the road and then push all war packages to the server. Wait. At the same time, I was notified that the next wave of subscription numbers started to push game images and images, and the access volume may immediately respond.

At PM, a few people finally had time to go to dinner. The tragedy found that the canteen and the whole house were gone, and they had to eat instant noodles...

The O & M feedback about pushed all war packages.

Then it was found that the game page began to enter slowly, and users paying attention to the public number began to be unable to enter the game page. Please follow the guide interface to return.

Troubleshoot the error at PM. There is no exception in checking tomcat on the online machine. At this time, the system's response speed in the command line is abnormal when logging on to the database machine. The system responds more than 2 or 3 seconds after the command is entered.
Looking at the top load, the cpu load is abnormal and overloaded, and the system load is the same.

After a database exception is detected at, the system decided to restart the database.

8: 45 PM systemctl stop db. Then there was a tragedy. The system load was down, but the restart failed. mysql error-log found the startup problem:

InnoDB: ErrZ databases? Http://www.bkjia.com/kf/ware/vc/ "target =" _ blank "class =" keylink "> memory + Memory + DQo8L2Jsb2NrcXVvdGU + memory/memory + bXa1vcHtzeK1xMS/memory + Memory + DQoJPHA + Memory + rb + pipeline + 12tK7zOzJz8/platform/Platform + 3b/Platform + platform/Platform + 8 + Platform + platform/b7dv + LOqtK7uPbB2cqxv + Platform + 08i70/a1vcr9vt2/platform/ examples/b7dv + examples/samples + MHLuPbG5Mv7sr/samples + samples/L/samples + 089bCwcTBy8/C0rLLtbK7x + Wz/samples/b7dv + samples/cnMwb ++ samples/b7dv + samples /fyc + kernel + 3b/itb3QwrXEyv2 + 3b/iyc + kernel + 3b/iyc + kernel/IxNy9 + kernel + 3b/i0Om7 + kernel/P36OsvLi49r + kernel/kernel + rbHIvc + Kernel /K/b7dv + primary/Yyv2 + 3beiz9bN7Te148r9vt2/primary + zvHG97Kiw7vT0MH3wb + primary/secondary "here write picture description" src = "http://www.bkjia.com/uploads/allimg/150716/040I52000-2.png" title = "\" />

Postscript

After the weekend break, I started the old database after Monday. It seems that I still cannot fight fatigue. After the startup, import the table data to be synchronized to the online database.

There were many unexpected situations, and various small problems accumulated together to cut down the last straw. Record the experience and lessons according to the review rules of Scrum.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.