When the system goes online, the system goes online.

Source: Internet
Author: User

When the system goes online, the system goes online.

It has been several days since the Database Failure on the first day of the release on Friday. The last emergency recovery process has not had time to reproduce the field problem. The Code logic of the user portal is not complex and there are no major vulnerabilities.

The office space of haichuang Park is already the same as the swimming pool in summer. All the sides and corners are full of tables and people. The people in charge of arranging seats for new people in various departments must have a big head. It is an art to find out the places where people can sit in such a tight space ~ Overwhelmed, he finally started his new office. On Monday, his Department moved from the "country" to the "city" of Xixi Road and turned to be a city man.

On Monday, I opened the second screen to focus on system monitoring for one day. The traffic and load were stable, and no exception was found after work. I can go home happily.

Thursday is also staring for half a day. Everything is normal and the activity will end on Thursday evening. Let's see if there is no big problem. We were too tired to take a rest on Wednesday. But Murphy's Law has taken effect again. It seems that interstellar has learned from Nolan and is suitable for use here.

At five o'clock on Wednesday, the partner also informed that another public account was scheduled to push text and texts to bring users into the game. Although they were resting at home, they had nothing to do with opening the monitoring system. At that time, the cpu and database load of the application server had indeed increased, but it did not seem to be a problem. So iQiYi was ready to enjoy the Taoist saying that it was a tragedy, it has nothing to do with Lin Zhiling's Taoist downhill.

After several minutes, the O & M personnel called and said, "The last time you said you wanted the site to come, the number of connections came up again." They immediately boarded the database machine to view the cpu usage, the load average value is exactly the same as the previous failure. However, this time it is different. First, we have gained experience from the last time. Second, our O & M personnel are not on the highway this time and can handle it together in a timely manner.

The database is configured with an 8-core VM. The last time I learned that the load was high, the tomcat application tried to establish more jdbc connections, and the maximum database connection was configured with 3000, but the operating system file handle is not opened, so it can only open to 1024 connections.

Phase 1: downgrade an application

Since more connections are attempted for 14 tomcat applications, we need to reduce the database pressure first, starting from shutting down tomcat, shutting down half of tomcat machines, and observing that the database connection is still 1024, considering that it may take some time for TCP to close wait, wait for a while and decide to continue to shut down tomcat after confirmation that it has not been lowered. When doing this, I have a question: the parameter for connecting a single tomcat database should have been adjusted to 60, and the maximum number of connections on the database end should be only 840. Why is it more than 1000 this time?

When tomcat is switched to only two, the number of connections to the database starts to drop. By 300, the system load is still 99%, however, the load average has been reduced from 500 to on basic security. Although there are some configuration problems in the production database of this machine, the SQL query log of the database is not output, but according to the experience of another machine last time, slow queries are all querying data in a single user table, this operation is performed when each user enters the page and the session does not contain the user information.

Stage 2: Add a database

At this time, it seems that a single database cannot afford the query performed by the business logic. Therefore, the second action is to prepare multiple databases to distribute traffic to the front-end business. When the O & M personnel prepared the database for me, they checked the connection pool configuration and found that the maximum number of connections was 150 instead of the previously imagined 60, it may be because of temporary modification during the last problem, so the application previously connected to 1000 + to normal.

At this time, the two databases are ready. dump the master database backed up to the new database, re-Modify the database IP address of the war package, start three tomcat databases and release them again; then start three tomcat corresponding to another database. Looking at monitoring again, database access by traffic delivery does not seem to alleviate the problem. The CPU usage of the databases started is 99% immediately, and it is still difficult for users to access the page normally.

Phase 3: migrate Cache

Since the database cannot be held, you can only temporarily modify the program to put all user data in the redis cache. Because it is a cluster deployment, the original jedis connection pool configuration is changed from 1000 to 100. When the upload was about to be released, several friends on the phone shouted, and the new office was on fire? When I was not at the scene, I could only hear the background sound in the hands-free phone call. There seemed to be a property in the background that would force them to retreat, but I heard the O & M buddy say calmly, and so on, let me deploy this program first...

This fire is an episode. After a while, these buddies seem to be not moving and are still dealing with the problem. It is estimated that it is not an important fire and there is no danger to life. This change is first tested on a single-instance tomcat, and the front-end nginx has been changed to point to this single tomcat. Cache needs to be preheated. the user needs to find the database for the first time and will not need it later. After confirming that the function is correct, deploy the new war package to all tomcat servers and start them all. In the past 10 minutes, the cache connection peak reached 1000 at the beginning, and then stabilized at 600. it was normal to check the Application Service log.

After the application was normal at, several friends on the phone started to call the whole house bucket for takeout, and finally we could eat half of the food.

In the face of online faults, the first thing to do is to calm down the development itself and withstand pressure from all aspects. If you are busy trying to recover the service, the superiors will call each other to inquire about the situation. This is not enough, and the partners will call you when they see that the service is unavailable, in this case, developers have to pay attention to this. They need to know their priorities, do not focus on lost, and concentrate on what they should do.

The last time I joined the Hangzhou Oracle user group, a DBA said that this line of operation would be able to withstand a lot of pressure, because every day I came into contact with the core database of the business side. Once a database crashes, when you recover, you must never be alarmed. At this time, there may be many people (leaders, colleagues, and anxious customers) watching you behind you, at this time, you have to shake your keyboard. This sentence is also true for developers who handle online faults. Presumably, the parties involved in the handling of Ctrip's accidents will share the same feeling.

Finally, I would like to explain Murphy's Law in Baidu encyclopedia and wish you a pleasant weekend.

"Murphy's Law" mainly involves four aspects: 1. Nothing looks as simple as it looks; 2. All things will take longer than you expected; 3. Errors will always happen; 4. If you are worried about a situation, it will happen more likely.

The fundamental content of Murphy's Law is that "there is a high probability that anything that may go wrong will go wrong." It refers to any event, as long as it has a chance of being greater than zero, it cannot be assumed that it will not happen.

In terms of science and algorithms, it is synonymous with the so-called "worst-case scenario (the worst case)", which is represented by a large O symbol in mathematics. For example, for insertion sorting, the worst case is that the array to be sorted is completely inverted, And the sorting can be completed only after n * (n-1) replacement. In experiment, it is proved that the worst case will not happen, and it does not mean that it is not possible to be a slight one, unless the probability distribution of events can be inferred with confidence is linear.

The article is from the platform "malt bread", "darkjune_think 」. For more information, see.

Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.