An analysis and summary of the accident of Suning store

Source: Internet
Author: User

Turn Java 3 years, remember a recent pit;

Near 818, Suning store online app made a second kill activity, backstage service appeared abnormal.

Activity is Friday 10 o'clock, received a system of alarm SMS, colleagues started in the company bean sprouts (suning internal communication tools) feedback some users to order failure;

In the monitoring instrument panel can also see the traffic began to soar, while the response time began to grow, the middle-price system began to error, leading to the subsequent interface system error.

Go to the log platform to see the price system interface calls. We can see the interface response time is slowly slow, and finally began to appear the error phenomenon. To check the error message Timeout error.

Look at the call details of the long-time interface, you can see that the price system interface has begun to appear backlog, the subsequent interface needs to wait for the previous interface processing completed before processing, so the interface time-consuming is also growing. Looking at the next code logic just reads the data from Redis, what should be done in a matter of milliseconds so slow as to cause a backlog? Is there a problem with Redis?

Really, the original price system is a set of Redis [two groups (one master and two slave)] shared with the master Data system.

The images are posted with two main libraries with 72w and 62w of pressure respectively. Memory usage is also high.

Two sets of Redis were applied independently for the price system, and the night was released overnight (bitter force).

That night after the upgrade, did a sub-test, the price interface can support peak 70w times/minute call, this Friday activity from the dashboard is also seen a little loose pressure. (0MS is because the dashboard is accurate to Ms rounding)

Summarize

1. The system pressure is not evaluated according to the number of activities, resulting in an accident. When doing activities, products and operations should anticipate the number of active people and develop communication, development should be the current system can support such traffic has a cognition can not take for granted, if not sure can contact the test to do stress test. It is clear that the accident did not do the relevant work.

2. There is no downgrade program to fuse or downgrade the service, resulting in a direct error in the front desk system.

If you see here, then please long press the QR code, follow me, grow together! (Do not send the book regularly!)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.