BKJIA: Another interesting story to share with readers today. I am sure that most people will be interested as long as the reader is the application owner or is responsible for running Web applications.
We recently changed the verification service for most of our actually running websites. I am responsible for the Compuware APM Community website. The change in the verification service is a major event. We tested the change in the test environment before deploying it to the online production environment. Everything looks good in the test environment. After being deployed to the online environment, we found that some of the users in a specific user group were affected and they could not access some content on the website.
I spent five minutes looking for the impact of the problem and confirming it, and provided enough information for our operations department to solve the problem.
Editor's note: This article is from the Compuware dynaTrace team blog. The following steps are mainly used to test the dynaTrace website monitoring tool. However, it is important to find an idea of online environment problems. This idea still has some reference significance.
The first question: is there any problem we haven't found in the test environment?
The application overview graph shows that a transaction on our Community Portal has a very high failure rate:
The application overview chart shows that one of our transactions has a high failure rate.
First, let's answer the first question: Yes, we have encountered a problem!
The second question: what is the problem?
The next step is to view automatically detected errors, which indicate that this issue is related to the HTTP 4xx request-this means that many users access several pages and the request is rejected:
Access denial is the root cause of high failure rate.
Now we fully understand that there are restrictions on accessing these pages. As for the actual problem, it is still not clear whether the user attempts to access restricted content.
Step 3: Is this a real problem? If yes, what information can I provide to the Operations Department to solve the problem?
As mentioned above, this may be because many users only try to access restricted content. In this case, we feel that these errors are not a big deal, because they would have been like this. After checking the underlying error information, such as an exception, we find that the problem is actually related to our verification service. It seems that we didn't migrate all the security groups after switching to the new verification system:
Exception details indicate that our security group encountered a problem.
This information is sufficient for the Operation Department to figure out why these security groups have not been migrated.
Fourth question: which users are affected? Can We proactively contact and apologize to these users?
As we now know that this problem is caused by us, we want to know which users are affected. As the app owner, I would like to proactively contact these users to explain what they seem to have encountered, even though they have not reported the issues) and let them know that we are actively seeking a solution. With our user experience solution, we fully understand the specific situation of every visitor encountering these exceptions:
Visitor affected by verification issues
Conclusion
Fortunately, we tested the system in the test environment, so we were able to solve this problem. But if you can really see the problems in the production environment, it would be better, because it is not always possible to test every scenario.
Original article: Field Report: 5 Minutes to Identify a Production Problem and its Impact about: performance