The rapid increase in user volume, the number of visits in a short period of time doubled, due to previous capacity planning done better, hardware resources can be supported, but the software system has a big problem: 40% of requests will return HTTP 500:internal Server Error
Problem description
The rapid increase in user volume, the number of visits in a short period of time doubled, due to the earlier capacity planning done better, hardware resources can be supported, but the software system has a big problem:
40% of requests will return HTTP 500:internal Server Error
By viewing the log, the error is found in the connection processing of PHP <-> Redis
Debug processing
1th time
At first, the root cause was not found and only a variety of error-related approaches were attempted, such as:
Increase the number of PHP connections and increase the time-out from 500ms to 2.5s
Disable the default_socket_timeout in PHP settings
Disable SYN cookies in the host system
Check the number of file descriptors for Redis and webservers
Increase the mbuffer of the host system
Adjust the number of TCP backlog
......
Tried many methods, but all were invalid
2nd time
Want to reproduce the problem in the pre-release environment, unfortunately, it is not successful, should be insufficient traffic, can not be reproduced
3rd time
Could it be that the Redis connection was not turned off in the code?
Normally, PHP will automatically close the resource connection at the end of execution, but there will be a memory leak in the old version, and for the sake of insurance, modify the code once and manually close the connection.
The result is still invalid.
4th time
Suspect target: Phpredis this client library
Do A/B testing, replace the library with Predis, deploy to 20% of users in the data center
Thanks to a good code structure, replacement work is done quickly
But the result is still invalid, but there are good side, can prove Phpredis no problem
5th time
Look at the version of Redis, v2.6, when the latest version is v2.8.9
Upgrade Redis Try it.
It's okay to be optimistic, it's not going to make the Redis version up to date.
6th time
By finding a large number of documents, a debugging method is found in the official documentation of Redis software Watchdog, which is opened and executed:
$ redis-cli--latency-p 6380-h 1.2.3.4min:0, max:463, avg:2.03 (19443 samples)
To view the Redis logs:
... [20398] 09:20:55.351 * 10000 changes in seconds. Saving ... [20398] 09:20:55.759 * Background saving started by PID 41941[41941] (May 09:22:48.197 * DB saved on disk[20398] 09:22:49.321 * Background saving terminated with success[20398], may 09:25:23.299 * 10000 changes in seconds. Saving ... [20398] 09:25:23.644 * Background saving started by PID 42027 ...
Found the problem:
Every few minutes to the hard disk to save data, fork a background storage for why it takes about 400ms (through the above log 1th and 2nd time can be seen)
Here, we finally find the source of the problem, because there is a lot of data in the Redis instance, which makes it time-consuming to fork the background process for each persistent operation and often modifies key in their business, causing frequent trigger persistence, and often creating a blocking on Redis
Workaround: Use a separate slave for persistence
This slave does not handle real traffic requests, the only function is to handle persistence and transfer the persisted operations on the previous Redis instance to this slave
The effect is very obvious, the problem is basically solved, but sometimes it will be error
7th time
Troubleshoot slow queries that may block Redis, and find a place to use the keys *
Because the data in Redis is getting more and more, this command will naturally cause serious blockage.
You can use scan to replace
8th Time
After the previous adjustment, the problem has been resolved, the following months, even if the traffic is growing, also anti-live
But they are aware of the new problem:
The way to do this is to create a Redis connection on a request, execute a few commands, and then disconnect, which creates a serious performance waste when the volume of requests is large, and more than half of the commands are used to handle the connection operation, which exceeds the business logic and slows Redis
Workaround: Introduce proxy, they choose the twemproxy of Twitter, only need to install agent on each webserver, Twemproxy is responsible for persistent connection with Redis instance, which greatly reduces the operation of connection.
There are also two convenient places to Twemproxy:
Support memcached
can block very time-consuming or dangerous commands, such as keys, Flushall
The effect is naturally perfect, no longer worry about previous connection errors
9th time
Continue optimization with data sharding:
Data split isolation for different contexts
Consistent hash sharding of data in the same context
Effect:
Reduced requests, loads on each machine
Improves cache reliability without worrying about node failures
The above is the whole content of this article, I hope that everyone's study has helped.