Combined with the operation and maintenance problems encountered in my work, this paper summarizes the common operation and maintenance problems of Linux server and the location method. The server here mainly refers to the self-developed logic server,web SRV because it usually uses a common architecture, so the problem is relatively small.
The typical processing power of logical server is between 3K/S-1w/s and varies by business characteristics. Logical server is generally self-developed, although in the pre-launch most after the function and stress testing, but put to the current network environment after the deployment or will inevitably appear some problems, some of the problems are in the grayscale release can be found, and some problems is a long exposure process. The following summarizes the general problem classification and positioning methods.
1. Program bugs such as FD leaks or memory leaks
Business on-line must do the pressure test, at the same time to see the process consumption of memory and FD number, combined with the business characteristics of the use of FD is reasonable, while observing whether memory usage will eventually tend to stabilize the value, if always increased, there must be a leak.
FD Leak Confirmation method is: Ls/proc/pid/fd-al | WC, you can see the number of FD used by a single process, observe whether it has been growing long, and if it does not eventually reach a stable value, you can confirm that there is a leak. At the same time can be cat/proc/net/sockstat to see whether the overall FD use number has been rising, usually 32-bit machines, FD over 10W when the system will reach the bottleneck.
The memory leak confirmation method is: Top look at the res and SHR used by the process, observe if it has been up and down, and if it does not eventually reach a stable value, you can confirm that there is a leak. At the same time, you can see whether the use of mem has been increasing. The final result of the memory leak is the swap partition used, and once this happens, the WA field of the CPU will appear much larger than 0, indicating that the CPU is blocked waiting for the input output.
2. Natural growth in business
This depends on the number of requests to the statistics, through the comparison of the days before and after, it is not difficult to confirm whether the business is natural growth, the increase in the number of single-machine requests to make the system bottlenecks, this problem can be easily solved by expansion, but the best way is the capacity of the system and key parameters such as Cpu/mem/eth plus monitoring, Can do early warning, so as not to emergency expansion in the time of the problem.
3. Attribute changes cause user behavior anomalies
For example, one time I was upgrading the server, based on performance considerations, less returned an invalid field, grayscale upgrade a machine, found that the system load increased 3 times times, then the first response is a bug, so that the use of CPU burst, but Vmstat found that the pre-upgrade CPU utilization USR and SYS is roughly 14 7, upgraded to 42 21, roughly up 3 times times the year, and then look at the number of requests, and found that the number of requests also grew, visible, is some reason for users to retry.
Because the function test before the line, so the normal user's function should be no problem, compare these version of the hair more, found that there may be less return a field, so that the plug-in user resolution failed to retry, so re-add the field after the release, the problem solved. Since that time, summed up a point, when the data returned to the user side of the field changes, it is necessary to confirm whether the gray level of the impact of the external users, if there is an impact can be returned by false data resolution.
4. Improper configuration of system parameters
For example, a period of time to find that during peak access time, the process will appear to apply for the failure of FD, because some interfaces with a short connection, each processing will need to apply for FD, processing, Release FD, later viewed the system parameters, found/proc/sys/net/ipv4/tcp_tw_ Recycle and/proc/sys/net/ipv4/tcp_tw_reuse are equipped with 0,0 that does not turn on to accelerate the recovery of FD and multiplexing time_wait State of FD, resulting in a short connection closed, FD large area in time_wait state, Because some new requests fail to apply for FD, the problem is solved by adjusting the system parameters.
5. Coding problems result in poor system processing power
In fact, this category is not an operational problem, but the poor processing capacity of the system will be easy to reach the bottleneck. In the coding process, we must pay attention to avoid unnecessary overhead, especially system calls. Here I summed up a few for your reference: Configure only once, then resident memory or shared memory; Common tool classes such as escalation, write logs, etc., using static or single-piece mode, to ensure that only one initialization, as long as possible to reduce the application of FD, connection, release costs , notifications, such as non-critical can be lost messages using UDP, only hair, do not print unnecessary logs, and to loop write, to prevent log files too large error, the external interface timeout as short as possible, to prevent the process due to external interface problems are suspended, a single process to set the maximum processing time, to ensure the worst case of the system processing capacity Less time, stat, and other system calls.
In terms of system invocation, it is possible to optimize the business logic by strace-c the number of calls and time-consuming of the statistical system. How to use Strace-c
Here, for example, I once strace-c a processing process, found that the STAT function CPU utilization is very high, and then strace tracking the process of the system call discovery, the process used a statistical escalation of the class, the class itself is initialized with static, but the class escalation interface, Each time an object is initialized, the sampling is analyzed, and escalated, at this time will parse the sample configuration file and then parse the escalation profile again, so although the class itself is static but there is no meaning, the object will be initialized every time, and later modified, the interface of the object with a pointer instead, It is used directly when the first interface call is initialized, and when called again, the pointer is not NULL. After optimization, it is found that the CPU usage of the system has dropped by nearly 20%, and the reduction of unnecessary system calls has greatly improved the system's processing ability.
The above summarizes the common operation and maintenance problems and positioning methods, I believe we have a set of self-positioning problems, here I talk about the basic process of my positioning problems for your reference:
1. View Logs
By looking at the system log, you can determine first whether the business logic or external interface is out of the question, and can be verified in conjunction with the code process.
2. Whether fd>10w
Cat/proc/net/sockstat, observe the Tcp_use field, if the continuous growth does not tend to stabilize the FD leakage or excessive number of connections, >10w the system will be abnormal.
3. Load Analysis
First, the CPU, swap, and R fields are observed in Vmstat 1, which can be broadly divided into the following situations:
The WA field of the CPU is far >0, and the Si field of swap is far >0, indicating that a swap partition has been used, when the res and SHR fields are observed through the top view process, and if the Res field is large and growing continuously, a memory leak can be confirmed.
The CPU usr and sys ratio is high, the R field value is also relatively high, and the swap usage is 0, the description may be the request volume changes, then check the request volume data, whether proportional growth, if it is proportional growth, you can confirm that the increase in the number of requests for reasons, This is based on the number of requests for a few days of data to confirm whether the sudden increase or natural growth.
Operation and maintenance of small things, in the system operation and maintenance process, there may be a variety of problems, but the system access and processing capacity-related key indicators are not many, as long as the key points of grasp, it is not difficult to locate the problem. More methods, experience and experience, we welcome you to discuss together.
Analysis and location of server operation and maintenance problem in Linux platform