LinuxWEB server faults and Distributed Systems

Last Update:2017-09-14 Source: Internet

Author: User

Tags high cpu usage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For details about the faults of the linuxWEB server, refer to the distributed system-Linux Enterprise Application-Linux server application information. Two accidents occurred on the company's website some time ago. The Apache process running PHP on the front-end encountered a large number of deadlocks, resulting in no server load but service failure. The problem could not be solved even when the service was restarted, but it will be automatically restored in a few hours. These systems have been running stably for several months, and such situations have not occurred even though they have been continuously upgraded.

When the second occurrence of the same situation occurs one week later, the company organized manpower for troubleshooting and found no problems. Because I do not have the WEB machine logon permission, I asked my O & M colleagues to perform these operations to find the cause:

1. view any httpd process in strace-p xxxx and view the running status of the process. If it is not blocked on futex, change the process and view it again.

This step can be used to identify many reasons, such as an endless loop. If strace does not show that a system call is in progress, it is generally restricted to an endless loop without an ending condition, this situation is generally accompanied by high CPU usage. If there are continuous and systematic calls, it is easier to find out the cause of the endless loop. Of course, it is also a logical error.

After checking, most processes are blocked on futex. Write down the file descriptor of the futex deadlock, for example, 10.

2. Check the file descriptor opened by the process in lsof-p xxxx and find the resource corresponding to 10. The result is that this is a session file, so it can be inferred that a session lock causes the service process to crash.

Write down the session file path, for example,/tmp/abcdefg.

3. Run lsof/tmp/abcdefg to view the IDs of all processes that open the file.

4. Analyze the process ID found in step 3 in strace-p xxxx. Generally, a process is not blocked in futex. Write down the process ID, such as 1111, and write down the current blocking operation, which is usually a file descriptor, for example, 11. Or this process is in an endless loop. You can find it by observing it.

5. Check the process in lsof-p 1111 and find the resource corresponding to the file descriptor 11, which may be a socket connection or other resources. In short, the real reason is usually found here.

After troubleshooting, the company's servers were found to have a poor network quality at a certain time point. packets that were automatically closed when connected to another remote socket server were lost, and even failed to be retransmitted multiple times, therefore, the remote end cannot see this connection, but the WEB server shows ESTABLISHED. The PHP code keeps receiving socket data until it is disconnected, so it cannot be returned. In many cases, security issues do not allow you to obtain server permissions. Therefore, it is very important to identify problems through simple methods.

Another example is the PHP Upload program. Because distributed storage is used, data is sent to the storage server through socket. The PHP program uses stream_set_timeout to set the timeout. However, after an accident, the analysis finds that it only affects read and has no effect on write. Therefore, it changes to socket_send and uses setsockopt to set the timeout.

This situation should happen frequently, but most of the time there is no problem with the network, so you don't want to write complicated code to handle this exception, or many times Think this is only a small accident, however, for large websites, it is generally used to determine the accident level by calculating how many people are affected, how long they are affected, and how many features are unavailable.

There is no accident size. We should learn from the accident to avoid another occurrence. The company has more than 10 operation accidents every week. After I arrived at the company, I had three accidents. Of course, they were mainly caused by problems with the old system architecture or database stability when I took over the company, but there are also code bugs. Bugs can only be reduced. To ensure that there is no BUG at all, it is only theoretically feasible and the cost may be too high. The fault tolerance and alarm monitoring mechanisms are added at a relatively low cost. However, to detect problems, you must fix and upgrade the system in time, unit Testing is very important at this time.

In short, a distributed system is a group of servers working collaboratively. It mainly aims at load balancing, data balancing, and redundancy, the main difficulty lies in data synchronization, preventing chain reactions caused by downtime, reducing inter-server communication and dependencies, and handling tasks stacked or failed due to slow processing of any node, another point is not to trust the network, which is the most unstable part, even in the same data center. In the past few months, we have improved the quality through improving the development process, standardizing unit tests, adding Fault Tolerance Mechanisms, eliminating single points of redundancy, improving monitoring systems, and refining statistical analysis, but there are still many ways to go. The system design stage focuses on the general direction, and the development stage also needs to be particularly concerned about the small details, such as the log, once the system is launched, this may be the most abundant and stable information source.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More