Problems with time_wait increase due to traffic increase and MySQL connection failure

Source: Internet
Author: User

An application queries an interface every time. The interface returns user information data to display different page effects. The general process is as follows:

Application app (Telecom)-> memcache-> Telecom custom interface-> master-DB

Application app (Netcom)-> Netcom custom interface-> slave-DB

The interface environment is Php (CGI) + nginx. The interface has been running for a long time and has not encountered any exception.

 

The application accesses the M interface, and then the interface is used to query the database (the database is master-slave replication, data synchronization, and the data is read from the respective databases in the respective data centers, and all written data is written to the master-DB)

One thing is that the telecom data center has a memcache layer, but the data center has never been (considering that the traffic in the data center is not high, and the data center cache is not synchronized, the cache has not been used in the Netcom data center since its launch)

Once it was launched, the version was changed to cancel the memcache of the telecom data center, and the traffic in the telecom data center increased sharply.

 

View PV statistics:

[Xxx @ xxxxxx ~] $ Xxx. Sh "find/path-name 'Access * '| xargs WC-L | awk 'end {print $1}'" Fe

CMD: Find/path 'Access * '| xargs WC-L | awk 'end {print }'

Type: Fe

 

Server1

2x a total (atotal on the 28 th)

-----------

Server2

2TimesB (28 is B
Total)

-----------

Server3

C Total

-----------

Server4

D Total

....

Other servers ....


The traffic of the Netcom data center has been relatively stable, and no problems have occurred.

After the surge in traffic at the China Telecom M interface yesterday, an exception occurred. The machine load in the telecom data center increased by more than 40 times, and the QPS increased by 15 times. It did not fall below 1.

The application also reported a short timeout alarm, but PHP and nginx are still running quite well, and the restart is still very fast, and the terminal is not very stuck.

 

Traffic is 9 times that of the previous day!

 

 

An exception occurs when error. log reaches 3 GB after it is online !!!

And all errors are caused by can't connect to MySQL server on '1. 1.1.1 '(99)

Even if mysql-HX. x
-U-p

But according to the DBA, database monitoring has no exceptions and Applications of other departments in the database have no exceptions.

I wonder whether the server load is too high, resulting in a large number of time wait, resulting in MySQL connection timeout or connection failure.

 

The following is monitoring at 00:13 pm:
Time_wait has increased by 300 times (I wonder if it is related to him)

Established increased by 10 times

 

It is reasonable to say that the traffic of the M Netcom interface is very stable, and there has never been any exception. The interface of the telecom data center cannot withstand the problem after 2 times, and the load keeps increasing,

To avoid interface exceptions caused by cache traffic, two files were re-uploaded around and the memcache of the telecom data center was re-enabled,

Load is slowly downgraded after enabling, but there are still not so many MySQL errors

There are still a large number of errors on the machine. The log extraction is as follows:

FastCGI sent in stderr: "phpwarning: mysql_connect ()Can't connect to MySQL server on '1. 1.1.1 '(99)In xxxxxx on line xxx

 

Later, I communicated with the DBA and found that the/etc/sysctl. conf configuration of the Telecom and Netcom data centers is different.

The following lines are added to the Netcom Data Center:

Net. ipv4.tcp _ syncookies = 1

Net. ipv4.tcp _ tw_reuse = 1

Net. ipv4.tcp _ tw_recycle = 1

Net. ipv4.tcp _ fin_timeout = 5

 

This is the reason. After the configuration is synchronized to the Hangzhou telecom data center, the problem is solved, which is summarized as follows:


  1. Problem description

    • QPS: 5 times + due to a launch exception, and load: 40 times +. Although nginx + PhP indicates that it is not fixed, error is returned. log hits 3g/day, all of which are can't connect to MySQL server on '*. *. *. * '(99)
    • After the exception is solved, the error. log is missing, but the time_wait still cannot be reduced, and the database connection still fails.
  2. Troubleshooting
    • MySQL Config? (No problem)

      • Max_connect_errors = 50000 (no problem)
      • Max_connections = 1000 (no problem)
      • Max_user_connections = 950 (no problem)
    • OS Config? (Problem, solve the problem by modifying the following)
      • VI/etc/sysctl. conf
        // Edit
        Net. ipv4.tcp _ syncookies = 1
        Net. ipv4.tcp _ tw_reuse = 1
        Net. ipv4.tcp _ tw_recycle = 1
        Net. ipv4.tcp _ fin_timeout = 5
        // Make the parameter take effect
        /Sbin/sysctl-P
  3. Cause
    • The error "can't connect to MySQL server on '*. *' (99)" can be found in mysql client Error Code Description: The error code is 99,99. Meaning: $ perror
      99 OS error code 99: cannot assign requested address this is an error reported by a local OS, indicating that local address resources cannot be allocated (it should be a port) and socket cannot be created
    • Google the "cannot assign requested address", most of which is because the client requests are too frequent, and the local port is temporarily in time_wait after the server is shut down, so the port is unavailable temporarily. Therefore, you can modify the OS parameter.
  4. Reflection
    • Is this problem urgent? Urgent!

      • Refer to the article "nginx + PhP generates a large number of time_wait": Success!
    • Why does MySQL not use persistent connection pconnection?
      • Pconnection MySQL occupies a large amount of resources, and in the case of high concurrency, such as personalization and promotion activities, too many connections lead to numerous connection failures and errors, and restrain Apache (nginx) threadsperchild Parameters
    • Best practices for high concurrency?
      • Apache short connection, nginx short connection, and MySQL short connection. Although there are more time_wait connections, you can modify the OS kernel to accelerate time_wait reuse!


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.