Linux TCP/IP tuning for Scalability__linux

Source: Internet
Author: User
Tags ack syslog unix domain socket iptables

Hi there! I ' Philip (@bluesmoon), the CTO of Lognormal. We ' re a performance company, and performance and scalability go hand in hand. Better scalability results in the more consistent performance and at Lognormal, we like pushing we hardware as far as it'll Go. Today's post is about some of the infrastructure we and we tune it to handle a large number of requests.

We software stack to handle tasks have separate components. In this post I am ' ll only cover the parts so make up our beacon collection component and we tune it. Only a few of the tuning points are specific to this component. The Stack

(Side Note:someone needs to start a Coffee shop + co-working space called the Stack). The Beacon collector Runs linux at its base. We use a combination's Ubuntu 11.10 and 12.04, which for most purposes are the same. If you are going with a new implementation though, I ' d suggest 12.04 (or at least the 3.x kernels).  slightly higher up is iptables to restrict inbound connections. This is mainly because we ' re hosted on Shared infrastructure and need to restrict internal TS We trust. The iptables is the cheapest way to does this, but it brings in a few caveats "we" in the tuning section later. We then have nginx set up to serve HTTP traffic on ports and 443 and do some amount of filtering S later) behind Nginx is we custom node.js server that handles and processes beacons as they in. It reads some configuration data From couchdb and then sends ThesE processed beacons out into the ether. Nginx and Node talk To each other over a UNIX domain socket. That's about all that's relevant for this discussion, but in the heart of it, you ' ll be there there are of lots Handles and sockets in the A large part of this is due to the fact then Nginx only uses http/1.0 when it proxies requests to a back end server,  And that is means it opens a new connection on every request rather than using a persistent connection.What should we tune?

For the purposes of this post, I'll talk only about the the "the" and "two parts of our stack:linux and iptables." Open Files

Since we deal with a lot of file handles (each TCP socket requires a file handle), we need to keep our open file limit H. The current value can is seen using ulimit-a (look for open files). We set this value to ' 999999′and hope so we never need a million or more files open. In practice we never do. We set this limit by putting a file Into/etc/security/limits.d/that contains the following two:

*	Soft	nofile	999999
*	hard	nofile	999999

(Side Node:it took me minutes trying to convince Markdown that those asterisks were to be printed as asterisks) If you don ' t does this, you'll run out of open file handles and could to the your stack die of one or more parts. Ephemeral Ports

The second thing to be to increase the number of ephemeral Ports available to your application. By default this is all ports from ' 32768′to ' 61000′. We are ports from ' 18000′to ' 65535′. Ports below 18000 are reserved for current and future with the application. This I/future, but is sufficient for what we need right now, largely because to what we do next. time_wait State

TCP connections go through various states during their. There ' s The handshake that goes through multiple states, then the established state, and then a whole of bunch of Either end to terminate the connection, and finally a time_wait state that lasts a really long time.  If you are interested in all the states, read through the Netstat Mans page, but right now this only one we care about is the Time_wait state, and we care about it mainly because it's so long. By default, a connection are supposed to stay in the time_wait state for twice theMSL. Its purpose be to make sure any lost packets this arrive after a connection was closed do not confuse the TCP subsyste M (the full details of this are beyond the scope of this article, but ask me if you like details). The default msl is seconds, which puts the defaulttime_wait timeout value at 2 minutes. This is Means you ' ll run out of available ports if your receive more than about requests a second, or if we look back To how nginx does proxies, this actually translates to requests per second. Not good for scaling. We fixed this by setting the timeout value to 1 second. I ' ll let the sink in a bit. Essentially we reduced the timeout value by 99.16%. This is a huge reduction, and not to be taken lightly. Any documentation you read'll recommend against it, but here ' s why we did it. Again, remember the point of the the time_wait state was to avoid confusing the transport layer. The transport layer would get confused if it receivEs an out of order packet on a currently established socket, and send a reset packet in response. The key here is the term established socket. A socket is a tuple of 4 terms. The source and destination IPs and ports. Now for our purposes, we have server IP is constant and so this leaves 3 variables. Our port numbers are recycled, and we have 47535 of them. That leaves the "other" end of the connection. In the order of a collision to take place, we'll have to get a new connection the from the existing client, and that client WOULD&N Bsp;have to use the same port number this is it used for the earlier connection, and we server would have to assign the same  port number to this connection as it did before. Given that we use persistent HTTP connections between clients and nginx, the probability That we can ignore it. 1 second is a long enough time_wait timeout. The two TCP tuning parameters were set Using sysctl by putting a file into /etc/sysctl.d/with the following:

Net.ipv4.ip_local_port_range = 18000    65535
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 1
Connection Tracking

The next parameter we looked at is Connection tracking. This is a side effect of using iptables. Since iptables needs to allow two-way communication between established HTTP and SSH connections, it needs to keep TR ACK of Which connections are established, and it puts these into a connection tracking table. This table grows. and grows. And grows. You can, the current size of this table using sysctl net.netfilter.nf_conntrack_count and its limit Using&nbs P;sysctl Net.nf_conntrack_max. If count crosses Max, your Linux system'll stop accepting new TCP connections And you ' ll never know about this. The only indication so this has happened are a single line hidden somewhere in /var/log/syslog saying Re out of connection tracking entries. One line, once, when it is the happens. A better indication is ifcount is always very close to max. You might to "Hey, we ' ve set Max exactly right." and But you ' d be wrong. What you need to do (or at least, that ' s what, you are to increase max. Keep in mind though, which is larger this value, the more RAM the kernel would use to Keep track of these entries. Ram that could is used by your application. We started down this path, Increasing net.nf_conntrack_max, but soon we were just the it up pushing day. It Turned out that connections this were getting in there were never out.nf_conntrack_tcp_timeout_established

It turns out this there ' s another timeout value you need to is concerned with. The established connection timeout. Technically this should only apply to connections that are in theestablished state, and a connection should get out of thi S state when a FIN packet goes through in either direction. This is doesn ' t appear to happen and I ' m not entirely sure why. So how long does connections stay in this table then? It turns out this default value for Nf_conntrack_tcp_timeout_established is 432000 seconds. I ' ll wait for your to do the long division ... Fun times. I changed the timeout value to ten minutes (seconds) and in a few-time I noticed conntrack_countwent down steadily Until it sat at a very manageable level of a few thousand. We did this by adding another line to the Sysctl file:

net.netfilter.nf_conntrack_tcp_timeout_established=600
Speed Bump

At this point we were in a pretty good state. Our beacon collectors ran for months (not counting scheduled) reboots a without, problem a until of days ago One of them just stopped responding to any kind of network. No ping responses, no ACK packets to a SYN, nothing. All established SSH and HTTP connections terminated and the box is doing nothing. I still had console access, and couldn ' t tell what is wrong. The system was using less than 1% CPU and less than of RAM. All processes this were supposed to is running were running, but nothing is coming in or going out. I looked through syslog, and found one obscure message repeated the several times.

IPV4:DST Cache Overflow

So, there were, messages, but this is the one that mattered. I did a bit of searching online, and found something about an rt_cache leak in 2.6.18. We ' re on 3.5.2, so it shouldn ' t have been a problem, but I investigated anyway. The details of the post above related to 2.6, and 3.5 is different, with No ip_dst_cache entry IN /PROC/SL Abinfo so i started searching for it equivalent on 3.5, where I came Across vincent ' s post on the IP V4 Route cache. This is a excellent resource to understand the route cache in Linux, and that's where I found out about THE LNSTAT&N Bsp;command. This is something the needs to being added to any monitoring and stats gathering the scripts Nbsp;suggests that The dst cache gcroutines are complicated, and a bugs anywhere could result in a leak, one whic H could take several weeks to become apparent. From what I can tell, there doesn ' t appear to BE Art_cacheleak. The number of cache entries increases and Decreases with traffic, but I ' ll keep monitoring it to the if that changes Over time.Other things to tune

There are a few other things you might want to tune, but they ' re becoming less of a issue as base system configs evolve. TCP Window Sizes

This is related to TCP slow Start, and I ' d love to go into the details, but we friends Sajal and Aaron over at CDN Planet have already done a awesome job explaining how to tune TCP Initcwnd for optimum performance. This isn't a issue for us because the 3.5 kernel ' s default window size is already set to 10. Window size after idle

The

Related to the above is the sysctl setting net.ipv4.tcp_slow_start_after_idle. This tells the system whether it should start at the default window size of only for new TCP connections or also for Exi Sting TCP connections that have been idle for too long (on 3.5, too long is1 second, but see net.sctp.rto_i Nitial for its value on your system). If you ' re using the persistent http connections, you ' re likely to the ' end of the ', so Setnet.ipv4.tcp_slow_start_aft Er_idle=0 (just put it into the sysctl Config file mentioned above). Endgame

After changing all this settings, a single quad core VM (though using only one core) with 1Gig of RAM has been able t O handle all of the load that ' s been thrown at it. We never run out of the open file handles, never run out of ports, never run out of connection tracking-entries and never Run out of RAM. We have several weeks before another one of our beacon collectors to the DST cache runs, and I ' ll be issue ready Sp;the numbers when that happens. Thanks to reading, and let us know how to work out for you if you try them out. If you are like to measure the real user impact of your changes, have a look in Our real user Measurement too L at lognormal.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.