Network Performance Tuning Documentation

Source: Internet
Author: User
Tags systemtap valgrind nginx server dmesg

Performance Tuning Document Preface

Sysctl the pain of the network tuning settings, do not know not to blindly change

SYSCTL/SYSFS parameter adjustment and calculation network

1 setting maximum receive/send socket buffer size

net.core.rmem_max, which net.core.wmem_max is related to bandwidth Delay Product (BDP)

The network BDP values are calculated as follows:

BDP = Bandwidth/8*RTT

If the server has a bandwidth of 1g,rtt of 10ms, tcp_adv_win_scale=2 then the BDP is 2G/8*0.01=2.5MB , and the maximum read buffer is set to 4/3*2.5MB=3.3MB . Specific calculation formulas See TCP performance calculations

Common configurations:

 net.core.rmem_max = 134217728 net.core.wmem_max = 134217728 net.ipv4.tcp_rmem = 4096 87380 134217728 net.ipv4.tcp_wmem = 4096 65536 134217728

net.ipv4.tcp_mem = 8388608 12582912 16777216This defines the number of TCP memory page, the General page is 4k or 8k. Don't adjust unless you know what you're doing.

tcp_memcan be calculated according to the following formula:default_window*connection_count/page_size

The first value is the value of the TCP pressure-free time, the second value is the number of times when the TCP memory is under pressure, and the third is the maximum value. In general, when TCP memory exceeds the second value, the new TCP Windows no longer becomes larger. A new TCP connection is rejected if it is greater than the third value.

For example, an online Nginx server:

The default value for Tcp_mem should be set to:

nginx_active_connection*default_window/page_size

2 Nf_conntrack Settings

net.ipv4.netfilter.ip_conntrack_maxand sysfs/module/nf_conntrack/parameters/hashsize the relationship

ARCH = [32|64]

Hashsize = CONNTRACK_MAX/8 = Ramsize (in bytes)/131072/(ARCH/32)

For example, a 32G memory server conntrack_max = 1048576 ,hashsize = 131072

3 TCP Control Protocol

net.ipv4.tcp_congestion_controlDefaultcubic

Can be changed in accordance with the actual needs of the htcp specific TCP algorithm

4 TCP port range

net.ipv4.ip_local_port_rangeThis is generally useful for senders of TCP.

5 TCP port Reuse

net.ipv4.tcp_tw_recycle, it net.ipv4.tcp_tw_reuse 's okay not to set. Unless it is a temporary workaround setting. do not set both

6 TCP Time Correlation

' Net.ipv4.netfilter.ip_conntrack_tcp_timeout_close = 10net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000net.ipv4.netfilter.ip_conntrack_tcp_timeout_ fin_wait = 120net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack = 30net.ipv4.netfilter.ip_conntrack_tcp_timeout _max_retrans = 300NET.IPV4.NETFILTER.IP_CONNTRACK_TCP_TIMEOUT_SYN_RECV = 60net.ipv4.netfilter.ip_conntrack_tcp_ Timeout_syn_sent = 120net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent2 = 120net.ipv4.netfilter.ip_conntrack_ tcp_timeout_time_wait = 120net.ipv4.tcp_fin_timeout = 60net.ipv4.tcp_keepalive_time = 7200net.ipv4.tcp_thin_linear_ Timeouts = 0net.netfilter.nf_conntrack_tcp_timeout_close = 10net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60net.netfilter.nf_conntrack_tcp_timeout_established = 432000net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300NET.NETFILTER.NF_CONNTRACK_TCP_TIMEOUT_SYN_RECV = 60net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_ wait = 120net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120net.netfilter.nf_conntrack_tcp_timeout_close_ wait = 60net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120 `

By default it is enough, nothing to adjust. If you need to adjust the general instructions or the network has problems.

7 TCP Processing Queue

net.core.netdev_max_backlogThis parameter requires nothing to calculate the formula. The larger the average RTT value, the greater the value should be. 10G nic,rtt=100ms, net.core.netdev_max_backlog = 30000 10G nic,rtt=200ms or 40G NIC, rtt=50ms,net.core.netdev_max_backlog = 250000

Only see TCP: drop open request , need to adjust tcp_max_syn_backlog .

net.ipv4.tcp_max_orphansThis parameter can be adjusted to indicate that there is no FD associated socket. Do not turn small, you can adjust the size. But one point to remember is that each orphans eats up to 64k of non-swapped memory (unswappable memories).

8 TCP Properties

Some people may recommend net.ipv4.tcp_timestamps and net.ipv4.tcp_sack set to 0 to reduce CPU overhead, but the default value in the real world is useful.

TCP Performance Calculation

For example:

net.ipv4.tcp_rmem = 4096        87380   2067968

recieve window (tcp_rmem) 默认为87380

Cache Cost Calculation formula:

if tcp_adv_win_scale > 0 {    buf = window/2^tcp_adv_win_scale} else {    buf = window - window/2^(-tcp_adv_win_scale)}

The actual value used for network transmission tcp_rmem iswindow - buf

The TCP performance Calculation formula is:

if tcp_adv_win_scale > 0 {    speed = (window - window/2^tcp_adv_win_scale)/RTT} else {    speed = (window/2^-tcp_adv_win_scale)/RTT}

If Rtt=150ms tcp_adv_win_scale is 2, the default value of Tcp_rmem is used, and the maximum performance is

(87380 - (87380 / 2^2))/0.150 = 436906 bytes/s

Speed <= BDP (BANDWIDTH/8*RTT)

So if you follow the BDP anti-write buffer formula as follows:

(window - (window/2^tcp_adv_win_scale))/RTT = Bandwidth/8*RTTwindow = Bandwidth/8*(2^tcp_adv_win_scale)/(2^tcp_adv_win_scale -1)*(RTT^2)

In fact, the initial size of window does not follow the value of Tcp_rmem, but is related to MSS.

The role of TCP initial win
int init_cwnd = 4;if (mss > 1460*3)    init_cwnd = 2;else if (mss > 1460)    init_cwnd = 3;if (*rcv_wnd > init_cwnd*mss)    *rcv_wnd = init_cwnd*mss;

Kernel 3.x later, the initial window was resized to 10 MSS

Many tuning articles may recommend modifying the default win with the IP route command.

# 发送windowip route change default via 192.168.1.1 dev eth0  proto static initcwnd 10# 接受windowip route change default via 192.168.1.1 dev eth0  proto static initrwnd 10

Big win means a large TCP buffer, which means that the brush sends more packets at a time of buffering.

About the network card ring buffer

ethtool -g eth1

Ring parameters for eth1:Pre-set maximums:RX:             2040RX Mini:        0RX Jumbo:       8160TX:             255Current hardware settings:RX:             255RX Mini:        0RX Jumbo:       0TX:             255

You can modify the ring buffer in accordance with the actual needs, but do not blindly change. A large ring buffer can cause additional network latency.

The ring buffer stores a description pointer to SKB (socket kernel buffers). For example, the transmission speed of a network is 5MBIT/S, and the NIC MTU is 1500,SKB size 1500bytes (12000bit). The ring buffer can be considered a FIFO queue. So the delay is (254*12000)/5000000 = 0.6096s .

Modify the ring buffer itself to measure the balance between network latency and bandwidth. Large ring buffer can reduce packet loss, but the network latency can be significant.

In most cases, the default factory settings are appropriate.

The Driver queue is actually the ring buffer of the NIC.

Queuing disciplines is actually the Qdisc inside the TC. Different QDISC settings have different effects (http://www.tldp.org/HOWTO/Traffic-Control-HOWTO/classless-qdiscs.html). The length of this queue is determined by the Txqueuelen parameter of the NIC.

[email protected]:~$ ip add

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00    inet 127.0.0.1/8 scope host lo    valid_lft forever preferred_lft forever    inet6 ::1/128 scope host    valid_lft forever preferred_lft forever

/sbin/ifconfig ethN txqueuelen 10000This setting actually modifies the length of the NIC send queue. This setting can keep 10,000 packets in the queue.

The optimization parameters set on the 10G Nic /sbin/ifconfig ethN txqueuelen 10000 , this is the experience value of the online reference. In the real world, if you see a lot of overruns or drop, you need to change it.

Linux Performance Checklist

From:http://www.brendangregg.com/usemethod/use-linux.html

Physical Resources

Component| type| Metric---------|-----|-------cpu|utilization|system-wide:vmstat 1, "us" + "sy" + "st"; Sar-u, sum fields except "%idle" and "%iowait"; Dstat-c, sum fields except "IDL" and "Wai"; Per-cpu:mpstat-p all 1, sum fields except "%idle" and "%iowait"; Sar-p all, same as Mpstat; Per-process:top, "%cpu"; Htop, "CPU%"; Ps-o pcpu; Pidstat 1, "%cpu"; Per-kernel-thread:top/htop ("K" to toggle), where VIRT = = 0 (heuristic). [1] Cpu|saturation|system-wide:vmstat 1, "R" > CPU count [2]; Sar-q, "Runq-sz" > CPU count; Dstat-p, "Run" > CPU count; Per-process:/proc/pid/schedstat 2nd field (Sched_info.run_delay); Perf Sched Latency (shows "Average" and "Maximum" delay per-schedule); Dynamic tracing, eg, SystemTap schedtimes.stp "queued (US)" [3] Cpu|errors|perf (LPE) if processor specific error events (C PC) is available; eg, AMD64 ' s "04Ah single-bit ECC Errors Recorded by Scrubber" [4] Memory capacity|utilization|system-wide:free-m, "Mem:" (main memory), "Swap:" (VIrtual memory); Vmstat 1, "Free" (main memory), "swap" (virtual memory); Sar-r, "%memused"; Dstat-m, "free"; Slabtop-s C for Kmem slab usage; Per-process:top/htop, "RES" (resident main memory), "VIRT" (Virtual memory), "Mem" for system-wide summary memory capacit Y|saturation|system-wide:vmstat 1, "si"/"so" (swapping); Sar-b, "Pgscank" + "Pgscand" (scanning); Sar-w; per-process:10th field (Min_flt) From/proc/pid/stat for minor-fault, or dynamic tracing [5]; OOM KILLER:DMESG | grep killed Memory CAPACITY|ERRORS|DMESG for physical failures; Dynamic tracing, eg, SystemTap Uprobes for failed malloc () s Network interfaces|utilization|sar-n DEV 1, "rxkb/s"/max "TxK b/S "/MAX; Ip-s link, rx/tx tput/max bandwidth; /proc/net/dev, "bytes" Rx/tx Tput/max; Nicstat "%util" [6] Network interfaces|saturation|ifconfig, "overruns", "dropped"; Netstat-s, "segments retransmited"; Sar-n edev,  drop and  FIFO metrics; /proc/net/dev, rx/tx "drop"; Nicstat "Sat" [6]; Dynamic tracing for other TCP/IP stack queueing [7] Network interfaces|errors|ifconfig, "errors", "dropped"; Netstat-i, "Rx-err"/"Tx-err"; Ip-s link, "errors"; Sar-n Edev, "rxerr/s", "txerr/s"; /proc/net/dev, "errs", "drop"; Extra counters may be under/sys/class/net/...; Dynamic tracing of driver function returns] Storage device I/o|utilization|system-wide:iostat-xz 1, "%util"; sar-d, "%util"; Per-process:iotop; pidstat-d; /proc/pid/sched "Se.statistics.iowait_sum" Storage device I/o|saturation|iostat-xnz 1, "Avgqu-sz" > 1, or High "await" ; Sar-d same; LPE block probes for queue length/latency; Dynamic/static tracing of I/O subsystem (incl. LPE block probes) Storage device i/o|errors|/sys/devices/.../ioerr_cnt; Smartctl; Dynamic/static tracing of I/O subsystem response codes [8] Storage capacity|utilization|swap:swapon-s; Free /proc/meminfo "Swapfree"/"swaptotal"; File systems: "Df-h" Storage Capacity|satUration|not sure this one makes sense-once it's full, ENOSPC Storage capacity|errors|strace for ENOSPC; Dynamic tracing for ENOSPC; /var/log/messages errs, depending on FS Storage controller|utilization|iostat-xz 1, sum devices and compare to known IOPS /tput limits Per-card Storage controller|saturation|see Storage device saturation, ... Storage controller|errors|see Storage device errors, ... Network Controller|utilization|infer from Ip-s Link (or/proc/net/dev) and known controller Max Tput for its interfaces N Etwork Controller|saturation|see network interface saturation, ... Network Controller|errors|see network interface errors, ... CPU interconnect|utilization| LPE (CPC) for CPU interconnect ports, Tput/max CPU interconnect|saturation| LPE (CPC) for stall cycles CPU interconnect|errors| LPE (CPC) for whatever is available Memory interconnect|utilization| LPE (CPC) for memory busses, Tput/max; or CPI greater than, say, 5; CPC may also has local vs remote counters Memory INTERConnect|saturation| LPE (CPC) for stall cycles Memory interconnect|errors| LPE (CPC) for whatever is available I/O interconnect|utilization| LPE (CPC) for Tput/max if available; Inference via known tput from iostat/ip/... I/O interconnect|saturation| LPE (CPC) for stall cycles I/O interconnect|errors| LPE (CPC) for whatever is available

Refer

[1] There can some oddities with the%cpu from top/htop in virtualized environments; I ' ll update with details later when I can. CPU UTILIZATION:A single Hot CPU can is caused by a single hot thread, or mapped hardware interrupt. Relief of the bottleneck usually involves tuning to use more CPUs in parallel. Uptime "Load average" (or/proc/loadavg) wasn ' t included for CPUs metrics since Linux load averages include tasks in the UN Interruptable State (usually I/O).

[2] The man page for Vmstat describes "R" as "the number of processes waiting for run time", which is either incorrect or Misleading (on recent Linux distributions it's reporting those threads that is waiting, and threads that is running on-c PU; It ' s just the wait threads in other OSes).

[3] There may is a-measure per-process scheduling latency with perf ' sched:sched_process_wait event, otherwise per F probe to dynamically trace the scheduler functions, although, the overhead under high load to gather and post-process Ma NY (100s of) thousands of events per second may make this prohibitive. SystemTap can aggregate Per-thread latency in-kernel to reduce overhead, although, last I tried SCHEDTIMES.STP (on FC16) I T produced thousands of "unknown transition:" Warnings. LPE = = Linux performance Events, aka Perf_events. This is a powerful Observability toolkit the reads CPC and can also use static and dynamic tracing. Its interface is the perf command. CPC = = CPU Performance Counters (aka "Performance Instrumentation Counters" (PICs) or "Performance monitoring Events" (PMU s) or "Hardware Events"), read via programmable registers on each CPU by perf (which it is originally designed to do). These has traditionally been hard-to-work with due to differences between CPUs. LPE perf makes life easier by providing aliases for commonly used counters. Be aware that there is usually many more made available by the processor, accessible by providing their hex values to per F Stat-e. Expect to spend some quality time (days) with the processor vendor manuals when trying to use these. (My short video is about CPC is useful, despite not being on Linux).

[4] there aren ' t many error-related events in the recent Intel and AMD processor manuals; Be aware this public manuals may not show a complete list of events.

[5] The goal is a measure of memory capacity saturation-the degree to which a process was driving the system beyond its A Bility (and causing paging/swapping). High fault latency works well, but there isn ' t a standard LPE probe or existing SystemTap example of this (roll your own u Sing dynamic tracing). Another metric that may serve a similar goal was minor-fault rate by process and which could be watched From/proc/pid/stat. T His should is available in Htop as Minflt.

[6] Tim Cook ported nicstat to Linux; It can be found on SourceForge or his blog.

[7] Dropped packets is included as both saturation and error indicators, since they can occur due to both types of events .

[8] This includes tracing functions from different layers of the I/O subsystem:block device, SCSI, SATA, IDE, ... Some static probes is available (LPE "SCSI" and "block" tracepoint events), else use dynamic tracing. CPI = = Cycles per instruction (Others use IPC = = Instructions per Cycle). I/O interconnect:this includes the CPU to I/O controller busses, the I/O Controller (s), and device busses (eg, PCIe). Dynamic Tracing:allows Custom metrics to being developed, live in production. Options on Linux Include:lpe ' s "perf Probe", which have some basic functionality (function entry and variable tracing), AL Though in a trace-n-dump style that can cost performance; SystemTap (in my experience, almost unusable on centos/ubuntu, but much more stable on Fedora); Dtrace-for-linux, either the Paul Fox port (which I ' ve tried) or the OEL port (which Adam has tried), both projects very m Uch in beta.

Software Resources

Component|type|metric---------|------|---------Kernel mutex|utilization| With Config_lock_stats=y,/proc/lock_stat "Holdtime-totat"/"acquisitions" (Also see "Holdtime-min", "Holdtime-max") [8 ]; Dynamic tracing of lock functions or instructions (maybe) Kernel mutex|saturation| With Config_lock_stats=y,/proc/lock_stat "waittime-total"/"contentions" (Also see "Waittime-min", "Waittime-max"); Dynamic tracing of lock functions or instructions (maybe); Spinning shows up with profiling (perf record-a-g-f 997 ..., oprofile, dynamic tracing) Kernel mutex|errors|dynamic tra Cing (eg, recusive mutex enter); Other errors can cause kernel lockup/panic, debug with Kdump/crash User mutex|utilization|valgrind--tool=drd--exclusive- Threshold= ... (Held time); Dynamic tracing of lock to unlock function time User mutex|saturation|valgrind--TOOL=DRD to infer contention from held Ti Me Dynamic tracing of synchronization functions for wait time; Profiling (Oprofile, PEL, ...) User stacks for Spins USer mutex|errors|valgrind--tool=drd various errors; Dynamic tracing of Pthread_mutex_lock () for Eagain, EINVAL, Eperm, Edeadlk, Enomem, Eownerdead, ... Task capacity|utilization|top/htop, "Tasks" (current); Sysctl Kernel.threads-max,/proc/sys/kernel/threads-max (max) Task capacity|saturation|threads blocking on memory Allocation At the the page scanner should is running (sar-b "pgscan*"), else examine using dynamic tracing Task Capacity|erro Rs| " Can ' t fork () "errors; User-level threads:pthread_create () failures with Eagain, EINVAL, ...; Kernel:dynamic tracing of Kernel_thread () Enomem File descriptors|utilization|system-wide:sar-v, "File-nr" Vs/proc/sys /fs/file-max; Dstat--fs, "files"; or JUST/PROC/SYS/FS/FILE-NR; per-process:ls/proc/pid/fd | Wc-l vs Ulimit-n File descriptors|saturation|does This do sense? I Don ' t think there is any queueing or blocking, and other than on memory allocation. File descriptors|errors|strace errno = = emfile on syscalls returning FDS (eg, Open (), accept (), ...).

Refer

[8] Kernel lock analysis used to is via Lockmeter, which had an interface called "Lockstat".

What ' s Next

See the use Method for the follow-up strategies after identifying a possible bottleneck. If checklist But still has a performance issue, move onto other strategies:drill-down analysis and LAT Ency analysis.

Network Performance Tuning Documentation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.