Go TCP performance test, optimization results

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

The C + + (16Gbps) is 4 times times the performance of Go (4Gbps) in the case of a test before, without any optimizations, refer to http://blog.csdn.net/win_lin/article/details/40744175

This time, for the TCP part, go was optimized, and the test results were satisfying. Go single process (7Gbps) does not lose C + + (8Gbps), is C + + using Writev (16Gbps) of half, go multi-process (59Gbps) outright Triumph C + + is several times C + +.

Test Code reference: HTTPS://GITHUB.COM/WINLINVIP/SRS.GO/TREE/MASTER/RESEARCH/TCP

Note: The previous test was on the virtual machine, and this time on the physical machine, the results might be slightly different.

Why TCP

TCP is the basis of network communication, while the Web is based on the HTTP framework, and HTTP is based on TCP. Rtmp is also based on TCP.

TCP throughput rates can be met, laying the foundation for the language to develop high-performance servers.

Before srs1.0, researched go, wrote a go version of the SRS, but the performance and Red5 almost deleted.

Now srs2.0 single-process one-way network throughput can reach 4Gbps (6,000 client, bit rate 522Kbps), go if not to achieve this goal, then SRS can not be rewritten with go.

Platform

The test on the 24CPU server, the CPU is not a bottleneck.

The test uses the LO network card, direct memory copy, and the network is not a bottleneck.

When both the CPU and the network are fully resourced, the execution speed of the server itself is critical.

OS Selection Centos6 64bits.

The client chooses C + + as a client and uses the same client test.

Write

The following is a C + + server, single-process single-threaded as a server:

g++ tcp.server.cpp-g-o0-o tcp.server &&/tcp.server 1990 4096 g++ tcp.client.cpp-g-o0-o tcp.client &&am  p;./tcp.client 127.0.0.1 1990 4096----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read  writ| recv  send|  In Out   | int   CSW   0   6   0   0   1|   0    11k|1073m 1073m|   0     0 |2790    81k  0   6   0   0   1|   0  7782b|1049m 1049m|   0     0 |2536    76k    PID USER      PR  NI  VIRT  RES  SHR S%cpu%MEM    time+  COMMAND32573 Winlin   0 11744  892  756 R 99.9 0.0 17:56.88  ./tcp.server 1990 4096 2880 Winlin   0 11740  764 S 85.3  0.0   0:32.53./tcp.client 127.0.0.1 1990 4096

Single process of C + + efficiency is very high 1049MBps, compared to the current SRS1 168mbps,srs2 run to 391MBps, in fact, SRS did not reach the performance limit.

The following is the go as the server, no delay set to 1, that is the Go default TCP option, this option will cause TCP performance degradation:

Go build./tcp.server.go &&./tcp.server 1 1 1990 4096g++ tcp.client.cpp-g-o0-o tcp.client &&./tcp.cli ENT 127.0.0.1 1990 4096----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read  writ| recv  send|  In Out   | int   CSW   0   5   0   0   2|   0  7509b| 587M  587m|   0     0 |2544   141k  0   5   0   0   2|   0    10k| 524M  524m|   0     0 |2629   123k   PID USER      PR  NI  VIRT  RES  SHR S%cpu%MEM    time+  COMMAND                           5496 Winlin   0 98248 1968 1360 S 100.5 0.0  4:40.54   ./tcp.server 1 1 1990 4096       551 7 Winlin   0 11740  896  764 S 72.3  0.0   

It can be seen that the go to TCP no delay is only half the performance of C + +.

Below is the go as server, turn off TCP no delay option, single process as server:

Go build./tcp.server.go &&./tcp.server 1 0 1990 4096g++ tcp.client.cpp-g-o0-o tcp.client &&./tcp.cli ENT 127.0.0.1 1990 4096----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read  writ| recv  send|  In Out   | int   CSW   0   5   0   0   1|   0    10k| 868M  868m|   0     0 |2674    79k  1   5   0   0   1|   0    16k| 957M  957m|   0     0 |2660    85k   PID USER      PR  NI  VIRT  RES  SHR S%cpu%MEM    time+  COMMAND                          3004 Winlin   0 98248 1968 1360 R 100.2 0.0  2:27.32   ./tcp.server 1 0 1990 4096      303 0 Winlin   0 11740  764 R 81.0  0.0   1:59.42./tcp.client 127.0.0.1 1990 4096

In fact, after the TCP no delay is closed, go performance and C + + is not very big difference.

Multiple CPU

The most powerful thing about go is to consider multi-CPU parallel computing. In fact, C + + can also use fork multi-process, for the business code has a great impact; go directly extends support for multiple CPUs without impacting business code.

Go open 10 cpu,8 client, turn on no delay (default) performance:

Go build./tcp.server.go &&/tcp.server 1 1990 4096g++ tcp.client.cpp-g-o0-o tcp.client && for ((i=0 ; i<8;i++)); Do (./tcp.client 127.0.0.1 1990 4096 &); Done----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read writ|  Recv send| In-out |   int CSW 4 37 47 0 0 12|   0 105k|3972m 3972m|  0 0 |   14k 995k 4 37 46 0 0 13|   0 8055b|3761m 3761m|  0 0 |   14k 949k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 6353 Winlin 20  0 517m 6896 1372 R 789.6 0.0 13:24.49./tcp.server 1 1990 4096 6384 Winlin 0 11740 68.4 0.0 1:11.57./tcp.client 127.0.0.1 1990 4096 6386 Winlin 0 11740 896 764 R 67.4 0.0 1:09.53./tcp.client 1 27.0.0.1 1990 4096 6390 Winlin 0 11740, 764 R 66.7 0.0 1:11.24./tcp.client 127.0.0.1 1990 4096 6382 win Lin 0 11740 896 764 R 64.8 0.0 1:11.3 ./tcp.client 127.0.0.1 1990 4096 6388 Winlin 0 11740 896 764 R 64.4 0.0 1:11.80./tcp.client 127.0.0.1 1990  4096 6380 Winlin 0 11740 896 764 S 63.4 0.0 1:08.78./tcp.client 127.0.0.1 1990 4096 6396 Winlin 20 0    11740 896 764 R 62.8 0.0 1:09.47./tcp.client 127.0.0.1 1990 4096 6393 Winlin 0 11740-764 R 61.4 0.0  1:11.90./tcp.client 127.0.0.1 1990 4096

Also more powerful, can run to 30Gbps.

Turn on 10 cpu,8 clients and check the performance of Go when you close no delay:

Go build./tcp.server.go &&/tcp.server 0 1990 4096g++ tcp.client.cpp-g-o0-o tcp.client && for ((i=0 ; i<8;i++)); Do (./tcp.client 127.0.0.1 1990 4096 &); Done----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read writ|  Recv send| In-out |   int CSW 5 42 41 0 0 12|   0 8602b|7132m 7132m|  0 0 |   15k 602k 5 41 41 0 0 12|   0 13k|7426m 7426m|  0 0 |   15k 651k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 4148 Winlin 20 0 528m 9.8m 1376 R 795.5 0.1 81:48.12/tcp.server 0 1990 4096 4167 Winlin 0 11740 896 764 S 89.8 0. 0 8:16.52./tcp.client 127.0.0.1 1990 4096 4161 Winlin 0 11740, 764 R 87.8 0.0 8:14.63./tcp.client 127.    0.0.1 1990 4096 4174 Winlin 0 11740 896 764 S 83.2 0.0 8:09.40./tcp.client 127.0.0.1 1990 4096 4163 Winlin 0 11740 896 764 R 82.6 0.0 8:07.80./tcP.client 127.0.0.1 1990 4096 4171 Winlin 0 11740 764 R 82.2 0.0 8:08.75./tcp.client 127.0.0.1 1990 4096  4169 Winlin 0 11740 764 S 81.9 0.0 8:15.37./tcp.client 127.0.0.1 1990 4096 4165 Winlin 20 0 11740 764 r 78.9 0.0 8:09.98/tcp.client 127.0.0.1 1990 4096 4177 Winlin 0 11740 0.0 764 R 74.0 8:07.6 3./tcp.client 127.0.0.1 1990 4096

This is even more powerful, can run to 59Gbps, very bad!

Writev

Go is no writev, can be used in C + + Writev and go multi-process ratio.

Considering that SRS2 currently uses Writev to boost performance, it makes sense to use a multi-process if SRS uses go.

At the same time, the client also uses READV to read multiple times, making up the bottleneck of the client single process.

g++ tcp.server.writev.cpp-g-o0-o tcp.server &&./tcp.server 1990 4096 g++ tcp.client.readv.cpp-g-o0-o TCP . Client &&./tcp.client 127.0.0.1 1990 4096----total-cpu-usage-----dsk/total----net/lo-----paging-----sys TEM--USR Sys IDL Wai HiQ siq| Read  writ| recv  send|  In Out   | int   CSW   0   6   0   0   1|   0    15k|1742m 1742m|   0     0 |2578    30k  0   6   0   0   1|   0    13k|1779m 1779m|   0     0 |2412    30k    PID USER      PR  NI  VIRT  RES  SHR S%cpu%MEM    time+  COMMAND                             9468 Winlin   0 12008 1192  R 99.8  0.0   1:17.63./tcp.server 64 1990 4096           9487 Winlin   0 12008 1192  R 80.3  0.0   1:02.49./tcp.client 127.0.0.1 1990 64 4096

Using Writev does improve performance by a factor that reduces the time for system calls.

Compare the use of the READV client's go to disable TCP no delay:

Go build./tcp.server.go &&./tcp.server 1 0 1990 4096g++ tcp.client.readv.cpp-g-o0-o tcp.client &&./t Cp.client 127.0.0.1 1990 4096----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai hi Q siq| Read  writ| recv  send|  In Out   | int   CSW   0   5   0   0   1|   0  5734b| 891M  891m|   0     0 |2601   101k  0   5   0   0   2|   0  9830b| 897M  897m|   0     0 |2518   103k  PID USER      PR  NI  VIRT  RES  SHR S%cpu%MEM    time+  COMMAND                             9690 Winlin   0 98248 3984 1360 R 100.2 0.0  2:46.84   ./tcp.server 1 0 1990 4096         969 8 Winlin   0 12008 1192  R 79.3  0.0   2:13.23./tcp.client 127.0.0.1 1990 64 4096

This time the go single process does not improve, because the bottleneck is not the client, so the client has not changed after using READV.

Compared to go with 10 CPUs, the client uses READV to disable TCP no delay:

Go build./tcp.server.go &&/tcp.server 0 1990 4096g++ tcp.client.readv.cpp-g-o0-o tcp.client && fo R ((i=0;i<8;i++)); Do (./tcp.client 127.0.0.1 1990 4096 &); Done----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read writ|  Recv send| In-out |   int CSW 5 41 42 0 0 12|   0 7236b|6872m 6872m|  0 0 |   15k 780k 4 42 41 0 0 12|   0 9557b|6677m 6677m|  0 0 | 15k 723k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 10169 Winlin 2 0 0 655m 7072 1388 r 799.9 0.0 51:39.13./tcp.server 0 1990 4096 10253 Winlin 0 12008 1192 (R 84) .5 0.0 5:05.05/tcp.client 127.0.0.1 1990 409610261 Winlin 0 12008 1192 80.6.0.0 Ient 127.0.0.1 1990 409610255 Winlin 0 12008 1192 (R 79.9 0.0 5:05.32./tcp.client 127.0.0.1 1990 64 409 610271 Winlin 0 12008 119279.3 0.0 5:05.15./tcp.client 127.0.0.1 1990-409610258 Winlin 0 12008 1192 78.3 0.0 5:05.45. Client 127.0.0.1 1990 409610268 Winlin 0 12008 1192 (R 77.6 0.0 5:06.54./tcp.client 127.0.0.1 1990 64 4 09610251 Winlin 0 12008 1188 R 76.6 0.0 5:03.68./tcp.client 127.0.0.1 1990 20 409610265 Winlin 0 12008 1192 R 74.6 0.0 5:03.35./tcp.client 127.0.0.1 1990 64 4096

The test results are about the same as before.

GO Write Analysis

To debug the Tcpconn.write method of Go, the call stack is:

    at/home/winlin/go/src/github.com/winlinvip/srs.go/research/tcp/tcp.server.go:203203        N, err: = conn. Write (b)

The code that is called is:

Func handleconnection (Conn *net. Tcpconn, No_delay int, packet_bytes int) {for    {        N, err: = conn. Write (b)        if err! = Nil {            FMT. Println ("Write data error, n is", N, "and err are", err)            break        }    }

s debugging in, the call is:

Net. (*conn). Write (C=0xc20805bf18, b= ..., ~r1=0, ~r2= ...) at/usr/local/go/src/pkg/net/net.go:130130return C.fd.write (b)

The code for this section is:

Func (c *conn) Write (b []byte) (int, error) {if!c.ok () {return 0, Syscall. Einval}return C.fd.write (b)}

Next is:

Breakpoint 2, net. (*NETFD). Write (fd=0x0, p= ..., nn=0, err= ...) at/usr/local/go/src/pkg/net/fd_unix.go:327327n, err = Syscall. Write (int (FD.SYSFD), P[nn:])

This part of the code is:

Func (FD *netfd) Write (P []byte) (nn int, err error) {if err: = Fd.writelock (); Err! = Nil {return 0, Err}defer Fd.writeun Lock () If Err: = Fd.pd.PrepareWrite (); Err! = Nil {return 0, &operror{"write", Fd.net, Fd.raddr, err}}for {var n intn, err = Syscall. Write (int (FD.SYSFD), P[nn:]) if n > 0 {nn + = n}if nn = = Len (p) {break}if err = syscall. Eagain {If Err = Fd.pd.WaitWrite (); err = = Nil {continue}}if Err! = Nil {n = 0break}if n = = 0 {err = io. Errunexpectedeofbreak}}if Err! = Nil {err = &operror{"Write", Fd.net, Fd.raddr, Err}}return nn, err}

The system call layer is:

Syscall. Write (Fd=0, p= ..., n=0, err= ...) at/usr/local/go/src/pkg/syscall/syscall_unix.go:152152n, err = Write (FD, p)

The code is:

Func Write (fd int, p []byte) (n int, err error) {if raceenabled {racereleasemerge (unsafe. Pointer (&iosync))}n, err = Write (FD, p) if raceenabled && n > 0 {racereadrange (unsafe. Pointer (&p[0]), n)}return}

The last is:

#0  syscall.write (fd=12, p= ..., n=4354699, err= ...) at/usr/local/go/src/pkg/syscall/zsyscall_linux_amd64.go : 12281228r0, _, E1: = Syscall (Sys_write, UIntPtr (FD), UIntPtr (_p0), UIntPtr (Len (p)))

The code is:

Func write (fd int, p []byte) (n int, err error) {var _p0 unsafe. Pointerif len (P) > 0 {_p0 = unsafe. Pointer (&p[0])} else {_p0 = unsafe. Pointer (&_zero)}r0, _, E1: = Syscall (Sys_write, UIntPtr (FD), UIntPtr (_p0), UIntPtr (Len (P)))) n = Int (r0) if e1! = 0 {Err = E1}return}

The call is Sys_write, which is WRITE. Find found to be Writev:

[Winlin@dev6 src]$ Find. -name "*.go" |xargs grep-in "Sys_writev"./pkg/syscall/zsysnum_linux_amd64.go:27:sys_writev                 = 20

Unfortunately did not find the use of this code, so go certainly does not support Writev.

Finally, all the Syscall are written in ASM:

Syscall. Syscall () At/usr/local/go/src/pkg/syscall/asm_linux_amd64.s:2020callruntime Entersyscall (SB)

The code is:

Text Syscall (SB), Nosplit,$0-56callruntime Entersyscall (SB) MOVQ16 (SP), DIMOVQ24 (SP), SIMOVQ32 (SP), dxmovq$0, r10movq$0, R8movq$0, R9movq8 (sp), ax//syscall Entrysyscallcmpqax, $0xfffffffffffff001jlsokmovq$-1, (SP)//r1movq$0, (SP)// R2negqaxmovqax, (SP)  //Errnocallruntime Exitsyscall (SB) Retok:movqax, (SP)//R1MOVQDX, (SP)//r2movq$0, 56 ( SP)//Errnocallruntime Exitsyscall (SB) RET

This is the end of the process.

Summarize

Go single process off TCP no delay is about the same as write performance in C + +, but only half of the Writev in C + +.

Go multi-process can linearly improve performance, without changing the business code, the performance is a single process of C/s a number of times.


Winlin 2014.11.22

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.