This is a creation in Article, where the information may have evolved or changed.
The C + + (16Gbps) is 4 times times the performance of Go (4Gbps) in the case of a test before, without any optimizations, refer to http://blog.csdn.net/win_lin/article/details/40744175
This time, for the TCP part, go was optimized, and the test results were satisfying. Go single process (7Gbps) does not lose C + + (8Gbps), is C + + using Writev (16Gbps) of half, go multi-process (59Gbps) outright Triumph C + + is several times C + +.
Test Code reference: HTTPS://GITHUB.COM/WINLINVIP/SRS.GO/TREE/MASTER/RESEARCH/TCP
Note: The previous test was on the virtual machine, and this time on the physical machine, the results might be slightly different.
Why TCP
TCP is the basis of network communication, while the Web is based on the HTTP framework, and HTTP is based on TCP. Rtmp is also based on TCP.
TCP throughput rates can be met, laying the foundation for the language to develop high-performance servers.
Before srs1.0, researched go, wrote a go version of the SRS, but the performance and Red5 almost deleted.
Now srs2.0 single-process one-way network throughput can reach 4Gbps (6,000 client, bit rate 522Kbps), go if not to achieve this goal, then SRS can not be rewritten with go.
Platform
The test on the 24CPU server, the CPU is not a bottleneck.
The test uses the LO network card, direct memory copy, and the network is not a bottleneck.
When both the CPU and the network are fully resourced, the execution speed of the server itself is critical.
OS Selection Centos6 64bits.
The client chooses C + + as a client and uses the same client test.
Write
The following is a C + + server, single-process single-threaded as a server:
g++ tcp.server.cpp-g-o0-o tcp.server &&/tcp.server 1990 4096 g++ tcp.client.cpp-g-o0-o tcp.client &&am p;./tcp.client 127.0.0.1 1990 4096----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read writ| recv send| In Out | int CSW 0 6 0 0 1| 0 11k|1073m 1073m| 0 0 |2790 81k 0 6 0 0 1| 0 7782b|1049m 1049m| 0 0 |2536 76k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND32573 Winlin 0 11744 892 756 R 99.9 0.0 17:56.88 ./tcp.server 1990 4096 2880 Winlin 0 11740 764 S 85.3 0.0 0:32.53./tcp.client 127.0.0.1 1990 4096
Single process of C + + efficiency is very high 1049MBps, compared to the current SRS1 168mbps,srs2 run to 391MBps, in fact, SRS did not reach the performance limit.
The following is the go as the server, no delay set to 1, that is the Go default TCP option, this option will cause TCP performance degradation:
Go build./tcp.server.go &&./tcp.server 1 1 1990 4096g++ tcp.client.cpp-g-o0-o tcp.client &&./tcp.cli ENT 127.0.0.1 1990 4096----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read writ| recv send| In Out | int CSW 0 5 0 0 2| 0 7509b| 587M 587m| 0 0 |2544 141k 0 5 0 0 2| 0 10k| 524M 524m| 0 0 |2629 123k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 5496 Winlin 0 98248 1968 1360 S 100.5 0.0 4:40.54 ./tcp.server 1 1 1990 4096 551 7 Winlin 0 11740 896 764 S 72.3 0.0
It can be seen that the go to TCP no delay is only half the performance of C + +.
Below is the go as server, turn off TCP no delay option, single process as server:
Go build./tcp.server.go &&./tcp.server 1 0 1990 4096g++ tcp.client.cpp-g-o0-o tcp.client &&./tcp.cli ENT 127.0.0.1 1990 4096----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read writ| recv send| In Out | int CSW 0 5 0 0 1| 0 10k| 868M 868m| 0 0 |2674 79k 1 5 0 0 1| 0 16k| 957M 957m| 0 0 |2660 85k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 3004 Winlin 0 98248 1968 1360 R 100.2 0.0 2:27.32 ./tcp.server 1 0 1990 4096 303 0 Winlin 0 11740 764 R 81.0 0.0 1:59.42./tcp.client 127.0.0.1 1990 4096
In fact, after the TCP no delay is closed, go performance and C + + is not very big difference.
Multiple CPU
The most powerful thing about go is to consider multi-CPU parallel computing. In fact, C + + can also use fork multi-process, for the business code has a great impact; go directly extends support for multiple CPUs without impacting business code.
Go open 10 cpu,8 client, turn on no delay (default) performance:
Go build./tcp.server.go &&/tcp.server 1 1990 4096g++ tcp.client.cpp-g-o0-o tcp.client && for ((i=0 ; i<8;i++)); Do (./tcp.client 127.0.0.1 1990 4096 &); Done----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read writ| Recv send| In-out | int CSW 4 37 47 0 0 12| 0 105k|3972m 3972m| 0 0 | 14k 995k 4 37 46 0 0 13| 0 8055b|3761m 3761m| 0 0 | 14k 949k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 6353 Winlin 20 0 517m 6896 1372 R 789.6 0.0 13:24.49./tcp.server 1 1990 4096 6384 Winlin 0 11740 68.4 0.0 1:11.57./tcp.client 127.0.0.1 1990 4096 6386 Winlin 0 11740 896 764 R 67.4 0.0 1:09.53./tcp.client 1 27.0.0.1 1990 4096 6390 Winlin 0 11740, 764 R 66.7 0.0 1:11.24./tcp.client 127.0.0.1 1990 4096 6382 win Lin 0 11740 896 764 R 64.8 0.0 1:11.3 ./tcp.client 127.0.0.1 1990 4096 6388 Winlin 0 11740 896 764 R 64.4 0.0 1:11.80./tcp.client 127.0.0.1 1990 4096 6380 Winlin 0 11740 896 764 S 63.4 0.0 1:08.78./tcp.client 127.0.0.1 1990 4096 6396 Winlin 20 0 11740 896 764 R 62.8 0.0 1:09.47./tcp.client 127.0.0.1 1990 4096 6393 Winlin 0 11740-764 R 61.4 0.0 1:11.90./tcp.client 127.0.0.1 1990 4096
Also more powerful, can run to 30Gbps.
Turn on 10 cpu,8 clients and check the performance of Go when you close no delay:
Go build./tcp.server.go &&/tcp.server 0 1990 4096g++ tcp.client.cpp-g-o0-o tcp.client && for ((i=0 ; i<8;i++)); Do (./tcp.client 127.0.0.1 1990 4096 &); Done----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read writ| Recv send| In-out | int CSW 5 42 41 0 0 12| 0 8602b|7132m 7132m| 0 0 | 15k 602k 5 41 41 0 0 12| 0 13k|7426m 7426m| 0 0 | 15k 651k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 4148 Winlin 20 0 528m 9.8m 1376 R 795.5 0.1 81:48.12/tcp.server 0 1990 4096 4167 Winlin 0 11740 896 764 S 89.8 0. 0 8:16.52./tcp.client 127.0.0.1 1990 4096 4161 Winlin 0 11740, 764 R 87.8 0.0 8:14.63./tcp.client 127. 0.0.1 1990 4096 4174 Winlin 0 11740 896 764 S 83.2 0.0 8:09.40./tcp.client 127.0.0.1 1990 4096 4163 Winlin 0 11740 896 764 R 82.6 0.0 8:07.80./tcP.client 127.0.0.1 1990 4096 4171 Winlin 0 11740 764 R 82.2 0.0 8:08.75./tcp.client 127.0.0.1 1990 4096 4169 Winlin 0 11740 764 S 81.9 0.0 8:15.37./tcp.client 127.0.0.1 1990 4096 4165 Winlin 20 0 11740 764 r 78.9 0.0 8:09.98/tcp.client 127.0.0.1 1990 4096 4177 Winlin 0 11740 0.0 764 R 74.0 8:07.6 3./tcp.client 127.0.0.1 1990 4096
This is even more powerful, can run to 59Gbps, very bad!
Writev
Go is no writev, can be used in C + + Writev and go multi-process ratio.
Considering that SRS2 currently uses Writev to boost performance, it makes sense to use a multi-process if SRS uses go.
At the same time, the client also uses READV to read multiple times, making up the bottleneck of the client single process.
g++ tcp.server.writev.cpp-g-o0-o tcp.server &&./tcp.server 1990 4096 g++ tcp.client.readv.cpp-g-o0-o TCP . Client &&./tcp.client 127.0.0.1 1990 4096----total-cpu-usage-----dsk/total----net/lo-----paging-----sys TEM--USR Sys IDL Wai HiQ siq| Read writ| recv send| In Out | int CSW 0 6 0 0 1| 0 15k|1742m 1742m| 0 0 |2578 30k 0 6 0 0 1| 0 13k|1779m 1779m| 0 0 |2412 30k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 9468 Winlin 0 12008 1192 R 99.8 0.0 1:17.63./tcp.server 64 1990 4096 9487 Winlin 0 12008 1192 R 80.3 0.0 1:02.49./tcp.client 127.0.0.1 1990 64 4096
Using Writev does improve performance by a factor that reduces the time for system calls.
Compare the use of the READV client's go to disable TCP no delay:
Go build./tcp.server.go &&./tcp.server 1 0 1990 4096g++ tcp.client.readv.cpp-g-o0-o tcp.client &&./t Cp.client 127.0.0.1 1990 4096----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai hi Q siq| Read writ| recv send| In Out | int CSW 0 5 0 0 1| 0 5734b| 891M 891m| 0 0 |2601 101k 0 5 0 0 2| 0 9830b| 897M 897m| 0 0 |2518 103k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 9690 Winlin 0 98248 3984 1360 R 100.2 0.0 2:46.84 ./tcp.server 1 0 1990 4096 969 8 Winlin 0 12008 1192 R 79.3 0.0 2:13.23./tcp.client 127.0.0.1 1990 64 4096
This time the go single process does not improve, because the bottleneck is not the client, so the client has not changed after using READV.
Compared to go with 10 CPUs, the client uses READV to disable TCP no delay:
Go build./tcp.server.go &&/tcp.server 0 1990 4096g++ tcp.client.readv.cpp-g-o0-o tcp.client && fo R ((i=0;i<8;i++)); Do (./tcp.client 127.0.0.1 1990 4096 &); Done----total-cpu-usage-----dsk/total----net/lo-----paging-----system--usr sys IDL Wai HiQ siq| Read writ| Recv send| In-out | int CSW 5 41 42 0 0 12| 0 7236b|6872m 6872m| 0 0 | 15k 780k 4 42 41 0 0 12| 0 9557b|6677m 6677m| 0 0 | 15k 723k PID USER PR NI VIRT RES SHR S%cpu%MEM time+ COMMAND 10169 Winlin 2 0 0 655m 7072 1388 r 799.9 0.0 51:39.13./tcp.server 0 1990 4096 10253 Winlin 0 12008 1192 (R 84) .5 0.0 5:05.05/tcp.client 127.0.0.1 1990 409610261 Winlin 0 12008 1192 80.6.0.0 Ient 127.0.0.1 1990 409610255 Winlin 0 12008 1192 (R 79.9 0.0 5:05.32./tcp.client 127.0.0.1 1990 64 409 610271 Winlin 0 12008 119279.3 0.0 5:05.15./tcp.client 127.0.0.1 1990-409610258 Winlin 0 12008 1192 78.3 0.0 5:05.45. Client 127.0.0.1 1990 409610268 Winlin 0 12008 1192 (R 77.6 0.0 5:06.54./tcp.client 127.0.0.1 1990 64 4 09610251 Winlin 0 12008 1188 R 76.6 0.0 5:03.68./tcp.client 127.0.0.1 1990 20 409610265 Winlin 0 12008 1192 R 74.6 0.0 5:03.35./tcp.client 127.0.0.1 1990 64 4096
The test results are about the same as before.
GO Write Analysis
To debug the Tcpconn.write method of Go, the call stack is:
at/home/winlin/go/src/github.com/winlinvip/srs.go/research/tcp/tcp.server.go:203203 N, err: = conn. Write (b)
The code that is called is:
Func handleconnection (Conn *net. Tcpconn, No_delay int, packet_bytes int) {for { N, err: = conn. Write (b) if err! = Nil { FMT. Println ("Write data error, n is", N, "and err are", err) break } }
s debugging in, the call is:
Net. (*conn). Write (C=0xc20805bf18, b= ..., ~r1=0, ~r2= ...) at/usr/local/go/src/pkg/net/net.go:130130return C.fd.write (b)
The code for this section is:
Func (c *conn) Write (b []byte) (int, error) {if!c.ok () {return 0, Syscall. Einval}return C.fd.write (b)}
Next is:
Breakpoint 2, net. (*NETFD). Write (fd=0x0, p= ..., nn=0, err= ...) at/usr/local/go/src/pkg/net/fd_unix.go:327327n, err = Syscall. Write (int (FD.SYSFD), P[nn:])
This part of the code is:
Func (FD *netfd) Write (P []byte) (nn int, err error) {if err: = Fd.writelock (); Err! = Nil {return 0, Err}defer Fd.writeun Lock () If Err: = Fd.pd.PrepareWrite (); Err! = Nil {return 0, &operror{"write", Fd.net, Fd.raddr, err}}for {var n intn, err = Syscall. Write (int (FD.SYSFD), P[nn:]) if n > 0 {nn + = n}if nn = = Len (p) {break}if err = syscall. Eagain {If Err = Fd.pd.WaitWrite (); err = = Nil {continue}}if Err! = Nil {n = 0break}if n = = 0 {err = io. Errunexpectedeofbreak}}if Err! = Nil {err = &operror{"Write", Fd.net, Fd.raddr, Err}}return nn, err}
The system call layer is:
Syscall. Write (Fd=0, p= ..., n=0, err= ...) at/usr/local/go/src/pkg/syscall/syscall_unix.go:152152n, err = Write (FD, p)
The code is:
Func Write (fd int, p []byte) (n int, err error) {if raceenabled {racereleasemerge (unsafe. Pointer (&iosync))}n, err = Write (FD, p) if raceenabled && n > 0 {racereadrange (unsafe. Pointer (&p[0]), n)}return}
The last is:
#0 syscall.write (fd=12, p= ..., n=4354699, err= ...) at/usr/local/go/src/pkg/syscall/zsyscall_linux_amd64.go : 12281228r0, _, E1: = Syscall (Sys_write, UIntPtr (FD), UIntPtr (_p0), UIntPtr (Len (p)))
The code is:
Func write (fd int, p []byte) (n int, err error) {var _p0 unsafe. Pointerif len (P) > 0 {_p0 = unsafe. Pointer (&p[0])} else {_p0 = unsafe. Pointer (&_zero)}r0, _, E1: = Syscall (Sys_write, UIntPtr (FD), UIntPtr (_p0), UIntPtr (Len (P)))) n = Int (r0) if e1! = 0 {Err = E1}return}
The call is Sys_write, which is WRITE. Find found to be Writev:
[Winlin@dev6 src]$ Find. -name "*.go" |xargs grep-in "Sys_writev"./pkg/syscall/zsysnum_linux_amd64.go:27:sys_writev = 20
Unfortunately did not find the use of this code, so go certainly does not support Writev.
Finally, all the Syscall are written in ASM:
Syscall. Syscall () At/usr/local/go/src/pkg/syscall/asm_linux_amd64.s:2020callruntime Entersyscall (SB)
The code is:
Text Syscall (SB), Nosplit,$0-56callruntime Entersyscall (SB) MOVQ16 (SP), DIMOVQ24 (SP), SIMOVQ32 (SP), dxmovq$0, r10movq$0, R8movq$0, R9movq8 (sp), ax//syscall Entrysyscallcmpqax, $0xfffffffffffff001jlsokmovq$-1, (SP)//r1movq$0, (SP)// R2negqaxmovqax, (SP) //Errnocallruntime Exitsyscall (SB) Retok:movqax, (SP)//R1MOVQDX, (SP)//r2movq$0, 56 ( SP)//Errnocallruntime Exitsyscall (SB) RET
This is the end of the process.
Summarize
Go single process off TCP no delay is about the same as write performance in C + +, but only half of the Writev in C + +.
Go multi-process can linearly improve performance, without changing the business code, the performance is a single process of C/s a number of times.
Winlin 2014.11.22