This is a creation in Article, where the information may have evolved or changed.
Hi, everybody! My name is Sergey Kamardin, I am an engineer of Mail.ru.
Introduced
The context of our story is introduced first, and we should describe some of the reasons why we need this server.
Mail.ru has a lot of stateful systems. User e-mail storage is one of them. There are several ways to track state changes and system events in the system. This is primarily through periodic system polling or system notifications about their status changes.
Both ways have pros and cons. However, when a message is involved, the faster the user receives the new message, the better.
Message polling involves about 50,000 HTTP queries per second, where 60% returns a 304 status, which means that the mailbox has not changed.
Therefore, in order to reduce the load on the server and expedite the delivery of the message to the user, it is decided to write a publish-subscribe server, on the one hand will receive notifications about state changes, on the other hand, you will receive a subscription to this notification.
Previous
Right now
The first scenario shows what it looked like before. The browser polls the API periodically and queries for changes to the storage (mailbox service).
The second scenario describes the new schema. The browser establishes a WebSocket connection with the notification API, which notifies the client of the bus server. When a new e-mail message is received, storage sends a notification to the bus (1), which is sent by the bus to the subscriber. The API determines the connection to send the received notification and sends it to the user's browser (3).
So today we're going to talk about APIs or WebSocket servers. Our server will have approximately 3 million online connections.
Implementation method
Let's look at how to use the Go function to implement some parts of the server without any optimizations.
In the net/http, let's talk about how we send and receive data. The data that stands on top of the WebSocket protocol, such as a JSON object, is later referred to as grouping.
We begin to implement the channel structure that includes sending and receiving these packets over a websocket connection.
Channel structure
Packet represents Applicationleveldata.
Type Packet struct {
...
}
Channel wrapsuserconnection.
Type Channel struct {
Conn Net. Conn//Websocketconnection.
Send Chan Packet//Outgoing packets queue.
}
Func Newchannel (Conn net. Conn) *channel {
c: = &channel{
Conn:conn,
Send:make (Chan Packet, N),
}
Go C.reader ()
Go C.writer ()
Returnc
}
Note that there is a goroutines between reader and writer. Each goroutine requires its own memory stack, which may have an initial size of 2 to 8 KB depending on the operating system and go version.
We will need up to 3 million GB of memory (stack of 4 KB) to maintain all the connections when we are connected online. This has not yet been calculated for the memory allocated by the channel structure, the outgoing packet ch.send, and the memory consumed by other internal fields.
I/O goroutines
Let's take a look at the implementation of "reader":
Func (c *channel) reader () {
We make a bufferedreadtoreducereadsyscalls.
BUF: = Bufio. Newreader (C.conn)
for{
PKT, _: = Readpacket (BUF)
C.handle (PKT)
}
}
Here we use Bufio. Reader to reduce the number of read () Syscalls and to read the same amount as the BUF buffer size. In the infinite loop, we expect new data to come. Keep in mind that new data is expected to come. We'll be back later.
We will leave the parsing and processing of incoming packets because it is not important to the optimizations we will be discussing. However, buf now deserves our attention: By default, it is 4 KB, which means we need another GB of memory. "Writer" has a similar situation:
Func (c *channel) writer () {
We make buffered Writetoreduce write Syscalls.
BUF: = Bufio. Newwriter (C.conn)
FORPKT: = Range C.send {
_: = Writepacket (buf, PKT)
Buf. Flush ()
}
}
We traverse the c.send and write them to the buffer. The careful reader has guessed that our 3 million connections will also consume up to three gigabytes of memory.
HTTP
We already have a simple channel implementation, and now we need a websocket connection to use.
Note: If you don't know how websocket works. The client switches to the WebSocket protocol through a special HTTP mechanism called an upgrade. After the upgrade request is successfully processed, the server and client use a TCP connection to Exchange binary WebSocket frames. This is a description of the frame structure in the connection.
Import (
"Net/http"
"Some/websocket"
)
http. Handlefunc ("/v1/ws", func (w http). Responsewriter, R *http. Request) {
Conn, _: = WebSocket. Upgrade (R, W)
CH: = Newchannel (conn)
//...
})
Please note that HTTP. Responsewriter memory allocations for Bufio.reader and Bufio.writer (using 4 KB buffers) for *http. Request initialization and further response writes.
Regardless of what WebSocket library is used, after a successful response to the upgrade request, the server receives the I/O buffer along with the TCP connection after the Responsewriter.hijack () call.
Tip: In some cases, go:linkname can be used to return a buffer to sync within net/http by calling Net/http.putbufio{reader,writer}. Pool.
As a result, we need another GB of memory to maintain 3 million links.
Therefore, our program requires 72G of memory even if nothing is done.
Optimization
Let's review what we talked about in the introductory section and remember the behavior of user connections. After switching to WebSocket, the client sends a packet containing the related event, in other words, the subscription event. Then (regardless of technical information such as Ping/pong), the client may not send any additional information throughout the lifetime of the connection.
The connection life may be a few seconds to a few days.
So at most times, our Channel.reader () and Channel.writer () are waiting to receive or send data processing. Each has a 4 KB I/O buffer.
Now it's clear that something can be done better, isn't it?
Netpoll
You remember Bufio. Inside Reader.read (), Channel.reader () realizes that Conn.read () is locked when there is no new data. If there is data in the connection, the Go runtime "wakes up" our goroutine and allows it to read the next packet. After that, Goroutine again locked up and looked forward to the new data. Let's see how the Go runtime understands that goroutine must be "awakened". If we look at Conn. The read () implementation, where we will see the Net.netFD.Read () Call:
Net/fd_unix.go
Func (FD *netfd) Read (P []byte) (Nint, err error) {
//...
for{
N, err = Syscall. Read (FD.SYSFD, p)
If err! = Nil {
n = 0
If Err = = Syscall. Eagain {
If Err = Fd.pd.waitRead (); Err = = Nil {
Continue
}
}
}
//...
Break
}
//...
}
Go uses sockets in non-blocking mode. Eagain indicates that there is no data in the socket and is not locked when read from an empty socket, and the operating system returns control to us.
We see a read () system call from the connection file descriptor. If the read returns a Eagain error, the runtime causes Polldesc.waitread () to be called:
Net/fd_poll_runtime.go
Func (PD *polldesc) waitread () error {
Returnpd.wait (' R ')
}
Func (PD *polldesc) Wait (modeint) error {
Res: = runtime_pollwait (pd.runtimectx, mode)
//...
}
If we dig deeper, we'll see that Netpoll is implemented using the Kqueue in Epoll and BSD in Linux. Why not use the same method to connect? We can allocate a read buffer and use Goroutine only when it is really necessary: when there is real-world readable data in the socket.
On Github.com/golang/go, there is a problem exporting the Netpoll function.
Get rid of Goroutines
Suppose we have the netpoll implementation of go. Now we can avoid using an internal buffer to start Channel.reader () Goroutine and subscribe to events in the connection for readable data:
CH: = Newchannel (conn)
Make Conntobe Observedbynetpoll instance.
Poller. Start (conn, Netpoll. Eventread, func () {
We Spawn goroutine heretoprevent poller wait loop
Tobecome locked during receiving PACKETFROMCH.
Go Receive (CH)
})
Receive reads a packetfromconnandhandles it somehow.
Func (Ch *channel) Receive () {
BUF: = Bufio. Newreader (Ch.conn)
PKT: = Readpacket (BUF)
C.handle (PKT)
}
Using Channel.writer () is easier, because only when we are sending packets can we run Goroutine and allocate buffers:
Func (Ch *channel) Send (P Packet) {
If C.nowriteryet () {
Go Ch.writer ()
}
Ch.send <-P
}
Note that we do not handle this situation when the operating system returns Eagain when the Write () system is called. In this case, we tend to deal with the go runtime. If needed, it can be handled in the same way.
After the outgoing packet is read from ch.send (one or more), writer completes its operation and frees the Goroutine stack and send buffer.
Perfect! By getting rid of the stacks and I/O buffers in the two continuously running goroutine, we saved up to GB.
Resource control
A large number of connections involve not only high memory consumption. In the development of servers, we experience repeated competition conditions and deadlocks, often called automatic DDoS, when application clients wantonly attempt to connect to the server, destroying the server.
For example, if for some reason we suddenly cannot process the ping/pong message, but the idle connection handler closes such a connection (assuming that the connection is broken and therefore does not provide the data), the client keeps trying to connect instead of waiting for the event.
If a locked or overloaded server has just stopped accepting a new connection and the load balancer (for example, Nginx) passes the request to the next server instance, the pressure will be enormous.
In addition, regardless of server load, if all clients suddenly want to send packets for any reason (presumably because of an error), the previously saved gigabytes will be reused because we will actually revert to the initial state of Goroutine and allocate buffers for each connection.
Goroutine Pond
We can use goroutine pools to limit the number of packets processed simultaneously. This is a simple implementation of the Go routine pool:
Package Gopool
Func New (sizeint) *pool {
return&pool{
Work:make (chan func ()),
Sem:make (Chan struct{},size),
}
}
Func (P *pool) Schedule (Task func ()) error {
select{
casep.work<-Task:
Casep.sem <-struct{}{}:
Go P.worker (Task)
}
}
Func (P *pool) Worker (Task func ()) {
Defer func () {<-p.sem}
for{
Task ()
Task = <-p.work
}
}
Now our Netpoll code is as follows:
Pool: = Gopool. New (128)
Poller. Start (conn, Netpoll. Eventread, func () {
We'll block Poller wait Loopwhen
Allpool workers is busy.
Pool. Schedule (func () {
Receive (CH)
})
})
So now we can read the packet to use the idle goroutine in the pool.
Again, we will change the Send ():
Pool: = Gopool. New (128)
Func (Ch *channel) Send (P Packet) {
If C.nowriteryet () {
Pool. Schedule (Ch.writer)
}
Ch.send <-P
}
Instead of Go ch.writer (), we want to write a reusable goroutine. Therefore, for n goroutines Pools, we can guarantee that n + 1 will be processed at the same time and that we do not allocate n + 1 buffers for reading. The Goroutine pool also allows us to limit the accept () and upgrade () of new connections and to avoid being overwhelmed by DDoS in most cases.
0 Copy Upgrade
Let's deviate a little from the WebSocket protocol. As mentioned earlier, the client switches to the WebSocket protocol using an HTTP upgrade request. The protocol is like this:
Get/ws http/1.1
Host:mail.ru
Connection:upgrade
sec-websocket-key:a3xne7seb9hixkmbhvryaa==
Sec-websocket-version:13
Upgrade:websocket
http/1.1 101 Switching protocols
Connection:upgrade
sec-websocket-accept:ksu0wxwg+ymkvx+kqr2agp0cqn4=
Upgrade:websocket
In other words, in our case, we need HTTP requests and headers to switch to the WebSocket protocol. This knowledge point and HTTP. The internal implementation of request indicates that we can do optimizations. We discard unnecessary memory allocations and replications while processing HTTP requests, and discard standard net/http servers.
For example, HTTP. The request contains a field of the header file type with the same name, which populates all the request headers unconditionally by copying the data from the connection to the value string. Imagine how much extra data can be retained in this field, such as a large cookie header.
But what to do?
WebSocket implementation
Unfortunately, all the libraries that existed during our server optimizations allowed us to upgrade our standard net/http servers. In addition, all libraries cannot use all of the above read and write optimizations. For these optimizations to work properly, we must use a fairly low level API to handle websocket. To reuse buffers, we need the Procotol function to look like this:
Func readframe (IO. Reader) (Frame, error)
Func writeframe (IO. Writer, Frame) error
If we have a library of such APIs, we can read the packets from the connection as follows (packet writes look similar):
Getreadbuf, Putreadbuf is Intendedto
Reuse *bufio. Reader (Withsync. Poolforexample).
Func getreadbuf (IO. Reader) *bufio. Reader
Func putreadbuf (*bufio. Reader)
Readpacket must be Calledwhendata could bereadfromconn.
Func readpacket (conn io. Reader) Error {
BUF: = Getreadbuf ()
Defer Putreadbuf (BUF)
Buf. Reset (conn)
Frame, _: = Readframe (BUF)
Parsepacket (frame. Payload)
//...
}
In short, it's time to make our own library.
Github.com/gobwas/ws
To avoid imposing the protocol operation logic on the user, we wrote the WS library. All read and write methods accept the standard IO. Reader and Io.writer interface, you can use or not use buffering or any other I/O wrappers.
In addition to the standard Net/http upgrade requests, WS supports 0 copy upgrades, processing of upgrade requests, and switching to websocket without memory allocation or replication. Ws. Upgrade () accepts IO. Readwriter (NET. Conn implementation of this interface). In other words, we can use the standard net. Listen () and will receive the connection from Ln. Accept () is immediately passed to WS. Upgrade (). The library can replicate any request data for future use in the application (for example, a cookie to validate a session).
The following is the baseline for upgrade request processing: Standard Net/http server and net. Listen () plus 0 copy upgrade:
Benchmarkupgradehttp 5156 ns/op 8576 b/op 9 allocs/op
BENCHMARKUPGRADETCP 973 ns/op 0 b/op 0 allocs/op
Switching to WS and 0 copy upgrades saves additional GB of memory-this is the space allocated for the I/O buffers when the Net/http handler requests processing.
Profile
Let us combine the code to tell you about the optimizations we do.
It is very expensive to read the goroutine of the internal buffer. Solution: Netpoll (Epoll,kqueue); Reuse buffers.
Writing an internal buffer to the goroutine is very expensive. Solution: Start goroutine when necessary; Reuse buffers.
Ddos,netpoll will not work. Solution: Re-use quantity-limited goroutines.
Net/http is not the quickest way to handle the upgrade to WebSocket. Solution: Use a 0 copy upgrade on the connection.
This is what the server code looks like:
Import (
"NET"
"Github.com/gobwas/ws"
)
ln, _: = Net. Listen ("TCP", ": 8080")
for{
Trytoaccept incomingconnectioninsidefreepool worker.
If therenofreeworkersfor1ms, Donotaccept Anythingandtry later.
This would help ustoprevent many self-ddosoroutofresource limit cases.
ERR: = Pool. Scheduletimeout (Time.millisecond, func () {
Conn: = ln. Accept ()
_ = ws. Upgrade (conn)
Wrap websocketconnectionwithour Channel struct.
This would help ustohandle/send our app ' s packets.
CH: = Newchannel (conn)
Waitforincoming bytesfromconnection.
Poller. Start (conn, Netpoll. Eventread, func () {
Donotcrossthe resource limits.
Pool. Schedule (func () {
Readandhandle incoming packet (s).
Ch. Recevie ()
})
})
})
If err! = Nil {
Time. Sleep (Time.millisecond)
}
}
Conclusion
Premature optimization is the source of all evil. Donald Knuth
Of course, the above optimizations are meaningful, but not all of them. For example, if the ratio between available resources (memory, CPU) and number of online connections is quite high (the server is idle), then optimizations may not make any sense. However, you can benefit from where you need to improve and improve your content.
Original link