Performance optimization Combat: Millions websockets and go languages

Source: Internet
Author: User
Tags epoll
This is a creation in Article, where the information may have evolved or changed.

Hello everyone! My name is Sergey Kamardin. I am an engineer from Mail.ru . This article will explain how we develop a high-load websocket service in the Go language. Even if you are familiar with WebSockets but have little knowledge of the go language, I hope you will be inspired by the ideas and techniques of performance optimization described in this article.

1. Introduction

As a foreshadowing of the full text, I would like to start by saying why we should develop this service.

Mail.ru has many systems that contain states. The user's e-mail store is one of them. There are many ways to track changes in these states. It is the change of state by regular polling or system notification. Both of these methods have their advantages and disadvantages. For mail this product, let the user receive the new mail as soon as possible is a consideration indicator. Polling for a message generates about 50,000 HTTP requests per second, where 60% of the requests return a 304 status (indicating that the mailbox has not changed). Therefore, to reduce the load on the server and speed up the receipt of the message, we decided to rewrite a publisher-subscriber service (which is also often referred to as bus,message Broker or Event-channel). This service is responsible for receiving notifications of status updates, and then also processing subscriptions to those updates.

Before overriding the Publisher-subscriber service:

Right now:

The first diagram above is the old schema. The browser (Browser) Periodically polls the API service for updates to the mail storage Service (Storage).

The second picture shows the new architecture. The browser (Browser) and the Notification API service (Notificcation API) establish a websocket connection. Notifies the API service that a related subscription is sent to the bus service. When a new e-mail message is received, the storage Service (Storage) sends a notification to the bus (1) and the bus then sends the notification to the appropriate subscriber (2). The API service finds the appropriate connection for the received notification, and then pushes the notification to the user's browser (3).

Let's talk about this API service today (also called the WebSocket Service). Before I start, I would like to mention that this online service handles nearly 3 million connections.

2. Customary practice (the idiomatic way)

First, let's take a look at some of the features of how to use go to implement this service without any optimizations. Before using net/http to implement specific functions, let's discuss how we will send and receive data. This data is defined above the WebSocket protocol (for example, a JSON object). We will make them packet in the following.

Let's implement the Channel structure first. It contains the appropriate logic to send and receive packet over the Webscoket connection.

2.1. Channel structure

// Packet represents application level data.type Packet struct {    ...}// Channel wraps user connection.type Channel struct {    conn net.Conn    // WebSocket connection.    send chan Packet // Outgoing packets queue.}func NewChannel(conn net.Conn) *Channel {    c := &Channel{        conn: conn,        send: make(chan Packet, N),    }    go c.reader()    go c.writer()    return c}

What I want to emphasize here is to read and write these two goroutines. Each goroutine requires its own memory stack. The initial size of the stack is determined by the operating system and the go version, usually between 2KB and 8KB. We mentioned earlier that there are 3 million online connections, and if each goroutine stack needs 4KB, all connections require 24GB of memory. This is not yet given to the Channel structure, which sends the memory space allocated by the packet Ch.send and some other internal fields.

2.2. I/O goroutines

Next look at the implementation of "reader":

func (c *Channel) reader() {    // We make a buffered read to reduce read syscalls.    buf := bufio.NewReader(c.conn)    for {        pkt, _ := readPacket(buf)        c.handle(pkt)    }}

Here we use the Bufio. Reader. Each time the buf size is allowed to read as many bytes as possible, thus reducing the number of read () system calls. In an infinite loop, we expect to receive new data. Remember the previous sentence: Expect to receive new data. We'll discuss that later.

We have ignored the parsing and processing logic of packet because they are not relevant to the optimizations we are discussing. But buf deserves our attention: its default size is 4KB. This means that all connections will consume additional gigabytes of memory. "Writer" is a similar situation:

func (c *Channel) writer() {    // We make buffered write to reduce write syscalls.    buf := bufio.NewWriter(c.conn)    for pkt := range c.send {        _ := writePacket(buf, pkt)        buf.Flush()    }}

We loop the packet into the cache (buffer) on the c.send channel to be sent packet. The attentive reader must have discovered that this is another extra 4KB of memory. 3 million connections can consume 12GB of memory.

2.3. HTTP

We've got a simple Channel implementation. Now we need a websocket connection. Because it's still under the heading of the usual practice (idiomatic way) , let's take a look at how it's usually done.

Note: If you don't know how websocket works, it's worth mentioning that the client is building websocket through a special HTTP mechanism called upgrade (Upgrade) request. After the upgrade request is successfully processed, the server and client use a TCP connection to Exchange binary WebSocket frames (frames). Here is a description of the frame structure.

import (    "net/http"    "some/websocket")http.HandleFunc("/v1/ws", func(w http.ResponseWriter, r *http.Request) {    conn, _ := websocket.Upgrade(r, w)    ch := NewChannel(conn)    //...})

Please note the http here. The Responsewriter structure contains Bufio. Reader and Bufio. Writer(each of which contains 4KB caches respectively). They are used for *http. The Request initializes and returns the result.

Regardless of which WebSocket, after a successful response to an upgrade request, the server receives an I/O cache and a corresponding TCP connection after calling Responsewriter.hijack () .

Note: Sometimes we can release the cache back to the sync in net/http by Net/http.putbufio{reader,writer} call . Pool.

In this way, the 3 million connections require additional GB of memory.

So, for this nothing to do the program, we have taken up the amount of GB of memory!

3. Optimization

Let's review the workflow of the user connection described earlier. After the WebSocket is established, the client sends a request-subscription-related event (we ignore a request like Ping/pong here). Next, the client may not send any additional data throughout the lifetime of the connection.

The life cycle of the connection may last for a few seconds to a few days.

So for most of the time,Channel.reader () and Channel.writer () are waiting to receive and send data. Together they are waiting for the 4 KB I/O cache that is allocated for each.

Now, we find that some areas can be further optimized, right?

3.1. Netpoll

Do you remember that the implementation of Channel.reader () used the Bufio. Reader.read () ? Bufio. Reader.read () will call Conn again. Read (). This call is blocked to wait for new data on the connection to be received. If there is new data on the connection, the run environment (runtime) of Go will wake the corresponding goroutine and let it read the next packet. After that, Goroutine will be blocked again to wait for the new data. Let's look at how the Go run environment knows Goroutine needs to be awakened.

If we look at Conn. The implementation of read () will see that it calls the Net.netFD.Read ():

// net/fd_unix.gofunc (fd *netFD) Read(p []byte) (n int, err error) {    //...    for {        n, err = syscall.Read(fd.sysfd, p)        if err != nil {            n = 0            if err == syscall.EAGAIN {                if err = fd.pd.waitRead(); err == nil {                    continue                }            }        }        //...        break    }    //...}

Go uses the sockets non-blocking mode. Eagain indicates that there is no data in the socket but does not block on the empty socket, and the OS returns control to the user process.

Here it first makes a read () system call to the connection file descriptor. If read () returns a eagain error, the operating environment is called Polldesc.waitread ():

// net/fd_poll_runtime.gofunc (pd *pollDesc) waitRead() error {   return pd.wait('r')}func (pd *pollDesc) wait(mode int) error {   res := runtime_pollWait(pd.runtimeCtx, mode)   //...}

If we continue digging deep, we can see that the implementation of Netpoll in Linux is Epoll and used in BSD is kqueue. Why don't we use these connections in a similar way? The cache space is allocated and the goroutine of the read data is enabled only if there is readable data on the socket.

On Github.com/golang/go, there is a question about the open (exporting) Netpoll function.

3.2. Kill Goroutines

Suppose we use the go language to achieve netpoll. We can now avoid creating channel.reader () goroutine, instead of receiving new data from the subscription connection.

ch := NewChannel(conn)// Make conn to be observed by netpoll instance.poller.Start(conn, netpoll.EventRead, func() {    // We spawn goroutine here to prevent poller wait loop    // to become locked during receiving packet from ch.    go ch.Receive()})// Receive reads a packet from conn and handles it somehow.func (ch *Channel) Receive() {    buf := bufio.NewReader(ch.conn)    pkt := readPacket(buf)    c.handle(pkt)}

Channel.writer () is relatively easy because we only need to create the Goroutine and allocate the cache when we send the packet.

func (ch *Channel) Send(p Packet) {    if c.noWriterYet() {        go ch.writer()    }    ch.send <- p}

Note that we are not dealing with the Eagainreturned at the write () system call. We rely on the go run environment to handle it. This happens very rarely. We can still handle it as we did before, if we need to.

After reading the packets to be sent from ch.send ,ch.writer () completes its operation and finally releases the Goroutine stack and the cache used for the send.

Very good! By avoiding the I/O cache and stack memory consumed by these two continuously running goroutine, we have saved up to GB.

3.3. Controlling resources

A large number of connections will not only cause a lot of memory consumption. At the time of the development of the service, we were constantly encountering competitive conditions (race conditions) and deadlocks (deadlocks). This is followed by the so-called self-distributed blocking attack (Self-ddos). In this case, the client will attempt to reconnect to the server and make the situation worse.

For example, if for some reason we are suddenly unable to process Ping/pong messages, these idle connections will be shut down continuously (they will assume that these connections are invalid and therefore will not receive data). The client then waits every n seconds for the connection to be lost and tries to reestablish the connection instead of waiting for the message from the server to be sent.

In this case, the better way is to let the overloaded server stop accepting the new connection, so that the load balancer (such as nginx) can transfer the request to the other server.

Regardless of the load on the server side, if all the clients suddenly (most likely because of a bug) send a packet to the server, our previously saved gigabytes of memory will be consumed. Because then we're going to create the same goroutine for each connection and allocate the cache.

Goroutine Pond

You can limit the number of simultaneous processing packets with a goroutine pools. The following code is a simple implementation:

package gopoolfunc New(size int) *Pool {    return &Pool{        work: make(chan func()),        sem:  make(chan struct{}, size),    }}func (p *Pool) Schedule(task func()) error {    select {    case p.work <- task:    case p.sem <- struct{}{}:        go p.worker(task)    }}func (p *Pool) worker(task func()) {    defer func() { <-p.sem }    for {        task()        task = <-p.work    }}

The code we use with Netpoll becomes the following:

pool := gopool.New(128)poller.Start(conn, netpoll.EventRead, func() {    // We will block poller wait loop when    // all pool workers are busy.    pool.Schedule(func() {        ch.Receive()    })})

Now we not only wait for the readable data to appear on the socket to read the packet, but also to wait for the free goroutine from the pool.

Similarly, we modify the code of the next Send () :

pool := gopool.New(128)func (ch *Channel) Send(p Packet) {    if c.noWriterYet() {        pool.Schedule(ch.writer)    }    ch.send <- p}

Here we do not call go Ch.writer (), but want to reuse the pool goroutine to send the data. So, if a pool has n goroutines, we can guarantee that N requests are processed at the same time. n + 1 requests do not allocate n + 1 caches. The Goroutine pool allows us to limit the Accept () and Upgrade ()to new connections, thus avoiding most DDoS scenarios.

3.4.0 Copy Upgrade (Zero-copy upgrade)

As previously mentioned, the client switches to the WebSocket protocol via an HTTP upgrade (Upgrade) request. An upgrade request is shown below:

GET /ws HTTP/1.1Host: mail.ruConnection: UpgradeSec-Websocket-Key: A3xNe7sEB9HixkmBhVrYaA==Sec-Websocket-Version: 13Upgrade: websocketHTTP/1.1 101 Switching ProtocolsConnection: UpgradeSec-Websocket-Accept: ksu0wXWG+YmkVx+KQR2agP0cQn4=Upgrade: websocket

We receive the HTTP request and its header just to switch to the WebSocket protocol, while http. The Request holds all the data for the head. From here it can be enlightened that, if it is for optimization, we can abandon the standard net/http service and avoid useless memory allocations and copies when processing HTTP requests.

As an example,http. The Request contains a field called the header. The standard net/http service copies all header data in the request unconditionally to the header field. You can imagine that this field will save a lot of redundant data, such as a head with a very long cookie.

How are we going to optimize it?

WebSocket implementation

Unfortunately, when we optimize the server, all the libraries we can find only support the upgrade of standard net/http services. And not a single library allows us to implement the above mentioned read and write optimizations. In order for these optimizations to be possible, we must have a set of underlying APIs to operate the WebSocket. In order to reuse the cache, we need a protocol function like this:

func ReadFrame(io.Reader) (Frame, error)func WriteFrame(io.Writer, Frame) error

If we had a library with such an API, we would read the packets from the connection as follows:

// getReadBuf, putReadBuf are intended to// reuse *bufio.Reader (with sync.Pool for example).func getReadBuf(io.Reader) *bufio.Readerfunc putReadBuf(*bufio.Reader)// readPacket must be called when data could be read from conn.func readPacket(conn io.Reader) error {    buf := getReadBuf()    defer putReadBuf(buf)    buf.Reset(conn)    frame, _ := ReadFrame(buf)    parsePacket(frame.Payload)    //...}

In short, we need to write a library ourselves.

Github.com/gobwas/ws

The main design idea of WS Library is not to expose the operational logic of the Protocol to the user. All read and write functions accept generic IO. Reader and io. Writer interface. So it is free to use libraries with cache and other I/O.

In addition to the upgrade request in standard library net/http ,ws also supports 0 copy upgrades. It can handle the upgrade request and switch to WebSocket mode without generating any memory allocations or copies. ws. Upgrade () accepts io. Readwriter (net. Conn implementation of this interface). In other words, we can use the standard net. The Listen () function is then put from Ln. the connection received by Accept () is immediately given to ws. Upgrade () to deal with. The library also allows you to copy any request data to meet future application requirements (for example, copy a Cookie to verify a session).

The following is a performance test for handling upgrade requests: implementation of the standard net/http library and net using 0 copy upgrade . Listen ():

BenchmarkUpgradeHTTP    5156 ns/op    8576 B/op    9 allocs/opBenchmarkUpgradeTCP     973 ns/op     0 B/op       0 allocs/op

Using ws and 0 copy upgrades saves us up to a few gigabytes of space. These spaces were originally used as the I/O cache for processing requests in net/http .

3.5. Review

Let's review the optimizations we mentioned earlier:

    • A read goroutine that contains a cache can consume a lot of memory. Scenario: netpoll (Epoll, Kqueue), reusing cache.
    • A write Goroutine that contains a cache can consume a lot of memory. Scenario: Create a goroutine when needed, and reuse the cache.
    • When there are a large number of connection requests, Netpoll does not have a good limit on the number of connections. Scenario: reuse Goroutines and limit the number of them.
    • Net/http The process of upgrading to websocket requests is not the most efficient. scenario: A 0 Copy upgrade is implemented on a TCP connection.

The following is the approximate implementation code for the service side:

import (    "net"    "github.com/gobwas/ws")ln, _ := net.Listen("tcp", ":8080")for {    // Try to accept incoming connection inside free pool worker.    // If there no free workers for 1ms, do not accept anything and try later.    // This will help us to prevent many self-ddos or out of resource limit cases.    err := pool.ScheduleTimeout(time.Millisecond, func() {        conn := ln.Accept()        _ = ws.Upgrade(conn)        // Wrap WebSocket connection with our Channel struct.        // This will help us to handle/send our app's packets.        ch := NewChannel(conn)        // Wait for incoming bytes from connection.        poller.Start(conn, netpoll.EventRead, func() {            // Do not cross the resource limits.            pool.Schedule(func() {                // Read and handle incoming packet(s).                ch.Recevie()            })        })    })    if err != nil {        time.Sleep(time.Millisecond)    }}

4. Conclusion

In programming, premature optimization is the root of all evils. Donald Knuth

The above optimizations are meaningful, but not all of them apply. For example, if the ratio of idle resources (memory, CPU) to the number of online connections is high, then optimization doesn't make much sense. Of course, knowing where to optimize and how to optimize is always helpful.

Thank you for your attention!

5. References

    • Https://github.com/mailru/easygo
    • Https://github.com/gobwas/ws
    • Https://github.com/gobwas/ws-examples
    • Https://github.com/gobwas/httphead

Original link @medium.com published on 2017/08/03

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.