Performance optimization Combat: Millions websockets and go languages

Source: Internet
Author: User
Tags epoll
This is a creation in Article, where the information may have evolved or changed.

Translate the original link reprint/reprint please indicate the source

Original link @medium.com published on 2017/08/03

Hello everyone! My name is Sergey Kamardin. I am an engineer from Mail.ru . This article will explain how we develop a high-load websocket service in the Go language. Even if you are familiar with WebSockets but have little knowledge of the go language, I hope you will be inspired by the ideas and techniques of performance optimization described in this article.

1. Introduction

As a foreshadowing of the full text, I would like to start by saying why we should develop this service.

Mail.ru has many systems that contain states. The user's e-mail store is one of them. There are many ways to track changes in these states. It is the change of state by regular polling or system notification. Both of these methods have their advantages and disadvantages. For mail this product, let the user receive the new mail as soon as possible is a consideration indicator. Polling for a message generates about 50,000 HTTP requests per second, where 60% of the requests return a 304 status (indicating that the mailbox has not changed). Therefore, to reduce the load on the server and speed up the receipt of the message, we decided to rewrite a publisher-subscriber service (which is also often referred to as bus,message Broker or Event-channel). This service is responsible for receiving notifications of status updates, and then also processing subscriptions to those updates.

Before overriding the Publisher-subscriber service:


0_pull.png

Right now:


1_push.png

The first diagram above is the old schema. The browser (Browser) Periodically polls the API service for updates to the mail storage Service (Storage).

The second picture shows the new architecture. The browser (Browser) and the Notification API service (Notificcation API) establish a websocket connection. Notifies the API service that a related subscription is sent to the bus service. When a new e-mail message is received, the storage Service (Storage) sends a notification to the bus (1) and the bus then sends the notification to the appropriate subscriber (2). The API service finds the appropriate connection for the received notification, and then pushes the notification to the user's browser (3).

Let's talk about this API service today (also called the WebSocket Service). Before I start, I would like to mention that this online service handles nearly 3 million connections.

2. Customary practice (the idiomatic way)

First, let's take a look at some of the features of how to use go to implement this service without any optimizations. Before using the net/http implementation of specific functions, let's discuss how we will send and receive data. This data is defined above the WebSocket protocol (for example, a JSON object). We will make them packet in the following.

Let's implement the Channel structure first. It contains the appropriate logic to send and receive packet over the Webscoket connection.

2.1. Channel structure

// Packet represents application level data.type Packet struct {    ...}// Channel wraps user connection.type Channel struct {    conn net.Conn    // WebSocket connection.    send chan Packet // Outgoing packets queue.}func NewChannel(conn net.Conn) *Channel {    c := &Channel{        conn: conn,        send: make(chan Packet, N),    }    go c.reader()    go c.writer()    return c}

What I want to emphasize here is to read and write these two goroutines. Each goroutine requires its own memory stack. The initial size of the stack is determined by the operating system and the go version, usually between 2KB and 8KB. We mentioned earlier that there are 3 million online connections, and if each goroutine stack needs 4KB, all connections require 24GB of memory. This has not been calculated for the Channel structure, send packet ch.send and some other internal fields allocated memory space.

2.2. I/O goroutines

Next look at the implementation of "reader":

func (c *Channel) reader() {    // We make a buffered read to reduce read syscalls.    buf := bufio.NewReader(c.conn)    for {        pkt, _ := readPacket(buf)        c.handle(pkt)    }}

Here we have used bufio.Reader . Each time you buf try to read as many bytes as possible within the allowable size, reducing read() the number of system calls. In an infinite loop, we expect to receive new data. Remember the previous sentence: Expect to receive new data. We'll discuss that later.

We have ignored the parsing and processing logic of packet because they are not relevant to the optimizations we are discussing. But buf it's worth our attention: its default size is 4KB. This means that all connections will consume additional gigabytes of memory. "Writer" is a similar situation:

func (c *Channel) writer() {    // We make buffered write to reduce write syscalls.    buf := bufio.NewWriter(c.conn)    for pkt := range c.send {        _ := writePacket(buf, pkt)        buf.Flush()    }}

We c.send loop the packet to the cache (buffer) on the channel to be sent packet. The attentive reader must have discovered that this is another extra 4KB of memory. 3 million connections can consume 12GB of memory.

2.3. HTTP

We've got a simple Channel implementation. Now we need a websocket connection. Because it's still under the heading of the usual practice (idiomatic way) , let's take a look at how it's usually done.

Note: If you don't know how websocket works, it's worth mentioning that the client is building websocket through a special HTTP mechanism called upgrade (Upgrade) request. After the upgrade request is successfully processed, the server and client use a TCP connection to Exchange binary WebSocket frames (frames). Here is a description of the frame structure.

import (    "net/http"    "some/websocket")http.HandleFunc("/v1/ws", func(w http.ResponseWriter, r *http.Request) {    conn, _ := websocket.Upgrade(r, w)    ch := NewChannel(conn)    //...})

Note that the http.ResponseWriter structure here contains bufio.Reader and bufio.Writer (each contains a 4KB cache respectively). They are used to \*http.Request initialize and return results.

Regardless of which WebSocket, after a successful response to an upgrade request, the server responseWriter.Hijack() receives an I/O cache and a corresponding TCP connection after the call.

Note: Sometimes we can net/http.putBufio{Reader,Writer} release the cache back in by calling net/http sync.Pool .

In this way, the 3 million connections require additional GB of memory.

So, for this nothing to do the program, we have taken up the amount of GB of memory!

3. Optimization

Let's review the workflow of the user connection described earlier. After the WebSocket is established, the client sends a request-subscription-related event (we ignore ping/pong a similar request here). Next, the client may not send any additional data throughout the lifetime of the connection.

The life cycle of the connection may last for a few seconds to a few days.

So in most of the time, Channel.reader() and Channel.writer() all are waiting to receive and send data. Together they are waiting for the 4 KB I/O cache that is allocated for each.

Now, we find that some areas can be further optimized, right?

3.1. Netpoll

Do you remember Channel.reader() the implementation used? bufio.Reader.Read() bufio.Reader.Read()is called again conn.Read() . This call is blocked to wait for new data on the connection to be received. If there is new data on the connection, the run environment (runtime) of Go will wake the corresponding goroutine and let it read the next packet. After that, Goroutine will be blocked again to wait for the new data. Let's look at how the Go run environment knows Goroutine needs to be awakened.

If we look at conn.Read() the implementation, we'll see that it calls net.netFD.Read() :

// net/fd_unix.gofunc (fd *netFD) Read(p []byte) (n int, err error) {    //...    for {        n, err = syscall.Read(fd.sysfd, p)        if err != nil {            n = 0            if err == syscall.EAGAIN {                if err = fd.pd.waitRead(); err == nil {                    continue                }            }        }        //...        break    }    //...}

Go uses the sockets non-blocking mode. Eagain indicates that there is no data in the socket but does not block on the empty socket, and the OS returns control to the user process.

Here it first makes a system call to the connection file descriptor read() . If an read() error is returned EAGAIN , the running environment is called pollDesc.waitRead() :

// net/fd_poll_runtime.gofunc (pd *pollDesc) waitRead() error {   return pd.wait('r')}func (pd *pollDesc) wait(mode int) error {   res := runtime_pollWait(pd.runtimeCtx, mode)   //...}

If we continue digging deep, we can see that the implementation of Netpoll in Linux is Epoll and used in BSD is kqueue. Why don't we use these connections in a similar way? The cache space is allocated and the goroutine of the read data is enabled only if there is readable data on the socket.

On Github.com/golang/go, there is a <a href= "https://github.com/golang/go/issues/15735#" on the Open (exporting) Netpoll function. issuecomment-266574151 "target=" _blank "rel=" Noopener "> Issues </a>.

3.2. Kill Goroutines

Suppose we use the go language to achieve netpoll. We can now avoid creating Channel.reader() goroutine, instead of receiving new data from a subscription connection.

ch := NewChannel(conn)// Make conn to be observed by netpoll instance.poller.Start(conn, netpoll.EventRead, func() {    // We spawn goroutine here to prevent poller wait loop    // to become locked during receiving packet from ch.    go ch.Receive()})// Receive reads a packet from conn and handles it somehow.func (ch *Channel) Receive() {    buf := bufio.NewReader(ch.conn)    pkt := readPacket(buf)    c.handle(pkt)}

Channel.writer()It's relatively easy because we only need to create goroutine and allocate caches when sending packet.

func (ch *Channel) Send(p Packet) {    if c.noWriterYet() {        go ch.writer()    }    ch.send <- p}

Note that we do not return this when we are dealing with write() system calls EAGAIN . We rely on the go run environment to handle it. This happens very rarely. We can still handle it as we did before, if we need to.

ch.sendafter reading the packets to be sent, ch.writer() it completes its operation, finally releasing the Goroutine stack and the cache for sending.

Very good! By avoiding the I/O cache and stack memory consumed by these two continuously running goroutine, we have saved up to GB.

3.3. Controlling resources

A large number of connections will not only cause a lot of memory consumption. At the time of the development of the service, we were constantly encountering competitive conditions (race conditions) and deadlocks (deadlocks). This is followed by the so-called self-distributed blocking attack (Self-ddos). In this case, the client will attempt to reconnect to the server and make the situation worse.

For example, if for some reason we are suddenly unable to process the ping/pong message, these idle connections will be kept closed (they will not receive data because they are invalid). The client then waits every n seconds for the connection to be lost and tries to reestablish the connection instead of waiting for the message from the server to be sent.

In this case, the better way is to let the overloaded server stop accepting the new connection, so that the load balancer (such as nginx) can transfer the request to the other server.

Regardless of the load on the server side, if all the clients suddenly (most likely because of a bug) send a packet to the server, our previously saved gigabytes of memory will be consumed. Because then we're going to create the same goroutine for each connection and allocate the cache.

Goroutine Pond

You can limit the number of simultaneous processing packets with a goroutine pools. The following code is a simple implementation:

package gopoolfunc New(size int) *Pool {    return &Pool{        work: make(chan func()),        sem:  make(chan struct{}, size),    }}func (p *Pool) Schedule(task func()) error {    select {    case p.work <- task:    case p.sem <- struct{}{}:        go p.worker(task)    }}func (p *Pool) worker(task func()) {    defer func() { <-p.sem }    for {        task()        task = <-p.work    }}

The code we use with Netpoll becomes the following:

pool := gopool.New(128)poller.Start(conn, netpoll.EventRead, func() {    // We will block poller wait loop when    // all pool workers are busy.    pool.Schedule(func() {        ch.Receive()    })})

Now we not only wait for the readable data to appear on the socket to read the packet, but also to wait for the free goroutine from the pool.

Similarly, we modify the following Send() code:

pool := gopool.New(128)func (ch *Channel) Send(p Packet) {    if c.noWriterYet() {        pool.Schedule(ch.writer)    }    ch.send <- p}

Here we do not call go ch.writer() , but want to reuse the pool goroutine to send data. So, if a pool has a N goroutines, we can ensure that a N request is processed at the same time. A N + 1 request does not allocate N + 1 a cache. The Goroutine pool allows us to limit the number of new connections Accept() and Upgrade() thus avoids most DDoS scenarios.

3.4.0 Copy Upgrade (Zero-copy upgrade)

As previously mentioned, the client switches to the WebSocket protocol via an HTTP upgrade (Upgrade) request. An upgrade request is shown below:

GET /ws HTTP/1.1Host: mail.ruConnection: UpgradeSec-Websocket-Key: A3xNe7sEB9HixkmBhVrYaA==Sec-Websocket-Version: 13Upgrade: websocketHTTP/1.1 101 Switching ProtocolsConnection: UpgradeSec-Websocket-Accept: ksu0wXWG+YmkVx+KQR2agP0cQn4=Upgrade: websocket

We receive the HTTP request and its head just to switch to the WebSocket protocol, while the http.Request data for all the headers is saved. From here it can be enlightened that if it is for optimization, we can abandon the use of standard net/http services and avoid useless memory allocations and copies when processing HTTP requests.

For example, http.Request a field called a header is included. Standard net/http service copies all header data in the request to the header field unconditionally. You can imagine that this field will save a lot of redundant data, such as a head with a very long cookie.

How are we going to optimize it?

WebSocket implementation

Unfortunately, when we optimize the server, all the libraries we can find support only the upgrade of standard net/http services. And not a single library allows us to implement the above mentioned read and write optimizations. In order for these optimizations to be possible, we must have a set of underlying APIs to operate the WebSocket. In order to reuse the cache, we need a protocol function like this:

func ReadFrame(io.Reader) (Frame, error)func WriteFrame(io.Writer, Frame) error

If we had a library with such an API, we would read the packets from the connection as follows:

// getReadBuf, putReadBuf are intended to// reuse *bufio.Reader (with sync.Pool for example).func getReadBuf(io.Reader) *bufio.Readerfunc putReadBuf(*bufio.Reader)// readPacket must be called when data could be read from conn.func readPacket(conn io.Reader) error {    buf := getReadBuf()    defer putReadBuf(buf)    buf.Reset(conn)    frame, _ := ReadFrame(buf)    parsePacket(frame.Payload)    //...}

In short, we need to write a library ourselves.

Github.com/gobwas/ws

wsThe main design idea of the library is not to expose the operational logic of the Protocol to the user. All read and write functions accept the generic io.Reader and io.Writer interface. So it is free to use libraries with cache and other I/O.

In addition net/http to the upgrade requests in the standard library, ws 0 copy upgrades are supported. It can handle the upgrade request and switch to WebSocket mode without generating any memory allocations or copies. ws.Upgrade()accepted io.ReadWriter ( net.Conn implemented by this interface). In other words, we can use the standard net.Listen() function and then hand ln.Accept() over the received connection immediately ws.Upgrade() to the processing. The library also allows copying of any request data to meet future application requirements (for example, Cookie a copy to verify a session).

The following is a performance test for handling upgrade requests: the implementation of the standard net/http library and the upgrade using 0 copies net.Listen() :

BenchmarkUpgradeHTTP    5156 ns/op    8576 B/op    9 allocs/opBenchmarkUpgradeTCP     973 ns/op     0 B/op       0 allocs/op

Using ws and 0 copy upgrades saves us up to a few gigabytes of space. These spaces were originally used net/http in the I/O cache for processing requests.

3.5. Review

Let's review the optimizations we mentioned earlier:

    • A read goroutine that contains a cache can consume a lot of memory. Scenario: netpoll (Epoll, Kqueue), reusing cache.
    • A write Goroutine that contains a cache can consume a lot of memory. Scenario: Create a goroutine when needed, and reuse the cache.
    • When there are a large number of connection requests, Netpoll does not have a good limit on the number of connections. Scenario: reuse Goroutines and limit the number of them.
    • net/httpThe process of upgrading to websocket requests is not the most efficient. scenario: A 0 Copy upgrade is implemented on a TCP connection.

The following is the approximate implementation code for the service side:

import (    "net"    "github.com/gobwas/ws")ln, _ := net.Listen("tcp", ":8080")for {    // Try to accept incoming connection inside free pool worker.    // If there no free workers for 1ms, do not accept anything and try later.    // This will help us to prevent many self-ddos or out of resource limit cases.    err := pool.ScheduleTimeout(time.Millisecond, func() {        conn := ln.Accept()        _ = ws.Upgrade(conn)        // Wrap WebSocket connection with our Channel struct.        // This will help us to handle/send our app's packets.        ch := NewChannel(conn)        // Wait for incoming bytes from connection.        poller.Start(conn, netpoll.EventRead, func() {            // Do not cross the resource limits.            pool.Schedule(func() {                // Read and handle incoming packet(s).                ch.Recevie()            })        })    })    if err != nil {        time.Sleep(time.Millisecond)    }}

4. Conclusion

In programming, premature optimization is the root of all evils. Donald Knuth

The above optimizations are meaningful, but not all of them apply. For example, if the ratio of idle resources (memory, CPU) to the number of online connections is high, then optimization doesn't make much sense. Of course, knowing where to optimize and how to optimize is always helpful.

Thank you for your attention!

5. References

    • Https://github.com/mailru/easygo
    • Https://github.com/gobwas/ws
    • Https://github.com/gobwas/ws-examples
    • Https://github.com/gobwas/httphead
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.