This is a creation in Article, where the information may have evolved or changed.
There was a time, our push service socket is very abnormal, we own statistics at the same time on-line 10w users, but occupied the socket to reach 30w, and then view the number of goroutine, found that has been 60w+.
Each user occupies a socket, and a socket, with read and write two goroutine, simplifies the code as follows:
c, _ := listerner.Accept()go c.run()func (c *conn) run() { go c.onWrite() c.onRead()}func (c *conn) onRead() { stat.AddConnCount(1) //on something stat.AddConnCount(-1) //clear //notify onWrite to quit}
At that time, I suspected that the user at the same time online statistics is correct, that is, after the clear stage has a problem, resulting in two goroutine can not end properly. After examining the code, we found a suspicious place, because we not only have our own statistics, but also send some statistical information to our company's statistical platform, the code is as follows:
ch = make([]byte, 100000)func send(msg []byte) { ch <- msg}//在另一个goroutine的地方,msg <- msghttpsend(msg)
Our channel cache is allocated 10w, and if there is a problem with the company's statistics platform, it may cause channel blocking. But is that the reason?
Fortunately, we have built in the code inside the Pprof function, through the pprof goroutine information, found a large number of goroutine of the current function in the httpsend, that is, the company's statistical platform under the large concurrent service is not available, Although we have HTTP timeout processing, the amount of data sent is too frequent to cause overall blocking.
The temporary solution is to turn off the sending of statistics, and we will consider sending it to our MQ, although there may be problems with the MQ service being unavailable, but to be honest, the company's statistical platform makes me more untrustworthy than the MQ I realized.
This also gave me a lesson, access to external services must be a good deal with external services are not available, even if available, but also to consider the pressure problem.
For Pprof how to see the goroutine problem, you can use a simple example to illustrate:
package mainimport ( "net/http" "runtime/pprof")var quit chan struct{} = make(chan struct{})func f() { <-quit}func handler(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "text/plain") p := pprof.Lookup("goroutine") p.WriteTo(w, 1)}func main() { for i := 0; i < 10000; i++ { go f() } http.HandleFunc("/", handler) http.ListenAndServe(":11181", nil)}
In this example above, we start with 10,000 goroutine and block, and then by accessing http://localhost:11181/, we can get the entire goroutine information, listing only the key information:
goroutine profile: total 1000410000 @ 0x186f6 0x616b 0x6298 0x2033 0x188c0# 0x2033 main.f+0x33 /Users/siddontang/test/pprof.go:11
As you can see, in the MAIN.F function, there are 10,000 goroutine executing, which are in line with our expectations.
In go, there are a lot of runtime viewing mechanism, can be very convenient to help us locate the program problems, have to praise a bit.