This is a creation in Article, where the information may have evolved or changed.
Smart you, with Golang write back-end services, all kinds of use of channel and Goroutine, Java to use the thread Chengchigan things with Ctrip. Under the service line to run all normal, pressure measurement, unit test, all through the joint adjustment. You smile with pride, a button released to the production environment, the joy of discovery services collapsed.
Why does the service crash?
Channel deadlock
Deadlock is the Golang inside the most common type of problem, we from Java and C + + Gopher in the programming time will pay special attention to the use of mutex,semaphore,atomic and so on, repeatedly check will not cause deadlock. But it's possible that we'll ignore another deadlock big culprit: the channel.
The channel is simply used for deadlock.
Unbuffered Channel reads and writes are blocking, meaning that reading a channel that no one writes and writing a channel that no one reads can cause deadlocks. The goroutine in the deadlock state will forever wait for the channel operation to return. This is a lot of goroutine, directly will cause a memory leak, the server crashes.
Buffered channel won't you?
It will. Buffered the channel is full and then it will be dead to you when you write it. Empty channel No one's going to read and die for you to see.
How to avoid it?
select{case<-ch: // a read from ch has occurredcase<-time.After(1*time.Second): // moving on after 1 second}
This is in the official effective go inside the introduction.
Third-party calls
We often can not be naïve to think that we call the upstream service will not hang, we also can not be honest that our network environment is always reliable. Often a small fluctuation will make us many goroutine temporary cards are on third-party API calls. For example, under normal circumstances we expect all goroutine20ms to be completed. But suddenly the network jitter, or the third-party service stability a little bit of a problem, our goroutine will take 2 seconds to complete.
Some people may think, is not a bit slower, what big relationship.
Comrades, the relationship is very big. Originally Goroutine 20ms on the exit case, we may also be thousands of goroutine. If the Goroutine life cycle gets longer, we will instantly save up 100* thousands of goroutine, and if we don't, our service will crash instantly.
So other people's problems may lead to failure of our own services.
What to do?
Use circuit breaker.
We want to protect our services in this way:
First set SLA (Service level agreement)
- Average response time
- Tolerance Rate
- Max QPS
We need to monitor third-party service invocation error rates. When the error rate is higher than a certain threshold (time-out is also an error), we need to temporarily make a direct error on the service invocation, so that the goroutine of the call can be quickly exited.
We also need to try again and again to see if the call condition is back to a good available state, and if so, we have to return to the normal invocation mechanism. and start calculating the error rate again.
This pattern is actually the fuse (circuit breaker) that we often say.
A Golang implementation of the fuse: afex/hystrix-go
How to observe the pressure of our service?
Prometheus: Prometheus
Data dog: Modern Monitoring & Analytics
Newrelic:digital performance monitoring and Management | New Relic
How to debug exactly where is the Goroutine card?
Pprof: https://golang.org/pkg/net/http/pprof/
This article is very superficial talk about how Golang services to prevent, detect, troubleshoot goroutine caused by excessive online failure. The intention is to initiate. Smart you, we study together, together in-depth discussion.
I hope we can smile when we deploy Golang service, we will be more confident and fully confident.