21-day Boutique blockchain free learning, in-depth expert led the way, to help developers easily play the blockchain! >>>
Golang pprof
When your Golang program consumes more memory than you understand during the run, you need to figure out what code in the program is causing the memory consumption. At this time Golang compiled program for you is a black box, how to understand the memory use it? Fortunately Golang has built in some mechanisms to help us with our analysis and tracking.
At this point, usually we can use Golang's pprof to help us analyze the memory usage of the Golang process.
Pprof instances
We usually use the HTTP API to expose PPROF information for analysis and we can use the package net/http/pprof
. The following is a simple example:
// pprof 的init函数会将pprof里的一些handler注册到http.DefaultServeMux上// 当不使用http.DefaultServeMux来提供http api时,可以查阅其init函数,自己注册handlerimport _ "net/http/pprof"go func() { http.ListenAndServe("0.0.0.0:8080", nil)}()
At this point we can start the process, and then access http://localhost:8080/debug/pprof/
can see a simple page, the page is displayed: Note: All of the following data, including the go tool pprof
collected data are dependent on the PPROF sample rate in the process, the default 512kb for one sampling, when we think the data is not detailed, The sample rate can be adjusted runtime.MemProfileRate
, but the lower the sampling rate, the slower the process runs.
/debug/pprof/profiles:0 block136840 goroutine902 heap0 mutex40 threadcreatefull goroutine stack dump
It simply exposes several built-in Profile
statistics. For example, there are 136,840 goroutine running, click the relevant link to see the details.
When we analyze memory-related issues, you can click on the heap
item and enter http://127.0.0.1:8080/debug/pprof/heap?debug=1
to see the specific display:
Heap profile:3190:77516056 [54762:612664248] @ heap/10485761:29081600 [1:29081600] @ 0x89368e 0x894cd9 0x8a5a9d 0x8a9 b7c 0x8af578 0x8b4441 0x8b4c6d 0x8b8504 0x8b2bc3 0x45b1c1# 0x89368d github.com/syndtr/goleveldb/leveldb/memdb. (*DB). put+0x59d# 0X894CD8 xxxxx/storage/internal/memtable. (*memtable). set+0x88# 0x8a5a9c Xxxxx/storage. (*snapshotter). appendcommitlog+0x1cc# 0x8a9b7b Xxxxx/storage. (*store). update+0x26b# 0x8af577 Xxxxx/config. (*config). update+0xa7# 0x8b4440 xxxxx/naming. (*naming). update+0x120# 0x8b4c6c xxxxx/naming. (*naming). instancetimeout+0x27c# 0x8b8503 xxxxx/naming. (*naming). (xxxxx/naming.instancetimeout)-fm+0x63......# runtime. memstats# Alloc = 2463648064# Totalalloc = 31707239480# Sys = 4831318840# lookups = 2690464# Mallocs = 274619648# Frees = 262711312# HeapAlloc = 2463648064# Heapsys = 3877830656# Heapidle = 854990848# Heapinuse = 3022839808# HeapReleased = 0# H Eapobjects = 11908336# Stack = 655949824/655949824# MSpan = 63329432/72040448# Mcache = 38400/49152# Buckhashsys = 1706593# Gcsys = 170819584# Othersys = 52922583# NextGC = 3570699312# Pausens = [1052815 217503 208124 233034 1146462 456882 1098525 530706 551702 419372 768322 596273 387826 45580 7 563621 587849 416204 599143 572823 488681 701731 656358 2476770 12141392 5827253 3508261 1715582 1295487 908563 788435 7 18700 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]# numgc = 31# DEBUGGC = False
It shows more content, but the subject is divided into 2 parts: The first part is printed as runtime.MemProfile()
the record obtained by the runtime.MemProfileRecord
. The meaning is:
heap profile: 3190(inused objects): 77516056(inused bytes) [54762(alloc objects): 612664248(alloc bytes)] @ heap/1048576(2*MemProfileRate)1: 29081600 [1: 29081600] (前面4个数跟第一行的一样,此行以后是每次记录的,后面的地址是记录中的栈指针)@ 0x89368e 0x894cd9 0x8a5a9d 0x8a9b7c 0x8af578 0x8b4441 0x8b4c6d 0x8b8504 0x8b2bc3 0x45b1c1# 0x89368d github.com/syndtr/goleveldb/leveldb/memdb.(*DB).Put+0x59d 栈信息
The second part is better understood, printed by runtime.ReadMemStats()
reading the runtime.MemStats
information. We can focus on
Sys
The memory space that the process obtains from the system, the virtual address space.
HeapAlloc
The amount of space used by the process heap memory allocation, usually the heap object that the user new comes out of, containing the non-GC-dropped.
HeapSys
The process obtains the heap memory from the system, because the Golang underlying uses the TCMALLOC mechanism, which caches part of the heap memory, the virtual address space.
PauseNs
Records the time of each GC pause (in nanoseconds), recording up to 256 up-to-date records.
NumGC
Records the number of times the GC has occurred.
Believe that the pprof do not know the user read the above content, it is difficult to obtain more useful information. So we need to cite more tools to help us interpret pprof content more simply.
Go tool
We can use go tool pprof -inuse_space http://127.0.0.1:8080/debug/pprof/heap
commands to connect to the process to see some memory-related information in use, and we get a command line that can be interacted with.
We can look at the data top10
to see the 10 function portals that are using more objects. Typically used to detect any memory object references that do not meet the expectations.
(pprof) TOP101355.47MB of 1436.26MB Total (94.38%) Dropped 371 nodes (cum <= 7.18MB) Showing top nodes out Of (cum >= 23.50MB) flat flat% sum% cum cum% 512.96MB 35.71% 35.71% 512.96MB 35.71% net/http.ne Wbufiowritersize 503.93MB 35.09% 70.80% 503.93MB 35.09% net/http.newbufioreader 113.04MB 7.87% 78.67% 113.04MB 7. 87% runtime.rawstringtmp 55.02MB 3.83% 82.50% 55.02MB 3.83% runtime.malg 45.01MB 3.13% 85.64% 45.01MB 3.13 % Xxxxx/storage. (*node). Clone 26.50MB 1.85% 87.48% 52.50MB 3.66% context. Withcancel 25.50MB 1.78% 89.26% 83.58MB 5.82% runtime.systemstack 25.01MB 1.74% 91.00% 58.51MB 4.07% net/h Ttp.readrequest 25MB 1.74% 92.74% 29.03MB 2.02% runtime.mapassign 23.50MB 1.64% 94.38% 23.50MB 1.64% NE T/http. (*server). Newconn
Then we are using go tool pprof -alloc_space http://127.0.0.1:8080/debug/pprof/heap
the command linker to see the correlation of the memory object allocations. Then enter top
to see some of the function calls that have more cumulative allocation memory:
(pprof) top523.38GB of 650.90GB total (80.41%)Dropped 342 nodes (cum <= 3.25GB)Showing top 10 nodes out of 106 (cum >= 28.02GB) flat flat% sum% cum cum% 147.59GB 22.68% 22.68% 147.59GB 22.68% runtime.rawstringtmp 129.23GB 19.85% 42.53% 129.24GB 19.86% runtime.mapassign 48.23GB 7.41% 49.94% 48.23GB 7.41% bytes.makeSlice 46.25GB 7.11% 57.05% 71.06GB 10.92% encoding/json.Unmarshal 31.41GB 4.83% 61.87% 113.86GB 17.49% net/http.readRequest 30.55GB 4.69% 66.57% 171.20GB 26.30% net/http.(*conn).readRequest 22.95GB 3.53% 70.09% 22.95GB 3.53% net/url.parse 22.70GB 3.49% 73.58% 22.70GB 3.49% runtime.stringtoslicebyte 22.70GB 3.49% 77.07% 22.70GB 3.49% runtime.makemap 21.75GB 3.34% 80.41% 28.02GB 4.31% context.WithCancel
It can be seen that string-[]byte to each other, distribution map, Bytes.makeslice, Encoding/json. Unmarshal, such as call accumulation allocations more memory. At this point we can review the code, how to reduce these related calls, or optimize the related code logic.
When we do not know what functions are caused by these calls, we can enter top -cum
to find, that -cum
is, to accumulate the data in the function call relationship, such as the B function called by a function, the memory allocation in the B function will accumulate to a above, so that the call chain can be easily found.
(pprof) TOP20-CUM322890.40MB of 666518.53MB Total (48.44%) Dropped 342 nodes (cum <= 3332.59MB) Showing top nodes out of 106 (Cum >= 122316.23MB) flat flat% sum% cum cum% 0 0% 0% 643525.16MB 96.55% Runt Ime.goexit 2184.63MB 0.33% 0.33% 620745.26MB 93.13% net/http. (*conn). Serve 0 0% 0.33% 435300.50MB 65.31% xxxxx/api/server. (*httpserver). Servehttp 5865.22MB 0.88% 1.21% 435300.50MB 65.31% xxxxx/api/server/router. (*httprouter). Servehttp 0 0% 1.21% 433121.39MB 64.98% net/http.serverhandler.servehttp 0 0% 1.21% 430456.29MB 64.58% Xxxxx/api/server/filter. (*chain). Next 43.50MB 0.0065% 1.21% 429469.71MB 64.43% xxxxx/api/server/filter. TRANSURLTOV1 0 0% 1.21% 346440.39MB 51.98% xxxxx/api/server/filter. ROLE30X31283.56MB 4.69% 5.91% 175309.48MB 26.30% net/http. (*conn). readrequest 0 0% 5.91% 153589.85MB 23.04% github.com/julienschmidt/httprouter. (*router). Servehttp 0 0% 5.91% 153589.85MB 23.04% github.com/julienschmidt/httprouter. (*router). SERVEHTTP-FM 0 0% 5.91% 153540.85MB 23.04% xxxxx/api/server/router. (*httprouter). Register.func1 2MB 0.0003% 5.91% 153117.78MB 22.97% xxxxx/api/server/filter. VALIDATE151134.52MB 22.68% 28.58% 151135.02MB 22.68% runtime.rawstringtmp 0 0% 28.58% 150714.90MB 22.61% xxx Xx/api/server/router/naming/v1. (*servicerouter). (Git.intra.weibo.com/platform/vintage/api/server/router/naming/v1.service)-FM 0 0% 28.58% 150714.90MB 22.61% Xxxxx/api/server/router/naming/v1. (*servicerouter). Service 0 0% 28.58% 141200.76MB 21.18% net/http. REDIRECT132334.96MB 19.85% 48.44% 132342.95MB 19.86% runtime.mapassign 42MB 0.0063% 48.44% 125834.16MB 18.88% xxxxx /api/server/router/naming/v1.heartbeat 0 0% 48.44% 122316.23MB 18.35% xxxxxx/config. (*config). Lookup
As shown above, it is easy to find out which functions are called by these functions.
Depending on the calling relationship of the code, it is filter.TransURLTov1
called filter.Role30x
, but the difference between them is cum%
12.45%, so we can tell that the filter.TransURLTov1
amount of memory allocated by the internal itself reaches 12.45% of the total memory allocation of the entire process, which is a worthwhile optimization.
Then we can enter web
a command, which will pop up a picture of our browser .svg
, which will draw these cumulative relationships into a topological map, provided to us. or directly execute go tool pprof -alloc_space -cum -svg http://127.0.0.1:8080/debug/pprof/heap > heap.svg
to generate the heap.svg
picture.
Let's take a fragment from a picture to analyze:
Each block is a function call stack for the pprof record, and the number on the arrow pointing to the block is the memory of the cumulative allocation of the stack of records, the amount of memory allocated for the other functions called by the function from the number indicated by the arrow on the block. The difference between them can be simply understood as this function, in addition to calling other functions, its own allocation. The numbers inside the block also reflect this, and the numbers are: (自身分配的内存 of 该函数累积分配的内存)
.
--inuse/alloc_space
--inuse/alloc_objects
Difference
Typically:
- Use
--inuse_space
to analyze the occupancy of the program resident memory;
- Use
--alloc_objects
to analyze the temporary allocation of memory, you can improve the speed of the program.
Go-torch
In addition to direct use go tool pprof
, we can also use a more intuitive flame diagram. Therefore, we can use Go-torch directly to generate the flame diagram of the Golang program, which is also directly dependent on the tool pprof/go tool pprof
. For the installation of the tool, see the introduction of the project. The A4DAA2B version of the software only supports profiling of the memory.
We can use
go-torch -alloc_space http://127.0.0.1:8080/debug/pprof/heap --colors=memgo-torch -inuse_space http://127.0.0.1:8080/debug/pprof/heap --colors=mem
Note: -alloc_space/-inuse_space
parameters and -u/-b
other parameters are in conflict, after use, -alloc_space/-inuse_space
please append the pprof resources directly behind the parameters, instead of using -u/-b
parameters to specify, which go-torch
is related to the parameter resolution problem, after reading its source code can understand. Also note that the URL that parses the memory must be the end of the heap, because the default path is profile, which is used to analyze CPU-related issues.
With the above 2 commands, we can get alloc_space/inuse_space
2 flame diagrams of meaning, such as alloc_space.svg/inuse_space.svg. We can use the browser to observe these 2 graphs, which, like the section of a mountain, bottom is the call stack for each function, so the height of the mountain is positively correlated with the depth of the function call, and the width of the mountain is proportional to the amount of memory used/allocated. We just need to keep an eye out for the wide, flat peaks, which are usually where we need to optimize.
Testing
go test
We can use a similar approach when we need to profiling some of the test/benchmark. For example, we can use built-in go test
parameters to generate PPROF data and then analyze it with go tool pprof
/ go-torch
.
- Generate Pprof files for CPU, mem
go test -bench=BenchmarkStorageXXX -cpuprofile cpu.out -memprofile mem.out
- A binary file and 2 pprof data files are generated, such as
storage.test cpu.out mem.out
- Then using Go-torch to parse, the binary file is put in front
#分析cpugo-torch storage.test cpu.out#分析内存go-torch --colors=mem -alloc_space storage.test mem.outgo-torch --colors=mem -inuse_space storage.test mem.out
Optimization recommendations
Debugging performance issues in Go programs offers some common tuning recommendations:
1 merging multiple small objects into one large object
2 Reduce unnecessary pointer indirect references, use copy references more
For example bytes.Buffer
*bytes.Buffer
, use a substitution, because when you use a pointer, 2 objects are allocated to complete the reference.
3 When a local variable escapes, it aggregates it.
This theory is the same as 1, the core is to reduce the allocation of object, and reduce the pressure of GC. For example, the following code
for k, v := range m {k, v := k, v // copy for capturing by the goroutinego func() {// use k and v}()}
Can be modified to:
for k, v := range m {x := struct{ k, v string }{k, v} // copy for capturing by the goroutinego func() {// use x.k and x.v}()}
When modified, the escaped object becomes X and the K,v2 object is reduced to 1 objects.
[]byte
pre-allocation of 4
When we know []byte
exactly how many bytes will be used, we can use an array to pre-allocate this memory. For example:
type X struct { buf []byte bufArray [16]byte // Buf usually does not grow beyond 16 bytes.}func MakeX() *X { x := &X{} // Preinitialize buf with the backing array. x.buf = x.bufArray[:0] return x}
5 Use a small number of types whenever possible
When some of our const or count fields do not require too large a number of bytes, we can usually declare it as a int8
type.
6 Reducing unnecessary pointer references
When an object does not contain any pointers (note: Strings,slices,maps and Chans contain hidden pointers), the scanning of the GC has little effect. For example, the slice of 1GB byte actually contains only a limited number of objects and does not affect garbage collection time. Therefore, we can reduce the reference of pointers as much as possible.
7 use sync.Pool
to cache commonly used objects
Attention
The version between go1.9 and go1.9.2 go tool pprof
introduces a bug that causes the above memory Analysis command to fail. Here's a fix:
cd $GOROOT/src/cmd/vendor/github.com/googlerm pprofgit clone https://github.com/google/pprof.git #确保在版本`e82ee9addc1b6c8e1d667ed6de0194241e1e03b5`之后rm $GOROOT/pkg/darwin_amd64/cmd/vendor/github.com/google/pprofcd $GOROOT/src/cmd/pprofgo buildmv pprof $GOROOT/pkg/tool/darwin_amd64/pprof