Golang Core Developer Dmitry Vyukov (1.1 scheduler author) about profiling

Source: Internet
Author: User
Tags stack trace
This is a creation in Article, where the information may have evolved or changed.

Let's say you have a Golang program that wants to improve its performance. There are several tools that can help us accomplish this task. These tools can help us identify hotspots in the program (Cpu,io,memory), and hotspots are places where we need to focus on them to significantly improve performance. However, another result is possible, and the tool helps us identify a variety of performance flaws in the program. For example, every time you query the database, you prepare the SQL statement, however, you can only prepare it once when the program starts. Another example of an O (n^2) algorithm inexplicably slips in, where some exist O (n) algorithms. In order to identify these situations, you need to properly examine the results of the program anatomy. For example, in the first example, there is a significant amount of time spent in the SQL statement preparation phase, which is a dangerous signal.

It is also important to understand the various boundary factors that are related to performance. For example, your program uses 100Mbps network link communication, it already uses the link 90Mbps above the bandwidth, then your program does not have much performance improvement space. A similar situation exists for disk IO, memory consumption, and computational tasks. These situations, in mind, then we can look at the available tools.

Note: Tools can interfere with each other, for example, accurate memory profiling can skew CPU profiling, goroutine blocking profiling affects scheduler tracking, and using tools in isolation to get more accurate information. The following elaboration is based on Golang 1.3.

CPU Profiler

The Go runtime has a built-in CPU profiler that shows which functions consume a percentage of CPU time, and there are three ways to access it:

1. The simplest is the Go Test command-cpuprofile flag, for example, the following command:

$ go test-run=none-bench=clientserverparallel4-cpuprofile=cprof net/http

Parse the given benchmark to write the CPU profile information to the ' cprof ' file

Then:

$ go Tool pprof--text http.test cprof

The list of functions that print the most hot spots, there are several output formats available, the most useful:--text,--Web,--list. Run ' Go tool pprof ' for a complete list

2.NET/HTTP/PPROF package, this scheme is ideal for network Service application, you just need to import net/http/pprof, you can use the following command profile:

Go tool pprof--text mybin Http://myserver:6060:/debug/pprof/profile

3. Manual profile acquisition, need to introduce runtime/pprof, add the following code in the main function:

If *flagcpuprofile! = "" {f, err: = OS. Create (*flagcpuprofile) if err! = Nil {log. Fatal (Err)} pprof. Startcpuprofile (f) Defer pprof. Stopcpuprofile ()}

The profiling information is written to the specified file, which is visualized in the same way as the first option. Here is an example of visualizing using the--web option:

You can use the--list=funcname option to consult a single function, such as the following profile display time append function time spent:

.      .   93: func  (Bp *buffer)  WriteRune (r  rune)  error {.      .   94:      if r < utf8. runeself {5      5   95:          *bp = append (*bp, byte (R)) .      .    96:         return nil.       .   97:     }.      .    98: .      .   99:     b  := *bp.      .  100:     n :=  len (b).       .  101:    &Nbsp;for n+utf8. Utfmax > cap (b)  {.      .  102:          b = append (b, 0).       .   103:     }.      .  104:      w := utf8. Encoderune (B[n:n+utf8. UTFMAX],&NBSP;R).       .  105:     *bp =  b[:n+w].      .  106:     return  nil.      .  107: }

When the parser cannot unlock the call stack, use three special entries: Gc,system,externalcode. The GC represents the time spent in garbage collection, and the system represents the time spent in the Goroutine Scheduler, stack management, and other auxiliary runtime code;Externalcode, which represents the time spent calling local dynamic libraries. Here are some tips on how to interpret the results of a profile:

If you see a lot of time spent in the RUNTIME.MALLOCGC function, the program may have done too much small memory allocations. Profile can tell you where these allocations come from.

If a lot of time is spent on Channel,sync. mutexes, other synchronization primitives, or system components, programs may be subject to resource contention. Consider the following new organization code structure to eliminate the most frequently accessed shared resources. Common techniques include sharding/partitioning, local buffering/aggregation, write-time copying, etc.

If a lot of time is spent on Syscall. Read/write, the program may generate too much small data volume read and write operations, consider using BUFIO to wrap the OS. File or net. Conn.

If a large amount of time is spent in GC components, either the program allocates too many temporary objects, or the heap memory is too small, causing the garbage collection operation to run frequently.

Note: In the current Darwin platform, the CPU Profiler does not work correctly and Darwin does not work properly; the window platform needs to be installed Cygwin,perl,graphviz to generate Svg/web profile; on Linux platforms, You can also use the Perf System Profiler, which does not solve the go stack, but can dissect the Cgo/swig code and kernel code.

Memory Profiler

The memory profiler shows which functions allocate heap memory, you can collect it, use ' Go Test--memprofile ', or Net/http/ppro via http://myserver:6060:/debug/pprof/heap or call runtime/pprof. Writeheapprofile.

You can simply visualize the active allocation during profile collection (passing the--INUSE_SPACE flag, default), or all allocations since the program started (--alloc_space).

You can show how many bytes are allocated or how many objects (--inuse/alloc_space or--inuse/alloc_objects). The profiler tends to sample larger objects during multiple parsing. Understand that large objects affect memory consumption and GC time, and a large number of small allocations affect the execution speed to some extent affecting GC time. An object can be a persistent or temporary life cycle. If you have several large persistent object assignments at the start of the program, they are likely to be captured. These objects affect memory consumption and GC time, but do not affect the normal execution speed. On the other hand, if you have a large number of short life-cycle objects, they can hardly be rendered in the profile process. However, they have a significant effect on the speed of execution because they are distributed very frequently.

As a general rule, if you want to reduce memory consumption, consider using the profile of the--inuse_space option, and if you want to improve execution speed, use the--alloc_objects option profile. There are several options to control the granularity of the report, the--function function level (default),--lines,--files,--adrresses, row level, file level, and command address level.

Optimizations are typically application-specific, and here are some common recommendations:

1. Merge objects into larger objects. For example, use bytes. Buffer replaces *bytes. Buffer, as a member of the struct, reduces the number of memory allocations and reduces GC pressure.

2. Those local variables that break out of their declared scope are allocated by the promoted heap memory. Compilers often cannot determine that several such variables have the same life cycle, so they are assigned separately. You can put the following code:

For k, V: = range m {k, V: = k, V//copy for capturing by the Goroutine go func () {//Use K and V} ()}

To be replaced by:

For k, V: = range m {x: = struct{k, v string}{k, v}//Copy to capturing by the Goroutine go func () {//U Se x.k and X.V} ()}

Use one allocation instead of two times, but have a negative impact on code readability, so use it moderately.

3. A special case of a merge assignment, if you understand the typical size of slice in use, theunderlying array of slice can use pre-allocation.

Type X struct {buf []byte Bufarray [16]byte//BUF usually does not grow beyond + bytes.}    Func Makex () *x {X: = &x{}//Preinitialize BUF with the backing array. X.buf = x.bufarray[:0] return x}

4. Use a small footprint data type, for example, using int8 instead of Int.

5. Those objects that do not contain any pointers (note: String,slice,map,chan contains an implicit pointer) and will not be scanned by the garbage collector. For example, 1Gb byte slice does not affect GC time, removing pointers from actively used objects, which can have a positive effect on GC time. Some possible ways: use an index instead of a pointer to split the object into two parts, one of which does not have any pointers.

6. Use Freelist to reuse temporary objects and reduce the number of allocations. The standard library contains the sync. The Pool type allows the same object to be reused multiple times between GC. Be aware, however, that sync is used incorrectly, just like any manual memory management mode. Pool can cause use-after-free bugs.

Blocking profile

Goroutine Blocking profiler shows Goroutine blocking waiting for synchronization primitives (including Timer channel), where it appears in the code. You can use ' Go Test--blockprofile ', net/http/pprof via http://myserver:6060:/debug/pprof/block, or call runtime/pprof. Lookup ("block"). WriteTo

The block profiler is not turned on by default, ' Go Test--blockprofile ' will automatically open for you, but using Net/http/pprof, runtime/pprof, requires you to manually open it. Call runtime. Setblockprofilerate, open the blocking profiler, Setblockprofilerate control the report granularity of blocking time in blocking profile.

If a function contains multiple blocking operations, it becomes less clear that the operation caused the blockage, so it can be distinguished using the--lines flag.

Note that not all blockages are bad. When one goroutine is blocked, the underlying worker thread switches to another goroutine. In this way, the blocking of the collaborative go environment is significantly different from that of the non-cooperative system mutex. (in C + +, Java thread Library, blocking can lead to thread idle, and expensive thread context switching)

In time. There is usually nothing wrong with blocking the Ticker. If a goroutine blocks on a ticker, 10s blocking is also seen in the blocking anatomy. At sync. Waitgroup on the block, mostly also no problem. For example, a task spends 10s,goroutine waiting in Waitgroup, accounting for 10s. At sync. Cond on the block is good or bad, depending on the specific situation. Consumers are blocking the channel, hinting at the slow pace of the producers, or lack of work to do. Producers are blocking the channel, hinting at the slow pace of the consumer, which is usually not a problem. Block the channel based on the amount of semaphore, showing how many goroutine are stuck on the semaphore. Blocked in sync. mutexes, Sync,rwmutex, are generally not very good. You can use--ignore to exclude blocking events that are not of interest.

Goroutine's blockage has two negative consequences:

1. The program does not scale linearly with the processor.

2. Excessive goroutine blocking and unblocking, consuming too much CPU time.

Here are some tips to help reduce goroutine blocking:

1. In the matching producer, consumer model code, the buffer channel with sufficient buffering, the unbuffered channel essentially limits the degree of parallelism of the program.

2. In a scene with a lot of read operations, rarely modify data manipulation, use sync. Rwmutex instead of sync. Mutex. Readers do not block each other.

3. Some scenarios can even completely remove a mutex by using a copy-on-write technique. If the protected data structure is modified less frequently, it is possible to make a copy:

Type Config struct {

Routes map[string]net. Addr backends []net. Addr}var config unsafe.  Pointer//actual type is *config//Worker goroutines Use this function to obtain the current Config.func Currentconfig () *config {return (*config) (Atomic. Loadpointer (&config))}//Background Goroutine periodically creates a new config object//as sets it as current using T His function.func updateconfig (cfg *config) {atomic. Storepointer (&config, unsafe. Pointer (CFG))}

This mode prevents the writer from blocking the reader's activity during the update operation.

4. Partitioning is another common technique for reducing contention/blocking on variable data structures. Here is an example of how to partition a HashMap:

Type Partition struct {sync. Rwmutex m Map[string]string}const partcount = 64var m [Partcount]partitionfunc Find (k string) string {idx: = hash (k )% Partcount part: = &m[idx] part. Rlock () V: = Part.m[k] part. Runlock () return v}

5. Local buffering and batch updates can help reduce non-partitioned data structure contention.

Const CACHESIZE = 16type cache struct {buf [cachesize]int pos Int}func Send (c chan [Cachesize]int, Cache *cache, VA Lue int) {Cache.buf[cache.pos] = value cache.pos++ if Cache.pos = = CacheSize {c <-cache.buf C Ache.pos = 0}}

This technique is not limited to channel, and can be used to batch update the map and allocate memory in bulk.

6. Use Sync. The freelist of the Pool, not the channel-based, or mutex-protected freelist. Sync. The Pool interior uses some clever techniques to reduce congestion.

Goroutine Profiler

The Goroutine Profiler is just the current stack that lets you see all the active goroutine of the process, which is very handy for debugging load balancing and deadlock problems.

Goroutine profile is reasonable only for running applications, so the Go Test command does not do this. You can use net/http/pprof via  http://myserver:6060:/debug/pprof/goroutine   or call    runtime/pprof. Lookup ("Goroutine"). writeto . But the most useful method is to type  http://myserver:6060:/debug/pprof/goroutine?debug=2,  you will see a stack trace similar to when the program crashes. Note the Goroutine consuming OS thread that displays the "Syscall" status, the other goroutine does not. The goroutine of the "IO Wait" state also does not consume OS threads, which are docked on a non-blocking network polling device.  

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.