Go language GC Optimization experience sharing

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Do not want to see the long-winded, here first to a conclusion, go GC is not perfect but also not count, the key to see how to use, try not to create a large number of objects, and try not to create objects frequently, this truth in all with GC programming language is also common.


If you want to know how to prevent and solve problems in advance, please watch your patience.

The service side of our project is fully developed in the go language, and the game data is put in memory by GO management.

on-line test after I do a lot of tuning work, initially stability first, so the first solution is the memory leak problem, mainly rely on memprof to locate the problem, and then further improve performance, mainly by Cpuprof and some of their own statistical information to locate the problem.

tuning performance in the process of my results from Cpuprof found that the GC's Scanblock call occupied more than 40% CPU, so I began to engage in various object reuse and try to avoid unnecessary object creation, the effect is significant, CPU consumption dropped to more than 10%.

but I am still very unwilling, want to continue to optimize to see. When looking for information on the Internet to see gogctrace this environment variable can turn on the GC debug information printing, so I was in the intranet test suit opened, every time go to perform GC will print a line of information, the content is the GC execution time and the number of objects before and after the collection changes.

I was surprised to find that the GC was more than 20 milliseconds, and our server request processing time averaged only 33 microseconds, a magnitude difference.

so I began to care about the GC execution time value, is it a constant value? Or is the data much more about it?

I have questions in the external network player test server also turned on the GC tracking, the results more let me sweat, the GC execution time has reached more than 300 milliseconds. The GC for GO is fixed every two minutes, and every execution is a pause for the entire program, and more than 300 milliseconds should be sufficient to cause a sense of responsiveness delay.

Therefore, it becomes necessary to shorten the GC execution time. Where do you start? First of all, it can be inferred that the GC execution time is correlated with the amount of data, and the network data is much less. Secondly, the GC tracking information to the number of objects as the key data to output, estimated that the scan is by object scanning, so the object scanning time is long, the object less scanning time is short.

so I started to reduce the number of objects, at first I tried to use CGO to solve the problem, C application and release memory, this part of the C created objects will not be scanned by GC.

However, in practice, it is found that CGO can cause the original memory data to operate some strange problems, such as an object was initialized, but still read the unexpected data. It also causes the go run times to apply for memory deadlock errors, I repeatedly read the Go Application memory code, and I directly with the C malloc is completely unrelated, it is very strange.

I had to give up CGO's plan for the time being and thought of a way. A player has a lot of data, if the inactive player's data is serialized into a byte array, it is equal to a plurality of objects compressed into one, so that the number of objects can be reduced.

I follow this idea with a rapid change of the code, put on the outside of the actual test, the number of objects from millions of to hundreds of thousands of, the GC scan time to more than 20 microseconds.

The effect is good, but to use the player data to deserialize, this consumption is too big, still need to think of ways.

so I simply change the memory data to the structure and the tile storage, before using the object and the one-way list, so a data will have an object corresponding to the structure and structure of the slice, it is equal to the data reduction of multiple objects.

as expected, memory consumes a bit more, but the number of objects is less than one order of magnitude.

In fact, I was worried about the project at the beginning of the situation, then asked people everywhere, the object of more will not increase the GC burden, resulting in GC time is too long, the results did not get the answer.

now that I've filled this hole, it's OK to say, yes. Let's not jump into this pit again.

If the Go GC is smart enough to handle the difference between the old object and the new object, at least in my scenario, I can reduce unnecessary scanning, and if the GC can do it asynchronously without pausing the program, I don't care about the hundreds of millisecond execution time.

but also can not completely blame go imperfect, if at first I know to use gogctrace to observe, can compare early found the problem to solve the problem more fundamentally. But since the use, the project also on, no way to change, can only see recruit demolition recruit.

summarize the following points for a friend who intends to develop a project with go or is already using go to develop a project:
1, the early use Memprof, Cpuprof, gctrace to observe the procedure.
2, focus on request processing time, especially when developing new features, help to identify design problems.
3, try to avoid frequent creation of objects (&abc{}, New (abc{}), make ()), where it is frequently called to do object reuse.
4, try not to use go to manage a large number of objects, memory database can be fully implemented in C through CGO to invoke.

phone reply Typing good tired, first written here, back to supplement the case data.

Data supplement:

Figure 1, July 22, a cpuprof observation, sampling 3,000 times, the data shows that Scanblock ate 43.3% of the CPU.

Figure 2, July 23, the modified program to do cpuprof, sampling more than 10,000 calls, the data shows that the CPU consumption decreased to 9.8%

Data 1, the first GC trace result of the extranet server, the data shows that the GC executes more than 400 MS, and the number of objects recovered is 1,659,922:

gc13(1): 308+92+1 ms , 156 -> 107 MB 3339834 -> 1659922 (12850245-11190323) objects, 0(0) handoff, 0(0) steal, 0/0/0 yields


Data 2, the program has optimized the external network server GC trace results, the data shows the GC execution time of more than 30 MS, the number of objects recovered after 126,097:

gc14(6): 16+15+1 ms, 75 -> 37 MB 1409074 -> 126097 (10335326-10209229) objects, 45(1913) handoff, 34(4823) steal, 455/283/52 yields

< Span style= "Font-family:hannotate SC; font-size:12px ">

< Span style= "line-height:22px" > Example 1, the refactoring process of the data structure:

< Span style= "line-height:22px" > The initial data structure is similar to this

// 玩家数据表的集合type tables struct {        tableA *tableA        tableB *tableB        tableC *tableC        // ...... 此处省略一大堆表}// 每个玩家只会有一条tableA记录type tableA struct {        fieldA int        fieldB string}// 每个玩家有多条tableB记录type tableB struct {        xxoo int        ooxx int        next *tableB  // 指向下一条记录}// 每个玩家只有一条tableC记录type tableC struct {        id int        value int64}


the initial design causes each player to have a Tables object, each tables object having a bunch of one-to-many data like TableA and TableC, and a bunch of data like TableB.

Assuming there are 10,000 players, each player has a tableA and a tableC of data, and each has 10 TableB of data, then the total will produce 1w (tables) + 1w (TableA) + 1w (TableC) + 10w (TableB) of the object.

in real-world projects, the number of tables will be dozens of, one-to-many, the same as a-to-one, and the number of objects increases with the number of players.

Why did you start this design?

1, because some tables may not be recorded, in the form of objects can be used = = Nil to determine whether there is a record
2, one-to-many tables can dynamically add and delete records, so design into a linked list
3, save memory, no data is no data, there is data to object

the design after renovation:

// 玩家数据表的集合type tables struct {        tableA tableA        tableB []tableB        tableC tableC        // ...... 此处省略一大堆表}// 每个玩家只会有一条tableA记录type tableA struct {        _is_nil bool        fieldA int        fieldB string}// 每个玩家有多条tableB记录type tableB struct {        _is_nil bool        xxoo int        ooxx int}// 每个玩家只有一条tableC记录type tableC struct {        _is_nil bool        id int        value int64} 


A pair of tables with a structure, a pair of multi-table with slice, each table is added a _is_nil field, to indicate whether the current data is useful data.

The result of this modification is that 10,000 players, the total number of objects produced is 1w (tables) + 1w ([]tablesb), with the previous design difference is obvious.

However, the slice does not contract, and the structure is the first to occupy the memory, so the modification will result in increased memory consumption.

Reference Links:

Go GC code, Scanblock and other functions are in the inside:
Http://golang.org/src/pkg/runtime/mgc0.c

The Go runtime package document has a description of several key environment variables such as Gogctrace:
http://golang.org/pkg/runtime/

How to use Cpuprof and memprof, see "Profiling Go Programs":
Http://blog.golang.org/profiling-go-programs

Some of the little experimental code I did, the optimizations are based on the data of these experiments, can be referenced below:
Https://github.com/realint/labs/tree/master/src


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.