A case of Go code tuning

Source: Internet
Author: User
Tags map data structure
This is a creation in Article, where the information may have evolved or changed.

With tcpcopy-guided online traffic, the active database is also the real data, which I grabbed a pprof:

genius@geniuss-Macbook-air:~/project/src/dsp_masky $go tool pprof ls.prof Welcome to pprof!  For help, type 'help'.(pprof) topTotal: 14413 samples    1342   9.3%   9.3%     1342   9.3% strconv.ParseUint    1243   8.6%  17.9%     1243   8.6% runtime.duffcopy    1082   7.5%  25.4%     5067  35.2% dsp_masky/model/bid.(*BidServer).FilterByCampaign    1073   7.4%  32.9%     1824  12.7% runtime.mallocgc     638   4.4%  37.3%      638   4.4% runtime.MSpan_Sweep     555   3.9%  41.2%      555   3.9% settype     547   3.8%  45.0%      547   3.8% runtime.mapaccess2_fast64     513   3.6%  48.5%     1855  12.9% strconv.ParseInt     376   2.6%  51.1%      452   3.1% regexp.(*machine).add     337   2.3%  53.5%     2418  16.8% dsp_masky/model.regionQuery

The function that consumes the first three of the row is parseuint,duffcopy,filterbycampaign. An analysis of the following article.

Optimize duffcopy

I checked the duffcopy. This function is a function similar to memcopy that copies memory from one piece to another. This function is not called by any library function in runtime, it is called directly by the compiler generated code. Why is the duffcopy so high? I found the reason when I was looking at mentor's programming habits:

var RCampaign map[int64]Campaign

Rcampaign is a full-memory active database, noting that it uses storage Campaign instead *Campaign , where Campaign the definition is such that a large structure:

type Campaign struct {    Id                int64    Plan              CPlan    PublisherId       int    CreateUserId      int64    WardenStatus      int    Zone              []Zone    IpSections        []IpSection    ChannelId         int64    AdxId             int64    CategoryId        []int    TopPrice          float64    SubPrice          SubCharge    PositionPrice     map[int64]ChargePrice    ChargePrice       float64    ChargeType        int    Gaga              FilterGaga    MediaFilter       MediaFilter    OType             int    FrequencyFilters  []model.FrequencyFilter    AMonit            []string    Character         []model.ActionTerm    ...}

Mentor Alter writes code like this:

func ConvertCampaign(campaignInfo model.Campaign) (campaign Campaign, err error) {  //直接返回这么大的结构体campaigns[Id] = campaign    //直接整个结构体赋值到map中

There are a number of similar examples, all of which result in an entire copy of the data, and it is this programming habit that causes the duffcopy function to consume 8.6% of the cpu!

A good habit to recommend is that a slightly larger type is stored in a map to store pointers instead of values.

The following is the best off duffcopy after a copy of the pprof:

Total: 12507 samples    1739  13.9%  13.9%     1739  13.9% strconv.ParseUint    1290  10.3%  24.2%     8282  66.2% dsp_masky/model/bid.(*BidServer).FilterByCampaign     813   6.5%  30.7%      906   7.2% runtime.mapaccess2_fast64     705   5.6%  36.4%     2444  19.5% strconv.ParseInt     440   3.5%  39.9%      523   4.2% regexp.(*machine).add     402   3.2%  43.1%     1202   9.6% hash_insert     394   3.2%  46.2%     3136  25.1% dsp_masky/model.regionQuery     305   2.4%  48.7%     2737  21.9% strconv.Atoi     305   2.4%  51.1%      316   2.5% syscall.Syscall     288   2.3%  53.4%     1027   8.2% runtime.mallocgc

Optimize Parseuint

This is our real-time auction service, with a large number of requests arriving every second. In each request, you need to traverse through all the activities and pick out the bidding that matches the opportunity of this exposure. One of them is geo-information, and the request contains an IP source. And the activity is the geographical information, through the IP query to the user's territory, and then with the activity of the region to compare.

The problem is that in a geographically used structure, the activity is stored in string:

type Zone struct {    CountryId string    RegionId  string    CityId    string}

The area obtained by IP is represented by an int. So when comparing the two matches, the conversion is used strconv.Atoi , and the function calls the Parseuint.

The current number of activities is 2000, where the actual participation compares 400, then each comparison invokes three string conversion operations, assuming that the QPS is currently 1500, then 400*3*1500 will produce 1,800,000 Parseuint calls. This is why the Parseuint function consumes such a high CPU.

So, I've adjusted the storage structure of the zone to int, avoiding the Parseuint function. After optimizing the Parseuint, the resulting pprof is this:

(pprof) topTotal: 4554 samples     426   9.4%   9.4%      480  10.5% runtime.mapaccess2_fast64     232   5.1%  14.4%     1768  38.8% dsp_masky/model/bid.(*BidServer).FilterByCampaign     216   4.7%  19.2%      216   4.7% syscall.Syscall     183   4.0%  23.2%      213   4.7% hash_insert     167   3.7%  26.9%      167   3.7% ExternalCode     141   3.1%  30.0%      222   4.9% runtime.mallocgc     121   2.7%  32.6%      121   2.7% runtime.futex     115   2.5%  35.2%      136   3.0% evacuate     112   2.5%  37.6%      112   2.5% runtime.MSpan_Sweep     103   2.3%  39.9%      103   2.3% runtime.aeshash64

Just get the code when observed, the limit will probably have more than 2000 a bit of QPS, in the highest flow pressure CPU occupied even more than 700%, and after the two done, and never saw the CPU ran to more than 500%.

Kill defer.

mapaccess2_fast64Is v, ok := m[key] generated by this form of map access. The current code is very large use of the map data structure, and I looked at the corresponding code, most of the use is a reasonable category, so this direction is difficult to optimize. Perhaps the result of their own implementation of the hash will bring some improvement, but the thankless thing, temporarily put.

As for filterbycampaign, for each exposure opportunity, it needs to be matched with all the activity settings, filtered, and there is a large for range cycle that consumes performance. Unless the algorithm level of optimization, otherwise this large cycle can not be opened. But I haven't thought of a good algorithm for the moment.

Some progress was not smooth, so went to discuss with mentor. Mentor mentioned that after adding an update to the price adjustment, the system's CPU rose from the original 200% to 400%, so I can check the code of this path. Then I saw the defer, and immediately my eyes lit up:

 func (this *Campaign) GetPrice(positionId, size, host int64, b *BidServer) (price float64, err error) {     ...     defer func() {         ...     }()     ...     if err != nil {         return     }     return }

There is an obvious improper use: If the return err is not NULL, the direct return is possible, there is absolutely no need to execute the defer function, so the code should not be written like this. Also, defer calls have a certain overhead. Briefly, first of all, the defer need to encapsulate the parameters and the like into the structure, together with the function, as a closure, saved in the defer chain. Then, when the function returns, there will also be a defer linked list of processing, in the order of the stack call execution closure function. For defer garbage collection, and so on, there will be some special treatment, in short, the impact is quite large.

The GetPrice function is a function that must be called in the code path, although there is no loop, but after the QPS goes up, the performance impact is not a small hush. After the optimization of this observation, in about 1000 of the QPS environment, the CPU utilization rate from 200% before the optimization, down to about 120%.

Defer is a performance killer, and my principle is to avoid it as much as possible.

Time. Date

Now look at the pprof:

(pprof) topTotal: 3348 samples     254   7.6%   7.6%      266   7.9% syscall.Syscall     177   5.3%  12.9%      221   6.6% hash_next     168   5.0%  17.9%     1320  39.4% dsp_masky/model/bid.(*BidServer).FilterByCampaign     153   4.6%  22.5%      369  11.0% runtime.mallocgc     153   4.6%  27.0%      194   5.8% runtime.mapaccess2_fast64     113   3.4%  30.4%      113   3.4% time.absDate     102   3.0%  33.5%      102   3.0% runtime.futex      90   2.7%  36.1%      106   3.2% runtime.MSpan_Sweep      83   2.5%  38.6%       83   2.5% runtime.aeshash64      74   2.2%  40.8%       86   2.6% syscall.Syscall6

Now what? time.absDateThis function can occupy 3.4% performance? Quickly look at the code:

 for _, campaignId := range campaignIds {     //获取活动     campaign, ok := RCampaign[campaignId]     //排期    now := time.Now()    year, month, day := now.Date()    day = year*10000 + int(month)*100 + day

Get rid of, move to the for loop outside, and find a problem, good happy. Pprof after optimization:

(pprof) topTotal: 3310 samples     272   8.2%   8.2%      278   8.4% syscall.Syscall     213   6.4%  14.7%      264   8.0% hash_next     172   5.2%  19.8%      385  11.6% runtime.mallocgc     163   4.9%  24.8%      215   6.5% runtime.mapaccess2_fast64     155   4.7%  29.5%     1045  31.6% dsp_masky/model/bid.(*BidServer).FilterByCampaign     143   4.3%  33.8%      143   4.3% runtime.futex      90   2.7%  36.5%       90   2.7% runtime.aeshash64      87   2.6%  39.1%      116   3.5% runtime.MSpan_Sweep      67   2.0%  41.1%       82   2.5% settype      66   2.0%  43.1%       66   2.0% runtime.findfunc

After this is done, around 1000QPS, CPU usage is about 100% or less, compared to the last time about a 20% optimization.

Optimize MALLOCGC

MALLOCGC accounted for 5.2%, a result of frequent memory allocations. Following SVG, we found the source of MALLOCGC, which is generated by this function:

func (this *Campaign) GetPrice(positionId, size, host int64, b *BidServer) (price float64, err error) {        offer = &Offer{            CampaignId: this.Id,            UserId:     this.CreateUserId,            Ctype:      this.ChargeType,        }        malloced = true    }    ...}

A lot of requests per second, every request, will be called GetPrice for each activity to filter the price, it is the offer this bid object caused a lot of MALLOCGC.

Now that we have found the problem, we begin to change it. The idea is to dedicate a pool of assignments to an offer object.

Go1.3 provides the Sync.pool, I first use this to achieve, do a global var OfferPool *sync.Pool , and then each allocation from the Pool to fetch. But... It was an unsuccessful attempt. After the modification, I found that CPU utilization soared nearly one times, and Pprof showed that a lot of CPU time was consumed in the Pool.get function.

How can you do the allocation of the object pool, performance is not as good as every time directly assigned???

After thinking, I found the reason. A lot of CPU consumption in the Pool.get, inside is called the cap operation, Rob ' lock ' condition is very serious. Originally, I do is a global pool, and each request is in a different goroutine, so when the object from the pool, grab the lock wasted a lot of CPU, resulting in CPU explosion after optimization. So why is MALLOCGC all right? is because its locks are much smaller in size. MALLOCGC allocation, Goroutine will take memory from the local mcache, not to take the memory block in the oxalate was body m, only when the m out of reach, will go to the global memory allocation pool to take, so there is no serious competition problems.

In order to avoid the problem of global lock, I changed to a little offer cache with each request, and then the assignment changed to look like this:

var offer *Offerif b.OfferPoolIndex < len(b.OfferPool) {    offer = &b.OfferPool[b.OfferPoolIndex]    b.OfferPoolIndex++} else {    offer = &Offer{        CampaignId: this.Id,        UserId:     this.CreateUserId,        Ctype:      this.ChargeType,    }}

With many optimizations already in the past, this optimization is not a big boost, probably at 10%. The current observed CPU usage around 1000QPS is probably 70-80%. The cost is not high, so the optimization work to this end!

Update:

The subsequent tests found that modifying the code when optimizing the Parseuint introduced errors, resulting in a lot of computation being filtered. After the continuation of the optimization, the relevant data failure, PPROF can no longer reflect the correct direction. Melancholy, ~_~.

Comparison before and after optimization

The experimental environment is open two processes, a run optimization before the code, a run optimized code. Using Tcpcopy to import two processes to the real-time online traffic observation.

CPU utilization

Observe the CPU utilization for a period of time, which is a small piece of data sampled:

Pre-optimization: 473 473 450 476 469 579 495

After optimization: 258 264 246 261 258 267 274

The average is the original 53.5%.

Memory

Use slightly lower than the original.

优化前:{"mem":"508.8MB","ver":"20140327~git-xxx","up_time":"2014-08-04 18:02:55","total_count":50208025,"last_10_second_qps":1119.800000}优化后:{"mem":"431.6MB","ver":"20140327~git-xxx","up_time":"2014-08-05 12:12:43","total_count":9974904,"last_10_second_qps":1056.700000}

Response time

At present, the response time falls within 10MS is more ideal. As a comparative observation, tail takes 100,000 log records and compares the number of records that the response time falls on the 10/20/30ms interval.

Before optimization: 7309/1924/779

After optimization: 4220/1041/220

Pprof Performance Optimization Golang

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.