This is a creation in Article, where the information may have evolved or changed.
With tcpcopy-guided online traffic, the active database is also the real data, which I grabbed a pprof:
genius@geniuss-Macbook-air:~/project/src/dsp_masky $go tool pprof ls.prof Welcome to pprof! For help, type 'help'.(pprof) topTotal: 14413 samples 1342 9.3% 9.3% 1342 9.3% strconv.ParseUint 1243 8.6% 17.9% 1243 8.6% runtime.duffcopy 1082 7.5% 25.4% 5067 35.2% dsp_masky/model/bid.(*BidServer).FilterByCampaign 1073 7.4% 32.9% 1824 12.7% runtime.mallocgc 638 4.4% 37.3% 638 4.4% runtime.MSpan_Sweep 555 3.9% 41.2% 555 3.9% settype 547 3.8% 45.0% 547 3.8% runtime.mapaccess2_fast64 513 3.6% 48.5% 1855 12.9% strconv.ParseInt 376 2.6% 51.1% 452 3.1% regexp.(*machine).add 337 2.3% 53.5% 2418 16.8% dsp_masky/model.regionQuery
The function that consumes the first three of the row is parseuint,duffcopy,filterbycampaign. An analysis of the following article.
Optimize duffcopy
I checked the duffcopy. This function is a function similar to memcopy that copies memory from one piece to another. This function is not called by any library function in runtime, it is called directly by the compiler generated code. Why is the duffcopy so high? I found the reason when I was looking at mentor's programming habits:
var RCampaign map[int64]Campaign
Rcampaign is a full-memory active database, noting that it uses storage Campaign
instead *Campaign
, where Campaign
the definition is such that a large structure:
type Campaign struct { Id int64 Plan CPlan PublisherId int CreateUserId int64 WardenStatus int Zone []Zone IpSections []IpSection ChannelId int64 AdxId int64 CategoryId []int TopPrice float64 SubPrice SubCharge PositionPrice map[int64]ChargePrice ChargePrice float64 ChargeType int Gaga FilterGaga MediaFilter MediaFilter OType int FrequencyFilters []model.FrequencyFilter AMonit []string Character []model.ActionTerm ...}
Mentor Alter writes code like this:
func ConvertCampaign(campaignInfo model.Campaign) (campaign Campaign, err error) { //直接返回这么大的结构体campaigns[Id] = campaign //直接整个结构体赋值到map中
There are a number of similar examples, all of which result in an entire copy of the data, and it is this programming habit that causes the duffcopy function to consume 8.6% of the cpu!
A good habit to recommend is that a slightly larger type is stored in a map to store pointers instead of values.
The following is the best off duffcopy after a copy of the pprof:
Total: 12507 samples 1739 13.9% 13.9% 1739 13.9% strconv.ParseUint 1290 10.3% 24.2% 8282 66.2% dsp_masky/model/bid.(*BidServer).FilterByCampaign 813 6.5% 30.7% 906 7.2% runtime.mapaccess2_fast64 705 5.6% 36.4% 2444 19.5% strconv.ParseInt 440 3.5% 39.9% 523 4.2% regexp.(*machine).add 402 3.2% 43.1% 1202 9.6% hash_insert 394 3.2% 46.2% 3136 25.1% dsp_masky/model.regionQuery 305 2.4% 48.7% 2737 21.9% strconv.Atoi 305 2.4% 51.1% 316 2.5% syscall.Syscall 288 2.3% 53.4% 1027 8.2% runtime.mallocgc
Optimize Parseuint
This is our real-time auction service, with a large number of requests arriving every second. In each request, you need to traverse through all the activities and pick out the bidding that matches the opportunity of this exposure. One of them is geo-information, and the request contains an IP source. And the activity is the geographical information, through the IP query to the user's territory, and then with the activity of the region to compare.
The problem is that in a geographically used structure, the activity is stored in string:
type Zone struct { CountryId string RegionId string CityId string}
The area obtained by IP is represented by an int. So when comparing the two matches, the conversion is used strconv.Atoi
, and the function calls the Parseuint.
The current number of activities is 2000, where the actual participation compares 400, then each comparison invokes three string conversion operations, assuming that the QPS is currently 1500, then 400*3*1500 will produce 1,800,000 Parseuint calls. This is why the Parseuint function consumes such a high CPU.
So, I've adjusted the storage structure of the zone to int, avoiding the Parseuint function. After optimizing the Parseuint, the resulting pprof is this:
(pprof) topTotal: 4554 samples 426 9.4% 9.4% 480 10.5% runtime.mapaccess2_fast64 232 5.1% 14.4% 1768 38.8% dsp_masky/model/bid.(*BidServer).FilterByCampaign 216 4.7% 19.2% 216 4.7% syscall.Syscall 183 4.0% 23.2% 213 4.7% hash_insert 167 3.7% 26.9% 167 3.7% ExternalCode 141 3.1% 30.0% 222 4.9% runtime.mallocgc 121 2.7% 32.6% 121 2.7% runtime.futex 115 2.5% 35.2% 136 3.0% evacuate 112 2.5% 37.6% 112 2.5% runtime.MSpan_Sweep 103 2.3% 39.9% 103 2.3% runtime.aeshash64
Just get the code when observed, the limit will probably have more than 2000 a bit of QPS, in the highest flow pressure CPU occupied even more than 700%, and after the two done, and never saw the CPU ran to more than 500%.
Kill defer.
mapaccess2_fast64
Is v, ok := m[key]
generated by this form of map access. The current code is very large use of the map data structure, and I looked at the corresponding code, most of the use is a reasonable category, so this direction is difficult to optimize. Perhaps the result of their own implementation of the hash will bring some improvement, but the thankless thing, temporarily put.
As for filterbycampaign, for each exposure opportunity, it needs to be matched with all the activity settings, filtered, and there is a large for range
cycle that consumes performance. Unless the algorithm level of optimization, otherwise this large cycle can not be opened. But I haven't thought of a good algorithm for the moment.
Some progress was not smooth, so went to discuss with mentor. Mentor mentioned that after adding an update to the price adjustment, the system's CPU rose from the original 200% to 400%, so I can check the code of this path. Then I saw the defer, and immediately my eyes lit up:
func (this *Campaign) GetPrice(positionId, size, host int64, b *BidServer) (price float64, err error) { ... defer func() { ... }() ... if err != nil { return } return }
There is an obvious improper use: If the return err is not NULL, the direct return is possible, there is absolutely no need to execute the defer function, so the code should not be written like this. Also, defer calls have a certain overhead. Briefly, first of all, the defer need to encapsulate the parameters and the like into the structure, together with the function, as a closure, saved in the defer chain. Then, when the function returns, there will also be a defer linked list of processing, in the order of the stack call execution closure function. For defer garbage collection, and so on, there will be some special treatment, in short, the impact is quite large.
The GetPrice function is a function that must be called in the code path, although there is no loop, but after the QPS goes up, the performance impact is not a small hush. After the optimization of this observation, in about 1000 of the QPS environment, the CPU utilization rate from 200% before the optimization, down to about 120%.
Defer is a performance killer, and my principle is to avoid it as much as possible.
Time. Date
Now look at the pprof:
(pprof) topTotal: 3348 samples 254 7.6% 7.6% 266 7.9% syscall.Syscall 177 5.3% 12.9% 221 6.6% hash_next 168 5.0% 17.9% 1320 39.4% dsp_masky/model/bid.(*BidServer).FilterByCampaign 153 4.6% 22.5% 369 11.0% runtime.mallocgc 153 4.6% 27.0% 194 5.8% runtime.mapaccess2_fast64 113 3.4% 30.4% 113 3.4% time.absDate 102 3.0% 33.5% 102 3.0% runtime.futex 90 2.7% 36.1% 106 3.2% runtime.MSpan_Sweep 83 2.5% 38.6% 83 2.5% runtime.aeshash64 74 2.2% 40.8% 86 2.6% syscall.Syscall6
Now what? time.absDate
This function can occupy 3.4% performance? Quickly look at the code:
for _, campaignId := range campaignIds { //获取活动 campaign, ok := RCampaign[campaignId] //排期 now := time.Now() year, month, day := now.Date() day = year*10000 + int(month)*100 + day
Get rid of, move to the for loop outside, and find a problem, good happy. Pprof after optimization:
(pprof) topTotal: 3310 samples 272 8.2% 8.2% 278 8.4% syscall.Syscall 213 6.4% 14.7% 264 8.0% hash_next 172 5.2% 19.8% 385 11.6% runtime.mallocgc 163 4.9% 24.8% 215 6.5% runtime.mapaccess2_fast64 155 4.7% 29.5% 1045 31.6% dsp_masky/model/bid.(*BidServer).FilterByCampaign 143 4.3% 33.8% 143 4.3% runtime.futex 90 2.7% 36.5% 90 2.7% runtime.aeshash64 87 2.6% 39.1% 116 3.5% runtime.MSpan_Sweep 67 2.0% 41.1% 82 2.5% settype 66 2.0% 43.1% 66 2.0% runtime.findfunc
After this is done, around 1000QPS, CPU usage is about 100% or less, compared to the last time about a 20% optimization.
Optimize MALLOCGC
MALLOCGC accounted for 5.2%, a result of frequent memory allocations. Following SVG, we found the source of MALLOCGC, which is generated by this function:
func (this *Campaign) GetPrice(positionId, size, host int64, b *BidServer) (price float64, err error) { offer = &Offer{ CampaignId: this.Id, UserId: this.CreateUserId, Ctype: this.ChargeType, } malloced = true } ...}
A lot of requests per second, every request, will be called GetPrice for each activity to filter the price, it is the offer this bid object caused a lot of MALLOCGC.
Now that we have found the problem, we begin to change it. The idea is to dedicate a pool of assignments to an offer object.
Go1.3 provides the Sync.pool, I first use this to achieve, do a global var OfferPool *sync.Pool
, and then each allocation from the Pool to fetch. But... It was an unsuccessful attempt. After the modification, I found that CPU utilization soared nearly one times, and Pprof showed that a lot of CPU time was consumed in the Pool.get function.
How can you do the allocation of the object pool, performance is not as good as every time directly assigned???
After thinking, I found the reason. A lot of CPU consumption in the Pool.get, inside is called the cap operation, Rob ' lock ' condition is very serious. Originally, I do is a global pool, and each request is in a different goroutine, so when the object from the pool, grab the lock wasted a lot of CPU, resulting in CPU explosion after optimization. So why is MALLOCGC all right? is because its locks are much smaller in size. MALLOCGC allocation, Goroutine will take memory from the local mcache, not to take the memory block in the oxalate was body m, only when the m out of reach, will go to the global memory allocation pool to take, so there is no serious competition problems.
In order to avoid the problem of global lock, I changed to a little offer cache with each request, and then the assignment changed to look like this:
var offer *Offerif b.OfferPoolIndex < len(b.OfferPool) { offer = &b.OfferPool[b.OfferPoolIndex] b.OfferPoolIndex++} else { offer = &Offer{ CampaignId: this.Id, UserId: this.CreateUserId, Ctype: this.ChargeType, }}
With many optimizations already in the past, this optimization is not a big boost, probably at 10%. The current observed CPU usage around 1000QPS is probably 70-80%. The cost is not high, so the optimization work to this end!
Update:
The subsequent tests found that modifying the code when optimizing the Parseuint introduced errors, resulting in a lot of computation being filtered. After the continuation of the optimization, the relevant data failure, PPROF can no longer reflect the correct direction. Melancholy, ~_~.
Comparison before and after optimization
The experimental environment is open two processes, a run optimization before the code, a run optimized code. Using Tcpcopy to import two processes to the real-time online traffic observation.
CPU utilization
Observe the CPU utilization for a period of time, which is a small piece of data sampled:
Pre-optimization: 473 473 450 476 469 579 495
After optimization: 258 264 246 261 258 267 274
The average is the original 53.5%.
Memory
Use slightly lower than the original.
优化前:{"mem":"508.8MB","ver":"20140327~git-xxx","up_time":"2014-08-04 18:02:55","total_count":50208025,"last_10_second_qps":1119.800000}优化后:{"mem":"431.6MB","ver":"20140327~git-xxx","up_time":"2014-08-05 12:12:43","total_count":9974904,"last_10_second_qps":1056.700000}
Response time
At present, the response time falls within 10MS is more ideal. As a comparative observation, tail takes 100,000 log records and compares the number of records that the response time falls on the 10/20/30ms interval.
Before optimization: 7309/1924/779
After optimization: 4220/1041/220