This is a creation in Article, where the information may have evolved or changed. 1. Ring buffer performs better than channel in scenarios where high concurrent queue tasks are distributed The 2.defer functionality is not provided at the time of static compilation, but is provided by the runtime, so using defer will incur some additional performance overhead (it's good to know, it's going to work) 3.encoding/json serialization is implemented through a reflection mechanism, and performance is poor, and you can use Ffjson to generate Encode/decode code to improve performance. If possible, use Msgpack instead of JSON because the msgpack performance is better.  4. Creating objects in the stack is much more performance than creating objects in the heap, so use new to create objects less. A scene that needs to create a lot of temporary objects can use sync. The pool has reduced the pressure on the GC.  5. In a scenario where the performance requirements are particularly high for concurrent access to the same object, you can increase the performance by avoiding false sharing by increasing the padding to improve the CPU cache hit rate.  6. Consider using lock-free data structures to reduce the impact of locks on concurrency performance 7. Optimize for critical nodes and weigh optimization and development costs   The following is the body content: go high performance programming Tips 
So far, I've forgotten what I'm writing, but I'm sure this article is about the go language. This is mainly an article about running speed  , not response speed-these two speeds are different . 
 
 I used to work with a lot of smart people.  Many of us are obsessed with performance issues  , and what we've done before is trying to approximate the limits of what we can expect. The application engine has some very stringent performance requirements, so we can make changes. Since we've used the go language, we've learned a lot about how to improve performance and how go works in system programming. 
 
 
The simplicity of go and the concurrency model make it an attractive choice for developing backend systems, but the bigger question is how does it respond to latency-sensitive scenarios?  is it worth sacrificing the simplicity of the language to make it faster? Let's take a look at some aspects of the Go Language performance optimization: Language features, memory management, concurrency, and making appropriate optimization decisions based on these.  all the test code presented here  can be found on GitHub (https://github.com/tylertreat/go-benchmarks).
 
Channels
 
The channel has received a lot of attention in the go language because it is a handy concurrency tool, but it is also important to understand its impact on performance. In most scenarios its performance is "good enough", but in some latency-sensitive scenarios it can be a bottleneck. Channel is not a black magic. In the underlying implementation of the channel, a lock is used. In a single-threaded application without lock contention, it works well, but in a multithreaded scenario, performance can drop sharply. We can easily use the lock-free queue ring buffer to replace the channel function.
 
 
The first performance test compared single-threaded buffer channel and ring buffer (a producer and a consumer). First look at the single core case (Gomaxprocs = 1)
 
 
  
  Benchmarkchannel 3000000 Ns/op
Benchmarkringbuffer 20000000 80.9 ns/op
 
 
 
 
as you can see, the ring buffer is about 6 times times faster (if you're not familiar with go performance testing tools, the median number represents the number of executions, and the last array represents the time spent on each execution).  Next, let's look at adjusting  The Gomaxprocs = 8 case . 
 
 
  
  BenchmarkChannel-8 3000000 542 Ns/op
BenchmarkRingBuffer-8 10000000 182 ns/op
 
 
 
 
The ring buffer is nearly three times times faster.
 
channel is typically used to assign tasks to a worker.  in the following test, we compare the case of multiple readers reading the same channel or ring buffer. Set  gomaxprocs = 1  The test results show that the channel is particularly well-behaved in a single-core-process application. 
 
 
  
  Benchmarkchannelreadcontention 10000000 148 Ns/op
Benchmarkringbufferreadcontention 10000 390195 Ns/op
 
 
 
 
However, ring buffer is faster in multi-core situations:
 
 
  
  BenchmarkChannelReadContention-8 1000000 3105 ns/op
BenchmarkRingBufferReadContention-8 3000000 411 NS/OP
 
 
 
 
Finally, let's look at several reader and multiple writer scenarios. It is also possible to see the ring buffer better at multiple cores from the following comparison  . 
 
 
  
  Benchmarkchannelcontention 10000 160892 Ns/op
Benchmarkringbuffercontention 2 806834344 ns/op
BenchmarkChannelContention-8 314428 Ns/op
BenchmarkRingBufferContention-8 10000 182557 Ns/op
 
 
 
 
ring buffer uses only CAS operations to achieve thread safety.  We can see that the decision to select channel or ring buffer depends largely on the system's number of cores.  for most systems,   gomaxprocs> 1, so unlocked ring buffer is often a better choice. Channel is a bad choice in multi-core systems . 
 
Defer
 
Defer is a very useful keyword for improving readability and avoiding the release of resources. For example, when we open a file for reading, we need to close it at the end of the read. If there is no defer keyword, we must make sure that the file is closed before each return point of the function.
 
Func findhelloworld (filename string) error {file, err: = OS. Open (filename) if err ! = Nil { return err} scanner: = Bufio. N Ewscanner (file) for scanner. Scan () { if scanner. Text () = = "Hello, world!" {file. Close () return Nil}} file. Close () If err: = Scanner. ERR (); Err ! = Nil { return err} return errors. New ("didn ' t find Hello World")}             
 
 
   This is prone to error because it is easy to forget to close a file before any of the return statements    . Defer solves this problem with a single line of code. 
 
  func findhelloworld (filename string) error {file, err: = Osif err! = nil {ret Urn Err} defer filefor scannerif scanner "Hello, world!" {return nil}} if err: = Scannernil {return err} return errors "didn ' t find Hello World")}         
 
 
 At first glance, people will think that defer  may be fully optimized by the compiler. If I use the defer statement only at the beginning of the function, the compiler can do so by inserting the defer content before each return statement. But the reality is often more complicated. For example, we can add defer to a conditional statement or loop. The first case may require the compiler to find a conditional branch that applies the defer statement. The compiler also needs to check the panic, as this is also a case where the function exits execution. It seems unlikely (at least superficially) to provide this functionality through static compilation (defer). 
 
Derfer is not a 0-cost keyword, and we can look at it through performance testing. In the following tests, we compared the case where a mutex was locked in the loop body, unlocked directly, and unlocked using the defer statement.
 
 
  
  BenchmarkMutexDeferUnlock-8 20000000 96.6 ns/op
BenchmarkMutexUnlock-8 100000000 19.5 ns/o
 
 
 
 
The use of defer is almost 5 times times slower. In all fairness, 77ns may not be that important, but it does have an impact on performance in a loop. It is often up to developers to weigh the performance and readability of the code. Optimization is always cost-optimized.
 
Refection and JSON
 
reflection are usually slow and should be avoided in latency-sensitive services.  JSON is a common format for data interchange, but the Encoding/json library of Go relies on reflection to serialize and deserialize JSON  . Using Ffjson, we can avoid the use of reflection by using code generation, which is a performance comparison. 
 
 
  
  Benchmarkjsonreflectionmarshal 8 200000 7063 ns/op
Benchmarkjsonmarshal 8 500000 3981 ns/op
 
  Benchmarkjsonreflectionunmarshal 8 200000 9362 ns/op
Benchmarkjsonunmarshal 8 300000 5839 ns/op
 
 
 
 
the JSON serialization and deserialization generated by (Ffjson) is about 38% faster than the reflection-based standard library. Of course, if our performance requirements for codecs are really high, we should avoid using JSON.  .  messagepack   is a better choice for serialization code.  in this test we used  the msgp Library to compare with JSON. 
 
 
  
   
 
  BenchmarkMsgpackMarshal-8 3000000 555 Ns/op
BenchmarkJSONReflectionMarshal-8 200000 7063 ns/op
BenchmarkJSONMarshal-8 500000 3981 Ns/op
 
  BenchmarkMsgpackUnmarshal-8 20000000 94.6 ns/op
BenchmarkJSONReflectionUnmarshal-8 200000 9362 ns/op
BenchmarkJSONUnmarshal-8 300000 5839 Ns/op
 
   
 
 
 
 
The difference here is remarkable. Messagepack is still much faster, even compared to the code generated by (Ffjson).
 
If we really care about small optimizations, we should also avoid using the interface type, which requires some extra processing when serializing and deserializing. In some dynamically invoked scenarios, the run-time invocation also adds some additional overhead. The compiler was unable to replace these calls with inline calls.
 
 
  
  Benchmarkjsonreflectionunmarshal 8 200000 9362 ns/op
Benchmarkjsonreflectionunmarshaliface 8 200000 10099 ns/op
 
 
 
 
let's look at the call lookup, which converts a interface variable to its true type. This test invokes the same method as the same struct. The difference is that the second variable is a pointer to the struct body . 
 
 
  
  BenchmarkStructMethodCall-8 2000000000 0.44 Ns/op
BenchmarkIfaceMethodCall-8 1000000000 2.97 Ns/op
 
 
 
 
sorting is a more practical example of how performance differences are well displayed.  In this test, we compare the sorting of 1,000,000 structures and 1,000,000 interface that point to the same structural body. Sorting a struct is 92% faster than sorting the interface . 
 
 
  
  BenchmarkSortStruct-8 105276994 Ns/op
BenchmarkSortIface-8 5 286123558 Ns/op
 
 
 
 
in summary, avoid using JSON if possible.  If you do need to use JSON, generate serialization and deserialization code. in General, it is best to avoid relying on reflection and interface, but rather to write specific types of use.  Unfortunately, this often leads to a lot of repetitive code, so it's best to generate this code with abstractions.  again, weigh the gains and losses. 
 
Memory management
 
go does not actually expose the heap or direct stack allocation to the user.  In fact, the two words "heap" and "stack" do not appear anywhere in the Go language specification.   This means that stacks and heaps of stuff are only technically relevant. In fact , each goroutine does have its own heap and stack. The compiler does not escape the analysis to determine whether the object is allocated on the stack or in the heap. 
 
Unsurprisingly, avoiding heap allocation can be the main direction of optimization.  by allocating space in the stack (that is, creating objects in a way that uses a{} more, instead of using new (a), we avoid expensive malloc calls, such as the tests shown below. 
 
 
  
  BenchmarkAllocateHeap-8 20000000 62.3 ns/op b/op 1 allocs/op
BenchmarkAllocateStack-8 100000000 11.6 ns/op 0 b/op 0 allocs/op
 
 
 
 
naturally, passing through the pointer is faster than passing through the object, because the former needs to copy a unique pointer, while the latter needs to replicate the entire object. The differences in the test results below are almost negligible because the difference is largely dependent on the type of object being copied . Note that there may be some compiler optimizations for this test. 
 
 
  
  BenchmarkPassByReference-8 1000000000 2.35 ns/op
BenchmarkPassByValue-8 200000000 6.36 Ns/op
 
 
 
 
However, the biggest problem with heap space allocation is the GC (garbage collection).  If we generate many objects with short life cycles, we trigger GC work. In this scenario , the object pool comes in handy.  in the following test, we compared the use of heap allocation with the use of sync. The situation of the Pool   . The object pool is 5 times times more performance-enhancing. 
 
 
  
  BenchmarkConcurrentStructAllocate-8 5000000 337 Ns/op
BenchmarkConcurrentStructPool-8 20000000 65.5 ns/op
 
 
 
 
It should be noted that the Sysc.pool of Go is also recycled during the garbage collection process. Use Sync. The role of pool is to reuse memory between garbage collection operations. We can also maintain our own list of free objects so that objects are not recycled, but this may cause garbage collection to lose its rightful effect.  Go's pprof  tool is useful for analyzing memory usage. be sure to use it for analysis before you blindly do memory optimization. 
 
False sharing
 
when performance is really important, you have to start thinking at the hardware level. The famous Formula One driver Jackie Stewart once said, "To be a racer you don't have to be an engineer, but you have to have mechanical knowledge." "A deep understanding of the inner workings of a car can make you a better driver.  Also, understanding how a computer works can make you a better programmer.  For example, how is memory laid out?  How does the CPU cache work?  How does the hard drive work? 
 
 memory Bandwidth is still a limited resource for modern CPUs, so caching is extremely important to prevent performance bottlenecks. The multicore processor now caches data in cache line , typically 64 bytes in size, to reduce the overhead of main memory accesses. To ensure cache consistency, a small write to the memory will make the cache line obsolete . Read operations on adjacent addresses cannot hit the corresponding cacheline. This phenomenon is called false sharing. This problem becomes apparent when multiple threads access different data in the same cache line. 
 
Imagine how a struct in the go language is stored in memory, and we use the previous ring buffer as an example, and the struct might be the following  :
 
type RingBuffer struct {queue          uint64dequeue        uint64mask, disposed uint64nodes          nodes}
 
 
the queue and Dequeue fields are used to determine the location of producers and consumers, respectively.  These fields are 8byte in size and are concurrently accessed and modified by multiple threads to implement the insert and delete operations of the queue, because these fields are stored continuously in memory, they use only 16byte of memory, and they are likely to be stored in the same cache line. Therefore, modifying any one of these fields will cause the other field cache to be retired, which means that the subsequent read operation will be slow. In other words, adding and removing elements in the ring buffer can cause a lot of CPU cache to fail. 
 
We can add padding directly to the field of the struct. Each padding is as large as a CPU cache line, which ensures that the field of the ring buffer is cached in a different cache line. The following is the modified structure:
 
type RingBuffer struct {_padding0      [8]uint64queue          uint64_padding1      [8]uint64dequeue        uint64_padding2      [8]uint64mask, disposed uint64_padding3 [8]uint64nodes nodes}
 
 
How much difference does it make when actually running? As with other optimizations, the optimization effect depends on the actual scenario. It is related to the CPU's number of cores, the amount of resources competing, and the layout of the memory. Although there are a lot of factors to consider, we still need to use data to speak. We can make a comparison with the ring buffer with the addition of padding and no padding.
 
First, we test the situation of a producer and a consumer, each running in a gorouting. In this test, the difference between the two is very small, with less than 15% performance gains:
 
 
  
  BenchmarkRingBufferSPSC-8 10000000 156 Ns/op
BenchmarkRingBufferPaddedSPSC-8 10000000 Ns/op
 
 
 
 
However, when we have multiple producers and multiple consumers, such as each of the 100, the difference will be more obvious.  In this case, the populated version is about 36% faster. 
 
 
  
  BenchmarkRingBufferMPMC-8 100000 27763 ns/op
BenchmarkRingBufferPaddedMPMC-8 100000 17860 Ns/op
 
 
 
 
False Sharing is a very real problem.  based on concurrency and memory contention, add padding to mitigate its impact.  These numbers may seem trivial, but it has been optimized, especially in the case of clock cycles. 
 
No lock
 
The lock-free data structure is very important to make full use of the multi-core.  Given that go is committed to high concurrency scenarios, it does not encourage the use of locks. It encourages more use of channel rather than mutex . 
 
This means that the standard library does provide common memory-level atomic operations, such as the atomic package  . It provides atomic comparison and Exchange, atomic pointer access.  However, the use of atomic packages is largely discouraged  :
 
 
  
  We generally don ' t want Sync/atomic to being used at all ... Experience have shown us again and again that very very few people is capable of writing correct code that uses atomic Ope Rations ... If we had thought of internal packages when we added the Sync/atomic package, perhaps we would has used that. Now we can ' t remove the package because of the Go 1 guarantee.
 
 
 
 
how difficult is it to achieve lock-free?  is it possible to do so with some CAs? Having learned enough knowledge, I realized that this is definitely a double-edged sword . Unlocked code can be very complex to implement. Non-thread-safe packages are not easy to use. Also, writing a thread-safe, unlocked code is tricky and error-prone. Simple, lock-free data structures like ring buffer are relatively simple to maintain, but other scenarios are prone to problems. 
 
Ctrie is an introduction to a lock-free data structure implementation that, although theoretically easy to understand, is actually very complex to implement. Debugging a lock-free code in a high-concurrency environment is a very painful thing. If you do not write right at first, you will spend a lot of time correcting the errors.
 
But it's obvious that it makes sense to write a complex lock-free algorithm, otherwise why would anyone do it? Ctrie the insert operation is more time-consuming than synchronizing a map or jumping table, as the addressing operation becomes much more. The real advantage of Ctrie is that memory consumption, unlike most hash tables, is always a series of keys in the tree. Another performance advantage is that it can complete a linear snapshot within a constant time. We compared the snapshots of synchronized map and Ctrie in 100 concurrent conditions:
 
 
  
  BenchmarkConcurrentSnapshotMap-8 9941784 Ns/op
BenchmarkConcurrentSnapshotCtrie-8 20000 90412 ns/op
 
 
 
 
in a specific access mode, the lock-free data structure provides better performance in multithreaded systems.  For example,NATS Message Queuing uses a data structure based on synchronized map to complete a subscription match. If you use unlocked ctrie, throughput can increase a lot. Time-consuming, the blue line represents the implementation of a lock-based data structure, and the red line represents the implementation of a lock-free data structure
 
 
Avoiding the use of locks in specific scenarios can result in a good performance boost. The obvious advantages of the lock-free structure can be seen from the contrast between ring buffer and channel. However, we need to weigh the complexity of the coding and the benefits gained. In fact, sometimes a lock-free structure does not provide any tangible benefit.
 
Considerations for Optimization
 
as we have seen from the above discussion, performance optimization is always cost-optimized. Understanding and understanding optimization methods is only the first step. It is more important to understand when and where to use them. A famous quote from C. A. R. Hoare, which has become a classic maxim for all programmers :
 
 
  
  The real problem is this programmers has spent far too much time worrying about efficiency in the wrong places and at the Wrong times; Premature optimization is the root of any evil (or at least most of it) in programming.
 
 
 
 
But the idea is not to oppose optimization, but to learn to weigh the speed--the speed of the algorithm, the speed of the response, the speed of maintenance, and the speed of the system. This is a very subjective topic, and there is no simple standard. is premature optimization a source of error? Should I implement the function first and then optimize it? Or does it not need optimization at all? There is no standard answer. Sometimes it is also possible to implement the function and then raise the speed first.
 
However, my advice is to optimize only the critical path. The farther you go on the critical path, the lower the return on your optimization will be, and the more likely it is to waste your time. It is important to be able to make the right judgments about the performance standards. Don't waste time outside of here. Use data-driven-speak with experience, not on the spur of the moment. There is also the need to pay attention to reality. It doesn't make sense to optimize dozens of nanoseconds for code that is not very sensitive for some time. There are more places to optimize than this.
 
Summarize
 
If you have already read this, congratulations, but you may have some questions wrong. We have learned that in software we actually have two kinds of speed--response speed and execution speed. The user wants the first, and the developer pursues the second, and the CTO wants both. At the moment the first speed is the most important, as long as you want users to use your product. The second speed is required for your scheduling and iteration. They often clash with one another.
 
Perhaps more intriguing, we've discussed some of the ways that go can improve performance and make it more available in low-latency systems. The go language is for brevity, but this simplicity sometimes comes at a price. As with the trade-offs of the previous two speeds, there is a tradeoff between the maintainability of the Code and the performance of the code. Speed often implies the simplicity of sacrificing code, more development time, and later maintenance costs. Make the choice wisely.
 
 
Finally attach the original address: so you wanna Go Fast?