This is a creation in Article, where the information may have evolved or changed.
First, the conclusion is listed in the front:
- Golang performance can be very good, but some native package performance is likely to be retarded, such as RegExp and Encoding/json. If used in high performance requirements, we should optimize according to the actual situation.
- The use of ON-CPU/OFF-CPU flame diagram is a sharp weapon of program performance analysis, often sharply. Although generating a flame diagram is tedious (especially off-cpu), it is definitely worth owning!
Previously, Logstash was used as a log file to collect client programs. Logstash powerful, rich data processing plug-in and good expansion capabilities, but because the use of JRuby implementation, performance is worrying. And Filebeat is a later one with the go language implemented, more lightweight log file acquisition client. Good performance, low resource consumption, but almost no analytical processing capacity. The usual usage scenario is to use Filebeat to capture logstash resolution processing and then upload to Kafka or Elasticsearch. It is noteworthy that both Logstash and filebeat are excellent open source products of elastic company.
In order to improve the log collection performance of clients, reduce data transmission and deployment complexity, and more fully use the performance advantage of the go language to log parsing, so decided on filebeat through the development of plug-ins, the implementation of the company log format specification for the resolution, directly as a substitute for logstash.
After the introduction of the background, the following is the implementation and optimization process.
Version 1
The simplest implementation is to do the log parsing with the regular expression package regexp that go comes with. Performance has been 30% higher than logstash (also through the development of plug-ins for canonical log parsing).
The performance test here looks at the bottleneck of the log collection-the parsing process, which is the time it takes to collect and parse 1 million logs of the specified format and length by restricting the use of only one CPU core (to minimize the resource footprint on the server) for the business application. The test environment is 1 PCs with a 3.2GHz frequency. In order to avoid the impact of disk IO and page cache, both the input file and the output file are placed in/dev/shm. The CPU limit for filebeat is implemented by specifying the environment variable Gomaxprocs=1 at startup.
This version takes 122 seconds to process 1 million logs, or 8,200 logs per second.
Version 2
Next try to do some optimizations to see if the performance of the go plugin can be improved. The first thought was to replace the RegExp package. Linux has a C implementation of the Pcre library, github.com/glenn-brown/golang-pkg-pcre/src/pkg/pcre This third-party package is to apply the Pcre library to Golang. This package needs to be installed before CentOS pcre-devel
.
This version has a processing time of 97 seconds and results show a 25% increase in processing performance than the first version.
Version 3
The third version, which does not use regular expressions at all, uses strings for fixed log formatting rules. Index () Does the string decomposition and extraction operations. The processing time of this version is 70 seconds, and the performance is significantly increased by nearly 40%.
Version 4
There is no room for further improvement. Yes, it is the go language with its own JSON package. Our log uploads are in JSON format, and Go's Encoding/json is based on reflection, and performance has been widely criticized. If the JSON parsing is optimized, performance improvements can be significant. Since our log format is fixed, the parsed fields are fixed, so JSON serialization can be done based on a fixed log structure without the need for inefficient reflection. Go has several third-party packages that do JSON serialization/deserialization for a given struct, and we're using Easyjson. After installing the Easyjson package, the Easyjson command is executed for the program file that contains the log format structure definition, and a xxx_easyjson.go file is generated that contains the Marshal/unmarshal method for this struct. This reduces processing time to 61 seconds and improves performance by 15%.
At this time, the code in front of me, has not been able to think of any big aspect can also be optimized. It's time for another protagonist of this article, the flame diagram to come out.
The flame diagram is an effective tool for performance analysis, and here is a description of it. A commonly seen flame diagram is a on-cpu火焰图
method used to analyze which function calls are consumed by the CPU.
After installing the Flamegraph tool, run a one-time test on the current version of the program and follow the instructions to generate the flame diagram as follows.
Perf_on_cpu_orig.png
The vertical representation of the graph is the function call stack, the horizontal width of each block represents the proportion of CPU time, need to pay attention to the top of the large strip. The color of the squares is random and has no practical meaning.
There are two main blocks that can be seen from CPU time. A piece is the output processing part, a little bit of the JSON processing, this piece has been optimized for nothing to do. Another piece is more strange, is common. The Mapstr.clone () method, which actually accounted for 40% of the CPU time. Looking up, it is mainly the treatment of Errorf. A look at the code, immediately understand.
func (m MapStr) Clone() MapStr { result := MapStr{} for k, v := range m { innerMap, err := toMapStr(v) if err == nil { result[k] = innerMap.Clone() } else { result[k] = v } } return result}
Common. MAPSTR is the structure that holds the contents of the log in the pipeline, and its clone () method realizes whether a sub-key value is a nested MAPSTR structure, by judging whether the TOMAPSTR () method returns error. From here, the cost of generating an Error object is very considerable. A clear fix, then, is to move the Judgment method in Tomapstr () to clone () and avoid generating an error.
Version 5
Regenerate a fire diagram for the modified code as follows.
Perf_on_cpu_opt.png
At this time common. Mapstr.clone () has almost disappeared, proving that the CPU time spent is negligible.
The test time was shortened to 46 seconds, saving 33%, a great improvement!
Until now, there is one unresolved issue that has not been addressed-after restricting the use of a core, the CPU utilization of the test run can only reach around 82%. Is it due to the presence of locks that affect performance?
It's time to play again off-cpu火焰图
. OFF-CPU flame diagram, is used to analyze the program is not efficient use of the CPU when, where the consumption, here are detailed introduction. The data collection is more complex than the ON-CPU flame diagram and can be used in the Openresty-systemtap-toolkit package provided by the famous spring elder brother. The Kernel-devel and Debuginfo package installation methods are not described in detail on the project page of Spring brother. Also record here.
# kernel-devel没有问题,直接yum安装sudo yum install -y kernel-devel# debuginfo,在CentOS7中需要这样装sudo vim /etc/yum.repos.d/CentOS-Debuginfo.repo修改为enable=1sudo debuginfo-install kernel安装时可能还会报错:Invalid GPG Key from file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Debug-7: No key found in given key data需要从https://www.centos.org/keys/RPM-GPG-KEY-CentOS-Debug-7下载key写入到/etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Debug-7
After installation, follow the instructions to generate the OFF-CPU flame diagram as follows:
Perf_off_cpu_orig.png
I can't fully read this picture yet, but it's already clear that the registry file (Filebeat is used to record file capture lists and offset data) is a percentage of the write operations. As a result, try to set Filebeat's Spool_size (registry file updated once for each of these logs) to 10240, the default value of 5 times times, run the test CPU can run to more than 95%. Setting the registry to/dev/shm/can also solve the problem of running out of CPU during the test.
This negates the guesswork of using improper locking to affect performance. The spool_size settings should be determined based on the test data combined with the output end (such as writing to Kafka) in the actual application.
At this point, the optimization is over, up to 3 times times the original version performance!
The specific operational performance data for each version is shown in.
Performance_compare.png
A little clarification is required:
- Filebeat development is based on the 5.3.1 version
- The Logstash test uses a worker process with the-W 1 parameter configuration and does not restrict the use of a core
- The execution time includes the start time of the program (Logstash start time is nearly 20 seconds)
The ultimate optimization result is that the ability to log parsing for a particular format and length reaches 25,000 pcs per second on a PC, even on a production server with a lower CPU frequency, and can reach 20,000 per second.
The high performance of go is really not blown, of course, after enough optimization:)
Appendix, about the performance of go there is a discussion of this, and it is interesting to look at:
Http://stackoverflow.com/questions/20875341/why-golang-is-slower-than-scala