absrtact: Since the implementation of microservices, we have encountered a lot of problems. One of the biggest problems is how to troubleshoot, the service interface will often rely on multiple services, the slow dependence of the interface will directly affect the quality of the interface service. The slowness of this dependency is common on-line, but it's not a good reason to investigate ...
We are common to Daniel on the line on how to optimize performance, how to improve performance, in fact, there is an important link they did not mention, how they found the low probability of failure? Distributed tracking system is very common in large-scale internet companies, but small and medium-sized companies do not have the technical strength to realize the system. And from our point of view even though the flow is very small but the company is still very important to the system is that we need to strengthen, to identify problems to solve the problem this is what I have been carrying out the purpose.
The implementation of the distributed tracking system is a certain technical difficulty, to achieve performance capture, log write, log collection and collation, log transfer, log storage, log index, log real-time analysis, the final merger display, the system can respond to the impact of large-flow system. such as each request each interface generates 1k of log, then QPS 2000 of the server will generate 2M of log, if it is a request to rely on 5 interfaces that is 10M per second log, when the online business more complex traffic, this value will increase.
Large Internet companies have a lot of distributed tracking system, able to withstand billions of of traffic, but for small companies this architecture burden is very large, many of which rely on distributed messaging systems, distributed storage, distributed computing, the light of the few will use at least 6 servers, for the general small companies cost-effective.
This time we open source distributed tracking is two, one is for small and medium-sized Internet companies to use a stand-alone version he can support the PV 2000w business system (such as payment system). There is also a distributed tracking system that supports distributed billions of PV. Currently just open fiery stand-alone version (Https://github.com/weiboad/fiery) This version is designed for small and medium-sized enterprises use, the entire project is a jar package out of the box, as long as the JAVA8 runtime can be used directly, Of course, the system needs to do a simple work of burying points. and the C + + distributed version depends on something more to the operation and maintenance personnel have a certain capacity requirements, follow-up watch stand-alone version of the subsequent release. These fully open-source core trading systems with sensitive data are also fully available.
At present, there are many ways of distributed tracking, some of the company's own internal use, some small-scale free-scale service. Common distributed tracking is a statistical way to record the performance of each block. The way we offer it is not exactly the same as in the market, we have made a lot of simplification through continuous experimentation, we only keep the function we think is really practical, we will design the system as the key system distributed monitoring. such as payment systems, trading systems.
We recorded the details of each request, the return value, the specific performance and other information, through the tabular analysis can quickly find the online dependence on interface performance (third-party or non-buried interface performance statistics), the interface to do the buried point has also independently done performance ranking analysis. By looking at the analysis table, we can quickly find the slowest interface request playback to analyze the reason for the slow performance on the line. In practice, we find that PHP relies on slow data resources, which in many cases results in poor PHP interface performance. So the point is to rely on resources. Other information users can increase according to their own needs, which can reduce the number of useless logs, can be more frugal.
The buried Point Library will have the Traceid (UUID) in the portal to hide the IP address of the Portal Server, request time, all subsequent logs will be marked with this UUID. All related logs are stored by this UUID after the log collection. The Buried Point Library is responsible for receiving the traceid and sending of other requests at run time. Rpcid,rpcid is a hierarchical counter through which we can display the call relationship order and hierarchy directly to the developer. In addition, in the course of PHP operation, if the generation of exception will also be captured by the Repository capture records, for the server to do the statistics. Finally, these logs will be landed on the server local disk, for some reason multiple PHP process write a file at the same time occasionally disorderly situation, we are now by the process ID plus the project name as the name of the landing.
Fiery log Fetch transfer We implemented a simple version that was designed to simplify the use of OPS, and there are indeed a lot of open source features that provide similar functionality, but need to rely on other environments, which is a burden on operations. We also have an experimental PHP log crawl transfer service, but still the function of the experiment, it is expected to have a certain defect users can participate in debugging improvements.
Fiery service side We do a lot of work, built-in Lucene and Rocksdb, to index and store requests separately. And do some memory statistics, the current statistical dimension is fixed, only for the local interface, dependency interface, Mysql, curl, the response to statistics, in addition to provide call relationship playback, error log alarm to redo statistics. These features allow you to quickly discover performance failures and system anomalies in key points on the line. It's just a stand-alone version, and I can further expand it to a simpler distributed approach.