Log data is the most common kind of massive data, in order to have a large number of user groups of e-commerce platform, for example, during the 11 major promotion activities, they may be an hourly number of logs to tens of billions of dollars, the massive log data explosion, with the technical team to bring severe challenges.
This article will start from the mass log system in optimization, deployment, monitoring direction of how to adapt to the needs of the business, the focus from a variety of log system architecture design comparison, the following tuning process: scale-out and vertical expansion, sub-cluster, data division, rewrite data link and other practical phenomena and problems.
Log system Architecture Benchmarks
Friends who have experience in project development know that: from the platform's initial construction to the realization of the core business, all need to have a log platform for a variety of business escort.
As shown, for a simple log scenario, you typically prepare master/slave two apps. We only need to run a Shell script to see if there is an error message.
As business complexity increases, the application scenario becomes more complex. While the monitoring system can display errors in a machine or an application.
However, in the actual production environment, due to the implementation of isolation, once on the lower side of the red box in an application has a Bug, you can not access to its corresponding log, there is no way to remove the log.
In addition, some depth depends on the application of the log platform, may also be in the log generated when the direct collection, and then delete the original log files. These scenarios have made it difficult for us to maintain the log system.
For reference Logstash, there are generally two types of log business processes:
normally the simple process is: application generated log → According to the pre-defined log file size or time interval, through the execution of Logrotation, constantly refresh the new file → periodic review → periodic deletion.
the process of complex application scenarios is: application generated log → capture → transfer → on-demand filtering and conversion → storage → analysis and viewing.
We can differentiate between different log data scenarios from two dimensions of real-time and error analysis:
real-time, generally applicable to the first-level applications we often say , such as: direct user-oriented applications. We can customize all kinds of keywords to make it easy for the relevant business people to be notified at the first time when there are various error or exception.
quasi-real-time, generally applicable to a number of project management platforms , such as: in the need to fill in hours when the downtime, but this does not affect the payment of wages.
The platform restarts in a few minutes and we can sign in again, which does not cause any principled impact. Therefore, we can rank it as quasi-real-time.
In addition to the direct acquisition of errors and anomalies, we also need to analyze . For example, knowing only one person's weight is meaningless, but if you increase the gender and height of two indicators, then we can determine whether the person's weight is a standard weight.
In other words: If you can give more than one indicator, you can de-noising large data, and then through the regression analysis, so that the data collected more meaningful.
In addition, we have to constantly restore the authenticity of the numbers . Especially for real-time first-class applications, we need to be able to quickly let users know the true meaning of what they are encountering.
For example, when the merchant was on the shelves, the price tag of the commodity was 100 yuan and was marked 10 yuan. This will cause the goods to be sold out immediately.
But this phenomenon is not a business problem, it is difficult to find, so we can only through the log data for logical analysis, timely feedback to ensure that in a few 10 seconds after the inventory modified to zero, so as to effectively solve the problem. It can be seen that real-time analysis is useful in this application scenario.
At the end of the day, we need to compare and summarize cross-temporal dimensions while acquiring historical information, so traceability can play its relevance in a variety of applications.
the various elements mentioned above are benchmarks for our management logs. as shown, our log system uses the Open source ELK mode:
ElasticSearch (hereinafter referred to as ES), responsible for the back-end centralized storage and query work.
A separate Beats is responsible for the collection of logs. Filebeat improved the Logstash resource utilization problem; Topbeat is responsible for collecting monitoring resources, similar to the system command top to get CPU performance.
Because the log service only has the role of stability and assurance for the business, and we need to achieve fast, lightweight data collection and transmission, so should not occupy the server too many resources.
in the way we are using the plug-in mode , including: Input plug-in, output plug-in, and the middle is responsible for the transfer filter plug-in. These plug-ins have different rules and their own format, supporting a variety of security transmission.
The idea of optimization of log system
With the structure of the above log, we put forward four optimization ideas for various practical application scenarios:
Basic optimization
Memory: How to allocate memory, garbage collection, increase cache, and locks.
Network: network transmission serialization, increased compression, policy, hashing, different protocols and formats.
CPU: use multithreading to improve utilization and load.
Here the utilization and load are two different concepts:
Utilization: Using the next kernel after using a single core, utilization is gradually increased.
Load: All eight cores are used all at once, the load is full, but the utilization is very low. That is, each core is occupied, but the resources used are not many, the calculation rate is relatively low.
Disk: attempts to merge files, reduce the generation of fragmented files, and reduce the number of seek paths. At the same time at the system level, the various useless services are turned off by modifying the settings.
Platform Extensions
To add or subtract, or to call alternatives: Whether it is an Internet application or a daily application, we add a distributed cache to the query to effectively improve the efficiency of the query. In addition, we will not be used by the platform to directly close or remove the place.
Vertical expansion: such as increasing the expansion disk and memory.
Scale-out : Add/subtract/parallel extensions, using distributed clusters.
Data divide and conquer
classify and rank data according to the different dimensions of the data . For example: we distinguish error, info, and debug from the log, and even filter the log of info and debug levels directly.
data hotspots : For example, some log data is trending up over a period of time during the day, and the evening is just a steady production. We can take them out of the hot spot and handle them separately to break the hot spots.
System demotion
On the basis of an effective division of the overall business, we set some downgrade options to stop some of the unimportant functions to meet the core business.
Log system Optimization Practice
In the face of continued growth in the volume of data, although we add a lot of resources, but can not fundamentally solve the problem.
In particular, it is reflected in the following three areas:
The volume of log production is large, with tens of billions of per day.
Because of the isolation of the production environment, we cannot directly view the data.
Proxy resource limitations, our various log collection and system resource acquisition operations, must not exceed the business resources of a core.
In the face of continued growth in the volume of data, although we add a lot of resources, but can not fundamentally solve the problem.
First-level business architecture
The level of our log system is relatively clear, can be easily divided into data access, data storage and data visualization of three pieces.
Details include:
Rsyslog is one of the most cost-saving performance in the collection tools we are exposed to today.
Kafka, has the function of persistence. Of course, it can be used to reach a certain amount of data when the Bug occurs.
Fluentd, which is similar to Rsyslog, is also a kind of log transfer tool, but it is more inclined to transfer service.
ES and Kibana.
The implementation of this architecture will use Golang, Ruby, Java, JS and other languages. In the post-transformation, we will quickly import data that conforms to Key-value mode into HBase.
Based on HBase's own characteristics, we implemented its B + tree in the memory layer and persisted it on top of our disk, thus achieving an ideal fast insertion speed. This is why we are willing to choose HBase as the log scheme.
Level Two business architecture
We look directly at the function diagram of level two business architecture , which is made up of the following processes:
After the data collection has been completed, many applications will rely entirely on our log system in order to save space on their own disk. So after the data is collected, we add a persistent cache.
The system performs the transfer after the cache is complete. The process of transmission includes filtering and conversion, which can be used for data thinning. It is worth emphasizing that if the business parties cooperate early and give us some agreement, we will be able to implement structured data through formatting.
The following is a diversion, which consists of two main blocks: A-source data walk A-channel, B-source data go B-channel. The other is to let A data flow into our storage device and trigger a protection mechanism. In order to secure the storage system, we added a single additional queue.
For example: Queue is 100, inside a chunk is 256 trillion, we now set high water level is 0.7, low water level is 0.3.
When the write operation is stacked, we set the 0.7, or 100 MHz. Then when a 256 trillion will accumulate to 70 chunk, we go to the storage platform to write speed is not up.
At this point the high water mark is triggered and the write is not allowed to continue until the entire writing process has digested the chunk and dropped to 30, before it can continue writing. We are using this protection mechanism to protect the background and storage devices.
Next is the storage, because the amount of the entire data flow will be relatively large, so in the storage link is mainly performed by the storage index, compression, and query.
Finally, some of the UI analysis algorithms, using some SQL query statements for a simple, fast query.
Usually from acquisition (logstash/rsyslog/heka/filebeat) to cache-oriented Kafka is a typical wide dependency.
The so-called wide dependency is that each App can be associated with each Broker. At Kafka, each transfer is hashed, and the data is written to each Broker.
A narrow dependency is a process in which each of its FLUENTD processes corresponds to only one Broker. It is eventually written to ES through a wide dependency process.
Acquisition
such as Rsyslog not only occupies the least resources, and can add a variety of rules, it can also support such as TSL, SSL and other security protocols.
Filebeat lightweight, in version 5.x, Elasticsearch has the ability to parse (like a Logstash filter)-ingest.
This means that data can be pushed directly to Elasticsearch with Filebeat and let Elasticsearch do both parsing and storing things.
Kafka
Then Kafka,kafka is the main implementation of sequential storage, which through the mechanism of topic and Message Queuing, to achieve fast data storage.
And its disadvantage: because all the data are written to Kafka, will lead to excessive topic, causing disk competition, which seriously drag Kafka performance.
In addition, if all the data use a unified label, because we do not know the specific categories of data collected, we will be difficult to achieve the separation of data.
Therefore, in the subsequent optimization of the transmission mechanism, we have to transform and implement the sequential storage process, and thus solve the need to do the persistence of this security guarantee needs.
Fluentd
Fluentd is a bit like Logstash, and its documentation and plugins are very complete. Many of its plugins are guaranteed to be directly connected to Hadoop or ES.
As far as access is concerned, we can adopt fluentd to Fluentd way. That is, on the basis of the original layer of data access, and then once again fluentd. It also supports secure transport. Of course, we have also focused on optimizing it in the back.
Es+kibana
In the end we used ES and Kibana. The advantage of ES is that it enables fast inverted indexing through Lucene.
Since a large number of logs are unstructured, we use ES's Lucene for packaging to meet the average user's search for unstructured logs. The Kibana provides visual display tools based on Lucene.
Problem location and Solution
Here are some of the problems and phenomena we have encountered, the following are the starting points for our optimization:
The transport server has low CPU utilization and the load on each core is not full.
the frequency of the transport server full GC is too high. because we are using Ruby to implement the process, the memory default setting of the amount of data is sometimes too large.
The storage server has a single crest phenomenon , that is, the storage server disk sometimes suddenly appears to have a performance straight-up or plunge.
high water levels are frequently triggered . As mentioned above, the High water level protection mechanism, once the storage disk triggers a high water level, no longer provide services, can only wait for manual disk "cleaning".
if one of ES's machines is "hung", the cluster is stuck . That is, when a machine is found to be unable to communicate, the cluster will think it is "hung", then quickly start data recovery. However, if the system is busy, the operation of such data recovery will be more of a drag on the overall performance of the system.
Since all data is written to Kafka, and we only use a topic, this causes each kind of data to go through not necessarily related to the rule chain, and does not necessarily apply the rule judgment, so the data transmission efficiency overall is reduced.
The host polling mechanism of the fluentd causes high water-level frequency. Since the FLUENTD follows a default policy when docking with ES: First five data writes are preferred, i.e. interacting with the first five interfaces of the first five.
In our production environment, FLUENTD is written in Cruby. Each process belongs to a FLUENTD process, and each process corresponds to a host file.
The first five default values for the host file are the write entries for ES, so all the machines will look for the five portals.
If one machine goes down, the next one is polled. This directly results in the frequent occurrence of high water levels, and the decline in write speed.
As we all know, the log query is a low-frequency query, that is, only in the event of a problem to be viewed. But in practice, we tend to retrieve them all in a way that is not very meaningful.
In addition, ES in order to achieve better performance, will be stored in the RAID0, the time span of storage is often more than 7 days, so its cost is relatively high.
Through real-time line analysis of data, we find that the write/write balance is not achieved.
In order to improve the utilization of FLUENTD, we use Kafka to improve the amount of data, the original is 5 trillion, now we have changed to 6 trillion.
If it is simply transmitted, it can actually be changed even higher, regardless of the calculation. Just because we consider that there are some things to calculate here, we only mention 6 trillion.
Our FLUENTD is based on JRuby, because JRuby can be multi-threaded, but our cruby doesn't make any sense.
In order to improve memory, I have known all of Ruby's memory mechanism, that is, some hash of the host file, because we each process to select the first five columns, I opened a few more mouth. Es optimization this piece, before ES, we've had someone to optimize it once.
Because, based on what I just said, sometimes the log volume is very high, sometimes the log volume is very small. We will consider doing dynamic configuration.
Because ES is supported by dynamic configuration, so it is dynamically configured, we can improve its writing speed in some scenarios, it can support the query efficiency in some scenarios. We can try to do some dynamic configuration load.
Retrofit One: storage reduction
There's not much change in the overall architecture of the storage, we just drop the days down and change to a day when we transfer to FLUENTD.
At the same time, we split the data directly into Hadoop, and put some Kibana-compliant data directly into ES.
As mentioned above, log query is low frequency, the general need to query more than two days of data is very small, so we reduce storage is very meaningful.
Transformation two: Data Divide and conquer
We removed the KAFKA layer in cases where the number of log file nodes was smaller (fewer than 5 machines). Since FLUENTD can support data and large file storage, data can be persisted to disk.
We give each application a direct correspondence to the tag, to facilitate each application to their own tag, follow their own fixed rules, and eventually write to ES, which facilitates the respective location of the problem.
In addition, we can quickly find the root cause of the problem by using lazy computing and file segmentation. Therefore, we save Kafka and ES various computing resources.
In practice, because HBase does not have to do raid, it is fully capable of controlling disk writes, so we compress the data. In terms of its effectiveness, the storage overhead of ES is significantly reduced.
Later, we also tried a more extreme scenario: Let the user query the data directly from the client's Shell and adopt a local cache retention mechanism.
Optimization effect
The results of the optimization are as follows:
Efficient utilization of server resources. After implementing the new scheme, we save a lot of servers, and the storage resources of a single server are saved by 15%.
Single-core processing can be transferred 3,000 per second, after implementation to the 1.5~1.8 million. Moreover, the single core can transmit nearly 30,000 per second when the server is running alone, that is, without any calculations.
The ES protection mechanism is rarely triggered. The reason is that we have diverted the data out.
Historical data can only be saved for 7 days, so we can now store longer data because we save the server. Also, for some logs that others have queried, we will selectively retain them for traceability based on the initial strategy.
Summary of Log system optimization
On the Log platform optimization, I summarized the following points:
Because the log is low frequency, we put the historical data into the low-cost storage, ordinary users need, we re-lead to ES, through the Kibana front-end interface can be quickly queried. For programmers, you do not need to be able to directly query ES.
The longer the data exists, the less meaningful it is. We have developed effective strategies to retain meaningful data based on the actual situation.
The sequential write disk replaces the memory. For example, when we read and write a stream file, we take the sequential write data pattern, which is different from the usual random write disk.
When storage volume is large, you should consider SSDs. In particular, when ES encounters a current limit, the use of SSDs can improve the performance of ES.
Customize the specification in advance, so as to effectively solve the post-analysis work.
Log format
As shown, the common types of log formats include: UUID, timestamp, host, and so on.
In particular, host, because the log will involve hundreds of nodes, with the host type, we can determine which machine is the standard. Other types of environment variables in the diagram can be effectively traced back to some historical information.
Log scheme
As shown, we can directly write data from the acquisition side to a file or database via Rsyslog.
Of course, for some temporarily unused logs, we do not necessarily have to implement the rules of filtering transmission.
For example, FLUENTD also has some transmission rules, including: Fluentd can directly docking Fluentd, but also directly docking MongoDB, MySQL and so on.
In addition, we have some components that can quickly dock plugins and systems, such as allowing Fluentd and Rsyslog to connect directly to ES.
This is one of the most basic baselines I've customized for everyone, and I think the logs are divided into three baselines, from capture, cache, transfer, storage, to final visualization.
Acquisition of storage is the simplest one, like Rsyslog to HDFs or other filesystem, we have this situation.
The more common situation, is from the collection, transmission, to the storage visualization, and then form the final we now the most complex system, we can choose according to the actual situation.
Finally, I consider a situation, if this case, we have as little as possible to occupy the server, and then transfer needs to filter the conversion, the log can be relatively simple, in line with this Key value (KV) format.
We can follow a Rsyslog, take a fluentd, take a Hbase, take a echars and so a way to do a scheme on it.
I think Rsyslog, Fluentd, Heka these can be collected. Then transmit this piece has FLUENTD transmission, because Fluentd and Kafka to plug-in very flexible can directly docking many of our storage devices, can also correspond to a lot of files, even ES can.
Visualization can be used Kibana, mainly with ES more tightly, they combine together need a little learning cost.
Author: Yanagitsu Ping
Editor: Chen Jun, Tao Jiarong, Sun Shujuan
More technical dry sharing
in 51CTO Technology Stack Public Number