This is a creation in Article, where the information may have evolved or changed.
"Editor's words" business platform generates a lot of log data every day, in order to achieve data analysis, it is necessary to collect all the logs on the production server after the big Data analysis processing, Docker provides log-driven, but does not meet the needs of different scenarios, this time will be combined with instance sharing log collection, Practical experience in storage and alarm.
Since 2013 Docker quickly fire up, its concept has brought great convenience, but the actual application will find that there are monitoring, logging, network and other issues to be resolved, this article will be combined with examples to share the experience of several people cloud container log system.
Architecture of log management system based on Elk
Log collection is the basis of big data, the business platform generates a lot of log data every day, in order to achieve data analysis, all the logs on the production server need to be collected for analysis and processing; high availability, high reliability, and scalability are essential elements of a log collection system.
Elk is currently the most popular log integration solution, providing log collection, processing, storage, search, display and other functions. Container standard output logs are commonly queried using Docker command Docker logs Containerid to see that the container's internal logs are not easily collected by container isolation, so it is not feasible to manage logs with a single command when facing large systems, A scenario for unified retrieval management of container logs is required. Based on Elk practice a set of container log management system, the structure is as follows:
Log capture
Traditional log collection has more mature solutions, such as Flume, Logstash, etc., but the traditional acquisition scheme is not applicable to container logs. Docker itself provides logdriver functionality that can be used to output logs to different locations using different driver, logdriver specific to the following:
- None (Set the log to no longer output)
- Json-file (docker default logdriver, storing the log locally as a JSON file)
- Syslog (standard output logs can be transmitted in this way)
- Journal
- Self
- Fluent
- Awslogs
- Splunk
- Etwlogs
- Gcplogs
For these logdriver is not a detailed introduction, we are interested to go to the Docker website to view. Docker provides a richer way to log logs, and there are excellent open source project Logspout to choose from, but this does not satisfy all usage scenarios.
The standard output log of the container can be selected from the above driver, because most users choose standardized output log, so Docker does not provide the acquisition function, if the log files are mounted for collection, multiple instances of the same name log will be unable to distinguish, the container file log processing, error log multi-line processing and other issues occur frequently , if you want the standard output log and the file log in the container, you need to own hands and clothing, the following is a cloud log collection system practice.
1. Standard output Log
A set of log capture tools has been developed for the marathon + Mesos environment, and Docker's standard output log json-file is persisted locally, and Mesos has a copy of the standard output log in the sandbox:
As a result, standard output logs can also be collected through Mesos files.
2. In-Container file log
Platform-supported file storage is overlay, which avoids the processing of many complex environments. About overlay stealing a picture:
Storage drive for containers using copy on write, overlay is mainly divided into lower and upper, and when a file needs to be modified, the file is copied from the read-only lower layer to the writable layer upper layer for modification by using cow, in Docker, The read-only layer at the bottom is an image, and the writable layer is container, so the in-container logs are found on the host through the upper file system, such as writing a test character in/var/log/test.log in the container,
In the same way, either the log of the callout output or the file log inside the container can be processed by file or the Json-file can be closed at the same time to relieve the pressure on the Docker itself.
3. Self-Research Log Collection Tool
Based on the above method developed a Log collection tool, the log is unified collection management, log through TCP to the JSON formatted log output to Logstash, including the application ID, container name, container id,taskid, of course, the development process also encountered many problems, such as breakpoint continuation and error log multi-line processing functions, which refer to filebeat (Go language development) for the way of log processing, personally think that if it is for traditional file log processing, filebeat is a good choice, log capture function first step support:
- Container standard output Log capture
- In-Container file log collection, supporting simultaneous acquisition of multiple files
- Breakpoint Continuation (if agent crashes, from last offset acquisition)
- Multi-line log merge (e.g., multi-row error log merge)
- Log file exception handling (e.g., log rotate can be re-collected)
- TCP Transport
- --add-env--add-label tag, you can add container env or label to the log data by specifying a command, such as (--add-env hostname=host--add-env test=env_name1-- Add-label tlabel=label_name)
- Prometheus Indicator data
Log processing needs to provide fast data processing capabilities, in the development process encountered a performance problem, CPU occupied very high, for the problem of the program tuning, using Golang built-in package net/http/pprof, the Golang program tuning is very useful, The proportion of CPU memory consumed by each function in the program can be visually reflected by the way SVG is generated.
Golang built-in package Encoding/json JSON serialization, regular, reflection, byte-to-string for resource consumption is also relatively high, for the above aspects and the program itself to adjust.
Log storage back-end architecture
The log storage function has Logstash, Heka, Fluentd and other schemes, Logstash based on R Uby, support rich features, but the performance of more complaints; Heka based on go, performance is much better than Logstash, but Heka seems to have been out of maintenance. Considering the logstash of community activity, iteration speed and stability, the most important parameters in practical application are as follows:
- --pipeline-workers (Command line arguments)
- --pipeline-batch-size (Command line arguments)
- Ls_heap_size=${ls_heap_size} (fill in according to your actual situation, you can write to the environment variables live command line parameters inside)
- Workers = 8 (according to your actual situation, generally equal to the number of CPUs, configuration file parameters)
- Flush_size = 3000 (test according to your actual situation)
The above parameters are for reference only and can be debugged according to the actual environment. If the log volume is large, in order to ensure the stability of the architecture, you can add a layer of Message Queuing in the middle, more commonly used Kafka, Redis, etc., I believe that we have more applications in this area, no longer repeat.
ES should be the best choice for index storage, the entire architecture of mitigation including ES through Docker deployment, pressure measurement with Marvel ES index mode monitoring, online there are many tuning data, can be self-experiment. The display of the log is self-customized, Kibana itself is more powerful and slightly some learning costs, the end customer wants is very simple things.
The pressure measuring tool selects the Distributed pressure Measuring tool Tsung, generates logs by means of a test application, and then collects the logs by log-agent, simulating the real environment log collection.
Log Alarms
In the log processing, the key word alarm is an important function, for the monitoring alarm mainly uses Prometheus + Alertmanager realizes. Application running process, according to the Log Keyword Alarm section of the application scenario, from the Logstash part of the log to do the diversion (the specific scheme can see the above image of the alarm section), self-research grok_export log filtering analysis to generate Prometheus format data, Then configure the alarm policy from the Prometheus via the Alertmanager alarm. The log-agent itself also supports Prometheus data, Prometheus to view log statistics through specific rules.
Prometheus:
Prometheus is an open source monitoring alarm system that collects time series through pull and transmits with HTTP, data is stored locally, supports rich query syntax and simple dashboard display.
Alertmanager:
Alertmanager as a component of Prometheus, all the time to reach the threshold through the Alertmanager alarm, Alertmanager support very powerful alarm function, including HTTP, email notification, and silent repeated alarm shielding and other functions.
The above is a few people cloud in the practice of container log system encountered problems, a higher level of application including container log analysis, but also to continue to dig and pits, welcome suggestions, and exchange.
Q&a
Q:overlay is not implemented INotify interface, how to get the file log increment data?
A: Record the file offset by looping through the file.
Q: Since the main frame is ELK, the acquisition end is not directly used filebeat because filebeat have limitations?
A:filebeat does not meet the Docker-based requirements of our products, which is equivalent to the above-added Docker logic.
Q: Since the research of the log system, the format of each of the log out is a good rule? Does everyone in development have to follow this specification? No matter what level of log?
A: In fact, there is no, but if it is used internally, can be a better specification of course, can be more convenient to handle, and can do more fine-grained analysis.
Q: Does the log collection have analysis and presentation processing? With what to deal with.
A: For the analysis of the log content has not been done, such as Nginx request log or has the analytical significance.
Q: Are there considerations for using Syslog and logrotate of the system directly in the acquisition?
A: Used Syslog was later re-developed because of the file log requirements in the container.
The above content is organized according to the January 5, 2017 night group sharing content. Share people
Guo Chin, a few people cloud development engineer. Early engaged in java&javascript development, reptile Big Data development, then began to write go, participate in open source container management tools crane research and development and maintenance, currently in a number of cloud-based cloud platform research and development work. Dockone Weekly will organize the technology to share, welcome interested students add: Liyingjiesz, into group participation, you want to listen to the topic or want to share the topic can give us a message.