1 Background and questions
With the advent of cloud computing, PAAs platforms, and the application of technologies such as virtualization and Docker, more and more services are deployed in the cloud. Usually, we need to get the log, for monitoring, analysis, forecasting, statistics and other work, but the cloud of services is not a physical fixed resources, log access to the difficulty increased, the past can be SSH landing or FTP access, is not so easy to obtain, but this is the urgent need of engineers, The most typical scenario is: On line, everything in the GUI of the PAAs platform point Mouse complete, but we need to combine tail-f, grep and other commands to observe the log, to determine whether the online success. Of course this is a situation where a perfect PAAs platform will do the job for us, but there are a lot of ad-hoc requirements that the PAAs platform doesn't meet us and we need a log. In this paper, we give a decentralized log in a distributed environment, and a method of centralized collection.
2 design constraints and requirements Description
Before you do any design, you need to be clear about scenarios, functional requirements, and non-functional requirements.
2.1 Application Scenarios
Distributed environment can carry hundreds of server-generated logs, a single data log less than 1k, maximum no more than 50k, the total log size is less than 500G per day.
2.2 Functional Requirements
1 Collect all service logs centrally.
2 can be distinguished from the source, by service, module and day granularity segmentation.
2.3 Non-functional requirements
1 does not invade the service process, the collection log function needs to deploy independently, occupy the system resources controllable.
2 Real-time, low latency, from the production log to the centralized storage latency of less than 4s.
3 persistence, keep the last n days.
4) As far as possible delivery of the log, do not ask not to lose weight, but the proportion should not exceed a threshold (for example, one out of 10,000).
4 can tolerate not strictly orderly.
5) The collection service is off-line offline function, the availability requirement is not high, the whole year meets 3 9 can.
3 Implementation Architecture
A scenario implementation architecture is shown in the following illustration:
Analysis of 3.1 producer layer
Service assumptions within the PAAs platform are deployed within the Docker container, so to meet non-functional requirements, another process is responsible for collecting logs, thus not intruding into service frameworks and processes. Using flume ng for log collection, this open source component is very powerful and can be seen as a monitoring, production increment, and can be published, consumption of the model, source is sources, is the incremental source, channel is the buffer channel, where the memory queue buffer, sink is the slot, It's a place to spend. The source inside the container is the execution of the TAIL-F command to read the incremental log using the standard output of Linux, Sink is a Kafka implementation that pushes messages to distributed messaging middleware.
3.2 Broker Layer Analysis
Multiple containers within the PAAs platform, there will be multiple flume NG clients to push messages to Kafka message middleware. Kafka is a highly throughput, highly performance message-oriented middleware that works in sequential writes with a single partition, and supports the feature of random reads at offset offsets, so it is ideal for topic release subscription models. There are multiple Kafka in the diagram, because the cluster feature is supported, and the Flume NG client within the container can connect several Kafka broker publishing logs, or it can be understood to connect the partitions under several topic, which enables high throughput, which can be achieved in the flume NG packaging Bulk sent to reduce QPS pressure, second can be dispersed to multiple partition write, while Kafka will also specify the number of replica backups, to ensure that a master write after the n backup, which is set to 2, not using the common distributed system of 3, It is because as much as possible to ensure high concurrency characteristics to meet the non-functional requirements of the #4.
Analysis of 3.3 consumer layer
Consumption Kafka increment is also a flume NG, you can see that it's powerful, is that you can access any data source, are pluggable implementation, through a small number of configuration can be. Here using the Kafka source subscription topic, collected logs also first into the memory buffer, and then use a file sink to write files, in order to meet functional requirements, can be differentiated from the source, by service, module and day granularity segmentation, I realized a sink, Called Rollingbytypeanddayfilesink, the source code on the github, you can download the jar from this page, directly into the flume Lib directory.
4 Practice Methods 4.1 In-container configuration Dockerfile
Dockerfile is a container of the program's running script, which will contain a lot of Docker from the command, the following is a typical dockerfile,base_image is a contains the running program and Flume bin of the mirror, more important is the entrypoint , the main use of Supervisord to ensure the high availability of the process in the container.
From ${base_image}
Maintainer ${maintainer}
ENV Refresh_at ${refresh_at}
RUN mkdir-p/opt/${module_name}
ADD ${package_name}/opt/${module_name}/
COPY service.supervisord.conf/etc/supervisord.conf.drvice.supervisord.conf
COPY supervisor-msoa-wrapper.sh/opt/${module_name}/supervisor-msoa-wrapper.sh
RUN chmod +x/opt/${module_name}/supervisor-msoa-wrapper.sh
RUN chmod +x/opt/${module_name}/*.sh
Expose
entrypoint ['/usr/bin/supervisord ', '-C ', '/etc/supervisord.conf ']
The following is the Supervisord configuration file, which executes the supervisor-msoa-wrapper.sh script.
[Program:${module_name}]
command=/opt/${module_name}/supervisor-msoa-wrapper.sh
Here is supervisor-msoa-wrapper.sh, where the start.sh or stop.sh in this script is the application's startup and stop script, where the background is that our start and stop scripts run in the background so that the current process is not blocked, so the Docker will think that the path The end of the application lifecycle is ended, and the wait command is used to block it so that we can look at the foreground running even if the background is running.
The flume command is added here, and the –conf behind the parameters will go to the folder below to find flume-env.sh, which can define Java_home and java_opts. –CONF-FILE Specifies the configuration of flume actual source, channel, sink, and so on.
#! /bin/bash
function shutdown ()
{
Date
Echo ' shutting down Service '
unset Service_pid # necessary in some cases
Cd/opt/${module_name}
SOURCE stop.sh
}
# # Stop Process
Cd/opt/${module_name}
Echo ' Stopping Service '
SOURCE stop.sh
# # START Process
Echo ' Starting Service '
SOURCE start.sh
Export service_pid=$!
# # Start Flume NG agent, wait for 4s log to be generated by start.sh
Sleep 4
Nohup/opt/apache-flume-1.6.0-bin/bin/flume-ng Agent--conf/opt/apache-flume-1.6.0-bin/conf--conf-file/opt/ apache-flume-1.6.0-bin/conf/logback-to-kafka.conf--name A1-dflume.root.logger=info,console &
# Allow any signal which would kill a process to stop Service
Trap shutdown HUP INT QUIT abrt KILL alrm TERM TSTP
Echo ' Waiting for $SERVICE _pid '
Wait $SERVICE _pid
Flume Configuration
Source should use exec source, execute tailf-f log file. However, a self-developed staticlineprefixexecsource is used, and the source code can be found on the GitHub. Customization is the result of the need to pass some fixed information, such as the name of the service/module and the hostname of the container in which the distributed service is located, so that the collectors can differentiate the logs based on this tag. If you find out why you don't use the Flume interceptor interceptor to do this work, join the header in some kv not OK. This is a small pit, I will explain later.
For example, an behavior of the original log: [INFO] 2016-03-18 12:59:31,080 [main] Fountain.runner.CustomConsumerFactoryPostProcessor (customconsume rfactorypostprocessor.java:91)-start to init IoC container by loading XML beans definitions from Classpath:fountain-con Sumer-stdout.xml
According to the following configuration, the log actually passed to channel is: service1##$$# #m1-ocean-1004.cp [INFO] 2016-03-18 12:59:31,080 [main] Fountain.runner.CustomConsumerFactoryPostProcessor (customconsumerfactorypostprocessor.java:91)-start to init IoC Container by loading XML beans definitions from Classpath:fountain-consumer-stdout.xml
Channel use memory buffer queues, size identification can be the number of log bars (event size), transactions can control the one-time from source and one-time to sink batch log number, the actual internal timeout timeout, can be set by the KeepAlive parameter, The timeout is still pushed over and the default is 3s.
Sink using Kafka sink, configure the list of broker and topic name, need Ack or not, as well as a one-time batch sent log size, default 5 a package, if the concurrency is large can be extended to increase throughput.
4.2 Broker Configuration
For reference to the official Kafka tutorial, here's a new topic with a name called Keplerlog, with a backup number of 2 and a partition of 4. > binfka-topics.sh--create--zookeeper localhost:2181--replication-factor 2--partitions 4--topic Keplerlog
Create some incremental information, such as the following script, to enter some strings in the terminal: > binfka-console-producer.sh--broker-list localhost:9092--topic keplerlog
Open another terminal, subscribe to topic, confirm that you can see the producer input string, that means unicom. > binfka-console-consumer.sh--zookeeper localhost:2181--topic keplerlog--from-beginning
4.3 receive log configuration in set Flume Configuration
First source uses the flume official Kafkasource, configures the zookeeper address, will find the available broker list for the log subscription receive. Channel uses a memory cache queue. Sink because our requirement is to split the log according to the service name and date, and the official default file roll sink can only be divided according to the time stamp, and time interval.
Custom Version Rollingbytypeanddayfilesink
source code see GitHub. There are two conditions for rollingbytypeanddayfilesink use:
1 The event header must have timestamp, otherwise it will ignore events and throw {@link inputnotspecifiedexception}
2 Event Body If it is in accordance with the ##$$# #分隔的, then the delimited string is treated as the module name and, if not, the default file name.
Output to a local file, first to set up a directory, through the sink.directory settings. Secondly, the module name, which is extracted from the conditional timestamp, is used as the file name prefix, and the log is used as the file name suffix, such as the file name portal.20150606 or default.20150703.
The format of a file directory as follows, you can see the collection of many services log, and according to the service name, time to differentiate: ~/data/kepler-log$ lsauthorization.20160512 default.20160513 default.20160505 portal.20160512 portal.20160505 portal.20160514
two pits that had to be mentioned Pit 1
Back to the previous two sections, the customization of a staticlineprefixexecsource to add some prefix work. Since the service/module name of the source is to be differentiated, and the time to be split, according to the official Flume document, the following source interceptor configuration can be used entirely. For example, I1 represents a timestamp, and I2 represents the default static variable Kv,key=module,value=portal.
But flume the official default Kafkasource (v1.6.0) Implementation:
You can see that you rewrote the KV in the event header, discarding the header that was sent over, because of the existence of this pit, tailf-f in the event body to specify the module/service name in the front, Then Rollingbytypeanddayfilesink will be split by separator. Otherwise the downstream can not reach KV.
Pit 2
exec source needs to execute the tail-f command to read a line of standard output and standard errors, but if the tail-f is encapsulated in a script, then some pipe commands are executed in the script, such as Tail-f Logback.log | awk ' {print ' portal##$$## ' $} ', then exec source will always discard the most recent output, resulting in a log appended to the end of the file some cannot always be "late", unless a new log is appended, they will be "squeezed" out. This question is more bizarre. There is no careful study at the moment. To show posterity not to pit.
5 Conclusion
From this centralized collection of distributed services, we can see that the use of some open source components can be very convenient to solve the problems found in our daily work, and the ability to find problems and solve problems is the basic quality requirements of engineers. For it does not meet the needs of the need to have the spirit of study, know it is still to know the reason why to do some ad-hoc work, can better leverage these components.
In addition, log collection is only a starting point, using valuable data, the subsequent use of the scene and imagination space will be very large, such as
1 The use of spark streaming in a time window to calculate the log, do flow control and access restrictions.
2 Use awk scripts, the advanced functions of Scala to do a single machine access statistical analysis, or Hadoop, spark large data statistical analysis.
3 In addition to port survival and semantic monitoring, the use of real-time computing processing logs, error, anomaly and other information filtering, to achieve a real service health care and early warning monitoring.
4 collection of logs can be imported through the Logstash elastic search, using Elk way to do log query use.