Dockone WeChat Share (124): Easy to raise monitoring system implementation plan

Source: Internet
Author: User
Tags kibana grafana logstash influxdb webhook filebeat pagerduty
This is a creation in Article, where the information may have evolved or changed.
"Editor's word" monitoring system is one of the most important components of service management, which can help developers to better understand the health of service and detect abnormal situation in time. Although Ali provides fee-based business monitoring services, but monitoring has a lot of open source solutions, you can try to build a monitoring system to meet the basic monitoring needs, and gradually improve the optimization. This can be more flexible to meet their own business monitoring needs, but also for the future self-built computer room to provide technical accumulation. Through the following 7 aspects to build a monitoring system.

"3-day burn-Brain Docker Camp | Shanghai Station" As Docker technology is recognized by more and more people, the scope of its application is more and more extensive. This training is a combination of our theory and practice, from the perspective of Docker should scenario, continuous deployment and delivery, how to improve testing efficiency, storage, network, monitoring, security and so on.


1. Log printing

The perfect log is the basis of the implementation of monitoring, how to print logs related to log filtering, storage and analysis. In addition to selecting the appropriate log library, some requirements for log printing are also met:
    • Log style: Output structured logs in key-value field format.
    • Output timing: Error log must be printed, info log combined with business needs to properly print, log only need to focus on the business layer, model and util and so do not need to print.
    • Output format: Print logs on line in JSON format for easy parsing of logs. Offline for easy viewing, a custom format print log can be used to control the log format, both online and offline, via ETCD.
    • Output: Each log will carry Logid, method, host and level, and depending on the business scenario, need to carry a different business identity field, such as ProjectType, platform, PayType and so on.
    • Use context to pass on shared information between different goroutine.


2. Log slicing

Log slicing is an operational dimension, and it should not be done by the log library, because Linux has a very mature tool on log slicing and does not need to write code to repeat it.

There are currently only 2 requirements for log slicing: splitting and deleting redundant logs by day. Logrotate will be able to meet these requirements very well, logrotate is based on Cron, and its script is/etc/cron.daily/logrotate, which is placed by default under/etc/cron.daily and executed once a day.

Sometimes a program exception or a surge in requests can cause a burst of log volume, possibly filling the entire disk in a short period of time. The logrotate configuration file can be added maxsize to limit the size of the log file, and the execution frequency of the logrotate is increased to hourly or even every minute, timely segmentation and deletion of more than the rotate number of logs, to prevent abnormal situations when the disk is full.

The sample configuration is as follows:
logrotate Config of sample
Rotate every day, and keep for 3 days
/var/log/sample.log {
Daily
Rotate 3
MaxSize 1G
Missingok
Sharedscripts
Postrotate
# Send SIGHUP signal to program when slicing
Killall-sighup Bin_sample
Endscript
}

The business program only needs to be responsible for monitoring the sighup signal, and then re-opening the log file when the signal is received.

3. Log capture

From the point of view of monitoring system, there are 2 ways of log collection: Active acquisition and passive reception, both of which have pros and cons.

Active acquisition

Advantages: Log collection and business procedures are separate, non-impact.

Cons: Log collection relies on additional acquisition services, and filtering and storage may require additional configuration.

Passive reception

Advantage: The business program sends logs directly to storage, which is flexible and can be controlled in the business code.

Disadvantage: Log storage instability can affect the normal operation of business programs, conversely, large log volume will also crush log storage.

However, in the early stage of construction monitoring system, log storage is not very stable, or active acquisition method is more secure, does not affect the stability of the main service.

The COLLECTD feature is really powerful, and its tail plug-in can also be used to collect logs from files, but the tail plugin configuration is complex and the documentation is less detailed than filebeat.

COLLECTD other plug-ins can collect a lot of data, but also have plug-in support to send data to Logstash and INFLUXDB, but most plug-in features we do not use, and elastic stack beats can also be very good collection of data such as system parameters, And with the elk can be very good compatibility.

Therefore, after experimenting with the 2 filebeat and COLLECTD, the above analysis decided to use Filebeat to collect logs from the log files. As shown below, the filebeat configuration is easy to understand:
Filebeat:
SPOOL_SIZE:1024 # Max can save up to 1024 data to send out together
Idle_timeout: "5s" # otherwise it will be sent every 5 seconds
Registry_file: "Registry" # File read location log file, will be placed in the current working directory.
Config_dir: "Path/to/configs/contains/many/yaml" # If the configuration is too long, you can split the configuration by way of directory load
Prospectors: # can be categorized as a prospector with the same configuration parameters
-
Fields
Log_source: "Sample" # similar to Logstash add_fields, where "Log_source" is used to identify which project the log originated from
Paths
-/var/log/system.log # Indicates the location of the read file
-/var/log/wifi.log
Include_lines: ["^err", "^warn"] # Send only the logs that contain these words
Exclude_lines: ["^ok"] # does not send a log containing these words
-
Document_type: "Apache" # defines _type value when writing to ES
Ignore_older: "24h" # files that have not been updated for more than 24 hours are no longer listening.
Scan_frequency: "10s" # Scans directories every 10 seconds, updates the list of files on wildcard matches
Tail_files:false # Whether to start reading from the end of the file
harvester_buffer_size:16384 # 16384 bytes per read when actually reading a file
Backoff: "1s" # detects if a file has a new line of content to read every 1 seconds
Paths
-"/var/log/apache/*" # can be used with wildcard characters
Exclude_files: ["/var/log/apache/error.log"]
-
Input_type: "stdin" # except for "log" and "stdin"
Multiline: # Multi-line Merge
Pattern: ' ^[[:space:]] '
Negate:false
Match:after
Output
Logstash
Hosts: ["localhost:5044"] # The Logstash hosts

The logs sent by Filebeat will contain the following fields:
    • Beat.hostname beat running host name
    • Beat.name Shipper Configuration Segment Setting name, if not set, equals Beat.hostname
    • @timestamp time to read the contents of the line
    • Type content set by Document_type
    • Input_type from "Log" or "stdin"
    • Source-specific file name full path
    • Offset from the beginning of the row log
    • Message Log Content
    • All other fixed fields added by field exist inside this object


4. Log filtering


Logstash has been a mature and popular log processing framework since it was born in 2009 and has evolved over the years. Logstash uses the pipeline method to collect and process logs. A bit like the *nix system's Pipeline command input | Filter | Output,input executes the filter and then executes the output. In Logstash, there are three stages: input input→ processing filter (not required) → output outputs. Each phase is worked with a number of plugins, such as file, Elasticsearch, Redis, and so on. Each stage can also be specified in a variety of ways, such as output can be output to elasticsearch, or can be specified to stdout in the console printing.

Codec is a new concept introduced by Logstash starting with version 1.3.0 (Codec from Coder/decoder two words). Prior to this, Logstash only supported plain text input and then processed it with a filter. But now, we can process different types of data during the input period, all because of the CODEC setting. So, here's a concept that needs to be corrected. Logstash not just an input | Filter | Output data stream, but one input | Decode | Filter | Encode | Data flow of Output! Codec is used to decode and encode events. The introduction of CODEC enables Logstash to coexist with other operational products that have custom data formats, such as graphite, fluent, NetFlow, COLLECTD, and using Msgpack, JSON, edn Other products such as common data formats.

Logstash provides a number of plug-ins (Input plugins, Output plugins, Filter plugins, Codec plugins) that can be combined on demand. Where Filter plugin Grok is the most important plug-in for Logstash. Grok matches the log content with regular expressions and constructs the log, so in theory you can parse any form of log, as long as the regular mastery is skillful enough to parse the unstructured logs generated by third-party services. However, if it is written by the service, there is no need to output the log to non-structure, increase the burden of writing regular, so in the above log printing section to specify the log output in JSON form, convenient Logstash parsing, Logstash provide the Filter plug-in JSON.

Logstash configuration files are placed by default in the/ETC/LOGSTASH/CONF.D directory, if you need to capture multiple project logs, each project Logstash configuration may not be the same, it will be in the CONF.D to store multiple profiles, each project named easy to manage. However, this poses a problem because Logstash will merge all the profiles into one, that is, when a log is entered into Logstash through input, it passes through the filter and output plug-ins in each configuration file, resulting in the processing and output of the log errors. The solution is to add fields that differentiate between different items in the Filebeat fileds configuration item, and if the log path can distinguish between different items, you can also use the Filebeat source field, and then the corresponding Logstash in the configuration file through if the project, the project's respective logs into their respective configurations, non-interference.

The following configuration example is a JSON log generated for a sample service, parsed with the JSON filter plug-in by Filebeat, and outputting the results to standard output.
Input {
Beats {
Port = "5044"
}
}
The filter part of this file was commented out to indicate that it is
Optional.
Filter {
If [beat] and [source] =~ "sample" {
JSON {
Source = "message"
}
Ruby {
Code = "Event.set" (' Time ', (Time.parse (Event.get (' time)). to_f*1000000). To_i) "
}
}
}
Output {
If [beat] and [source] =~ "sample" {
stdout {codec = Rubydebug}
}
}

5. Log storage

InfluxDB vs. Elasticsearch

According to Db-engines's rankings, Influxdb and Elasticsearch in their respective fields of specialization are no.1,influxdb reign time Series dbms,elasticsearch PA search engine, They have very detailed documentation and information on their principles and use, and they are not mentioned here.

In terms of time series data, influxdb performance is strong, elasticsearch on the main indicators are far from the downwind:

Data write: At the same time up to 4 processes write 14.4 million data, Elasticsearch average is 115,422 bar/second, influxdb average 926,389 bar/sec, write speed is Elasticsearch 8 times times. This write speed gap remains relatively consistent with the increase in data volume.

Disk storage: Stores the same 14.4 million data, Using the default configuration of the Elasticsearch requires 2.1G, the Elasticsearch for time series data is required for 517MB, and influxdb only needs 127MB, and the compression rate is 16 times times and 4 times times higher than the previous two.

Data query: In 24h data set (14.4 million data) randomly query 1 hours of data, at 1 minutes of time interval aggregation, Elasticsearch and influxdb single process to execute 1000 times the query, calculate the time-consuming average. Elasticsearch Time consuming 4.98ms (201 queries/second), Influxdb time consuming 1.26ms (794 queries/second), query speed is Elasticsearch 4 times times. With the increase of data set, the gap between query speed is gradually widening, the maximum difference is 10 times times more. And as the number of processes executing the query increases, INFLUXDB's query speed increases significantly, and the query speed between the different datasets is basically the same, but the elasticsearch increase is not big, and as the dataset increases the query speed is decremented.


For detailed comparison instructions see: InfluxDB markedly outperforms Elasticsearch in Time-series Data & Metrics Benchmark.

Elasticsearch strong in full-text search, Influxdb good at timing data, so the specific needs of specific analysis. If you need to save the log and frequently query, elasticsearch more appropriate, if only rely on the log to do state display, occasionally query, influxdb more appropriate.

Easy to raise the business each has its own characteristics, single choice elasticsearch or influxdb are not very good query log and indicator display, so it is necessary to influxdb and elasticsearch coexistence. 2 outputs are configured in the Logstash, 2 copies of the same log output, and one copy retains all the fields output to Elasticsearch; another filtered text field retains the indicator field and then outputs to InfluxDB.

Influxdb if as the output of Logstash, there is a pit to note that the Logstash INFLUXDB plug-in support of the time stamp precision is too coarse, can not be accurate to nanosecond, will cause the same value of the timestamp in the insertion of INFLUXDB when the exception occurs. Because Influxdb uniquely identifies a record with the measurement name, tag set, and timestamp. If a record inserted influxdb is the same as a record measurement name, tag set, and time stamp that already exists, filed will be the collection of new and old two records, and the value of the same field will be overwritten by the new record. There are 2 ways to solve the problem, one is to add a tag to identify the new record. The other is the precision of manual lifting timestamp, up to microseconds, in theory can support 86,400,000,000 of non-repeating log each day, can greatly avoid the overlap of timestamps, the configuration is as follows:
Business log output time stamp formatted to microseconds: 2006-01-02t15:04:05.999999z07:00

Logstash filter based on timestamp conversion
Filter {
Ruby {
Code = "Event.set" (' Time ', (Time.parse (Event.get (' time)). to_f*1000000). To_i) "
}
}

6. Data display

Grafana vs. Kibana

Comparing Kibana and Grafana,kibana are not Grafana beautiful on the chart display, and the Grafana configuration is simpler and more flexible. Now that the influxdb and elasticsearch coexist in the log store, the presentation also requires Kibana and Grafana to work together, Kibana to retrieve the logs from Elasticsearch, Grafana get presentation data from Influxdb and Elasticsearch. The following 2 images show the application of Grafana in easy-to-raise business monitoring:


7. Abnormal alarms

Even if the above 6 links are established, if there is no alarm all is meaningless, because it is impossible to stare at every moment of the curve to see, so you need to set the abnormal threshold, so that the monitoring system timing check, found abnormal immediately send alarm notification.

Alarm service has a lot, but the data show Grafana with alarm function, function can also meet our alarm needs, and configuration is simple, so the rules of simple alarm can be used grafana alarm service. However, the Grafana alarm only supports a subset of the database, graphite, Prometheus, InfluxDB and Opentsdb, so the log alarm in Elasticsearch also needs elastic stack's x-pack.

Condition


As shown, you can set the frequency of the alarm check, the alarm condition is the average value of the specified indicator is greater than 70 in the last 5 minutes, and if this condition is true, the alarm is triggered. This alarm condition is also relatively single, like the number of errors in 10 minutes more than a few times before the alarm, the current number of orders with the same time yesterday compared to the number of orders fell more than a few percent on the alarm, control alarm notification sent frequency, and so on, Grafana can not meet, for this alarm rule we have implemented an alarm engine, To satisfy these more complex alarm rules.

Notification

Grafana Alarm notification is only triggered when the state transitions, that is, the alarm status will send an alarm notification, if the condition has been satisfied for a period of time before the recovery of the alarm conditions, Grafana will not always send notifications until the time of recovery to send a recovery notification. If the alarm is triggered, Grafana supports 4 notification methods: Email, Slack, Webhook, and Pagerduty. One of the slack is a foreign collaboration tool, similar to nails, Pagerduty is a toll alarm platform, so the optional only left the email and webhook. Here's a quick introduction to configuring email and Webhook

Email

Grafana e-mail configuration is very simple, you can use the QQ Enterprise Mailbox SMTP Service to send alarm mail, the message content is configured alarm, configuration is relatively simple:
[SMTP]
Enabled = True
Host = smtp.exmail.qq.com:465
user = Alert@qingsongchou.com
Password = ********
From_address = alert@qingsongchou.com
From_user = Grafana

Webhook

Webhook is when the alarm is triggered, Grafana actively invoke the configured HTTP service, post or put to pass JSON data. This allows you to add additional notifications, such as text messages and even phone calls, to the HTTP service we develop ourselves.

Reception

Configured alarm notification, do not receive not to see is also useless. On the one hand we try to achieve a variety of notification channels, such as mail, and text messages. On the other hand, the project leader receives the alarm timely response, to view the problem.

Q&a

Q: For Grafana does not support the alarm, your own implementation of the alarm engine is directly on the basis of Grafana modified, or independent of Grafana?

A: We use go to implement an alarm engine, independent of the Grafana.
Q:logstash have you ever met a situation where you collected slow and lost logs? Now, what size do you logstash collect the logs?

A: We currently have a daily quality of about 200 million per day, peak hour about 20 million. Logstash run can also, if the late encounter phone slow, do simple way is to expand the machine, first solve the problem, and then want to better optimize the strategy.
Q: If similar to Nginx, MySQL this kind of log, type increase need to parse each add to change Logstash grok?

A: For commonly used services, Grok has provided some regular pattern, such as Nginx, MySQL, which you mentioned. Currently, each additional one needs to be modified Grok, and a UI can be implemented later to improve the efficiency of the modification.
Q: How do I learn this lostash log format conversion?

A:logstash has a very good documentation, interested in the words can refer to https://www.elastic.co/guide/e .... HTML
Q: It is said that Logstash compared to eat memory, FLUENTD as a EFK combination also often appear, I ask you have done selection?

A: At that time chose Elk, did not do too much selection, Logstash eat memory of the problem is not too prominent now.
Q: How is the integrity of the log guaranteed? How do I know if I lost my log or how many logs I lost?

The output plug-ins of a:filebeat and Logstash have some policy of retry, but also avoid log loss. The integrity of the logs and the assurance that the logs are not lost is also a problem we are currently trying to solve.
Q: Does the monitoring system need to be considered high availability?

A: It must be considered high availability, when more business depends on monitoring system later, it is necessary to ensure that the monitoring system does not hang, query to fast, real-time monitoring, alarm accurate and so on.
Easy to raise, 100 million of users rely on the all-popular platform!

The above content is organized according to the June 8, 2017 night group sharing content. Share people Shkada, easy to raise Golang senior engineer, BUPT graduate student, in the field of distributed computing and public opinion analysis in-depth research, 2015 Didi Travel Common platform is responsible for drip travel web App backend research and development, currently in easy to raise mainly engaged in feed system architecture design and re-ming bird Monitoring System construction 。 Dockone Weekly will organize the technology to share, welcome interested students add: Liyingjiesz, into group participation, you want to listen to the topic or want to share the topic can give us a message.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.