Real-Life Nagios NSCA-mode monitoring of Linux system Resource Usage--nagios configuration--Monitored end

Source: Internet
Author: User

nagios requires the monitored side to send data to the Nagios side in the agreed format. Monitoring includes 2 types of nodes and services. The node Monitoring convention data format is as follows:
[<timestamp>] Process_host_check_result;
The format is easy to understand, data commit timestamp, monitored node name, node State (up/down/unrearchable), plugin custom extra data. The state-specific explanations for each field are as follows:
1. Timestamp is the time in time_t format (seconds since the UNIX epoch), the host check was perfomed (or submitted). Please note the single space after the right bracket.
2. HOST_NAME is the short name of the host (as defined in the host definition)
3. Host_status is the status of the host (0=up, 1=down, 2=unreachable)
4. Plugin_output is the text output of the host check

The service monitoring contract data format is as follows:
[<timestamp>] Process_service_check_result;

data submission timestamp, monitored node name, monitored service name, service status (ok/warning/critical/unknown), plug-in custom additional data. Each field is interpreted as follows:

1. Timestamp is the time in time_t format (seconds since the UNIX epoch), the service check was perfomed (or submitted ). Please note the single space after the right bracket.
2. HOST_NAME is the short name of the host associated with the service in the service definition
3. Svc_description is the description of the service as specified in the service definition
4. Return_code is the return code of the check (0=ok, 1=warning, 2=critical, 3=unknown)
5. Plugin_output is the text output of the service check (i.e. the plugin output)

Additional data that is customized by the plugin is specifically intended to tell Nagios more details about the node being monitored, including status details or failure reasons, as well as performance data. It has the following format:
SERVICE status:first line of Output | First part of performance data
Output can be customized to display more detailed monitoring data, displayed in the Nagios status Information Bar. Performance data is displayed in the Performance data column, and it has special formatting requirements, which are structured as follows:
' label ' =value[uom]; [Warn]; [Crit]; [min]; [Max] ' label ' =value[uom]; [Warn]; [Crit]; [min]; [Max]
Each ' label ', the value combination is separated by a space, there is no additional uom,warn,crit,min,max in our system to define the performance data, the specific meaning can be referred to: https://nagios-plugins.org/doc/ Guidelines.html#plugoutput

Our system in the service is passive, the node is active ping, next talk about how our system monitors the cpu,memory,io,network usage of the monitored nodes, CPU data collection as the main interpretation object: cpu look first. See what the sending data looks like:
[1402018148] Process_service_check_result ; 192.168.0.6; cpu_status;0; STATISTICS ok:user=52.02% system=9.72% iowait=0.20% stealed=0.20% idle=35.02% | user=52.02%;;;; system=9.72%;;;; iowait=0.20%;;;; stealed=0.20%;;;; idle=35.02%;;;;
in combination with the above data format, we can know that this data is sent at 1402018148 (Unix time) by the monitoring side of a service check data, from the node 192.168.0.6 sent out, the server description is the CPU _status, the return result is 0 (OK), the output details data statistics success, the user process in this test takes up 52.02% of the CPU, the kernel process occupies 9.72%,idle 35.02%, The threshold for warning is not reached (non-idle CPU usage 70%). At the same time the performance data is transferred in the symbol | Following the performance data format, we only care about the various CPU usage, do not care about the other contents of the statistics, so we are all set empty.

Next talk about implementing
1. Data collection, the system collects data from/proc/stat, here does not explain too many/proc/stat details, probably say its data structure
<span style= "font-size:10px;"  >CAT/PROC/STATNAGIOS:/USR/LOCAL/NAGIOS/VAR/RW # cat/proc/statcpu 3793272 14468 2247200 1691101331 60064 0 929 363857 0cpu0 947929 2161 571867 422672541 57686 0 569 142613 0cpu1 1012031 5207 579405 422725361 828 0 121 72264 0cpu2 953097 43  557950 422803715 752 0 117 75309 0cpu3 880213 2775 537976 422899713 797 0 73670 0intr 395106792 0 0 0 0 0 0 0 0 0 0 000000000......
<span style= "FONT-SIZE:12PX;" > Parameter Explanation User: Accumulated from the start of the system to the current time, the user state of the CPU time (in jiffies), does not contain the nice value is a negative process. 1jiffies=0.01 second nice: Accumulates from the start of the system to the current moment, the CPU time (in jiffies) of the process that the nice value is negative: System: Accumulates from the start of the boot to the current time, kernel runtime (in jiffies) idle Accumulated from the start of the system to the current moment, in addition to the hard disk IO wait time other than the wait time (in jiffies) Iowait: From the start of the system to accumulate to the current moment, the hard disk IO wait time (in jiffies), IRQ: From the start of the system to accumulate to the current moment, Hard Interrupt Time (unit: jiffies) SOFTIRQ: Accumulates from the start of the system to the current moment, soft interrupt time (in jiffies) Steal_time:  
Our approach is simple, read the first line of CPU information, count the difference of 2 times for 5 seconds, calculate the percentage of total CPU time that we care about User,system,idle,iowait,idle,steal_time/proc/stat
2. Press Four status thresholds for service assembly send dataCPU usage is greater than 90%,critical CPU usage is greater than 70% less than 90%,warning
CPU usage is less than 70%,ok
CPU usage gain failed, UNKNOWN
</pre><pre name= "code" class= "python" ># Check if CPU Usage is critical/warning/okif cpu_usage_percent >= 9 0:    return_code = Pynsca. CRITICAL    plugin_output = ' STATISTICS CRITICAL: ' Elif  cpu_usage_percent >=:    return_code = Pynsca. WARNING    plugin_output = ' STATISTICS WARNING: ' Elif cpu_usage_percent >= 0:     return_code = Pynsca. OK    plugin_output = ' STATISTICS OK: ' else    return_code = Pynsca. UNKNOWN    plugin_output = ' STATISTICS UNKNOWN: '
3. Send data to the Nagios side
Plugin_output + = ' user=% (user). 2f%% system=% (System). 2f%% iowait=% (iowait). 2f%% stealed=% (steal_time). 2f%% idle=% ( Idle). 2f%% | user=% (user). 2f%%;;;; system=% (System). 2f%%;;;; iowait=% (iowait). 2f%%;;;; stealed=% (steal_time). 2f%%;;;; idle=% (idle). 2f%%;;;; '% cpu_stat_map#print plugin_outputnscaclient = Pynsca. Nscanotifier (nagios_address) Nscaclient.svc_result (cmd_options.local_address, service_description, Return_code, Plugin_output)
4. Add the crontab job to send CPU data to the Nagios side every minute.
*/1 * * * */home/nagios/check_cpu_status.py >/dev/null 2>&1

Note: The real script also needs to consider portability, monitored machine node names, thresholds, etc. must be parameterized, monitoring diversity, can only monitor one of the core, join >/dev/null 2>&1 is to remove crontab e-mail notification

Memory

[1402017111] Process_service_check_result; 192.168.0.6; memory_status;0;ok:used Memory Percentage is 37.2979746597% (2935 MiB) | usedmemory=37.2979746597%;80;90;;

Memory every minute to see/proc/meminfo, from total-free-buffers-cached to get the memory that is being used, see it occupies the percentage of total RAM, more than 80% reported warning,90% critical

NAGIOS:/USR/LOCAL/NAGIOS/VAR/RW # cat/proc/meminfomemtotal:        8401536 kbmemfree:         6881104 kBBuffers:          190732 kbcached:           497344 kb<div><span style= "Font-family:tahoma;font-size:12px;color: #000000; Font-style:normal; Font-variant:normal; Font-weight:normal; Letter-spacing:normal; Line-height:normal; Orphans:2; Text-align:-webkit-auto; text-indent:0px; Text-transform:none; White-space:normal; Widows:2; word-spacing:0px; -webkit-text-size-adjust:auto; -webkit-text-stroke-width:0px; Display:inline!important; Float:none; " >*/1 * * * */home/nagios/check_mem.py >/dev/null 2>&1</span><span style= "FONT-SIZE:12PX;" > </span></div>
Disk

[1383817502] Process_service_check_result; 192.168.0.6;D isk_status;0; STATISTICS OK: "mount/" usage=40% | "mount/" USAGE=40%;80;90;;

Disk executes every 10 minutes: "Df-h", view disk use%, more than 80% reported warning,90% critical

< em> io & IOPS
[1383817420] process_service_check_result; 192.168.0.6 ; io_status;0; STATISTICS ok:await=0.00% util=0.00% | await=0.00%;;;; util=0.00%;;;;

< Span style= "font-size:18px" >io performed once per minute: "IOSTAT-XKD", Get disk rkb/s (read data rate), wkb/s (write data rate), await (request response time) and util (percentage of time the CPU is used to process IO) information

[1399532518] Process_service_check_result; 192.168.0.6; iops_status;2; STATISTICS critical:iops=58.0 | iops=58.0;;;;

ioPS performed once per minute: "Iostat", obtaining disk IO TPS Information

Note: You need to install Sysstat

Iostat Command Specific Use reference

Bandwidth

[1399532517] Process_service_check_result; 192.168.0.6; bandwidth_status_eth1;0; STATISTICS OK:-The traffic in was 0.8Kbps, out was 0.53Kbps, Total is 1.33kbps.| In=0.8kbps;;;0;0 Out=0.53kbps;;;0;0 Total=1.33kbps;;;0;0

The bandwidth is similar to the CPU by calculating the/proc/net/dev's Tx,rx and Total's numerical difference over time by dividing the time interval to get the throughput.

JVM Heap

[1399533015] process_service_check_result; 192.168.0.6; heap_status;0;ok:used heap percentage is 50.3057759255% (1054988 MiB) | usedheap=50.3057759255%;90;98;;

Our system is WebLogic and can be weblogic. Admin gets information about the JVM (WebLogic. Admin Get-pretty-type Jrockitruntime | Egrep ' freeheap| usedheap| Totalheap '), the system is now more concerned about the use of heap.


Look at the data in the final Nagios (in the case of a node and the CPU service data for the node):

Aside: Many customers have their own monitoring system, their systems not only monitor the CPU, memory, the use of these hardware resources, but also scan the system log to see if there is a error/warning, scanning system open ports and services, whether compliance with security standards, For example, if there is an anonymous direct login FTP, in the overall system design should be careful how many services operating system should be opened, log printing should be carefully designed.


First article: Nagios Usage Introduction
Article Two: Nagios configuration--Nagios server side

Real-Life Nagios NSCA-mode monitoring of Linux system Resource Usage--nagios configuration--Monitored end

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.