Oracle Cluster Health Monitor (CHM)IntroductionOverview
Cluster Health Monitor( CHM) is an Oracle -provided tool for automating the collection of operating system resources (CPU, memory, The use of SWAP, process,I/O, and network, etc.). the CHM collects data once per second.
These system resource data are for diagnosing the node reboots of the cluster system, Hang, instance eviction (eviction), performance issues, etc. are very helpful. In addition, users can use CHM to detect some problems such as high system load, memory anomalies and so on, so as to avoid more serious problems.
The CHM is automatically installed in the following software:
11.2.0.2 and later versions of Oraclegrid Infrastructure for Linux ( not including Linux Itanium) ,Solaris (Sparc and x86-64)
11.2.0.3 and later Oraclegrid Infrastructure for AIX , windows ( not including Windows Itanium).
in the cluster, the following command can be used to view the status of the CHM corresponding resource (ORA.CRF):
$ crsctl Stat res-t-init
[Email protected] bin]#/crsctl stat resora.crf-init
Name=ora.crf
Type=ora.crf.type
Target=online
State=online on TESTRAC2
The CHM consists mainly of two services:
1). System Monitor Service (osysmond): This service runs on all nodes,Osysmond sends the resource usage of each node to cluster logger service , the latter will receive and save information from all nodes to The CHM database .
$ps-ef|grep Osysmond
Root 7984 1 0jun05? 01:16:14/u01/app/11.2.0/grid/bin/osysmond.bin
2). Cluster Logger Service (ologgerd): In a cluster,Ologgerdthere will be a host point(master), there is also a standby node(standby). WhenOloggerdafter the current node encounters a problem that cannot be started, it is enabled on the standby node.
Master Node:
$ ps-ef|grep Ologgerd
Root 8257 1 0jun05? 00:38:26/u01/app/11.2.0/grid/bin/ologgerd-m-D/U01/APP/11.2.0/GRID/CRF/DB/RAC2
Standby node:
$ ps-ef|grep Ologgerd
Root 8353 1 0jun05? 00:18:47/u01/app/11.2.0/grid/bin/ologgerd-m Rac2-r-D
/u01/app/11.2.0/grid/crf/db/rac1
CHM Repository: Used to store collected data, which, by default, exists in theGrid Infrastructure Homeunder , you need1 GBof disk space, each node consumes about0.5GBof space. You can useOclumonto adjust its storage path and the amount of space allowed(You can save up to3Days of data).
View current Settings
The following command is used to view its current settings:
$ Oclumon Manage-get Reppath
CHM Repository Path =/u01/app/11.2.0/grid/crf/db/rac2
Done
$ Oclumon Manage-get repsize
CHM Repository Size = 68082 <==== Unit is seconds
Done
Modify settings
To Modify a path:
$ Oclumon Manage-repos Reploc/shared/oracle/chm
Modify Size:
$ Oclumon manage-repos Resize 68083 <== between 3600 ( hours ) to 259200 (3 days )
RAC1-Retention Check Successful
New retention is 68083 and would use1073750609 bytes of disk space
Crs-9115-cluster Health Monitor repositorysize Change completed on all nodes.
Done
GetCHMmethods of the generated data
1. one is to use grid_home/bin/diagcollection.pl:
1). First, determine the primary node of the Clusterlogger service:
$ Oclumon Manage-getmaster
Master = Rac2
2).withRootidentity on the master nodeRac2execute the following command:
#/bin/diagcollection.pl-collect-chmos-incidenttime Inc_time-incidentduration Duration
Inc_timeis the time to start getting data, in the formMM/DD/YYYY24HH:MM:SS, durationrefers to the amount of data that is obtained after the start time.
For example:#diagcollection. pl-collect-crshome/u01/app/11.2.0/grid-chmoshome/u01/app/11.2.0/grid-chmos-incidenttime06/15/ 201215:30:00-incidentduration 00:05
3).after running this command,Chmthe data is generated in the fileChmosData_rac2_20120615_1537.tar.gz.
2.another way to getChmThe method for generating the data isOclumon:
$oclumon Dumpnodeview [[-allnodes] | [-N Node1 Node2] [-last "duration"] | [-S "time_stamp"-E "time_stamp"] [-V] [-warning]] [-h]
-Sindicates the start time,-Eindicates the end time
$ Oclumon dumpnodeview-allnodes-v-s "2012-06-15 07:40:00"-E "2012-06-15 07:57:00" >/tmp/chm1.txt
$ Oclumon dumpnodeview-n node1 node2node3-last "12:00:00" >/tmp/chm1.txt
$ Oclumon dumpnodeview-allnodes-last "00:15:00" >/tmp/chm1.txt
Below is/tmp/chm1.txtpart of the content:
----------------------------------------
Node:rac1 Clock: ' 06-15-12 07.40.01 ' serialno:168880
----------------------------------------
SYSTEM:
#cpus: 1 cpu:17.96 cpuq:5 physmemfree:32240 physmemtotal:2065856 mcache:1064024 swapfree:3988376 swaptotal:4192956 I or:57 IO
w:59 ios:10 swpin:0 swpout:0 pgin:57 pgout:59 netr:65.767 netw:34.871 procs:183 rtprocs:10 #fds: 4902 #sysfdlimit : 6815744
#disks: 4 #nics: 3 nicerrors:0
TOP Consumers:
TOPCPU: ' MRTG (32385) 64.70 ' Topprivmem: ' Ologgerd (8353) 84068 ' TOPSHM: ' Oracle (8760) 329452 ' TOPFD: ' Ohasd.bin (6627) 720 ' Topthread:
' Crsd.bin (8235) 44 '
PROCESSES:
Name: ' MRTG ' pid:32385 #procfdlimit: 65536 cpuusage:64.70 privmem:1160 shm:1584 #fd: 5 #threads: 1 priority:20 nice:0
Name: ' Oracle ' pid:32381 #procfdlimit: 65536 cpuusage:0.29 privmem:1456 shm:12444 #fd: #threads: 1 priority:15 Nice : 0
...
Name: ' Oracle ' pid:8756 #procfdlimit: 65536 cpuusage:0.0 privmem:2892 shm:24356 #fd: #threads: 1 priority:16 Nice: 0
----------------------------------------
Node:rac2 Clock: ' 06-15-12 07.40.02 ' serialno:168878
----------------------------------------
SYSTEM:
#cpus: 1 cpu:40.72 cpuq:8 physmemfree:34072 physmemtotal:2065856 mcache:1005636 swapfree:3991808 swaptotal:4192956 I or:54 IO
w:104 ios:11 swpin:0 swpout:0 pgin:54 pgout:104 netr:77.817 netw:33.008procs:178 rtprocs:10 #fds: 4948 #sysfdlim it:68157
#disks: 4 #nics: 4 nicerrors:0
TOP Consumers:
TOPCPU: ' Orarootagent.bi (8490) 1.59 ' Topprivmem: ' Ologgerd (8257) 83108 ' TOPSHM: ' Oracle (8873) 324868 ' TOPFD: ' Ohasd.bin ( 6744) 720 ' t
Opthread: ' Crsd.bin (8362) 47 '
PROCESSES:
Name: ' Oracle ' pid:9040 #procfdlimit: 65536 cpuusage:0.19 privmem:6040 shm:121712 #fd: #threads: 1 priority:16 Nice : 0
...
about theChmfor more explanations, please refer toOracleOfficial documents:
http://docs.oracle.com/cd/E11882_01/rac.112/e16794/troubleshoot.htm#CWADD92242
Oracle Clusterware Administration and Deployment Guide
11g Release 2 (11.2)
Part number E16794-17
orMy Oracle SupportDocumentation:
Cluster Health Monitor (CHM) FAQ (Doc ID 1328466.1)
Introduction to Oracle Cluster Health Monitor (CHM)