New Feature of 11gR2: Introduction to OracleClusterHealthMonitor (CHM), ClusterHealthMonitor (CHM) is a tool provided by Oracle.
New Feature of 11gR2: Oracle Cluster Health Monitor (CHM) Introduction. Cluster Health Monitor (CHM) is a tool provided by Oracle.
Cluster Health Monitor (CHM) is a tool provided by Oracle to automatically collect operating system resources (such as CPU, memory, SWAP, process, I/O, and network). CHM collects data once per second.
These system resource data is very helpful for diagnosing cluster system node restart, Hang, instance Eviction, and performance problems. In addition, you can use CHM to detect problems such as high system load and abnormal memory, so as to avoid more serious problems.
CHM is automatically installed in the following software:
11.2.0.2 and later versions of Oracle Grid Infrastructure for Linux (excluding Linux Itanium), Solaris (iSCSI 64 and x86-64)
11.2.0.3 and later versions: Oracle Grid Infrastructure for AIX and Windows (excluding Windows Itanium ).
In the cluster, you can run the following command to view the status of the resource (ora. crf) corresponding to CHM:
$ Crsctl stat res-t-init
--------------------------------------------------------------------------------
Name target state server STATE_DETAILS Cluster Resources
Ora. crfONLINE rac1
CHM mainly includes two services:
1 ). system Monitor Service (osysmond): this service runs on all nodes. osysmond sends the resource usage of each node to the cluster logger Service, the latter will receive and save the information of all nodes to the CHM database.
$ Ps-ef | grep osysmond
Root 7984 1 0 Jun05? 01:16:14/u01/app/11.2.0/grid/bin/osysmond. bin
2). Cluster Logger Service (ologadh): In a Cluster, ologadh has a master node and a standby node (standby ). When ologadh cannot be started on the current node, it will be enabled on the slave node.
Master node:
$ Ps-ef | grep ologadh
Root 8257 1 0 Jun05? 00:38:26/u01/app/11.2.0/grid/bin/ologadh-M-d/u01/app/11.2.0/grid/crf/db/rac2
Slave node:
$ Ps-ef | grep ologadh
Root 8353 1 0 Jun05? 00:18:47/u01/app/11.2.0/grid/bin/ologadh-m rac2-r-d
/U01/app/11.2.0/grid/crf/db/rac1
CHM Repository: used to store collected data. By default, it is stored in Grid Infrastructure home and requires 1 GB of disk space. Each node occupies about GB of space every day. You can use OCLUMON to adjust its storage path and allowed space (data can be stored for up to three days ).
The following command is used to view its current settings:
$ Oclumon manage-get reppath
CHM Repository Path =/u01/app/11.2.0/grid/crf/db/rac2
Done
$ Oclumon manage-get repsize
CHM Repository Size = 68082 <== unit: seconds
Done
Modify path:
$ Oclumon manage-repos reploc/shared/oracle/chm
Modify size:
$ Oclumon manage-repos resize 68083 <= between 3600 (hours) and 259200 (3 days)Between
Rac1 --> retention check successful
New retention is 68083 and will use 1073750609 bytes of disk space
CRS-9115-Cluster Health Monitor repository size change completed on all nodes.
Done
There are two methods to obtain the data generated by CHM:
1. Use Grid_home/bin/diagcollection. pl:
1) First, determine the master node of the cluster logger service:
$ Oclumon manage-get master
Master = rac2
2) run the following command in master node rac2 as root:
# /Bin/diagcollection. pl-collect-chmos-incidenttime inc_time-incidentduration
Inc_time refers to the time from which data is obtained. The format is MM/DD/YYYY24HH: MM: SS. duration refers to the time after which data is obtained.
For example: # diagcollection. pl-collect-crshome/u01/app/11.2.0/grid-chmoshome/u01/app/11.2.0/grid-chmos-incidenttime 06/15/2012- incidentduration
After running this command, chmdata is generated in the chmosdata_rac2_20120615_1537.tar.gz file.
2. Another method to obtain data generated by CHM is oclumon:
$ Oclumon dumpnodeview [[-allnodes] | [-n node1 node2] [-last "duration"] | [-s "time_stamp"-e "time_stamp"] [-v] [-warning] [-h]
-S indicates the start time, and-e indicates the end time.
$ Oclumon dumpnodeview-allnodes-v-s "07:40:00"-e "07:57:00">/tmp/chm1.txt
$ Oclumon dumpnodeview-n node1 node2 node3-last "12:00:00">/tmp/chm1.txt
$ Oclumon dumpnodeview-allnodes-last "00:15:00">/tmp/chm1.txt
Some content in/tmp/chm1.txt is as follows:
----------------------------------------
Node: rac1 Clock: '06-15-12 07.40.01 'SerialNo: 168880
----------------------------------------
SYSTEM:
# Cpus: 1 cpu: 17.96 cpuq: 5 physmemfree: 32240 physmemtotal: 2065856 mcache: 1064024 swapfree: 3988376 swaptotal: 4192956 ior: 57 io
W: 59 ios: 10 swpin: 0 swpout: 0 pgin: 57 pgout: 59 netr: 65.767 netw: 34.871 procs: 183 rtprocs: 10 # fds: 4902 # sysfdlimit: 6815744
# Disks: 4 # labels: 3 nicErrors: 0
Top consumers:
Topcpu: 'mrtg (32385) 100' topprivmem: 'ognames (64.70) 100' topshm: 'oracle (8353) 100' topfd: 'ohasd. bin (84068) 100' topthread:
'Crsd. bin (8235) 44'
PROCESSES:
Name: 'mrtg 'pid: 32385 # procfdlimit: 65536 cpuusage: 64.70 privmem: 1160 shm: 1584 # fd: 5 # threads: 1 priority: 20 nice: 0
Name: 'oracle 'pid: 32381 # procfdlimit: 65536 cpuusage: 0.29 privmem: 1456 shm: 12444 # fd: 32 # threads: 1 priority: 15 nice: 0
...
Name: 'oracle 'pid: 8756 # procfdlimit: 65536 cpuusage: 0.0 privmem: 2892 shm: 24356 # fd: 47 # threads: 1 priority: 16 nice: 0
----------------------------------------
Node: rac2 Clock: '06-15-12 07.40.02 'SerialNo: 168878
----------------------------------------
SYSTEM:
# Cpus: 1 cpu: 40.72 cpuq: 8 physmemfree: 34072 physmemtotal: 2065856 mcache: 1005636 swapfree: 3991808 swaptotal: 4192956 ior: 54 io
W: 104 ios: 11 swpin: 0 swpout: 0 pgin: 54 pgout: 104 netr: 77.817 netw: 33.008 procs: 178 rtprocs: 10 # fds: 4948 # sysfdlimit: 68157
44 # disks: 4 # Keys: 4 nicErrors: 0
Top consumers:
Topcpu: 'orarootagent. bi (8490) 1.59 'topprivmem: 'oss (8257) 100' topshm: 'oracle (83108) 100' topfd: 'oss. bin (6744) 720't
Opthread: 'crsd. bin (8362) 47'
PROCESSES:
Name: 'oracle 'pid: 9040 # procfdlimit: 65536 cpuusage: 0.19 privmem: 6040 shm: 121712 # fd: 33 # threads: 1 priority: 16 nice: 0
...
For more information about CHM, see the Oracle official documentation:
# CWADD92242
Oracle Clusterware Administration and Deployment Guide
11g Release 2 (11.2)
Part Number E16794-17
Or My Oracle Support documentation:
Cluster Health Monitor (CHM) FAQ (docid 1328466.1)