使用ganglia監控mongodb叢集

最後更新：2015-04-13 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

使用ganglia監控mongodb叢集
前幾天提交了一篇ganglia監控storm叢集的博文，本文將介紹使用ganglia監控mongdb叢集。因為我們需要使用ganglia一統天下。
1. ganglia擴充機制
首先要使用ganglia監控mongodb叢集必須先明白ganglia的擴充機制。通過ganglia外掛程式可以給我們提供兩種擴充ganglia監控功能的方法：

1）、通過添加內嵌（in-band）外掛程式，主要是通過gmetric命令來實現。

這是通常使用的一種方法，主要是通過cronjob方法並調用ganglia的gmetric命令來向gmond輸入資料，進而實現統一監控，這種方法簡單，對於少量的監控可以採用，但是對於大規模自訂監控時，監控資料難以統一管理。

2）、通過添加一些額外的指令碼來實現對系統的監控，主要是通過C或者python介面來實現。

在ganglia3.1.x版本以後，增加了C或者Python介面，通過這個介面可以自訂資料收集模組，並且這些模組可以被直接插入到gmond中以監控使用者自訂的應用。
2. python指令碼監控mongdb
我們使用python指令碼來監控mongodb叢集，畢竟通過python指令碼擴充比較方便，需要增加監控資訊時在相應的py指令碼中添加監控資料就可以了，十分方便，擴充性強，移植也比較簡單。
2.1 環境配置
要使用python指令碼來實現ganglia監控擴充，首先需要明確modpython.so檔案是否存在，這個檔案是ganglia調用python的動態連結程式庫，要通過python介面開發ganglia外掛程式，必須要編譯安裝此模組。modpython.so檔案存放在ganglia安裝目錄下的lib(or lib64)/ganglia/目錄中。如果存在則可以進行下面的指令碼編寫；如果不存在，那麼需要你重新編譯安裝gmond哦，編譯安裝時帶上參數“--with-python”。
2.2 編寫監控指令碼
我們開啟ganglia安裝目錄下的/etc/gmond.conf檔案，可以發現在用戶端監控中可以看到include ("/usr/local/ganglia/etc/conf.d/*.conf")，說明gmond服務直接掃描目錄下的監控設定檔，所以我們需要將監控配置指令碼放在/etc/conf.d/目錄下並命名為XX.conf，所以我們將要監控mongdb的配置指令碼命名為mongdb.conf
1)、查看modpython.conf檔案
modpython.conf位於/etc/conf.d/目錄下。檔案內容如下：

 
  modules { 
  module { 
  name = "python_module" #主模組文成

  path = "modpython.so" #ganglia擴充python指令碼需要的動態連結程式庫

  params = "/usr/local/ganglia/lib64/ganglia/python_modules" #python指令碼存放的位置

  } 
  } 
  

  include ("/usr/local/ganglia/etc/conf.d/*.pyconf") #ganglia擴充存放配置指令碼的路徑

所以我們使用python來擴充ganglia監控mongodb需要將配置指令碼和py指令碼放在相應的目錄下，再重啟ganglia服務就可以完成mongdb監控，下面將介紹如何編寫指令碼。
2)、建立mongodb.pyconf指令碼
注意這裡需要使用root許可權來建立編輯指令碼，將此指令碼存放在conf.d目錄下。具體要收集mongdb那些參數可以參考https://github.com/ganglia/gmond_python_modules/tree/master/mongodb，根據自己的需求酌量增刪。

 
  modules {
 
  module {
  name = "mongodb" #模組名，該模組名必須與開發的存放於"/usr/lib64/ganglia/python_modules"指定的路徑下的python指令碼名稱一致 
  language = "python" #聲明使用python語言
   #參數列表，所有的參數作為一個dict(即map)傳給python指令碼的metric_init(params)函數。

  param server_status{

  value = "mongo路徑 --host host --port 27017 --quiet --eval 'printjson(db.serverStatus())'"

  }

  param rs_status{

  value = "mongo路徑 --host host --port 2701 --quiet --eval 'printjson(rs.status())'"

  }

  }

  }

  

  #需要收集的metric列表，一個模組中可以擴充任意個metric

  collection_group {

   collect_every = 30

  time_threshold = 90 #最大發送間隔

  metric {
  name = "mongodb_opcounters_insert" #metric在模組中的名字

  title = "Inserts" #圖形介面上顯示的標題

  }

  metric {
   name = "mongodb_opcounters_query"
   title = "Queries"
   }
   metric {
  name = "mongodb_opcounters_update"
  title = "Updates"
   }
   metric {
   name = "mongodb_opcounters_delete"
   title = "Deletes"
   }
   metric {
   name = "mongodb_opcounters_getmore"
   title = "Getmores"
   }
   metric {
   name = "mongodb_opcounters_command"
   title = "Commands"
   }
   metric {
   name = "mongodb_backgroundFlushing_flushes"
   title = "Flushes"
   }
   metric {
   name = "mongodb_mem_mapped"
   title = "Memory-mapped Data"
   }
   metric {
   name = "mongodb_mem_virtual"
   title = "Process Virtual Size"
   }
   metric {
   name = "mongodb_mem_resident"
   title = "Process Resident Size"
   }
   metric {
   name = "mongodb_extra_info_page_faults"
   title = "Page Faults"
   }
   metric {
   name = "mongodb_globalLock_ratio"
   title = "Global Write Lock Ratio"
   }
   metric {
   name = "mongodb_indexCounters_btree_miss_ratio"
   title = "BTree Page Miss Ratio"
   }
   metric {
   name = "mongodb_globalLock_currentQueue_total"
   title = "Total Operations Waiting for Lock"
   }
   metric {
   name = "mongodb_globalLock_currentQueue_readers"
   title = "Readers Waiting for Lock"
   }
   metric {
   name = "mongodb_globalLock_currentQueue_writers"
   title = "Writers Waiting for Lock"
   }
   metric {
   name = "mongodb_globalLock_activeClients_total"
   title = "Total Active Clients"
   }
   metric {
   name = "mongodb_globalLock_activeClients_readers"
   title = "Active Readers"
   }
   metric {
   name = "mongodb_globalLock_activeClients_writers"
   title = "Active Writers"
   }
   metric {
   name = "mongodb_connections_current"
   title = "Open Connections"
   }
   metric {
   name = "mongodb_connections_current_ratio"
   title = "Open Connections"
   }
   metric {
   name = "mongodb_slave_delay"
   title = "Replica Set Slave Delay"
   }
   metric {
   name = "mongodb_asserts_total"
   title = "Asserts per Second"
   }
  }

從上面你可以發現這個設定檔的寫法跟gmond.conf的文法一致，所以有什麼不明白的可以參考gmond.conf的寫法。
3)、建立mongodb.py指令碼
將mongodb.py檔案存放在lib64/ganglia/python_modules目錄下，在這個目錄中可以看到已經有很多python指令碼存在，比如：監控磁碟、記憶體、網路、mysql、redis等的指令碼。我們可以參考這些python指令碼完成mongodb.py的編寫。我們開啟其中部分指令碼可以看到在每個指令碼中都有一個函數metric_init(params)，前面也說過mongodb.pyconf傳來的參數傳遞給metric_init函數。

 
  #!/usr/bin/env python

  import json
  import os
  import re
  import socket
  import string
  import time
  import copy
  

  NAME_PREFIX = 'mongodb_'
  PARAMS = {
  'server_status' : '/bin/mongo路徑 --host host --port 27017 --quiet --eval "printjson(db.serverStatus())"',
  'rs_status' : '/bin/mongo路徑 --host host --port 27017 --quiet --eval "printjson(rs.status())"'
  }
  METRICS = {
  'time' : 0,
  'data' : {}
  }
  LAST_METRICS = copy.deepcopy(METRICS)
  METRICS_CACHE_TTL = 3
  def flatten(d, pre = '', sep = '_'):
  """Flatten a dict (i.e. dict['a']['b']['c'] => dict['a_b_c'])"""
  new_d = {}
  for k,v in d.items():
  if type(v) == dict:
  new_d.update(flatten(d[k], '%s%s%s' % (pre, k, sep)))
  else:
  new_d['%s%s' % (pre, k)] = v
  return new_d
  

  def get_metrics():
  """Return all metrics"""
  global METRICS, LAST_METRICS
  if (time.time() - METRICS['time']) > METRICS_CACHE_TTL:
  metrics = {}
  for status_type in PARAMS.keys():
  # get raw metric data
  o = os.popen(PARAMS[status_type])
  # clean up
  metrics_str = ''.join(o.readlines()).strip() # convert to string
  metrics_str = re.sub('\w+\((.*)\)', r"\1", metrics_str) # remove functions
  # convert to flattened dict
  try:
  if status_type == 'server_status':
  metrics.update(flatten(json.loads(metrics_str)))
  else:
  metrics.update(flatten(json.loads(metrics_str), pre='%s_' % status_type))
  except ValueError:
  metrics = {}
  

  # update cache
  LAST_METRICS = copy.deepcopy(METRICS)
  METRICS = {
   'time': time.time(),
  'data': metrics
  }
  return [METRICS, LAST_METRICS]
  

  def get_value(name):
  """Return a value for the requested metric"""
  # get metrics
  metrics = get_metrics()[0]
  # get value
  name = name[len(NAME_PREFIX):] # remove prefix from name
  try:
  result = metrics['data'][name]
   except StandardError:
  result = 0
   return result
  

  def get_rate(name):
  """Return change over time for the requested metric"""
  # get metrics
  [curr_metrics, last_metrics] = get_metrics()
  # get rate
  name = name[len(NAME_PREFIX):] # remove prefix from name
  try:
  rate = float(curr_metrics['data'][name] - last_metrics['data'][name]) / \
  float(curr_metrics['time'] - last_metrics['time'])
  if rate < 0:
  rate = float(0)
  except StandardError:
   rate = float(0)
  return rate
  

  def get_opcounter_rate(name):
  """Return change over time for an opcounter metric"""
  master_rate = get_rate(name)
  repl_rate = get_rate(name.replace('opcounters_', 'opcountersRepl_'))
  return master_rate + repl_rate
  

  def get_globalLock_ratio(name):
  """Return the global lock ratio"""
  try:
  result = get_rate(NAME_PREFIX + 'globalLock_lockTime') / \
  get_rate(NAME_PREFIX + 'globalLock_totalTime') * 100
  except ZeroDivisionError:
  result = 0
  return result
  

  def get_indexCounters_btree_miss_ratio(name):
  """Return the btree miss ratio"""
  try:
   result = get_rate(NAME_PREFIX + 'indexCounters_btree_misses') / \
  get_rate(NAME_PREFIX + 'indexCounters_btree_accesses') * 100
  except ZeroDivisionError:
  result = 0
  return result
  

  def get_connections_current_ratio(name):
  """Return the percentage of connections used"""
  try:
  result = float(get_value(NAME_PREFIX + 'connections_current')) / \
  float(get_value(NAME_PREFIX + 'connections_available')) * 100
  except ZeroDivisionError:
  result = 0
  return result
  

  def get_slave_delay(name):
  """Return the replica set slave delay"""
  # get metrics
  metrics = get_metrics()[0]
  # no point checking my optime if i'm not replicating
  if 'rs_status_myState' not in metrics['data'] or metrics['data']['rs_status_myState'] != 2:
  result = 0
  # compare my optime with the master's
  else:
  master = {}
  slave = {}
  try:
  for member in metrics['data']['rs_status_members']:
  if member['state'] == 1:
  master = member
  if member['name'].split(':')[0] == socket.getfqdn():
  slave = member
  result = max(0, master['optime']['t'] - slave['optime']['t']) / 1000
   except KeyError:
  result = 0
  return result
  

  def get_asserts_total_rate(name):
  """Return the total number of asserts per second"""
  return float(reduce(lambda memo,obj: memo + get_rate('%sasserts_%s' % (NAME_PREFIX, obj)),['regular', 'warning', 'msg', 'user', 'rollovers'], 0))
  

  def metric_init(lparams):
  """Initialize metric descriptors"""
  global PARAMS
  # set parameters
  for key in lparams:
  PARAMS[key] = lparams[key]
  # define descriptors
  time_max = 60
  groups = 'mongodb'
  descriptors = [
  {
  'name': NAME_PREFIX + 'opcounters_insert',
  'call_back': get_opcounter_rate,
  'time_max': time_max,
  'value_type': 'float',
  'units': 'Inserts/Sec',
  'slope': 'both',
  'format': '%f',
  'description': 'Inserts',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'opcounters_query',
  'call_back': get_opcounter_rate,
  'time_max': time_max,
  'value_type': 'float',
  'units': 'Queries/Sec',
  'slope': 'both',
  'format': '%f',
  'description': 'Queries',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'opcounters_update',
  'call_back': get_opcounter_rate,
  'time_max': time_max,
  'value_type': 'float',
  'units': 'Updates/Sec',
  'slope': 'both',
  'format': '%f',
  'description': 'Updates',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'opcounters_delete',
  'call_back': get_opcounter_rate,
   'time_max': time_max,
  'value_type': 'float',
   'units': 'Deletes/Sec',
  'slope': 'both',
  'format': '%f',
  'description': 'Deletes',
  'groups': groups
   },
  {
  'name': NAME_PREFIX + 'opcounters_getmore',
  'call_back': get_opcounter_rate,
  'time_max': time_max,
  'value_type': 'float',
  'units': 'Getmores/Sec',
  'slope': 'both',
  'format': '%f',
  'description': 'Getmores',
  'groups': groups
   },
  {
  'name': NAME_PREFIX + 'opcounters_command',
  'call_back': get_opcounter_rate,
  'time_max': time_max,
  'value_type': 'float',
  'units': 'Commands/Sec',
  'slope': 'both',
  'format': '%f',
  'description': 'Commands',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'backgroundFlushing_flushes',
  'call_back': get_rate,
  'time_max': time_max,
  'value_type': 'float',
  'units': 'Flushes/Sec',
  'slope': 'both',
  'format': '%f',
  'description': 'Flushes',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'mem_mapped',
  'call_back': get_value,
  'time_max': time_max,
  'value_type': 'uint',
  'units': 'MB',
   'slope': 'both',
  'format': '%u',
  'description': 'Memory-mapped Data',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'mem_virtual',
  'call_back': get_value,
  'time_max': time_max,
  'value_type': 'uint',
  'units': 'MB',
  'slope': 'both',
  'format': '%u',
  'description': 'Process Virtual Size',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'mem_resident',
  'call_back': get_value,
  'time_max': time_max,
  'value_type': 'uint',
  'units': 'MB',
  'slope': 'both',
  'format': '%u',
  'description': 'Process Resident Size',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'extra_info_page_faults',
  'call_back': get_rate,
  'time_max': time_max,
  'value_type': 'float',
   'units': 'Faults/Sec',
   'slope': 'both',
  'format': '%f',
  'description': 'Page Faults',
  'groups': groups
   },
  {
  'name': NAME_PREFIX + 'globalLock_ratio',
  'call_back': get_globalLock_ratio,
  'time_max': time_max,
  'value_type': 'float',
  'units': '%',
  'slope': 'both',
  'format': '%f',
  'description': 'Global Write Lock Ratio',
  'groups': groups
   },
  {
  'name': NAME_PREFIX + 'indexCounters_btree_miss_ratio',
  'call_back': get_indexCounters_btree_miss_ratio,
  'time_max': time_max,
  'value_type': 'float',
  'units': '%',
  'slope': 'both',
  'format': '%f',
  'description': 'BTree Page Miss Ratio',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'globalLock_currentQueue_total',
   'call_back': get_value,
  'time_max': time_max,
  'value_type': 'uint',
  'units': 'Operations',
  'slope': 'both',
  'format': '%u',
  'description': 'Total Operations Waiting for Lock',
  'groups': groups
  },
   {
  'name': NAME_PREFIX + 'globalLock_currentQueue_readers',
  'call_back': get_value,
  'time_max': time_max,
  'value_type': 'uint',
  'units': 'Operations',
  'slope': 'both',
  'format': '%u',
  'description': 'Readers Waiting for Lock',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'globalLock_currentQueue_writers',
  'call_back': get_value,
  'time_max': time_max,
  'value_type': 'uint',
   'units': 'Operations',
  'slope': 'both',
  'format': '%u',
  'description': 'Writers Waiting for Lock',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'globalLock_activeClients_total',
  'call_back': get_value,
  'time_max': time_max,
  'value_type': 'uint',
  'units': 'Clients',
  'slope': 'both',
  'format': '%u',
  'description': 'Total Active Clients',
  'groups': groups
   },
  {
  'name': NAME_PREFIX + 'globalLock_activeClients_readers',
  'call_back': get_value,
  'time_max': time_max,
   'value_type': 'uint',
  'units': 'Clients',
  'slope': 'both',
   'format': '%u',
  'description': 'Active Readers',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'globalLock_activeClients_writers',
  'call_back': get_value,
  'time_max': time_max,
  'value_type': 'uint',
  'units': 'Clients',
  'slope': 'both',
  'format': '%u',
  'description': 'Active Writers',
  'groups': groups
   },
  {
  'name': NAME_PREFIX + 'connections_current',
  'call_back': get_value,
  'time_max': time_max,
  'value_type': 'uint',
  'units': 'Connections',
  'slope': 'both',
  'format': '%u',
  'description': 'Open Connections',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'connections_current_ratio',
  'call_back': get_connections_current_ratio,
  'time_max': time_max,
  'value_type': 'float',
  'units': '%',
   'slope': 'both',
  'format': '%f',
  'description': 'Percentage of Connections Used',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'slave_delay',
  'call_back': get_slave_delay,
  'time_max': time_max,
  'value_type': 'uint',
  'units': 'Seconds',
  'slope': 'both',
  'format': '%u',
  'description': 'Replica Set Slave Delay',
  'groups': groups
  },
  {
  'name': NAME_PREFIX + 'asserts_total',
  'call_back': get_asserts_total_rate,
  'time_max': time_max,
  'value_type': 'float',
  'units': 'Asserts/Sec',
  'slope': 'both',
  'format': '%f',
   'description': 'Asserts',
  'groups': groups
  }
   ]
  return descriptors
  

  def metric_cleanup():
  """Cleanup"""
  pass
  

  # the following code is for debugging and testing
  if __name__ == '__main__':
  descriptors = metric_init(PARAMS)
  while True:
  for d in descriptors:
  print (('%s = %s') % (d['name'], d['format'])) % (d['call_back'](d['name']))
  print ''
  time.sleep(METRICS_CACHE_TTL)

python擴充指令碼中必須要重寫的函數有：metric_init(params)，metric_cleanup()
metric_init()函數在模組初始化的時候調用，必須要返回一個metric描述字典或者字典列表，mongdb.py就返回了字典列表。

Metric字典定義如下：

d = {‘name’ : ‘<your_metric_name>’, #這個name必須跟pyconf檔案中的名字保持一致

'call_back’ : <call_back function>,

'time_max’ : int(<your_time_max>),

'value_type’ : ‘<string | uint | float | double>’,

'units’ : ’<your_units>’,

'slope’ : ‘<zero | positive | negative | both>’,

'format’ : ‘<your_format>’,

'description’ : ‘<your_description>’
}
metric_cleanup()函數在模組結束時調用，無資料返回
4)、在web端查看監控統計
完成指令碼編寫後，重啟gmond服務。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More