使用ganglia監控mongodb叢集

來源:互聯網
上載者:User

使用ganglia監控mongodb叢集
前幾天提交了一篇ganglia監控storm叢集的博文,本文將介紹使用ganglia監控mongdb叢集。因為我們需要使用ganglia一統天下。
1. ganglia擴充機制
首先要使用ganglia監控mongodb叢集必須先明白ganglia的擴充機制。通過ganglia外掛程式可以給我們提供兩種擴充ganglia監控功能的方法:

1)、通過添加內嵌(in-band)外掛程式,主要是通過gmetric命令來實現。

這是通常使用的一種方法,主要是通過cronjob方法並調用ganglia的gmetric命令來向gmond輸入資料,進而實現統一監控,這種方法簡單,對於少量的監控可以採用,但是對於大規模自訂監控時,監控資料難以統一管理。

2)、通過添加一些額外的指令碼來實現對系統的監控,主要是通過C或者python介面來實現。

在ganglia3.1.x版本以後,增加了C或者Python介面,通過這個介面可以自訂資料收集模組,並且這些模組可以被直接插入到gmond中以監控使用者自訂的應用。
2. python指令碼監控mongdb
我們使用python指令碼來監控mongodb叢集,畢竟通過python指令碼擴充比較方便,需要增加監控資訊時在相應的py指令碼中添加監控資料就可以了,十分方便,擴充性強,移植也比較簡單。
2.1 環境配置
要使用python指令碼來實現ganglia監控擴充,首先需要明確modpython.so檔案是否存在,這個檔案是ganglia調用python的動態連結程式庫,要通過python介面開發ganglia外掛程式,必須要編譯安裝此模組。modpython.so檔案存放在ganglia安裝目錄下的lib(or lib64)/ganglia/目錄中。如果存在則可以進行下面的指令碼編寫;如果不存在,那麼需要你重新編譯安裝gmond哦,編譯安裝時帶上參數“--with-python”。
2.2 編寫監控指令碼
我們開啟ganglia安裝目錄下的/etc/gmond.conf檔案,可以發現在用戶端監控中可以看到include ("/usr/local/ganglia/etc/conf.d/*.conf"),說明gmond服務直接掃描目錄下的監控設定檔,所以我們需要將監控配置指令碼放在/etc/conf.d/目錄下並命名為XX.conf,所以我們將要監控mongdb的配置指令碼命名為mongdb.conf
1)、查看modpython.conf檔案
modpython.conf位於/etc/conf.d/目錄下。檔案內容如下:

 
  1. modules {
  2. module {
  3. name = "python_module" #主模組文成
  4. path = "modpython.so" #ganglia擴充python指令碼需要的動態連結程式庫
  5. params = "/usr/local/ganglia/lib64/ganglia/python_modules" #python指令碼存放的位置
  6. }
  7. }

  8. include ("/usr/local/ganglia/etc/conf.d/*.pyconf") #ganglia擴充存放配置指令碼的路徑
所以我們使用python來擴充ganglia監控mongodb需要將配置指令碼和py指令碼放在相應的目錄下,再重啟ganglia服務就可以完成mongdb監控,下面將介紹如何編寫指令碼。
2)、建立mongodb.pyconf指令碼
注意這裡需要使用root許可權來建立編輯指令碼,將此指令碼存放在conf.d目錄下。具體要收集mongdb那些參數可以參考https://github.com/ganglia/gmond_python_modules/tree/master/mongodb,根據自己的需求酌量增刪。
 
  1. modules {
  2. module {
  3. name = "mongodb" #模組名,該模組名必須與開發的存放於"/usr/lib64/ganglia/python_modules"指定的路徑下的python指令碼名稱一致
  4. language = "python" #聲明使用python語言
  5. #參數列表,所有的參數作為一個dict(即map)傳給python指令碼的metric_init(params)函數。
  6. param server_status{
  7. value = "mongo路徑 --host host --port 27017 --quiet --eval 'printjson(db.serverStatus())'"
  8. }
  9. param rs_status{
  10. value = "mongo路徑 --host host --port 2701 --quiet --eval 'printjson(rs.status())'"
  11. }
  12. }
  13. }

  14. #需要收集的metric列表,一個模組中可以擴充任意個metric
  15. collection_group {
  16. collect_every = 30
  17. time_threshold = 90 #最大發送間隔
  18. metric {
  19. name = "mongodb_opcounters_insert" #metric在模組中的名字
  20. title = "Inserts" #圖形介面上顯示的標題
  21. }
  22. metric {
  23. name = "mongodb_opcounters_query"
  24. title = "Queries"
  25. }
  26. metric {
  27. name = "mongodb_opcounters_update"
  28. title = "Updates"
  29. }
  30. metric {
  31. name = "mongodb_opcounters_delete"
  32. title = "Deletes"
  33. }
  34. metric {
  35. name = "mongodb_opcounters_getmore"
  36. title = "Getmores"
  37. }
  38. metric {
  39. name = "mongodb_opcounters_command"
  40. title = "Commands"
  41. }
  42. metric {
  43. name = "mongodb_backgroundFlushing_flushes"
  44. title = "Flushes"
  45. }
  46. metric {
  47. name = "mongodb_mem_mapped"
  48. title = "Memory-mapped Data"
  49. }
  50. metric {
  51. name = "mongodb_mem_virtual"
  52. title = "Process Virtual Size"
  53. }
  54. metric {
  55. name = "mongodb_mem_resident"
  56. title = "Process Resident Size"
  57. }
  58. metric {
  59. name = "mongodb_extra_info_page_faults"
  60. title = "Page Faults"
  61. }
  62. metric {
  63. name = "mongodb_globalLock_ratio"
  64. title = "Global Write Lock Ratio"
  65. }
  66. metric {
  67. name = "mongodb_indexCounters_btree_miss_ratio"
  68. title = "BTree Page Miss Ratio"
  69. }
  70. metric {
  71. name = "mongodb_globalLock_currentQueue_total"
  72. title = "Total Operations Waiting for Lock"
  73. }
  74. metric {
  75. name = "mongodb_globalLock_currentQueue_readers"
  76. title = "Readers Waiting for Lock"
  77. }
  78. metric {
  79. name = "mongodb_globalLock_currentQueue_writers"
  80. title = "Writers Waiting for Lock"
  81. }
  82. metric {
  83. name = "mongodb_globalLock_activeClients_total"
  84. title = "Total Active Clients"
  85. }
  86. metric {
  87. name = "mongodb_globalLock_activeClients_readers"
  88. title = "Active Readers"
  89. }
  90. metric {
  91. name = "mongodb_globalLock_activeClients_writers"
  92. title = "Active Writers"
  93. }
  94. metric {
  95. name = "mongodb_connections_current"
  96. title = "Open Connections"
  97. }
  98. metric {
  99. name = "mongodb_connections_current_ratio"
  100. title = "Open Connections"
  101. }
  102. metric {
  103. name = "mongodb_slave_delay"
  104. title = "Replica Set Slave Delay"
  105. }
  106. metric {
  107. name = "mongodb_asserts_total"
  108. title = "Asserts per Second"
  109. }
  110. }
從上面你可以發現這個設定檔的寫法跟gmond.conf的文法一致,所以有什麼不明白的可以參考gmond.conf的寫法。
3)、建立mongodb.py指令碼
將mongodb.py檔案存放在lib64/ganglia/python_modules目錄下,在這個目錄中可以看到已經有很多python指令碼存在,比如:監控磁碟、記憶體、網路、mysql、redis等的指令碼。我們可以參考這些python指令碼完成mongodb.py的編寫。我們開啟其中部分指令碼可以看到在每個指令碼中都有一個函數metric_init(params),前面也說過mongodb.pyconf傳來的參數傳遞給metric_init函數。

 
  1. #!/usr/bin/env python
  2. import json
  3. import os
  4. import re
  5. import socket
  6. import string
  7. import time
  8. import copy

  9. NAME_PREFIX = 'mongodb_'
  10. PARAMS = {
  11. 'server_status' : '/bin/mongo路徑 --host host --port 27017 --quiet --eval "printjson(db.serverStatus())"',
  12. 'rs_status' : '/bin/mongo路徑 --host host --port 27017 --quiet --eval "printjson(rs.status())"'
  13. }
  14. METRICS = {
  15. 'time' : 0,
  16. 'data' : {}
  17. }
  18. LAST_METRICS = copy.deepcopy(METRICS)
  19. METRICS_CACHE_TTL = 3
  20. def flatten(d, pre = '', sep = '_'):
  21. """Flatten a dict (i.e. dict['a']['b']['c'] => dict['a_b_c'])"""
  22. new_d = {}
  23. for k,v in d.items():
  24. if type(v) == dict:
  25. new_d.update(flatten(d[k], '%s%s%s' % (pre, k, sep)))
  26. else:
  27. new_d['%s%s' % (pre, k)] = v
  28. return new_d

  29. def get_metrics():
  30. """Return all metrics"""
  31. global METRICS, LAST_METRICS
  32. if (time.time() - METRICS['time']) > METRICS_CACHE_TTL:
  33. metrics = {}
  34. for status_type in PARAMS.keys():
  35. # get raw metric data
  36. o = os.popen(PARAMS[status_type])
  37. # clean up
  38. metrics_str = ''.join(o.readlines()).strip() # convert to string
  39. metrics_str = re.sub('\w+\((.*)\)', r"\1", metrics_str) # remove functions
  40. # convert to flattened dict
  41. try:
  42. if status_type == 'server_status':
  43. metrics.update(flatten(json.loads(metrics_str)))
  44. else:
  45. metrics.update(flatten(json.loads(metrics_str), pre='%s_' % status_type))
  46. except ValueError:
  47. metrics = {}

  48. # update cache
  49. LAST_METRICS = copy.deepcopy(METRICS)
  50. METRICS = {
  51. 'time': time.time(),
  52. 'data': metrics
  53. }
  54. return [METRICS, LAST_METRICS]

  55. def get_value(name):
  56. """Return a value for the requested metric"""
  57. # get metrics
  58. metrics = get_metrics()[0]
  59. # get value
  60. name = name[len(NAME_PREFIX):] # remove prefix from name
  61. try:
  62. result = metrics['data'][name]
  63. except StandardError:
  64. result = 0
  65. return result

  66. def get_rate(name):
  67. """Return change over time for the requested metric"""
  68. # get metrics
  69. [curr_metrics, last_metrics] = get_metrics()
  70. # get rate
  71. name = name[len(NAME_PREFIX):] # remove prefix from name
  72. try:
  73. rate = float(curr_metrics['data'][name] - last_metrics['data'][name]) / \
  74. float(curr_metrics['time'] - last_metrics['time'])
  75. if rate < 0:
  76. rate = float(0)
  77. except StandardError:
  78. rate = float(0)
  79. return rate

  80. def get_opcounter_rate(name):
  81. """Return change over time for an opcounter metric"""
  82. master_rate = get_rate(name)
  83. repl_rate = get_rate(name.replace('opcounters_', 'opcountersRepl_'))
  84. return master_rate + repl_rate

  85. def get_globalLock_ratio(name):
  86. """Return the global lock ratio"""
  87. try:
  88. result = get_rate(NAME_PREFIX + 'globalLock_lockTime') / \
  89. get_rate(NAME_PREFIX + 'globalLock_totalTime') * 100
  90. except ZeroDivisionError:
  91. result = 0
  92. return result

  93. def get_indexCounters_btree_miss_ratio(name):
  94. """Return the btree miss ratio"""
  95. try:
  96. result = get_rate(NAME_PREFIX + 'indexCounters_btree_misses') / \
  97. get_rate(NAME_PREFIX + 'indexCounters_btree_accesses') * 100
  98. except ZeroDivisionError:
  99. result = 0
  100. return result

  101. def get_connections_current_ratio(name):
  102. """Return the percentage of connections used"""
  103. try:
  104. result = float(get_value(NAME_PREFIX + 'connections_current')) / \
  105. float(get_value(NAME_PREFIX + 'connections_available')) * 100
  106. except ZeroDivisionError:
  107. result = 0
  108. return result

  109. def get_slave_delay(name):
  110. """Return the replica set slave delay"""
  111. # get metrics
  112. metrics = get_metrics()[0]
  113. # no point checking my optime if i'm not replicating
  114. if 'rs_status_myState' not in metrics['data'] or metrics['data']['rs_status_myState'] != 2:
  115. result = 0
  116. # compare my optime with the master's
  117. else:
  118. master = {}
  119. slave = {}
  120. try:
  121. for member in metrics['data']['rs_status_members']:
  122. if member['state'] == 1:
  123. master = member
  124. if member['name'].split(':')[0] == socket.getfqdn():
  125. slave = member
  126. result = max(0, master['optime']['t'] - slave['optime']['t']) / 1000
  127. except KeyError:
  128. result = 0
  129. return result

  130. def get_asserts_total_rate(name):
  131. """Return the total number of asserts per second"""
  132. return float(reduce(lambda memo,obj: memo + get_rate('%sasserts_%s' % (NAME_PREFIX, obj)),['regular', 'warning', 'msg', 'user', 'rollovers'], 0))

  133. def metric_init(lparams):
  134. """Initialize metric descriptors"""
  135. global PARAMS
  136. # set parameters
  137. for key in lparams:
  138. PARAMS[key] = lparams[key]
  139. # define descriptors
  140. time_max = 60
  141. groups = 'mongodb'
  142. descriptors = [
  143. {
  144. 'name': NAME_PREFIX + 'opcounters_insert',
  145. 'call_back': get_opcounter_rate,
  146. 'time_max': time_max,
  147. 'value_type': 'float',
  148. 'units': 'Inserts/Sec',
  149. 'slope': 'both',
  150. 'format': '%f',
  151. 'description': 'Inserts',
  152. 'groups': groups
  153. },
  154. {
  155. 'name': NAME_PREFIX + 'opcounters_query',
  156. 'call_back': get_opcounter_rate,
  157. 'time_max': time_max,
  158. 'value_type': 'float',
  159. 'units': 'Queries/Sec',
  160. 'slope': 'both',
  161. 'format': '%f',
  162. 'description': 'Queries',
  163. 'groups': groups
  164. },
  165. {
  166. 'name': NAME_PREFIX + 'opcounters_update',
  167. 'call_back': get_opcounter_rate,
  168. 'time_max': time_max,
  169. 'value_type': 'float',
  170. 'units': 'Updates/Sec',
  171. 'slope': 'both',
  172. 'format': '%f',
  173. 'description': 'Updates',
  174. 'groups': groups
  175. },
  176. {
  177. 'name': NAME_PREFIX + 'opcounters_delete',
  178. 'call_back': get_opcounter_rate,
  179. 'time_max': time_max,
  180. 'value_type': 'float',
  181. 'units': 'Deletes/Sec',
  182. 'slope': 'both',
  183. 'format': '%f',
  184. 'description': 'Deletes',
  185. 'groups': groups
  186. },
  187. {
  188. 'name': NAME_PREFIX + 'opcounters_getmore',
  189. 'call_back': get_opcounter_rate,
  190. 'time_max': time_max,
  191. 'value_type': 'float',
  192. 'units': 'Getmores/Sec',
  193. 'slope': 'both',
  194. 'format': '%f',
  195. 'description': 'Getmores',
  196. 'groups': groups
  197. },
  198. {
  199. 'name': NAME_PREFIX + 'opcounters_command',
  200. 'call_back': get_opcounter_rate,
  201. 'time_max': time_max,
  202. 'value_type': 'float',
  203. 'units': 'Commands/Sec',
  204. 'slope': 'both',
  205. 'format': '%f',
  206. 'description': 'Commands',
  207. 'groups': groups
  208. },
  209. {
  210. 'name': NAME_PREFIX + 'backgroundFlushing_flushes',
  211. 'call_back': get_rate,
  212. 'time_max': time_max,
  213. 'value_type': 'float',
  214. 'units': 'Flushes/Sec',
  215. 'slope': 'both',
  216. 'format': '%f',
  217. 'description': 'Flushes',
  218. 'groups': groups
  219. },
  220. {
  221. 'name': NAME_PREFIX + 'mem_mapped',
  222. 'call_back': get_value,
  223. 'time_max': time_max,
  224. 'value_type': 'uint',
  225. 'units': 'MB',
  226. 'slope': 'both',
  227. 'format': '%u',
  228. 'description': 'Memory-mapped Data',
  229. 'groups': groups
  230. },
  231. {
  232. 'name': NAME_PREFIX + 'mem_virtual',
  233. 'call_back': get_value,
  234. 'time_max': time_max,
  235. 'value_type': 'uint',
  236. 'units': 'MB',
  237. 'slope': 'both',
  238. 'format': '%u',
  239. 'description': 'Process Virtual Size',
  240. 'groups': groups
  241. },
  242. {
  243. 'name': NAME_PREFIX + 'mem_resident',
  244. 'call_back': get_value,
  245. 'time_max': time_max,
  246. 'value_type': 'uint',
  247. 'units': 'MB',
  248. 'slope': 'both',
  249. 'format': '%u',
  250. 'description': 'Process Resident Size',
  251. 'groups': groups
  252. },
  253. {
  254. 'name': NAME_PREFIX + 'extra_info_page_faults',
  255. 'call_back': get_rate,
  256. 'time_max': time_max,
  257. 'value_type': 'float',
  258. 'units': 'Faults/Sec',
  259. 'slope': 'both',
  260. 'format': '%f',
  261. 'description': 'Page Faults',
  262. 'groups': groups
  263. },
  264. {
  265. 'name': NAME_PREFIX + 'globalLock_ratio',
  266. 'call_back': get_globalLock_ratio,
  267. 'time_max': time_max,
  268. 'value_type': 'float',
  269. 'units': '%',
  270. 'slope': 'both',
  271. 'format': '%f',
  272. 'description': 'Global Write Lock Ratio',
  273. 'groups': groups
  274. },
  275. {
  276. 'name': NAME_PREFIX + 'indexCounters_btree_miss_ratio',
  277. 'call_back': get_indexCounters_btree_miss_ratio,
  278. 'time_max': time_max,
  279. 'value_type': 'float',
  280. 'units': '%',
  281. 'slope': 'both',
  282. 'format': '%f',
  283. 'description': 'BTree Page Miss Ratio',
  284. 'groups': groups
  285. },
  286. {
  287. 'name': NAME_PREFIX + 'globalLock_currentQueue_total',
  288. 'call_back': get_value,
  289. 'time_max': time_max,
  290. 'value_type': 'uint',
  291. 'units': 'Operations',
  292. 'slope': 'both',
  293. 'format': '%u',
  294. 'description': 'Total Operations Waiting for Lock',
  295. 'groups': groups
  296. },
  297. {
  298. 'name': NAME_PREFIX + 'globalLock_currentQueue_readers',
  299. 'call_back': get_value,
  300. 'time_max': time_max,
  301. 'value_type': 'uint',
  302. 'units': 'Operations',
  303. 'slope': 'both',
  304. 'format': '%u',
  305. 'description': 'Readers Waiting for Lock',
  306. 'groups': groups
  307. },
  308. {
  309. 'name': NAME_PREFIX + 'globalLock_currentQueue_writers',
  310. 'call_back': get_value,
  311. 'time_max': time_max,
  312. 'value_type': 'uint',
  313. 'units': 'Operations',
  314. 'slope': 'both',
  315. 'format': '%u',
  316. 'description': 'Writers Waiting for Lock',
  317. 'groups': groups
  318. },
  319. {
  320. 'name': NAME_PREFIX + 'globalLock_activeClients_total',
  321. 'call_back': get_value,
  322. 'time_max': time_max,
  323. 'value_type': 'uint',
  324. 'units': 'Clients',
  325. 'slope': 'both',
  326. 'format': '%u',
  327. 'description': 'Total Active Clients',
  328. 'groups': groups
  329. },
  330. {
  331. 'name': NAME_PREFIX + 'globalLock_activeClients_readers',
  332. 'call_back': get_value,
  333. 'time_max': time_max,
  334. 'value_type': 'uint',
  335. 'units': 'Clients',
  336. 'slope': 'both',
  337. 'format': '%u',
  338. 'description': 'Active Readers',
  339. 'groups': groups
  340. },
  341. {
  342. 'name': NAME_PREFIX + 'globalLock_activeClients_writers',
  343. 'call_back': get_value,
  344. 'time_max': time_max,
  345. 'value_type': 'uint',
  346. 'units': 'Clients',
  347. 'slope': 'both',
  348. 'format': '%u',
  349. 'description': 'Active Writers',
  350. 'groups': groups
  351. },
  352. {
  353. 'name': NAME_PREFIX + 'connections_current',
  354. 'call_back': get_value,
  355. 'time_max': time_max,
  356. 'value_type': 'uint',
  357. 'units': 'Connections',
  358. 'slope': 'both',
  359. 'format': '%u',
  360. 'description': 'Open Connections',
  361. 'groups': groups
  362. },
  363. {
  364. 'name': NAME_PREFIX + 'connections_current_ratio',
  365. 'call_back': get_connections_current_ratio,
  366. 'time_max': time_max,
  367. 'value_type': 'float',
  368. 'units': '%',
  369. 'slope': 'both',
  370. 'format': '%f',
  371. 'description': 'Percentage of Connections Used',
  372. 'groups': groups
  373. },
  374. {
  375. 'name': NAME_PREFIX + 'slave_delay',
  376. 'call_back': get_slave_delay,
  377. 'time_max': time_max,
  378. 'value_type': 'uint',
  379. 'units': 'Seconds',
  380. 'slope': 'both',
  381. 'format': '%u',
  382. 'description': 'Replica Set Slave Delay',
  383. 'groups': groups
  384. },
  385. {
  386. 'name': NAME_PREFIX + 'asserts_total',
  387. 'call_back': get_asserts_total_rate,
  388. 'time_max': time_max,
  389. 'value_type': 'float',
  390. 'units': 'Asserts/Sec',
  391. 'slope': 'both',
  392. 'format': '%f',
  393. 'description': 'Asserts',
  394. 'groups': groups
  395. }
  396. ]
  397. return descriptors

  398. def metric_cleanup():
  399. """Cleanup"""
  400. pass

  401. # the following code is for debugging and testing
  402. if __name__ == '__main__':
  403. descriptors = metric_init(PARAMS)
  404. while True:
  405. for d in descriptors:
  406. print (('%s = %s') % (d['name'], d['format'])) % (d['call_back'](d['name']))
  407. print ''
  408. time.sleep(METRICS_CACHE_TTL)
python擴充指令碼中必須要重寫的函數有:metric_init(params),metric_cleanup()
metric_init()函數在模組初始化的時候調用,必須要返回一個metric描述字典或者字典列表,mongdb.py就返回了字典列表。

Metric字典定義如下:

d = {‘name’ : ‘<your_metric_name>’, #這個name必須跟pyconf檔案中的名字保持一致

'call_back’ : <call_back function>,

'time_max’ : int(<your_time_max>),

'value_type’ : ‘<string | uint | float | double>’,

'units’ : ’<your_units>’,

'slope’ : ‘<zero | positive | negative | both>’,

'format’ : ‘<your_format>’,

'description’ : ‘<your_description>’
}
metric_cleanup()函數在模組結束時調用,無資料返回
4)、在web端查看監控統計
完成指令碼編寫後,重啟gmond服務。

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.