Basic monitoring-year-on-year alarms and year-on-year alarms
The year-on-year alarms of basic monitoring are mainly collected for server monitoring, including load (load1, load5, and load15), average CPU usage, memory usage, Intranet/Internet traffic, and port count, for specific collection methods, see basic monitoring-server monitoring.
I. Alarm Principle
One data entry per minute for multiple indicators. Compare the average value of 7 days in the first 10 minutes of the current minute. If the range exceeds 100% and the absolute value difference reaches M, this is an exception, including an exception of increase/decrease, if an indicator is abnormal for two consecutive increases or two consecutive declines (not including first increases and then decreases or first drops and then increases) then, an alarm is triggered to the machine owner (O & M/development ). For example, if the current value is on the 10 th, it is compared... the average value of 10 minutes in the last seven days at, for example, 10, 9, and 8... 4. The average value of the data at in 7 days. The actual situation is generally that the data of the previous minute is parsed in the next minute, which is equivalent to the current value of, and the comparison between the value of and the 7-day average value of the previous 10 minutes is parsed.
Ii. Data Source
Basic monitoring-server monitoring collects a copy of data every minute and saves it to Redis. The storage format is reportTime-hash, And the hash format is {ip1: {item1: value, item2: value2 ...}, ip2: {item1: value, item2: value2 ...}, ip3 ...}, one redis hash every minute, with a total of 7x1440 = 10080 data records in 7 days. During actual storage, 10 minutes of data will be retained, that is, 7 days + 10 minutes. Because reportTime is reported based on the actual time of the machine (so that the drawing can be accurate ), some machines do not have NTP servers or other reasons that lead to inaccurate time, the reportTime may be varied, so the result is that redis's hash will increase, of course, this does not affect our data acquisition, because the entire year-on-year alarm is compared based on the drawing, and the drawing uses reportTime. Using hash to save data to redis removes the need for each ip address to read redis once, which can reduce network I/O times and greatly increase the program speed. Because redis occupies a large amount of memory, about 10 Gb, you need to adjust the size of the redis configuration file maxmemory, or redis will randomly Delete the key set to automatically expire.
Iii. Program Design
1. DB Design
Data source storage (redis), exception display table (mysql), threshold configuration table (mysql), and last status (redis) are required ). The exception display table stores the exception description and duration of all ip addresses, which can be displayed on the page. The threshold configuration table stores the threshold configuration information of all ip addresses, the exception rate of each ip address, the absolute value difference of each indicator, and whether to monitor the ip address. The previous status is used to determine whether an alarm is required. If the alarm persists for two similar exceptions, use redis to save the alarm.
mysql> show tables;+-----------------------------------+| Tables_in_machineMonitor_overLast |+-----------------------------------+| currentDisplay || monitorConf |+-----------------------------------+
2. Import Test Data
The test data is imported from mysql to redis using mysql and redis. The online data is imported by modifying the "server monitoring" reporting CGI after the program is completed. For details about the pitfalls encountered during the test import, refer to python's pitfalls for processing json and redis hash.
Def initRedis (client): if client not in CONF. redisInfo: raise RedisNotFound ("can not found redis % s in config" % client) try: pool = redis. connectionPool (** CONF. redisInfo [client]) # thread security red = redis. redis (connection_pool = pool) red. randomkey () # check failed t Exception, e: raise RedisException ("connect redis % s failed: % s" % (client, str (e) return reddef initDb (client ): if client not in CONF. dbInfo : Raise MysqlNotFound ("can not found mysql db % s in config" % client) try: db = opMysql. OP_MYSQL (** CONF. dbInfo [client]) failed t Exception, e: raise MysqlException ("connect mysql % s failed: % s" % (client, str (e) code, errMsg = db. connect () if 0! = Code: raise MysqlException ("connect mysql % s failed: % s" % (client, errMsg) return db
3. Alarm Interface
Call the "alarm platform" interface. After the project is configured, you can send RTX, SMS, and Wechat messages. You can view the history on the page to easily modify the configuration and temporarily block the requests. You can call the urllib/urllib2 library directly.
postData = urllib.urlencode(values)req = urllib2.Request(CONF.amcUrl, postData)response = urllib2.urlopen(req, timeout=CONF.AMC_TIME_OUT)amcDict = json.loads(response.read())code = int(amcDict["returnCode"])errMsg = amcDict["errorMsg"]
4. parse source data
Convert the data into a recognizable dictionary and troubleshoot the error, and reject the error data.
5. Calculate the average value using numpy and panda
Parse the source data and save it to panda. in DataFrame, if the data does not exist, use numpy. instead of nan, use panda. dataFrame (). mean (). round (2) calculates the average value and retains 2 digits. If all the data in the seven days of a certain minute cannot be obtained, the current value is not parsed. If some data in the seven days cannot be obtained, the current value is excluded and the average value is reduced by one, using NAN instead of this point can solve this problem in mean. If a custom function is added to compare whether each column meets the requirement of increase/decrease by 100% (configurable) and the absolute value difference reaches M (configurable), True is returned; otherwise, False is returned, determines whether all return values are True or False. If yes, the exception scenario is met.
Initialize DataFrame
for item in vd: if value is None or value[item] is None: vd[item][lastDayKey] = numpy.nan else: vd[item][lastDayKey] = value[item] vf = pandas.DataFrame(vd) columns.append(vf.mean().round(2)) indexes.append(lastMinKey) self.ipInfo[ip]["lastData"] = pandas.DataFrame(columns, index=indexes)
Panda User-Defined Function Comparison and determine whether an alarm is required
for item in curValue: if curValue[item] is None: # error continue else: curValue[item] = round(curValue[item], 2) def overLastCompute(v2, absSub): """ :param v2: float :param absSub: absolute subtract :return: high/low/null """ v1 = curValue[item] v2 = round(v2, 2) if 0 == v2: if v1 > absSub: return "HIGH" if v1 < -absSub: return "LOW" return "NULL" subVal = abs(v1 - v2) if subVal / v2 > CONF.RATIO and subVal > absSub: if v1 > v2: return "HIGH" return "LOW" return "NULL" self.ipInfo[ip]["result"][item] = self.ipInfo[ip]["lastData"][item].apply(overLastCompute, absSub=self.monitorConf[ip][item]) res = self.ipInfo[ip]["result"][item] == "HIGH" # Series if all(i for i in res): resErr[item] = CONF.HIGH_ERR if CONF.HIGH_ERR == self.lastCache[str(ip)][item]: # will Alert if switch on pass else: res = self.ipInfo[ip]["result"][item] == "LOW" if all(i for i in res): resErr[item] = CONF.LOW_ERR if CONF.LOW_ERR == self.lastCache[str(ip)][item]: # will Alert if switch on pass
6. Because there are many IP addresses and the main logic is on parsing data and panda computing, multiple processes need to be used when many CPUs are used, and the process is full in combination with the thread pool, do not waste process resources.
step = ipNums / multiprocessing.cpu_count()ipList = list()i = 0j = 1processList = list()for ip in self.ipInfo: ipS = str(ip) if ipS not in self.lastCache: self.lastCache[ipS] = copy.deepcopy(self.value) ipList.append(ip) i += 1 if i == step * j or i == ipNums: j += 1 def innerRun(): wm = Pool.ThreadPool(CONF.POOL_SIZE) for myIp in ipList: kw = dict(ip=myIp, handlerKey=myIp) wm.addJob(self.handleOne, **kw) wm.waitForComplete() ipListNums = len(ipList) for tmp in xrange(ipListNums): res = wm.getResult() if res: handlerKey, code, handlerRet, errMsg = res if 0 != code: continue self.lastCache[str(handlerKey)] = handlerRet process = multiprocessing.Process(target=innerRun) process.start() processList.append(process) ipList = list()for process in processList: process.join()
Iv. Optimization
The metric monitoring v2 is an upgraded version of the year-on-year alarm, and the data size will be several times larger. The current optimization is as follows:
1. Use hbase to replace redis
2. Change the program to Beijing-wide disaster tolerance for Distributed Operation, and horizontally expand multiple sets of programs for parallel operation.