Basic monitoring-year-on-year alarms and year-on-year alarms

Source: Internet
Author: User

Basic monitoring-year-on-year alarms and year-on-year alarms

The year-on-year alarms of basic monitoring are mainly collected for server monitoring, including load (load1, load5, and load15), average CPU usage, memory usage, Intranet/Internet traffic, and port count, for specific collection methods, see basic monitoring-server monitoring.

I. Alarm Principle

One data entry per minute for multiple indicators. Compare the average value of 7 days in the first 10 minutes of the current minute. If the range exceeds 100% and the absolute value difference reaches M, this is an exception, including an exception of increase/decrease, if an indicator is abnormal for two consecutive increases or two consecutive declines (not including first increases and then decreases or first drops and then increases) then, an alarm is triggered to the machine owner (O & M/development ). For example, if the current value is on the 10 th, it is compared... the average value of 10 minutes in the last seven days at, for example, 10, 9, and 8... 4. The average value of the data at in 7 days. The actual situation is generally that the data of the previous minute is parsed in the next minute, which is equivalent to the current value of, and the comparison between the value of and the 7-day average value of the previous 10 minutes is parsed.

Ii. Data Source

Basic monitoring-server monitoring collects a copy of data every minute and saves it to Redis. The storage format is reportTime-hash, And the hash format is {ip1: {item1: value, item2: value2 ...}, ip2: {item1: value, item2: value2 ...}, ip3 ...}, one redis hash every minute, with a total of 7x1440 = 10080 data records in 7 days. During actual storage, 10 minutes of data will be retained, that is, 7 days + 10 minutes. Because reportTime is reported based on the actual time of the machine (so that the drawing can be accurate ), some machines do not have NTP servers or other reasons that lead to inaccurate time, the reportTime may be varied, so the result is that redis's hash will increase, of course, this does not affect our data acquisition, because the entire year-on-year alarm is compared based on the drawing, and the drawing uses reportTime. Using hash to save data to redis removes the need for each ip address to read redis once, which can reduce network I/O times and greatly increase the program speed. Because redis occupies a large amount of memory, about 10 Gb, you need to adjust the size of the redis configuration file maxmemory, or redis will randomly Delete the key set to automatically expire.

Iii. Program Design

1. DB Design

Data source storage (redis), exception display table (mysql), threshold configuration table (mysql), and last status (redis) are required ). The exception display table stores the exception description and duration of all ip addresses, which can be displayed on the page. The threshold configuration table stores the threshold configuration information of all ip addresses, the exception rate of each ip address, the absolute value difference of each indicator, and whether to monitor the ip address. The previous status is used to determine whether an alarm is required. If the alarm persists for two similar exceptions, use redis to save the alarm.

mysql> show tables;+-----------------------------------+| Tables_in_machineMonitor_overLast |+-----------------------------------+| currentDisplay                    || monitorConf                       |+-----------------------------------+

2. Import Test Data

The test data is imported from mysql to redis using mysql and redis. The online data is imported by modifying the "server monitoring" reporting CGI after the program is completed. For details about the pitfalls encountered during the test import, refer to python's pitfalls for processing json and redis hash.

Def initRedis (client): if client not in CONF. redisInfo: raise RedisNotFound ("can not found redis % s in config" % client) try: pool = redis. connectionPool (** CONF. redisInfo [client]) # thread security red = redis. redis (connection_pool = pool) red. randomkey () # check failed t Exception, e: raise RedisException ("connect redis % s failed: % s" % (client, str (e) return reddef initDb (client ): if client not in CONF. dbInfo : Raise MysqlNotFound ("can not found mysql db % s in config" % client) try: db = opMysql. OP_MYSQL (** CONF. dbInfo [client]) failed t Exception, e: raise MysqlException ("connect mysql % s failed: % s" % (client, str (e) code, errMsg = db. connect () if 0! = Code: raise MysqlException ("connect mysql % s failed: % s" % (client, errMsg) return db

3. Alarm Interface

Call the "alarm platform" interface. After the project is configured, you can send RTX, SMS, and Wechat messages. You can view the history on the page to easily modify the configuration and temporarily block the requests. You can call the urllib/urllib2 library directly.

postData = urllib.urlencode(values)req = urllib2.Request(CONF.amcUrl, postData)response = urllib2.urlopen(req, timeout=CONF.AMC_TIME_OUT)amcDict = json.loads(response.read())code = int(amcDict["returnCode"])errMsg = amcDict["errorMsg"]

4. parse source data

Convert the data into a recognizable dictionary and troubleshoot the error, and reject the error data.

5. Calculate the average value using numpy and panda

Parse the source data and save it to panda. in DataFrame, if the data does not exist, use numpy. instead of nan, use panda. dataFrame (). mean (). round (2) calculates the average value and retains 2 digits. If all the data in the seven days of a certain minute cannot be obtained, the current value is not parsed. If some data in the seven days cannot be obtained, the current value is excluded and the average value is reduced by one, using NAN instead of this point can solve this problem in mean. If a custom function is added to compare whether each column meets the requirement of increase/decrease by 100% (configurable) and the absolute value difference reaches M (configurable), True is returned; otherwise, False is returned, determines whether all return values are True or False. If yes, the exception scenario is met.

Initialize DataFrame

for item in vd:    if value is None or value[item] is None:        vd[item][lastDayKey] = numpy.nan    else:        vd[item][lastDayKey] = value[item]           vf = pandas.DataFrame(vd)            columns.append(vf.mean().round(2))  indexes.append(lastMinKey)   self.ipInfo[ip]["lastData"] = pandas.DataFrame(columns, index=indexes)

Panda User-Defined Function Comparison and determine whether an alarm is required

for item in curValue:    if curValue[item] is None:  # error        continue    else:        curValue[item] = round(curValue[item], 2)            def overLastCompute(v2, absSub):            """            :param v2: float              :param absSub: absolute subtract             :return: high/low/null            """            v1 = curValue[item]            v2 = round(v2, 2)            if 0 == v2:                if v1 > absSub:                    return "HIGH"                if v1 < -absSub:                    return "LOW"                return "NULL"            subVal = abs(v1 - v2)            if subVal / v2 > CONF.RATIO and subVal > absSub:                if v1 > v2:                    return "HIGH"                return "LOW"            return "NULL"        self.ipInfo[ip]["result"][item] = self.ipInfo[ip]["lastData"][item].apply(overLastCompute, absSub=self.monitorConf[ip][item])        res = self.ipInfo[ip]["result"][item] == "HIGH"  # Series        if all(i for i in res):            resErr[item] = CONF.HIGH_ERR            if CONF.HIGH_ERR == self.lastCache[str(ip)][item]:                # will  Alert if switch on                 pass        else:            res = self.ipInfo[ip]["result"][item] == "LOW"            if all(i for i in res):                resErr[item] = CONF.LOW_ERR                if CONF.LOW_ERR == self.lastCache[str(ip)][item]:                    # will  Alert if switch on                     pass

6. Because there are many IP addresses and the main logic is on parsing data and panda computing, multiple processes need to be used when many CPUs are used, and the process is full in combination with the thread pool, do not waste process resources.

step = ipNums / multiprocessing.cpu_count()ipList = list()i = 0j = 1processList = list()for ip in self.ipInfo:    ipS = str(ip)    if ipS not in self.lastCache:        self.lastCache[ipS] = copy.deepcopy(self.value)    ipList.append(ip)    i += 1    if i == step * j or i == ipNums:        j += 1        def innerRun():            wm = Pool.ThreadPool(CONF.POOL_SIZE)            for myIp in ipList:                kw = dict(ip=myIp, handlerKey=myIp)                wm.addJob(self.handleOne, **kw)            wm.waitForComplete()            ipListNums = len(ipList)            for tmp in xrange(ipListNums):                res = wm.getResult()                if res:                    handlerKey, code, handlerRet, errMsg = res                    if 0 != code:                        continue                    self.lastCache[str(handlerKey)] = handlerRet        process = multiprocessing.Process(target=innerRun)        process.start()        processList.append(process)        ipList = list()for process in processList:    process.join()

Iv. Optimization

The metric monitoring v2 is an upgraded version of the year-on-year alarm, and the data size will be several times larger. The current optimization is as follows:

1. Use hbase to replace redis

2. Change the program to Beijing-wide disaster tolerance for Distributed Operation, and horizontally expand multiple sets of programs for parallel operation.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.