a mature automated operations system should include at least three subsystems:
Computer room Equipment Data System (EMDB)
1. Input computer room server and network equipment of various information, such as machine model, hard disk size, OS type, belongs to application, running state, room name, room, rack, location and so on various information, this is a most basic database, The main purpose is to give each machine from a number of dimensions of a unified label to facilitate the use of other systems.
2. Provide a variety of Query API interface, and do permissions control. The purpose is to be able to be called by various systems in the upper layer, usually the rest interface, the XML interface. Then the corresponding packaging library is made based on various languages.
Application Monitoring System (Appmonitor)
1. A unified data acquisition module for the acquisition of equipment operation information, including disk IO, network traffic, CPU utilization, network equipment session number, PPS. This acquisition module in the network device can generally be implemented through SNMP, on the server generally through a custom agent to achieve, the agent's most basic ability is to collect server running data, The most important is the ability to execute various scripting languages and implement various operations on the server through scripting languages (such as changing the configuration, analyzing the Application log, and outputting the results).
2. Monitoring data storage and visualization , data collection module to collect a variety of data will be many, but to the transactional requirements, can be used in a variety of NoSQL databases such as Hbase,cassandra to achieve. Data visualization is a very deep and application-level things can be done, generally in the monitoring system only to achieve the most basic graph display, to provide time-segment selection and comparison of functions, and other complex visual operations through a variety of APIs to achieve.
3.
Monitoring item addition and alarm notification, the monitor item is a hierarchy, not a list structure. the configuration of the upper node can be overwritten by the configuration of the underlying node. For network devices, the monitoring item is a different OID. With the help of the underlying data acquisition module, the monitoring item is basically a script for the server. Can be divided into standard monitoring items and custom monitoring items, standard monitoring items to maximize the general, to achieve CPU, memory, disk, network and other information monitoring. Custom monitoring items can be implemented using a variety of system Management scripting languages (Shell,python,perl), and the output of scripts conforms to certain specifications, generally using line structures or JSON strings. Each monitoring item is set Warn,crit alarm threshold and several alarm contacts, the threshold value is generally numeric, special can be a string. Monitoring items exceeding the threshold will send an alarm to the contact, and the alarm can be sent via SMS, email, and im software. Alarm Send to support merge alarm, frequency control, turn off alarm. Otherwise, a small fault can be issued thousands of alarm, the alarm will lose effect.
4. Monitor API interface, and do the right control. the practice and purpose are the same as Emdb. Open monitoring data acquisition, alarm message sending, configuration Push interface. The main purpose is to allow the data inside the monitoring system to be used by the outside world, on the basis of these data to do more beautiful and complex data visualization work, or do some more personalized monitoring and alarm. The secondary purpose is to support uniform operation of the server, such as the uniform upgrade of the system software version of all the company's machines. It is recommended that the API interface for unified operations be open to only a few people and that the permissions are tightly controlled.
Publishing and on-line configuration Management Systems (Releasemanager)
1. Application release and dependent library version management, application release is an important link between operations and development, the general release system and the SVN system tightly integrated, the SVN system will be wired on the application of the list, Emdb inside will have the application of each machine. This data is used by the publishing system to publish the application packages and their dependencies generated within the SVN system, and to release management and control of those application packages and dependent packages, which can be rolled back to the previous version if there is a problem with the application release.
2. On-line configuration management, similar to Linux under the puppet function, mainly for application server on the key configuration file version control, distribution, consistent maintenance work. large applications are generally a number of servers to form a cluster to provide services, requiring that the application configuration of several servers is consistent, but sometimes there is the application of grayscale publishing operations, or someone mistakenly change the configuration. The on-line configuration management system requires a unified configuration modification portal to support grayscale publishing, while correcting the misconfiguration of configuration changes. The operation can be performed using the Appmonitor interface. on the basis of these three systems can do more automation, such as financial personnel can use emdb inside the data accurate calculation Capex&opex, room manager can use Emdb through OOB remote execution of various shutdown, re-install system, network equipment maintenance work, The machine can be managed without the site, and the field work can be outsourced. Application developers can use the SVN system call Releasemanager to package, publish, roll back applications. Application maintenance personnel can call the monitoring system to obtain data and alarm information, through the writing of relevant scripts, to achieve some simple alarm automatic processing, improve efficiency.
Functions of automated operation and maintenance system