ITIL-based scom monitoring Best Practices

Source: Internet
Author: User
Tags apm knowledge base


1. Monitoring by system type

When using scom for monitoring, many of my friends only import management packages and push agents. In this case, during monitoring, scom performs monitoring based on default class objects. For example, for a Windows computer, only one Windows computer can be monitored at a time. Click a Windows computer, the following is further information about the computer, such as the disk, CPU, memory, and database status on the computer.

However, this monitoring method is too narrow and does not facilitate overall statistics. If an enterprise has many business systems, there are many machines under each business system, what should enterprises do if they want to calculate the overall operational SLA of the Business System? Can I view data from one machine and then perform statistics manually? There is actually a better solution in scom.

If you have studied scom in depth, you should know that scom monitoring is based on class objects. When we manually create a monitor or perform APM monitoring, we will find that, we need to specify a class object that we want to monitor. In fact, you can specify a class object as a group to monitor the group.

For example, if a business system has an intermediate layer, a web layer, and a database layer, we can create three subclass groups for the three layers in scom. Then, each physical computer of the business is added to the logic scom group. After the establishment, a large parent object group is created. For example, the name of the business system is called the scientific research system, you can establish a parent group for a scientific research system, and then add the intermediate layer, web layer, and database layer of the scientific research system to the sub-group of the scientific research system.

Once you do this, it means that your current monitoring object is not only for the computer level, or for a disk level, but for the system level.

For example, you can create different folders for different business systems in the scom monitoring view. For example, the scientific research system folder can include the performance view, graphical view, alarm view, and event view.

In the graphic view under the scientific research system monitoring folder, you can add the parent group of the scientific research system as the Display object. After adding the parent group, you will find that, the above parent group is displayed in the graph view. Below are three child groups, and below is a Windows computer. In this case, if the administrator needs to inspect the system status, you can directly view the top parent group status in the graph view, this is because the parent group summarizes the statuses of the Child Group and the Child Group. After the Administrator clicks the parent class group, the administrator can continue to open the Child class group and then start the computer in the Child class group, after you open the computer, you can also open the database or IIS status under the computer. Through this process, you should find that the whole monitoring process is intuitive, complete, and convenient.

Similarly, you can create an alarm view in the scientific research system folder. The display objects of the alarm view can also be displayed according to the parent class group. After this, you will find that, the alarm view under each system folder will show all the alarms of the entire scientific research system, but there will be no alarms except the scientific research system.

The performance view is typical. The administrator can create performance views for different business systems in the system folder. At the same time, the display objects of the performance view can also be displayed by parent class group, after this is done, you will find that in the performance view, the performance counters of all subclass groups and subobjects In the subclass group will appear. At the same time, the overall performance counters are customizable and can be saved continuously.

There are still many views, but the operations are the same. I will not introduce them one by one. The purpose of system-level monitoring is to add sub-objects to sub-groups, summarize the object status and monitoring information of the Sub-group to the sub-group level, and then add the sub-group to the parent group, all the object status and monitoring information in the sub-group will be summarized to the parent group. Then, when the parent group is displayed as the Display object, the sub-groups and sub-objects under the parent group are displayed according to the cascading hierarchy.

This is the first step for us to optimize the scom monitoring best practices based on ITIL. Through this step, we can use scom for monitoring at the system level, more intuitive, more integrated, and more in line with business needs monitoring.

2. Alarm classification staff handling

Of course, it is not enough to have a good monitoring view. If the target monitoring object has a performance problem or a fault, who should handle it, how can we use the functions in scom to effectively classify events?

In fact, there is a very simple solution in scom, called alarm subscription, which can be combined with the existing enterprise SMTP server by simply configuring scom.

First, we should create different user groups in the Active Directory, for example, the Monitoring Group responsible for monitoring and inspection, the application group for development, and the server group responsible for server O & M, information security groups that focus on major issues. First, set up these security groups in AD, and then add the users with specific responsibilities to the group.

After the establishment, the first step is complete. Next, return to scom. We need to create different subscription methods based on different groups. This aims to achieve alarm classification.

For example, if we create a server group subscription, because the server group may need to monitor and manage all the servers of the company, when creating a subscription, you can select the alarm subscription content as all system objects, but the server group may not be familiar with development. Therefore, some APM-related alarms may not be received by the server group. You can define alert policies, alert receipt times, and alert recipients during alert subscription.

In addition to server groups and application groups, application groups may have fewer alerts than server groups, the application staff may only need to ensure that their IIS \ SQL \ oracle \ APM business system has no performance problems or fault problems. Therefore, when creating an application group subscription, alarm rules can only define IIS \ SQL \ oracle \ APM. After this definition, the application group will only receive all alarms for these roles, but will not receive other alarms.

Compared with server groups and application groups, information security groups may have fewer alerts, because information security groups are generally at the management level, so they tend to focus only on the availability of the main system, as long as there is no problem with availability, the information security group is not required to know. Therefore, when creating an alarm subscription for an Information Security Group, the alarm rules only need to be defined and compared with the main business systems, if the alert priority is critical, it is sent to the Information Security Group. After this definition, the information security group is notified only when a major business system has a severe alarm, which is generally unavailable.

Through the above definition, we can see that in scom, what kind of alarm, what level of satisfaction, and who can refer to it can all be customized for subscription. after doing so, to a certain extent, duties and tasks are isolated. Different staff members only need to focus on their own fields.

What we actually define now is the definition of an alert rule, that is, an alert, what type of Alert, alert priority, and which subscribers are responsible for receiving alerts. After the alarm rules are defined, we can also define the channel, which means that an alarm is generated and how to notify engineers. By default, if no Subscription definition is available, the engineer can view the alarm through the scom console or the scom Web console. With the subscription, the administrator can define different channels. scom supports email, instant message, voice, and SMS as alarm channels. enterprises can use multiple alarm channels at the same time, you can also customize third-party alarm channels. The purpose of alarm channels is to ensure that when an alarm occurs, they can be promptly notified to the alert handler in multiple ways.

The above is the alarm classification in scom, so what is the meaning of alarm grouping.

The purpose of defining alarm sub-persons is to implement the event dispatch function in some ITSM system platforms. You can define different alarm resolution statuses in scom, for example, send to a first-line Monitoring Group, send to the server O & M group, send to the business application group, send to the Information Security Group, send to the senior engineer group, and so on. Once the alarm resolution status is defined, an alarm is found when a person in the first-line Monitoring Group opens the scom console or opens the scom Web console for inspection every day, however, the alarm does not belong to the processing scope of the first-line Monitoring Group. In this case, the first-line O & M group can select an alarm in the console, right-click the alarm, and send the alarm to the corresponding personnel, for example, when the server O & M group opens the scom console, or when the server O & M personnel receives emails and text messages, they can see the alarm assigned to them, after an alarm is assigned to the server O & M personnel, the server O & M personnel should first view the alarm and the knowledge base provided by the alarm. If the alarm can be resolved, the alarm is updated to the confirmed status, if the alarm cannot be resolved, the server O & M personnel can allocate the alarm to senior engineers or service providers. After the alarm has been completely resolved by the server O & M personnel, if the alarm does not automatically disappear, you can update the alarm to the resolved status. Wait for a while. If the alarm is cleared on the console, the alarm is still stored in the scom Data Warehouse. By default, the alarm is saved for 365 days, which means that the alarm is always stored in the data warehouse, for O & M personnel to analyze and forecast reports.

The benefit of defining alarm sub-persons is that you can assign alerts according to different responsibilities and levels, assign alerts to different persons, and update the alarm status after handling the alerts, archiving Alerts to a data warehouse is now a little bit of ITIL event processing experience.

3. View refined authorization View

By default, if you log in as an scom installation administrator, you will find that you can see all views and all functions. There is nothing to say, the largest administrator, so what about other people, if the first-line Monitoring Group needs to log on to the scom console, with what permissions, what accounts are logged in? Do I log on with the maximum permissions of administrator and scom? Don't worry ···

We do not recommend that you log on to the scom console using administrator or a single account. In this way, security is not just a problem, but there is no way to audit it, because if everyone logs on with the same account, you will not be able to audit the alert to whom it is handled.

So, in this case, how should we implement authorization? In fact, scom provides a very rigorous view to refine the authorization settings, in the scom console-Manage-security role area, you can see that eight user roles are provided by default.

Author

Read-Only Operator

Application monitoring Operator

Report security Operator

Report Operator

Operator

Administrator

Senior Administrator

For details about the permission roles of each role, you can go to scom or go to the technet website to view them. Here, I just give my best practices.

First, the first-line Monitoring Group, the entire group in the enterprise, often play the role of monitoring and inspection, so the first-line Monitoring Group does not need to have high management permissions on scom, therefore, during authorization. You only need to create a read-only operator security role for the first-line Monitoring Group, and then add the ad Security Group of the first-line Monitoring Group to the entire security role, as the name suggests, read-only operator, therefore, the first-line Monitoring Group can only view all things in scom. Apart from the read-only operation role, the first-line Monitoring Group may need to generate reports on a regular basis and output reports for the leaders to view, therefore, you also need to add the first-line Monitoring Group to the security role of the report operator. Therefore, the permissions of the first-line O & M group are globally read-only for scom, but reports can be generated and viewed.

The application group is different from the first-line Monitoring Group. Because the Application Group may need to perform some management operations on IIS and SQL IN THE scom console, the application group needs to view the APM website and the report. Therefore, for this type of application group, we should first establish an operator-level security role. The scope level and view are only the scom monitoring view of IIS, SQL, and Oracle, add the application group ad group to the security role. After this is done, the application group user logs on to the scom console using the user account of the Application Group, only the IIS, SQL, Oracle, and corresponding system folders belonging to the application group can be seen. In addition to the security role Scope View, the application group cannot be seen, except for the operator role, you may also need to log on to the APM website to view the application group. net and Java program running status, time-consuming experience, so you also need to authorize the application group to monitor the Operator role, so that the user account of the Application Group can log on to the APM website, finally, add the application group to the report Operator role so that the application group can view application-related reports.

Server O & M groups, such as server O & M groups, usually have the right to manage all servers in the enterprise. They also need to maintain and manage all servers in the enterprise at ordinary times, therefore, they can assign a larger security role in scom. They can add the server O & M group to the security roles of administrators, report operators, and report security operators, and set the scope view to global, allows the server O & M group to view all monitoring metrics in scom.

Information Security Group: This type of information security group often plays the role of Information Department administrators in enterprises. In normal times, you may need to go to scom to check the overall running status, check the application status and reports. Therefore, the information security group permissions can be granted to: global scope, Senior Administrator, application monitoring operator, and report security operator.

By defining view authorization, we further optimized the use of scom. Without authorization, we were not able to control the permissions and audit operations well. After authorizing the security role, different staff members can only view the views within their work scope, thus avoiding risks such as misoperations, and the operations performed by each person after logging on to scom, can be audited in the scom data warehouse.

4. What about Service Management

After the above three very basic and simple optimizations, We can initially see that relying on the scom event processing process, the first alarm occurred, the first line of monitoring staff, when an alert is received through the Web Console, console, email alert, sound alert, SMS alert, and instant message alert, the monitor first makes a basic judgment on the alert, if you can use the existing scom knowledge base or processing experience to solve the problem, the alert resolution will be disabled at the monitoring personnel level. If the problem cannot be solved after a short judgment, the monitor assigns the problem to the server group or application group. The server group logs on to the webconsole or console to confirm the alarm. If the problem can be solved, the system processes the alarm, then, update the alarm status to resolved. If the server group cannot be resolved, the alarm continues to be raised and allocated. After the final problem is solved, the alarm status is updated to resolved and archived to the scom data warehouse. The current process seems like this.

One of the ambiguities is that we may ask why the first-line monitoring personnel distribute alerts. In fact, this is not necessarily because the first-line monitoring personnel distribute alerts, the reason for writing this is that once an enterprise uses an O & M monitoring system, it may be necessary for first-line monitoring personnel to monitor the system on the monitoring platform, so I think the first line of monitoring personnel may know what alarms are there.

If the server O & M personnel or application personnel discover alarms before the first-line monitoring personnel, of course, they are better. In this way, the server O & M personnel can directly confirm the alarm status and fix the alarm, instead of the first-line monitoring personnel.

In such a process, we correspond to the ITIL event processing process, and we can find that a protal for event processing is missing, but currently, the whole portal is played by the scom Web Console. All events are handled, distributed, and changed on the Web Console. If you think scom Web Console is not professional enough, flash is not enough. In System Center, you still have one option: SCSM. If you are interested, you can go to SCSM. I understand SCSM as Microsoft's ITSM service management platform, SCSM provides many connectors. For scom, there are two connectors. One is the CI connector. By configuring the CI connector between SCSM and scom, all monitoring items in scom will be synchronized to the CMDB of SCSM. Another connector is the alarm connector. By configuring the alarm connector of SCSM and scom, you can set the alarms generated in scom, events are automatically synchronized to SCSM in real time, and then processed on the SCSM portal. Once SCSM is adopted, we do not need to define alert classification in scom. After SCSM is adopted, we can define such event handlers in SCSM, different event handlers are responsible for processing a type of event. Like scom, SCSM can also issue alerts and directly escalate events to problems. After the problem is resolved, the system initiates changes and releases, in contrast, I think SCSM fully implements the ITIL idea.

5. What about Automation

In scom or systemcenter, there are three types of O & M automation methods that I can think of for the time being.

First, the easiest way is to directly bind the automatic recovery task to the scom monitor. For example, we have set up a unit monitor to monitor the agent service on the SQL group, once the agent service is found to be stopped, the system automatically uses the Net start recovery command to start the service. This is the most basic O & M automation.

The second method is more standard and more formal. By configuring scom and SCO integration, the scom alarm is passed into SCO, then SCO automatically processes alarms through different processes. For example, SCO defines a basic process, synchronizes the process with scom, and gets alarms in scom, once the event ID is 1001, it means the Service has stopped. At this time, SCO will send an email to inform the Administrator that the service has stopped. If the Administrator does not process the event later, SCO will automatically start the service through subsequent standard activities, and then tell the Administrator that the service has been automatically started. The SCO and scom integration can also be configured in many places, for example, if an alert that meets the conditions is sent to SCO and scom has triggered an alert, SCO has built many standard activity operations into SCO, you can perform a lot of automated maintenance operations, or import the IP package or develop your own IP package for automated tasks.

The last method is the most formal automatic processing mechanism. By configuring scom + SCSM + SCO integration, SCSM and SCO integration is through the SCO connector, synchronize the runbook folder in SCO to SCSM as the runbook activity template, and associate the automatic task runbook with the corresponding event, once the scom sends an alert to SCSM to form an event, SCSM first checks whether the event has a corresponding runbook activity. If yes, SCSM automatically resolves the event and closes the event in SCSM and the alert in scom at the same time, archive the file to the CMDB of SCSM. If the corresponding event is not bound to the runbook activity, the administrator needs to manually go to The SCSM portal to handle the problem. After the processing is complete, it is also archived to the scsm cmdb and the scsm knowledge base is generated.

Through the implementation of automation, we can easily see that SCO \ scom \ SCSM is actually implementing a process automation, with some components of System Automation, through process automation, it can help O & M personnel deal with some basic and repetitive O & M work. The important point is that process automation does not mean that the process should be simplified and not normalized, but means that the process is more automated and standardized. The purpose of implementing automation is to reduce the workload of O & M personnel, at the same time, it also aims to reduce misoperations according to standardized operation steps.

6. Process Review

  • Generates alerts. You can obtain alerts through the webconsole, console, and multi-channel alert mechanisms.

  • Judgment: If scom is integrated with SCO, SCO can automatically resolve the alarm.

  • Judgment: If scom is integrated with SCSM and SCO, SCO can automatically resolve events and alarms.

  • If scom is not integrated with other SC components, after an alarm is generated, the first-line monitoring personnel should first make a basic judgment on the alarm. If the existing scom knowledge base or processing experience can be used to solve the problem, that is to say, the first-line monitoring personnel will close the alarm.

  • If the first-line monitoring staff is temporarily determined that the incident cannot be solved within a short period of time, the alarm will be assigned to the next-level engineer.

  • The superior engineer receives the assigned alert and resolves the alert through the scom knowledge base and his/her experience. The alert is set to confirmed. If the alert is resolved, the alert is updated to resolved.

  • If the superior engineer cannot resolve the event alert, the alert continues to rise.

  • Eventually, when an engineer at that level or department resolves the alert, the alert is resolved and archived to the scom data warehouse for future analysis and prediction.

  • If scom integrated SCO is used, scom alarms are automatically resolved through SCO, and alarm events are archived to the scom data warehouse.

  • If scom is integrated with SCSM and SCO, events in SCSM are automatically resolved through SCO, and the events are archived to the scsm cmdb database, it is also archived to the scom data warehouse.

 

The article is written here, and I sincerely hope that you can understand it and find a place that can be applied to the existing scom/SCSM/SCO, after scom + SCSM + SCO is integrated, it is a set of practice platforms that implement the basic ideas of ITIL and provides a relatively complete event processing process response mechanism, although there may not be good enough, at least Microsoft hopes to make efforts in this regard and provides itil in the system center. The author believes that the Monitoring and O & M platform or ITSM system platform is good, if it is necessary to implement it service management in enterprises and improve IT service management, technical implementation alone is not enough. Therefore, appropriate management policies must be combined to Promote the implementation, in order to truly implement ITIL in the enterprise and private cloud.


This article from "a stubborn Island" blog, please be sure to keep this source http://wzde2012.blog.51cto.com/6474289/1616066

ITIL-based scom monitoring Best Practices

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.