To be successful with SCOM, the Manifesto for SCOM Success is an excellent place to start. This was a manifesto put together in the 2020 SCOMathon event to help everyone maximize the power of SCOM.
The five core principals are tool-agnostic to make sure you get the very best out of SCOM. SCOM is a powerful infrastructure monitoring platform and these tips will maximize its potential. But we wanted to do a deeper dive into what these principles looked like in action in real environments.
You can see the full webinar here:
The first principle is to turn everything off in SCOM and then tune up, not down.
You automatically get fed a huge number of alerts by default when you add management packs into SCOM and it starts discovering objects. On first glance, this looks like a helpful feature as you get immediately alerted on all sorts of helpful things. But are these alerts on objects that are vitally important to you or other SCOM end users?
We highly recommend you turn off all alerts to start and only add in alerts as necessary. Work with the domain experts for each area, e.g. the SQL DBA for SQL, and discover what alerts they need. Tune up on a group by group basis rather than blanket rules.
SCOM alert tuning demo with Easy Tune
Jan Pettersson, System Specialist at UMEÅ University, demonstrated in the webinar how he uses EasyTune from Cookdown to tune his SCOM alerts. You can download and run it for free with up to five tuning packs (sets of tuning). Before Easy Tune, Pettersson used the manual override. It works but is time consuming having to right-click and select override for each server.
On a new download of the Microsoft SQL Server (version agnostic) management pack (MP), the first thing to do was to Tune Globally (All Objects), which means tuning all SQL servers in this instance.
First, turn everything off using the DiscoveryOnly level of each tuning pack to set overrides that will turn off all the monitoring and every alert included in the MP. Click next and allow it to create a new MP. Then select Finish. That will then let you tune more specifically so you can choose which SQL servers in a cluster you want to monitor, for example.
Next, once everything is turned off, select the MP again and right click to Tune a Group or Specific Object.
Select the group or object you want to tune – here, Jan selected the SQL production server group – and tune it to the level you need, such as performance, production, and test level.
This will disable some alerts and leave others enabled and you can see what changes are made in the overrides list that appears below your selection.
A great feature is that you can apply the tuning on a schedule to avoid false alerts when you are running backups and maintenance.
Create your own tuning pack
If you can’t find the tuning you want in the management pack, you can always create your own tuning pack. Jan’s team has created some custom tuning packs for databases, IIS, and Exchange Windows servers.
Once you select the management pack and specify the details for your tuning pack, it will pull out the workflows in the management pack that you can then tune and store them in a tuning pack (essentially a CSV file). You can choose to view the CSV file in a plain text editor (like Notepad++)or Excel, whichever you are more comfortable with.
Simply change the name ‘Custom’ as highlighted in the image below to the name you want to give your management pack.
Then set the overrides by changing the values by each workflow parameter you want to tune (e.g. change an enabled parameter from true to false).
Although most companies will have similar system monitoring requirements, there are always use cases unique to your system. Custom monitoring allows you to monitor those.
There is a whole industry around providing custom monitoring. But there comes a point where, to be truly successful with SCOM, you need to be writing your own monitoring.
Although it seems daunting because SCOM a can be extended to do almost anything, it doesn’t have to be overwhelming. Less is often more.
There is already a lot available to you out of the box in the authoring pane.
Let’s take the example of your application team having a known issue. They know that if a specific event occurs, it will cause them to run out of disk space. This needs custom monitoring. SCOM has everything you need to set that up.
You could create a simple Timer Reset that resets it after two hours, if that event is detected. Or choose a Windows Event Reset to detect it and say that it’s clear. The SCOM authoring capabilities are great and allow you to create alerts you can action, edit IDs, and more.
Another customization could be to use Performance Monitoring.
If you have a number of items used in a collection and you know that, for example, when your application has a processing queue of over 10, that’s too much of a backlog. You can set up alerting for that. Or, for your building management system, when the number of HVAC systems turned on in the building gets above ninety percent, you can set that performance counter and have an alert off it.
Custom monitoring can be that simple.
You can also use scripts to create custom monitors. There is a cost to launching PowerShell to run scripts, so think about this option before setting it up. But you can create a PowerShell script for almost anything. You can quickly create a script to detect a condition and set it to run every five minutes, or as frequently as you need, and return the values you need.
Finally, some monitors that works well but often get overlooked in the SCOM authoring templates are for Web Application Monitoring. This will let you check a page for specific text so you can check for Copyright 2022 on your company’s site every five or ten minutes. URLGenie can do the same, but it could be worth setting it up with SCOMs native functionality, as it would only take fifteen minutes.
TCP Port monitoring is also useful. If you don’t have an in-depth knowledge of a destination system, you may want to check if the port is listening so you can see network connectivity end to end, see whether the system is up, and some latency data too. It’s basic, but useful to have.
There’s a lot in the authoring pane that is ready to go. Being able to pick something up before it falls over and acting proactively will make you a hero and prove the value of SCOM to your organization.
SCOM’s bread and butter is infrastructure monitoring, but the world is changing. It’s becoming more focused on applications. SCOM has this covered with its Distributed Application functionality and APM if you run .net apps.
First, how do you know if your application is running? There may be a web server, SQL server and another application server to make your application run. So, a sequence to check that an application is running may look like this:
At the university where Jan works, they have a printing application that lets users swipe their card at any printer, which send their print queue to a dummy queue and print from that device. To check this is running, they need to check that the driver is not spooling or the app will stop working. There are also a couple of ports that need to be open, and the website responds with a text message to show it’s up. There are also a couple of service monitors that run a discovery for services. All these things need to be working for the application to be up.
Jan uses the Wildcard Management Pack for managing the application as it’s easier to use than the built-in MP.
Similarly for running Skype for Business, the university checks in SCOM for the website, Lync edge ports, databases, disks, the SIP Trunk, and even the certificate. By adding in the application information over these specific objects means that the alerts gain meaning. If a disk is running out of space, it’s not just a generic disk, but the one that runs Skype for Business. And these ports may not be otherwise monitored without the extra application layer of insight.
SCOM’s basic method of telling you when something is wrong is via email. This can quickly become overwhelming when you have hundreds and thousands of alerts from huge deployments, even if your SCOM environment is optimized.
You’d be much better served if you could raise the alerts in your ITSM solution as tasks or incidents depending on their nature. This way, they get put into a process and are dealt with. And on the flip side, if the team doesn’t want an incident raising on a particular event, then you can query whether it should be monitored at all.
Managing SCOM alerts in ServiceNow with Connection Center
Connection Center lets you pull incident information back into SCOM so you don’t have to context switch between your ITSM tool and SCOM.
You can set up a custom resolution state and send only those to your ITSM tool.
From the administration pane, you can see all the items you can push into and out of SCOM. You can push SCOM alerts anywhere using webhooks.
To push a SCOM alert out to an ITSM tool, you simply need to go to ‘Create Connection’ and choose your tool. Then, complete all the details needed, including specifying what is going to be receiving the alerts you’re sending.
When connecting to ServiceNow, as in this example, you get a Subscriptions Criteria picker that lets you determine the specifics around the types of alerts you want to send, e.g. only high priority alerts. All the configuration is saved in the MP, which is saved in SCOM.
Then you need to configure inbound alerts to pull back information on those alerts on the incidents raised. This follows the same wizard format.
When you then log in to your ITSM tool, you’ll be able to see the SCOM alerts and act on them. Below, you can see how that might look in the ServiceNow certified store app, which receives the SCOM alerts.
Behind the scenes are several rules working in ServiceNow that determine when something is raised as an incident or not.
For example, this SQL Team rule looks at the properties of the incoming SCOM alert to match properties against to make decisions. All the properties of the SCOM alert are available to you and you can match those pieces of information. You can do simple things like string matching and more complicated rules like pattern matching or regular expressions. And you can add all these together.
If you are ITIL compliant, for example, you might want a ticket when disk space is low and make that a task, rather than an incident, as under ITIL “disk space is low” is not truly an Incident.
By pulling all this into your ITSM tool, you get not only the information about the SCOM alert, but also the details of the incident raised.
Plus, with the power of bidirectional sync, you will also have the incident ID and information from ServiceNow pulled back into SCOM too.
SCOM is great at doing the difficult job of monitoring. But it’s not so great at displaying that data to make it easily digestible.
You can use the native dashboard or a more sophisticated third-party solution.
The first dashboard here was built in SquaredUp for the operating crew that take care of the main server hall at UMEÅ University where they keep all the servers. The dashboard gives instant visibility of the server room temperature, the load on the backup batteries, the run time, and the remaining current, as well as the number of tickets in the ServiceDesk Plus. A fun addition is the campus lunch menu in the bottom left next to the local weather.
Within one dashboard there’s data from multiple tools. A SQL integration pulls in the ticketing system numbers, and the temperature and UPS are displayed through some custom monitoring. Now they can visualize the data, the team can spot problems before they’re raised as alerts.
Then the dashboard that the services people see shows the distributed applications in the top left. The conditional colors make status changes immediately visible. In the top right, there are ticket stats for open, overdue, and how many have been handled today. At the bottom is the CAB calendar so the Service Desk can see what planned installations are happening. In the middle of the dashboard is the blog where alert information or notices about maintenance and when something goes down are posted. That uses an iframe.
The final dashboard shows servers, their load, the network usage, and connection latency for the Dev Team. This is particularly important because twice a year, students from all over Sweden log into the UMEÅ University website to see their SAT results, putting pressure on a specific set of infrastructure. With a dashboard to visualize this, the team can see what’s happening at any given moment.
Which of the 5 core principles will you use first?
So those are the core principles of the SCOM Manifesto and demos of how to use them in your own environment to get the most out of SCOM.
Which one will you be implementing first in your environment?
Head over to our SCOMathon Slack community to share what you’re doing and ask questions.