Coffee Break: The SCOM Clinic: Your Questions Answered (Part 1)

Last month’s Coffee Break was a hit. We enlisted the help of Bob Cornelissen, MVP and Managing Consultant at TopQore, and Nathan Foreman, Product Architect at Cookdown, alongside Bruce Cullen, our presenter and Director of Products at Cookdown, to answer all of your SCOM questions.  

Our “SCOM doctors” were expecting questions around the maintainability, like what to expect from the database, how much is stored there etc. They were also expecting some infrastructure questions, fixes, and how-tos.  

Here’s what the team were asked and the answers they gave. 

What are the daily, weekly, and monthly checks that you do in your SCOM environment? 

There are a number of things you should regularly do as an Administrator.  

  1. Start with daily checks of the health of SCOM itself. Check the infrastructure and agents to see how it is all performing. 
  1. You should also check your management servers resource pool and other key resource pools are online and healthy as they run a lot of the internal key processes for SCOM. If those go offline or are failing back and forth between management servers, things are just not going to work like you’re hoping.  
  1. Weekly, you need to be assessing the data volume that you create, and the volume of alerts created. These define how fast SCOM will work and how many problems you will have in either your environment or just looking at your ticketing system. This is a weekly check you need to make.  

Tips and tricks for tuning SCOM 

When it comes to tuning, Nathan recommends you first have a conversation with the person who will be receiving the SCOM alerts. Looking online will give you hundreds of different suggestions, but every company is slightly different. Plus, people love to have input and be heard. 

Ask the person who receives the alerts about what they actually need and care to see. Then tune the alerts to your shop and the individuals on the receiving end. This will give you a lot more traction than just copying what someone else has previously done.

Bob added that he usually uses some reporting to check out which workflows are generating the largest volume of data, and which are creating most of the ‘mess’ in SCOM – health changes, state changes, discoveries, alerts etc.  

Alerts are quite clear to most. However, there are also alerts which may open and close during the night that you might not see if you’re not automatically forwarding it into your ticketing system. In that case, there might be twice as many as you thought so they are interesting to look at.  

Then there’s the age-old question that’s often shop-specific: do you care about those alerts that have opened and closed in the night?  

Some companies do; some don’t; some should where they don’t. Bob recommends, if all alerts are put into a notification channel, to delay alerts by a couple of minutes. That way, if any alerts open and close in 2-3 minutes, there won’t be a notification sent, which will keep things a little clearer.  

General advice on upgrading from SCOM 2012R2 to SCOM 2019 

Remember that the SCOM in-place upgrade path only takes you one version up at a time. But there is SCOM 2016 between 2012R2 and 2019. So, this requires multiple upgrades. 

Make sure you also look at the supported Windows and SQL versions you are running your SCOM infrastructure on.  

Sometimes, the leap between versions might be too big with too many steps, so it might be worth doing a side-by-side migration. Spin up a completely new SCOM on the newest version and move everything across. 

This is the very short answer, but Bob has done webinars on this at SCOMathon 2021. He’ll also be running one for upgrading to SCOM 2022 when that is released at SCOMathon 2022. So, keep an eye out for new SCOMathon webinars coming soon.  

Can we do SNMP monitoring for the network devices in SCOM or is there a limitation I should know about? 

SCOM does support SNMP monitoring once you get it coded up and it picks up some things automatically from the start but others you will need to do some work to go deeper into the monitoring and figure out which Object Identifiers (OIDs) you want to know about.  

You can walk the OIDs for discovery as well as for monitoring a value to say you know the value at a specific OID. SNMP also does v2 and v3 authentication.  

But it’s not intuitive to create an SNMP discovery or OIDs in general. Figuring out what OIDs are where and what they should be is difficult. 

There are known limitations as SNMP monitoring always seems to come in the top five or ten requests for what people would like to see added to SCOM in the Big SCOM Survey. It is possible to monitor networks in SCOM, but it isn’t the best-of-breed tool for that. 

Can SCOM SSRS run on SQL Server 2019 when our SCOM DB runs on SCOM 2016 SP3? 

There’s a simple answer to this one – yes.  

Having trouble with subscriptions for new state rule alerts 

Tyler asked: 

“I’m having trouble with subscriptions. I would like to create one subscription to notify on multiple monitors and rules. However, the rules I would like to notify for are new state rule alerts, not closed state rule alerts. Any recommendations for the subscription query to make this work correctly?” 

Bob replied that the criteria builder allows you to filter on state. 

This filter lets you get notified on both – for example, new and closed. This would let you see that you had been sent an alert, but it has already been closed, so there’s no need to worry about it. Or you can send everything that was in a new state at the moment it was sent to you.  

SNMP using MIBS: separate monitoring account needed? 

The short answer is ‘yes’ because SNMP always has some authentication. It’s either going to be a v1 or v2 read-only string or it’s going to be in v3 with an account and a password. If it’s a read-only string, that’s sufficient. 

You don’t need a separate Windows account to run the workflows under, but you will need a run-as account for a community string or a v3 account.  

Installing and removing SCOM management servers in SCOM 2019 

First, make sure there is no SCOM Agent installed on the machine before you start installing a SCOM management server on it.  

When it comes to removing it, just make sure that once you remove it, it is also removed from SCOM through the other management servers which are still there.  

Another note is that Microsoft wants the management servers to be local to the database in terms of time distance. So, if you’re going to install a management server in a remote site, it just won’t work well because of the latency. So, make sure you check to see that it’s very quick to get to SQL. They’re supposed to be in the same data center. Gateways would otherwise be your option if you need to compress data remotely.  

Fixing the ‘Data Warehouse failed to deploy reports’ in SQL 2016 

Bryan asked: 

“How do you fix the “Data Warehouse failed to deploy reports for a management pack to SQL Reporting Services Server” when you are running SQL 2016 and don’t have the same option as SQL 2019?” 

You see this occurring in SQL 2017 or 2019 because of the upload resource extensions for upload settings. There, we have to add a *.* there for every specific extension that reports have.  

However, sometimes happens when certain management packs are being used. For example, the Sharepoint 2013 management pack or NLB 2012 or RDS 2012 MPs have some references in them that cannot be converted. So, because they are not there, it gives the error.  

But the error should state which pack is causing the issue and the first port of call would be to remove that management pack, wait a day, check if all the other reports are loading and working fine, then try adding it back in again. And make sure you have the latest version of the MP too.  

How to modify the default alert descriptions to be more meaningful 

The default descriptions are sent with the management pack, and they are fixed. So, you can’t edit them directly. 

However, as the alerts come out, you have a couple of options. There was an example in the last SCOMathon where you can trigger a PowerShell script and then extract it and make your own to then send it via HTML or a different method.  See the session in question here.

But there is nothing built into SCOM to let you directly override the description on an alert. You can only edit it after the alert is sent.  

The descriptions may usually be very short, but you can often find more information in the alert itself by opening the alert properties, the context, the health explorer, etc.  

There is also a paid-for add-on management pack for SCOM that will do this and even adds on the knowledge, which can be helpful.  

Smart way that health service can be automatically restarted? 

The full question asked was: 

“When we see “All management servers resource pool is unavailable”, most occurrences stopped due to health service. Is there a smart way to auto restart health service when this alert is raised by SCOM?” 

First, this should be a very rare occurrence. So if this has happened a few times, there’s an underlying issue that needs to be fixed.  

However, there are some ‘bandages’ you can apply as a quick fix. In the past, Bob has used automation, like the System Center Orchestrator, to check the services running on SCOM management servers. This basically monitors the monitor. Then you can restart them if they go down. Try a loop of three times. If it fails three times, then just send an email out.  

Remember there is also the Windows setting where you can have Windows Services automatically restart after a few minutes.  

But make sure that the old management servers resource pool is always or nearly always up because this is running so many things in the background for SCOM that you really need it to stay up. 

Restarting agent services or notifying someone 

Mattias asked: “Now and then the agent service stops working. Is there any way to automatically restart the agent service or notify someone (where SCOM admins don’t have admin rights on the agent machines)?”  

The answer here may be to use alerts. When a health service goes down, we have the health service watcher, which is the management server. And if it misses three heartbeats of the agent that will issue a health service heartbeat failed alert and it creates a notification for that. So you could notify based on that. 

On the agent side, you could set the Windows service to restart after failure. But you cannot use SCOM itself directly if the agent is down.  

But it might be good to have a look at this article from Kevin Holman, which talks about the health service restarts. It’s something that happens often, and it basically means that there are too many workflows running or something else happening, and then the agent will restart itself the whole day. You don’t want that happening. 

We aren’t as concerned with a regular agent going down as a management server. Since it’s a single agent, it’s less systemic than the management server service stopping.

Database performance issues with TempDB growing fast 

The next question came from Jamie. He asked: 

“We’ve been experiencing database performance issues with TempDB growing fast and becoming inaccessible. We think we fixed it by turning off a bunch of performance counters, but this was not easy to find. Are there any MPs or tools that can be used to find irregularities or bottlenecks in a SCOM DB? Is having a shared SQL instance running on a SQL availability group fit for purpose for SCOM?” 

To really get to the root of this, it will require some investigation because there are multiple possibilities and options that could cause this.  

Consider doing a health check. This is really important to uncover what’s going on at the SQL layer. Then there’s also tuning data churn and cover churn, which will affect the SQL side of things.  

Note that it’s best practice to keep SCOM databases on separate SQL instances where no other databases are located for other products. This is due to the performance and busy nature of these databases, and the use of the TempDB by these databases. Both the SCOM database and the SCOM data warehouse are hitting the TempDB hard. 

If your TempDB is on the same SQL instance, your TempDB is going to be shared with your other items on there. So, SQL is going to pull into that, if you do something like a Cartesian join, and it’s going to use the TempDB to hold part of that.  

So, it could also be whatever is else is on there is using up a large section of that TempDB as well. So, there’ll be some picking out of the inside of TempDB to see what’s using it and investigate from there. 

We recommend not hosting SCOM databases with anything. But if it must share, maybe only with an orchestrator database or similar, which is also a System Center product. Don’t put it with a busy database like a Service Manager database as that will also hit the TempDB. And never put SCOM and SQL on the same machine. 

Look out for Part 2 

That’s all we have time for today. But keep an eye out for part two of this blog post as we go through more of the questions asked in the February Coffee Break: The SCOM Clinic.  

Part 2 is now live here.