Coffee Break: The SCOM Clinic: Your Questions Answered (Part 2)

Snapshot synchronization failures from a management server that is a 10-millisecond round-trip of latency 

Now and then the SnapshotSynchronizatoin fails on one of our furthest away Management Servers from SQL (10ms round trip). Is there a more direct way to resolve it on the MS in question than the following? 

Bob and Nathan recommended that to you should first refer back to manual snapshot syncs and adjusting the config file settings. But the issue looks like it may be due to differences between the management servers because it looks like two management servers are doing what they should do and one doesn’t – presumably that’s the one with the 10ms round trip.  

You should always keep your management servers close to the database no matter what. It should be under 5ms so that even if you place them in a different data center, it should still be under 5ms. This third management server may be used for recovery scenarios, but it may just need to be placed closer to the agents. Put a gateway there to create a contact point.  

This workflow is run by the old management service resource pool which, by default, includes all management servers. So it’s not that they all need to run this workflow just one of them is running it at that time but what you could do is change the membership of the of that resource pool to manual and then put only the two fast management servers in there and then you still have you know a failover between the two and then you can just leave the third management server out of this thing and then it doesn’t try to run this uh the snapshot synchronization job.  

Also, be very careful with the editing of config files with these settings. Keep them the same on every management server because if they deviate from each other, that’s a recipe for disaster. 

There’s not a lot that’s local to the management server so you need to be looking to see what your differences are between those. And it could be that you need to rethink your design with the three management servers, especially if there are two in one data center and one in another. It will, it will never be able to survive the failure in the first datacenter. And then the third one will be down anyway. 

How can I author Windows Service monitoring for specific computers? 

There is a template in the authoring pane for service monitoring, which works well. To get the specific computers, start with an unsealed management pack and create a group beforehand so that you can put those computers in. Then, during the authoring template, you’ll be asked for a group to target. Just pick your group with explicit membership and it will target just those computers. It’s all done through the SCOM UI with no XML or anything required. 

Will SCOM ever have an out-of-the-box solution for customer alert description instead of authoring?  

This forum, a replacement for User Voice, is used by the SCOM Product Team to take feedback from. So, propose this one to Aakash and the SCOM Product Team.  

As for custom alert authoring, we tackled that in Part 1 of this blog, the quick summary is John Liss covered a method of doing this using PowerShell in his session at SCOMathon 2021

How to decouple MPs and get custom 2008 perf monitors 

“SCOM 2012 R2, AD 2008 MP (that has dependencies on AD 2K and AD 2K3). Our former AD dude created custom MP that leverages these legacy 2K8 MPs then created some custom override MPs as well. I’m trying to decouple and get all the custom 2K8 easy perf monitors etc. I’ve been hacking the XML; it’s daunting. Override Creator and Explorer are tools that exist. They look like they should be good but honestly, they don’t seem to work. Am I missing some framework on the server? I don’t have SCOM reporting. I surmised hacking the XML is the only way to get all of this cleaned up. Are there any thoughts?” 

First, you should always have SCOM reporting to work with SCOM as a SCOM admin because there are lots of useful reports and features in there.  

The Override Explorer works in most cases to move some overrides from one back to another. You can find out more about this in a webinar Bob ran for the SCOMathon Workshop.  

As for the custom monitoring packs and override packs that you want to get rid of because they’re usually a problem, many references will turn up empty so they can be manually removed. So yes, that’s hacking XML a little bit. But the rest is trying to find out what monitoring got created and see if it still applies to the new management packs since the Active Directory management pack for 2016 and up, and the domain controllers in 2012, have been completely rewritten. That means all the classes are different so the overrides will not work anymore, and you need to retarget all of the overrides. And check if the monitor or rule that you’re overriding is still there, because that may also have changed.  

Otherwise, it’s reverse engineering what that custom monitoring part was, and recreating it in a fresh management pack. Sometimes it’s easier to just start knowing that you’re going to need to refactor things. Although it’s daunting, you may need to just to start with the old one as simply a reference and look at it from the new design down.  

Dealing with IIS MP and web servers that have many websites and application pools 

“How to deal with IIS MP and web servers that have many websites and application pools. In an issue with the server, it creates lots of alerts for all of the above.” 

It sounds like it’s operating as designed because if there’s an issue with the server, and you have 50 websites, there are 50 websites down. This does unfortunately generate one server alert, plus 50 website alerts, plus a bunch more. But the monitoring is accurate.  

One possible solution, if you’re using VMs, is to split it into smaller web VMs. But that’s an option that requires a lot. 

Alternatively, try to get it in Maintenance Mode quickly if the server issue is either predicted or from a change. Maintenance Mode could be a good way to prevent this storm.  

And then finally, if you’re using a PowerShell to handle your alerts or a more advanced system, you could potentially try to correlate and suppress the alerting on the subsequent alerts to the parent alert. 

And then also add dashboarding. If you have dashboarding, you can see the relevance of websites going down, and can ignore test websites, for example, that are on the same machine. Then you can focus on what is really important for your business. 

Is there an easy way to prevent flapping Alerts? 

There is some protection since it is a rule, and it should have suppression on it. So, you would only see one per server. But there may be different types of workflows coming in that mean you might see multiple alerts on a server. But there should not be too many. And there should be a repeat count on those, so you only get one ticket per server if it happens. Also, if you don’t close the alert automatically, then it also raises the repeat count on it.  

Keep in mind that these are built-in rules for the agent itself and they do not have repeated events version of this so you should look into what script is failing as well. The only other way to do this would be to create a new custom role for repeated events detection. 

Is there an easy way to move a group or multiple groups and one MP to another?  

“Is there an easy way to move a group or multiple groups and one MP to another or do they need to be recreated? We’re working on cleaning up a left-behind giant cluster management pack.” 

It’s going to vary largely if it’s unsealed or sealed. If your clustered management pack is an unsealed pack, everything those groups were used for – every workflow that targets them, every override that targets them – will also be in that unsealed pack. Since it’s unsealed, if you move it, things that were in the pack won’t be able to target it.  

If it was a sealed pack, say you had a major groups pack and you wanted to split it out to maybe two or three groups pack or maybe groups by themes pack, you should be able to move those. But then again, every unsealed management pack that targets that cluster pack will now need to be rewritten to target a new context of your new group pack.  

There isn’t an easy way to move the group and everything included. Just moving the group wouldn’t be so bad itself, but it’s the downstream effects that are really going to bite you here. Everything that pointed to it now needs to be updated as well. 

Data dropped due to maximum queue size increased. Modifying the registry settings dont help much. 

In the back end of SCOM, when you’ve seen an author and you have your data sources, your probes your right actions, and things that get passed between, those go into a queue at one point. So you’ve probably got an issue somewhere where it’s not consuming the data out quickly enough, which could be a database write issue, an authentication issue, or a multitude of other items.  

Increasing the maximum size of the queue will give you longer before it fills, but if you aren’t consuming data out of the queue at the same rate or faster than the data you’re putting it in, you will always override that. So look for a Windows Error prior to this – something about a workflow failed – and see what’s happening with that workflow. 

If you have an extremely busy registry, upping that queue size can help. But if it’s a sustained, faster coming in and going out, upping the queue is just going to delay the inevitable as it fills up.  

What else could be daily, weekly, or monthly health checks for databases specifically? 

Look at the amount of config churn and data churn that you have through using a number of default management packs and default reports, like the ones built into SCOM itself:

Also look at the reports. So there are a lot of things that you can do on a daily, weekly, monthly, quarterly basis, to help make SCOM better. 

How can we automate the process of web URL monitoring, additions/removal from “Web application availability” Management Pack template? 

Adding it and removing it specifically to the template will be difficult just the way the templates are modified from the UI. But Kevin Holman has a blog post on discovering dynamic data from either a CSV or another source and you could potentially discover your URLs from there and target it.  

So it would just be a discovery that runs every 12 hours or every day, for example, and it pulls in the new URLs. And then your workflows with target that. Nathan says he don’t know if you can automate it very well through the UI, but automating this wouldn’t be too bad to write a custom management pack and a class type of URL tested. And then you will just modify the code the UI created to target your new class type.  

If you want to truly automate it, you might have a look at the XML that it actually creates when you create a new management pack, run this thing once and see what it actually creates. But it might create a little bit more than you would like. But you could try and replicate this for multiple objects. 

Once monitored servers didn’t come out of maintenance mode automatically, even though I set up the timeframe. Why?  

There were actually a few bugs in the past, which got fixed with update rollups. So make sure you use the latest update roll up for SCOM in whatever version you have.  

Sometimes it could also be due to the old management servers resource pool being down for a while in the middle of that maintenance mode. During that time, it doesn’t pick up that workflow. So the workflow that keeps an eye on the maintenance modes of other machines and takes them out of maintenance mode again is likely down at that moment. And then you never know if it comes back up and if it picks it up correctly. So keep an eye on that resource pool.  

There are also cases where it’s also in double maintenance mode through a secondary schedule. So there are multiple possibilities. 

Do you recommend Blake Drums health check? 

There are two options we want to share for health checks. Cookdown and TopQuore have health checks. Cookdown’s is free and TopQore’s costs EUR3000 but is much more comprehensive. 

The Cookdown health check is script-driven. They send you a script that you can run in your environment, it will give you pointers on what to look for. Then you will send this script back to Cookdown, they will have a look at it, perform some analysis, and book an explanation meeting with you in which they’ll run through the key points.  

TopQore’s SCOM Health Check is quite extensive; it has a lot of checks into all SCOM components for their implementation and configuration. They also look for technical errors or improvements. TopQore also looks into the monitoring processes and procedures in general, including the dashboarding training, idle processes and documentation, lifecycle management, and much more.  

All this results in a document of about 70 pages detailing the results and discussion. They also have a discussion with the SCOM admins to talk about what has been found and how to solve those. It’s a three-day engagement designed to deliver great value. 

Neither Nathan nor Bob have tried Blake’s health check but said it looks sensible from a quick glance.

Conclusion 

And that’s a wrap for the SCOM clinic questions. If you missed part 1, find it here.  

We hope this helped you fix some of your SCOM problems so everything works smoothly now.