Automating Maintenance Mode for Computers Behind a Gateway: SCOM Management Pack by Kevin Holman

Kevin Holman was one of the titans of SCOM management pack (MP) authoring that took on the challenge of building a community-requested MP in just 24 hours during HackaSCOM.

He was given the unenviable task of creating a SCOM management pack that put computers behind a gateway into maintenance mode when the gateway goes down and take them out of maintenance mode when the gateway becomes available again.

This is a request the community has long had and so a solution would really be a win for SCOM users.

The plan

As the competition started, Kevin spun up a few gateways with agents behind them and some certificates to get the test environment ready and build the MP.

The plan was to start by writing a monitor just for gateway availability. There is already built-in heartbeat failure monitoring in SCOM but most people set them to be quite long for the short network outages that impact them. Kevin figured that creating his own monitor will give some more control over the process.

His plan was to then make the recovery a Maintenance Mode when the monitor goes unhealthy. The question then is: what will get put into maintenance mode? His initial idea was to make a dynamic group containing all the machines assigned to a particular gateway and put everything in that group into maintenance mode when the monitor reads as unhealthy.

Let’s see what he actually built.

The final management pack

The SCOM community wanted to have an MP that does the following: when there’s a gateway or network outage – either because the gateway goes down, or the link goes down – they want to supress the alerts from the agents that report to the gateway. This would stop hundreds of heartbeat failures and computer unreachable alerts flooding in, which aren’t really true because there’s a singular problem with something like a network link.

So, the community wanted a gateway outage to trigger a maintenance mode for the agents behind the gateway then remove it once it’s available again.

This is what Kevin built. His management pack consists of:

The management pack uses a simple ping of each gateway to monitor for availability. If the monitor is unavailable, that turns the gateway ‘unhealthy’ which then triggers it to run a recovery to put the group behind the gateway into maintenance mode. This group can be dynamically populated with the right agents so maintenance mode is scheduled for the right agents only.

Once the ping recognises that the gateway is healthy again, another recovery is triggered to stop maintenance mode.

Authoring tip: It’s little known that it’s possible to run a recovery when a monitor goes back to healthy status. You have to write it in XML, but it is doable.

Within the management pack, you need to create a group for each gateway and the PowerShell script populates the group with agents assigned to that gateway. When maintenance mode is triggered, it’s triggered for that group.

Demo

Kevin demoed his management pack at the end of the HackaSCOM. You can find the demo in the video from 34:30:

In the Gateway Maintenance Mode folder, you can see the gateway health state. It’s a watcher so you can see a perspective that tells you whether the gateway is communicating right now or not. By drilling into the watcher, you can see the ping monitor and its health state.

There are also views of the computers assigned to a gateway and also the associated dynamic group.

When the ping monitor fails due to a network link failure or similar, you will then see the gateways become unavailable. The ping monitor is then triggered, and the first recovery is run to start the Gateway Group Maintenance Mode recovery.

You can see on the Group View that the group has now been put into maintenance mode. You can see the same for the individual computer objects in the Computers View.

Within the maintenance mode settings for the objects, you will see a special comment that shows the gateway outage maintenance mode was based on a group membership. Knowing this means that you don’t have someone trying to fix the wrong problem.

When the ping monitor is able to communicate with the gateway again, the monitor will be set back to healthy.

Open the health explorer on a gateway and drill into the ping monitor, you’ll see a recovery ran on the good, healthy condition. The action stops the maintenance mode and the groups and computers will be removed from maintenance mode.

Want to know how Kevin built this in 24 hours? He built it almost entirely from scripts from his own fragment library and used Silect MP Author Professional and a whole lot of PowerShell.

What’s next?

Kevin was asked how he’d take the management pack to the next level and there are already several ideas on his to-do list for the next few weeks and months. Here are a few of the things to come:

Each gateway ping monitor currently runs independently, so if you had 50 gateways you’d have 50 scripts running every minute. This isn’t ideal so it needs to cook down.

There also needs to be some error handling alerts for when things don’t go as expected. Then it should log as an event that generates rules that can alert on them.

It’d also be great to eventually automate group creation so when you import the MP it would detect how many gateways you have and automatically create a dynamic group for each gateway and automate the views.

Judges’ comments

The judges were impressed with Stoyan Chalakov even saying it was ‘flawless’. And most seemed to agree that it was new to them that you could execute a recovery on a healthy state (Kevin used ExecuteOnState=”Success”).

This was an incredible 24-hour build from Kevin Holman and the SCOM community have been the real winners here with this new management pack available.

Read Kevin’s write up and get the MP

Kevin kindly wrote up how the MP he built works on his own blog site, and points to the MP itself. Read more and download here.