I’ve been very much interested in the High Availability feature since it inception.
I’ve followed its evolution in technical preview, kicking its tyres, discussing it with the product group and MVP peers, and seeing the feature come out of the development dock, technical preview, to sail around the sea in the form of current branch, is exciting.
Exciting because the High Availability feature is a high-value design element for ConfigMgr Hierarchies, something architects have been waiting a very long time for.
We’re use to high availability for most roles now, with a few single instance roles causing ripples, but the crown jewel is the site server role.
With all the redundancy in place, business can continue, but nothing new can be authored without that site server role, and critically, client registrations would cease to function causing OSD failures.
High Availability in its first forms, will make the Site server role highly available, with a small lead-time between transitions, mostly due to it being a manual exercise in Build 1806, and because there is an up to 30 minute delay before the transition is completed. This will ease out to become instant when automatic failover is introduced. And as this feature iterates, we’ll see a reduction in the server count.
There’s more that High Availability can be used for, in the future we’re going to be able to build multiple primaries to leverage elastic computing, reducing as demand tails off, and right now we can use High Availability to completely remove the need to do an in-place OS upgrade on a primary ever again.
I put together a quick poll on Twitter to gauge interest in the feature, 427 votes is surely not representative, but it gives me an idea of interest none the less.
Over half of the participants do not have High Availability on the radar. The remaining half breaks down almost equally between not interested and implementing soon, with a small percentage implementing right now.
I’ll do another poll later in the year, I’m sure by that time we’ll see far more adoption as the penny drops.
In this post, I’m going to focus on showing you how to recover from an unrecoverable primary site failure, using an already built High Availability ConfigMgr hierarchy.
At some point I’ll rattle out a HA build-out guide, it is pretty straight forward, but I see that others have done this already, and alongside my now-ancient guide on SQL AlwaysOn, I don’t see what more I could contribute beyond the good work they have already laid down, I may have a go at it none the less.
Right then. Let’s define the Lab I’m using to make tracking what is about to happen easier.
The server and role manifest for the lab hierarchy looks like this as of Build 1806:
This is a lab build-out and not for production, not all site roles are installed or referenced. Their inclusion and placement is a matter of simple consideration and action, for example the Reporting Point and Reporting Services elements can be located on the site systems instead of on the SQL servers, Asset Intelligence Synchronisation point can be put on the Site systems.
In the above mock-up I’ve placed the SMS Providers onto the Site systems instead of the Primaries, placing the SMS Provider onto the primaries is a supported option in Build 1806.
When Build 1810 arrives, we’ll be able to condense this solution down to 4 nodes by placing the SQL AlwaysOn and Windows Cluster onto the two primary site servers and jettisoning remote SQL.
At some point, we may very well arrive at a 2 node solution, time will tell, there are a few interesting technical barriers that need to be overcome before such a design can be realised, making it unrealistic in the relatively short term. I figure that 4 nodes to make a rock-solid highly available hierarchy is a cost worth paying today.
So let’s have a play with my HA lab, literally switching off a passive site server to check out how unplanned events are handled.
We could have switched off the active primary, would have incurred a time penalty while we wait for the transition to take place, the passive to switch to becoming an active, I might do that in the end of this guide let’s see.
Handling an unrecoverable passive Site server
My two Primaries are called:
- L3CMN1 (Active)
- L3CMN3 (Passive)
Node 2 ‘took one for the team’ during testing.
I’ve turned off L3CMN3.Lab1.com, so as to simulate the complete and unrecoverable failure of one of the two nodes of a High Availability cluster.
As you can see in the shot below, L3CMN3.Lab1.com is listed as a Site system, which has the Site Server role installed:
* That lower-case server entry is annoying me too, sorry
If you try to remove the site system by right-clicking on it in the top panel, and selecting delete, it’ll complain that there is a role there:
So clearly the first step is to remove that role from the site system, the role is shown in the Site System Roles panel, right-click the role there and select Delete:
Give it a few moments, then select the site system in the top panel and select Delete:
L3CMN3.Lab1.com is gone:
We’re currently running with one primary, so we’re going to need to build ourselves a new passive primary at some point. We’d obviously schedule this for out of hours, although there is barely if any operational down-time while building a new passive primary.
Building a new passive Primary
Ready up a new OS, give it a name, fixed IP, install IIS + ADK in accordance with a Primary site’s needs, permission so that the active primary and everyone else can say hello, then create a new Site system in the ConfigMgr console, here I am creating L3CMN4.Lab1.com:
And I’m going to add the Site server in passive mode role:
Let it copy the files to the server, and install to C:\Program Files\Microsoft Configuration Manager
The new passive primary L3CMN4 now shows in the Servers and Site System Roles list:
Note that by default an SMS Provider is not installed with the passive primary. I recommend not placing them there if you have site systems to hand for client-facing roles, as it will complicate and extend recovery times. Maybe in the future when we’re down to 2 or 3 nodes, it may be more appropriate to house everything on the primaries, everything, maybe.
Keep an eye on the FailOverMgr log, and visit the new Primary and check out its setup log. I recommend using LogLauncher and punching in the servers that make up your HA setup, use it to visit\open logs, it will make life easier.
While we’re waiting for the build to complete, we can button mash the refresh button in the console, if it fails it is most likely due to permissions or prerequisites, the FailOverMgr log and site setup logs will help you shore the issues up and a Retry can be performed (right select the node in Sites > (Site server) > Nodes tab).
L3CMN4 is in place as a passive Primary, and we’re back to high availability of the Site server role:
With minimal roles deployed to the Primary, the option to give up on classical DR and build out a new primary site server on failure becomes a reality, and a good course of action to take.
If you have Roles installed onto the Primaries, things get a little bit more complicated, especially for the SMS Provider, and when Build 1810 arrives and SQL is co-located with the Primaries, building a new passive node will require a lot more to achieve.
All very doable though.
I figure I could build a passive and get it back into a Windows Cluster, then install SQL using a pre-configured unattended file, get the availability group feature running and join it back in to the existing AG being held alive by the active primary node in time for tea, well, over a few hours if things are well-connected and its all running good equipment (not laggy servers). But there will be down-time, due to a site reset being necessary to complete the SMS Provider removal process which has to take place before a new primary can be built.
In a moment I’m going to nuke the Active Primary by turning it off.
As you can guess from the serialisation, L3CMN1 is the very first primary built, so offing it is like cutting the cord in a big way, we’re not swinging from an artificial primary back and forth between a real primary, each primary in a High Availability configuration is a real primary, who cares not if its originator still exists, just whether it can reach the database and site systems.
Let’s step back a moment and ponder things.
How many of you are fixated on keeping the primary alive at all costs, treating it like an irreplaceable treasure chest?
Isn’t it a strange feeling to begin treating a primary like a throw-away lighter, something that can easily be replaced?
That’s new, exciting as I said.
High Availability is a game-changer, it is literally forcing us to look at a primary, which we’ve treated with affection and protected at all costs in the past, as something easily replaced and ‘disposable’. Almost.
3rd Party product integration with primaries may require some tending, which would complicate a High Availability design or recovery procedure. I’m exploring that.
Ok let’s have some more fun.
Making life complicated, a dead primary with SMS Provider and Roles
I’ve now installed the SMS Provider and a Management Point onto L3CMN1, before I sacrifice it to the Hyper-V gods for this guide.