Roberts Blog

The House of SCCM and Intune on System Center Street

Tag: High Availability

ConfigMgr HA and losing the Content Library

You’d think completely losing your ConfigMgr Content Library (no backup) would be quite a dramatic event from a bumpy road perspective, I found that it isn’t that traumatic at all, there are only two key activities, the first being some brief file system jiggery-pokery, and the second that the network is going to get a bit of a hammering, as all content will need to be resent (not redistribute) out to the DP’s to get ConfigMgr to put it back into the Content Library.

If you have a backup, restore that puppy and get out of jail totally. But with no backup, well, read on.

A while back I lost a HDD that was a member of a pair of 2TB disks in a lazy striped RAID disk, I was presenting that disk as a Shared Disk to two Hyper-V VM’s running SQL for my High Availability XL High Availability lab.

Losing that raided disk was a bit of a problem, as I said it was presented to two VM’s on the same Hyper-V host as a shared disk, they both used it to create a clustered SMB share to which I had moved the content library as part of the prep work to switch to HA.

My content library was literally gone!

It’s only a lab, so when things like this happen, hmmm, interesting, time to investigate!

So only recently I put another disk in to replace the faulted disk and brought the clustered SMB share back to life. Obviously there’s no content there but the clustered SMB share was back online and writable.

While the Content Library is unavailable no new content can be added to ConfigMgr, not the end of the world but worth noting from a HA perspective.

Left the lab alone for a day, came back and saw that ConfigMgr had attempted to distribute the built-in Configuration Manager Client package to the Content Library (CL) without being prompted, but had failed to complete the task.

It failed because ConfigMgr hadn’t recreated the CL top-level folder structure, it had created the DataLib folder but failed to create the FileLib folder and stalled right there.

Here’s the textual transcript:

  • Started package
    processing thread for package ‘HA200002‘, thread ID = 0x159C (5532)
  • Sleep 60 seconds…
  • STATMSG: ID=2304 SEV=I LEV=M
    SOURCE=”SMS Server” COMP=”SMS_DISTRIBUTION_MANAGER”
    SYS=L3CMN4.LAB1.COM SITE=HA2 PID=5364 TID=5532 GMTDATE=Sun Oct 06
    20:10:46.881 2019 ISTR0=”HA200002″ ISTR1=””
    ISTR2=”” ISTR3=”” ISTR4=”” ISTR5=””
    ISTR6=”” ISTR7=”” ISTR8=”” ISTR9=””
    NUMATTRS=1 AID0=400 AVAL0=”HA200002″
  • Retrying package HA200002
    (SourceVersion:17;StoredVersion:17)
  • Start updating the package HA200002…
  • CDistributionSrcSQL::UpdateAvailableVersion
    PackageID=HA200002, Version=18, Status=2300
  • Taking package snapshot for package
    HA200002 from source \\L3CMN4.Lab1.com\SMS_HA2\Client
  • GetDiskFreeSpaceEx failed for
    \\FSListener\SCCMContentLibrary\ContentLibrary\FileLib
  • GetDriveSpace failed; 0x80070003
  • Failed to find space for 104867131 bytes.
  • CFileLibrary::FindAvailableLibraryPath
    failed; 0x8007050f
  • CFileLibrary::AddFile failed; 0x8007050f
CContentDefinition::AddFile failed;
0x8007050f
  • Failed to add the file. Please check if
    this file exists. Error 0x8007050F
  • SnapshotPackage() failed. Error =
    0x8007050F
  • STATMSG: ID=2361 SEV=E LEV=M
    SOURCE=”SMS Server” COMP=”SMS_DISTRIBUTION_MANAGER”
    SYS=L3CMN4.LAB1.COM SITE=HA2 PID=5364 TID=5532 GMTDATE=Sun Oct 06
    20:10:47.362 2019 ISTR0=”\\L3CMN4.Lab1.com\SMS_HA2\Client”
    ISTR1=”Configuration Manager Client Package”
    ISTR2=”HA200002″ ISTR3=”30″ ISTR4=”32″
    ISTR5=”” ISTR6=”” ISTR7=”” ISTR8=””
    ISTR9=”” NUMATTRS=1 AID0=400 AVAL0=”HA200002″
  • CDistributionSrcSQL::UpdateAvailableVersion
    PackageID=HA200002, Version=17, Status=2302
  • STATMSG: ID=2302 SEV=E LEV=M
    SOURCE=”SMS Server” COMP=”SMS_DISTRIBUTION_MANAGER”
    SYS=L3CMN4.LAB1.COM SITE=HA2 PID=5364 TID=5532 GMTDATE=Sun Oct 06
    20:10:47.415 2019 ISTR0=”Configuration Manager Client Package”
    ISTR1=”HA200002″ ISTR2=”” ISTR3=””
    ISTR4=”” ISTR5=”” ISTR6=”” ISTR7=””
    ISTR8=”” ISTR9=”” NUMATTRS=1 AID0=400
    AVAL0=”HA200002″
  • Failed to process package HA200002 after
    68 retries, will retry 32 more times
  • Exiting package processing thread for
    package HA200002.

I created the FileLib and PkgLib folders and kept an eye on DistMgr, magic began to happen. The client source was snapshotted and put into the CL, and then it went out to the DP’s just fine.

From there I needed to find all the content I wanted to put back in the CL in my own time, and perform an Update DP’s action on them. Content flowed from the source to the CL and then onto the DP’s as expected.

In the ‘real’ you’d just leave the content alone, its no doubt already on the DP’s and clients can retrieve it still, new content can be added as the content library is back online, so content would only really need to be resent to the DP’s if the source has changed, so the re-distribution of content to the DP’s to get it back into the CL doesn’t need to happen until that content has actually been iterated.

So, what can you take away from this, if you lose your content library and for some reason you find yourself without a backup the outcome isn’t that bad besides controllable network utilisation. As for recreating some folders, I’ve asked the PG to look at the code as they do recreate one folder (DataLib) but fail to create the other folder (FileLib), I didn’t check to see if it’d fail to create the last remaining folder (PkgLib), if they can make that piece of code consistent in folder creation then recovery will singly focus on network utilisation, as long as you haven’t lost your content source as well!

In an ideal world ConfigMgr would recover from this rare and sloppily-managed (no backup) event without redistributing all that content to the DP’s, so that the problem pivots solely around the content source and the content library alone, perhaps this can be made to happen now I’m not sure, and besides in a lab this can be brushed off, in production, backups … are … everything.

ConfigMgr High Availability and the Content Library

Sub-titled “ConfigMgr High Availability feature and making the Content Library highly available”.

In this article I’m going to focus specifically on considerations on placement of the Content Library for High Availability purposes.

The Content Library, an under-the-radar-for-most ‘layer’ residing on the Primary in ConfigMgr, now has to be moved away from a Primary before High Availability can be enabled.

It cannot be moved back.

Most designs out there have a Distribution Point on the Primary, which is being fed by the Content Library, which is used to feed remote Distribution Points.

The Content Library itself is ‘fed’ by the Content Source locations you specify when introducing new content.

If a server that hosts the content source, be it not the Primary, was to fail, it would be a simple exercise for an administrator to create a new share to act as a repository, while the server hosting the original content source share is recovered. If you wanted, you wouldn’t even need to recover the server hosting the content source, if it isn’t the primary obviously, as you could recover the data itself and put it onto the new share, then change the content source locations for existing content using the tooling that is out there, or automation via PoSh, an energy-rich exercise but doable.

This is not the case with the Content Library.

You cannot create a new location for the content library and carry on, as you would need to ‘move’ the original content library to the new location first, catch 22, the server hosting the content library share is most likely down, thus all energy has to be focused on recovering said server, so as to gain access to the share again, which, once accessible removes the need to proceed with a ‘move’.

Without the Content Library, and read\write access too it, no new Content can be created. Many wheels can still turn including the site server role now, but we’re not highly available until the Content Library is brought back online.

It can quite literally paralyze operations.

That Content Library share is now kind of very special.

So as of Build 1806 of ConfigMgr Current Branch, the options for managing the Content Library is quite narrow, move it onto an SMB share, which puts the onus onto introducing complex infrastructure to support that share.

Let’s explore what this means.

In the diagram below, we can see a High Availability design mock-up showing the minimal amount of ‘moving part’s necessary to implement High Availability in Build 1806, while keeping client-communications and console access pinned to two Site systems, with the Primary left to work on being a primary and processing load as fast as possible, which is key, we’re operating an essentially queue based product here, and we always try to tease out congestion \ chokepoints \ bottlenecks were we can.

* SIR’s = Single Instance Roles, Service Connection Point being a primary example

As I stated above, the roles can be dispersed differently, depending on personal preference really, the SMS Provider can be installed onto the primaries as it is technically possible in Build 1806, along with every other role baring the Distribution Point and the Content Library, or the SMS Providers could I believe be installed onto the SQL servers, so that their communications with SQL are not network-based, SQL however has to remain remote from the primaries for this cut of the feature.

So how are we handling the Content Library in the above design?

We’re using an SMB share on a site system to host the content library, which now introduces a weakness, a vulnerability.

It is now a single point of failure due to the dependence on the site system hosting the SMB share. Classic text book problem.

On the site system itself, the data can be made highly available using physical disks in a RAID 10 configuration presented to the site system, or SAN presented, a disk lost here or there, with a spare kicking in, the data is safe and will remain accessible.

If the site system is installed onto a virtual machine, and a lot of sites are running on virtual machines, either on-premise or in the Azure or other cloud services, they will benefit from the hosts underlying protections and performance (RAID or LUN disks from a SAN, SSD’s).

But if we lose the host, we’ve lost access to the share, which means we’re paralyzed until we do something about it even if we could switch the disk to another site system, no dice.

Time for a cup of tea and some Disaster Recovery.

How to Fight Data Center Fires with Fog Suppression

So you’ve got this super-fancy Highly Available Hierarchy, you’ve touted it as a the bee’s knee’s to management, but if you lose one site system, the one that hosts the SMB share for the content library, there is a lot of noise generated and energy required to bring things back onto the rail with some DR. It’s a look, just not a good look.

Another consideration is that the site system lost may contain the SIR’s, the single instance roles such as the Service Connection Point, Endpoint Protection Point, and Asset Intelligence Synchronization Point, their loss is easily recovered from as are the others, simply remove them from the missing site system and place onto the alternate site system, in your own time perform DR, or rebuild the failed site system. Your LLD should plan for both site systems to access the internet for these services.

Having the the content library as a SPOF is a better look than pre-High Availability, since the Site server role is dealing with the old blockers and still ticking along nicely, critically servicing client registration requests and thus not impeding OSD, processing work queues, while working magic as wheels within wheels continue to turn as the site systems provide the redundancy needed to do so with clustered SQL and role redundancy.

Just no new content can be authored into the hierarchy.

There is another way. Ways I and Conan are just getting into.

Make the share itself highly available by using a Cluster Share.

The hostname becomes an alias, and we can survival cluster node failure as a result.

Perfect.

Now in our design, where would the best place to house this cluster share be?