How an issue with PowerShell DSC in WMF 5 cost us $5526.95

Short Version

  • PowerShell DSC in WMF 5.0 can lead to a significant increase in IO operations.
  • Operations will increase over time.
  • Fix is included in WMF 5.1, and a workaround is available.

With the move to the cloud, we often don’t consider how simple application bugs could lead to large infrastructure bills, however in this case, a minor issue with WMF 5.0 lead to a IO usage bill of $5526.95 over just under 6 months.

Background

In early December, as part of a regular review of Readify’s Azure usage, we noticed a spike in the usage for one of our Azure Subscriptions. The subscription accounts for a significant portion of our usage, and is used internally.

Looking closely at our usage break down, we started to see something that was very unusual, Data Management Geo Redundant Standard IO - Page Blob Write, had taken 2nd place, overtaking a D13v2 virtual machine that runs 24/7. These write operations are billed in units of 100000000 operations leading to the realisation that we had a significant IO consumer within one of our Azure Storage accounts. We quickly narrowed this down to two storage accounts housing the majority of our IaaS deployments.

Digging Deeper

In terms of IaaS, we don’t have anything deployed that would be considered particularly IO intensive, well, not to this level. We just have some AD DCs, ADFS servers, VPN, VSTS build and some Azure Automation Hybrid workers. Nothing was jumping out as an obvious culprit.

Looking at some metrics for the storage account, two stood out as being unusually high; Total Requests and Total Egress.

I started to look at the usage reports produced by the Azure EA portal. It appeared at first that the usage increased at around September. At first I thought it could have been something to do with some work around Azure Recovery Services, either some replication or backups, but after a few days, it was obvious this wasn't the cause of our issues.

Calling Support

At this point, I raised a support case with the Azure team.

I just want to say a massive thank you to the Azure Support team, they did an amazing job in resolving the issue. They started off just as perplexed as I was. We went through the workloads, and nothing presented as an obvious cause of our issues.

We went back through the usage reports in the EA portal. Looking back further, we determine the usage increased in July. Had we implemented anything in July? At this point, my mind drew a complete blank.

The engineer offered to see if the back-end engineering teams might be able to help us narrow down the usage, hopefully leading us to the VHD(s), and thus the virtual machine that was driving this usage.

The next day, they sent through an excel spreadsheet. The main culprits were our DCs and ADFS servers. These had been around for over a year, and had very little changes. I thanked support and promised I would spend some time studying the machines and get back to them if I found something.

Staring at Resource Monitor

I fired up Resource Monitor on one of the machines, and spent the next hour watching the disk usage. After almost going cross-eyed, I saw it, it was hard to believe, but I saw something that wasn’t quite right. I saw huge write cycles from the WMI Host process to C:\Windows\System32\config\systemprofile\AppData\Local\Microsoft\Windows\PowerShell\CommandAnalysis\PowerShellAnalysisCacheEntry*. Could that be the cause?

I quickly looked on other virtual machines, and saw similar patterns. Things were pointing to these files, but what were they? What was the WMI Host doing?

The cause is found!

I decided the first thing to do is find out what those files were, so I opened my favourite search engine, Duck Duck Go, and entered _wmi host PowerShellAnalysisCacheEntry, and hit search. The first two results, get-psdrive & abnormal I/O and Topic: Dsc v5 get-psdrive abnormal I/O leapt out.

That is when it hit me. We implemented Azure Automation DSC in July. Crap!

I started with the second result, Topic: Dsc v5 get-psdrive abnormal I/O, on PowerShell.org. In the post, from May, the user Brik Brac talks about seeing a major I/O performance issue using DSC v5.

Brik Brac also created the post on the Windows Server User voice, get-psdrive & abnormal I/O, in this post, Zachary Alexander a member of the PowerShell team acknowledged the issue, and posted up a workaround. He also posted that this issue would be fixed in WMF 5.1. WMF 5.1 will be released sometime in January 2017.

Simply put, the issue lies in how PowerShell discovers modules to import when performing certain PowerShell PS Drive Operations, and this heavily impacts DSC. Each time DSC runs, be it monitoring or applying configuration, the number of write operations on these cache files increases, left unchecked, the growth is dramatic. The files don't grow much in size, just the number of write operations to them. You can read more on auto discovery here, How Module Command Discovery Works in PSv3.

xRegistry is one of the resources impacted the most by this issue and as part of our DSC, we use xRegistry resources to harden the SSL configuration on our servers. In fact, this was one of the first reasons we moved to DSC.

I quickly emailed the support engineer I was working with, after a quick phone call, we felt confident that we had found the source of our issues. Now we just needed to prove it.

Resolving the issue

There were two ways we could resolve the issue. Firstly, I opted to install the WMF 5.1 preview onto systems where possible; secondly; we implemented the work around, much like outlined by Zachary in the User Voice post that clears the cache files.

Thankfully, implementing these steps was quite simple, but then we had a nervous wait to see if we had fixed the issue.

I started to see the Total Requests drop over the course of the day whilst I installed WMF 5.1 and ensured DSC was appropriately applied using Azure Automation DSC. By the end of the following day, the Total Requests had completely bottomed out, from over 35M to several thousand.

Ensuring this does not repeat

There are two ways to ensure that something like this doesn’t happen again.

First, implement threshold alerts for your Storage accounts. We didn’t have any defined, and since have. I strongly recommend that everyone ensures that they have alerts for the Total Requests metric.

Next, check your server monitoring. We had monitoring for CPU, memory usage, and free disk space, but I didn’t have any monitoring of disk IO. You can do this in the Azure Portal, or via OMS. If I had this in place, I would have picked up this issue much earlier. The other thing I will be doing is ensuring that our server and storage account deployment templates include basic performance alerting.

More Information