Presentation: Secure Azure Deployment Patterns

It was a wonderful experience to present at the Azure Global Bootcamp in Melbourne this year. This year I spoke about possible methods or patterns for securing Azure resources. This is an extremely board topic, though I focused on App Services, Virtual Network Endpoints and ARM templates.

You can find the PowerPoint slides to download here or view them via SlideShare.

How an issue with PowerShell DSC in WMF 5 cost us $5526.95

Short Version

  • PowerShell DSC in WMF 5.0 can lead to a significant increase in IO operations.
  • Operations will increase over time.
  • Fix is included in WMF 5.1, and a workaround is available.

With the move to the cloud, we often don’t consider how simple application bugs could lead to large infrastructure bills, however in this case, a minor issue with WMF 5.0 lead to a IO usage bill of $5526.95 over just under 6 months.

Background

In early December, as part of a regular review of Readify’s Azure usage, we noticed a spike in the usage for one of our Azure Subscriptions. The subscription accounts for a significant portion of our usage, and is used internally.

Looking closely at our usage break down, we started to see something that was very unusual, Data Management Geo Redundant Standard IO - Page Blob Write, had taken 2nd place, overtaking a D13v2 virtual machine that runs 24/7. These write operations are billed in units of 100000000 operations leading to the realisation that we had a significant IO consumer within one of our Azure Storage accounts. We quickly narrowed this down to two storage accounts housing the majority of our IaaS deployments.

Digging Deeper

In terms of IaaS, we don’t have anything deployed that would be considered particularly IO intensive, well, not to this level. We just have some AD DCs, ADFS servers, VPN, VSTS build and some Azure Automation Hybrid workers. Nothing was jumping out as an obvious culprit.

Looking at some metrics for the storage account, two stood out as being unusually high; Total Requests and Total Egress.

I started to look at the usage reports produced by the Azure EA portal. It appeared at first that the usage increased at around September. At first I thought it could have been something to do with some work around Azure Recovery Services, either some replication or backups, but after a few days, it was obvious this wasn't the cause of our issues.

Calling Support

At this point, I raised a support case with the Azure team.

I just want to say a massive thank you to the Azure Support team, they did an amazing job in resolving the issue. They started off just as perplexed as I was. We went through the workloads, and nothing presented as an obvious cause of our issues.

We went back through the usage reports in the EA portal. Looking back further, we determine the usage increased in July. Had we implemented anything in July? At this point, my mind drew a complete blank.

The engineer offered to see if the back-end engineering teams might be able to help us narrow down the usage, hopefully leading us to the VHD(s), and thus the virtual machine that was driving this usage.

The next day, they sent through an excel spreadsheet. The main culprits were our DCs and ADFS servers. These had been around for over a year, and had very little changes. I thanked support and promised I would spend some time studying the machines and get back to them if I found something.

Staring at Resource Monitor

I fired up Resource Monitor on one of the machines, and spent the next hour watching the disk usage. After almost going cross-eyed, I saw it, it was hard to believe, but I saw something that wasn’t quite right. I saw huge write cycles from the WMI Host process to C:\Windows\System32\config\systemprofile\AppData\Local\Microsoft\Windows\PowerShell\CommandAnalysis\PowerShellAnalysisCacheEntry*. Could that be the cause?

I quickly looked on other virtual machines, and saw similar patterns. Things were pointing to these files, but what were they? What was the WMI Host doing?

The cause is found!

I decided the first thing to do is find out what those files were, so I opened my favourite search engine, Duck Duck Go, and entered _wmi host PowerShellAnalysisCacheEntry, and hit search. The first two results, get-psdrive & abnormal I/O and Topic: Dsc v5 get-psdrive abnormal I/O leapt out.

That is when it hit me. We implemented Azure Automation DSC in July. Crap!

I started with the second result, Topic: Dsc v5 get-psdrive abnormal I/O, on PowerShell.org. In the post, from May, the user Brik Brac talks about seeing a major I/O performance issue using DSC v5.

Brik Brac also created the post on the Windows Server User voice, get-psdrive & abnormal I/O, in this post, Zachary Alexander a member of the PowerShell team acknowledged the issue, and posted up a workaround. He also posted that this issue would be fixed in WMF 5.1. WMF 5.1 will be released sometime in January 2017.

Simply put, the issue lies in how PowerShell discovers modules to import when performing certain PowerShell PS Drive Operations, and this heavily impacts DSC. Each time DSC runs, be it monitoring or applying configuration, the number of write operations on these cache files increases, left unchecked, the growth is dramatic. The files don't grow much in size, just the number of write operations to them. You can read more on auto discovery here, How Module Command Discovery Works in PSv3.

xRegistry is one of the resources impacted the most by this issue and as part of our DSC, we use xRegistry resources to harden the SSL configuration on our servers. In fact, this was one of the first reasons we moved to DSC.

I quickly emailed the support engineer I was working with, after a quick phone call, we felt confident that we had found the source of our issues. Now we just needed to prove it.

Resolving the issue

There were two ways we could resolve the issue. Firstly, I opted to install the WMF 5.1 preview onto systems where possible; secondly; we implemented the work around, much like outlined by Zachary in the User Voice post that clears the cache files.

Thankfully, implementing these steps was quite simple, but then we had a nervous wait to see if we had fixed the issue.

I started to see the Total Requests drop over the course of the day whilst I installed WMF 5.1 and ensured DSC was appropriately applied using Azure Automation DSC. By the end of the following day, the Total Requests had completely bottomed out, from over 35M to several thousand.

Ensuring this does not repeat

There are two ways to ensure that something like this doesn’t happen again.

First, implement threshold alerts for your Storage accounts. We didn’t have any defined, and since have. I strongly recommend that everyone ensures that they have alerts for the Total Requests metric.

Next, check your server monitoring. We had monitoring for CPU, memory usage, and free disk space, but I didn’t have any monitoring of disk IO. You can do this in the Azure Portal, or via OMS. If I had this in place, I would have picked up this issue much earlier. The other thing I will be doing is ensuring that our server and storage account deployment templates include basic performance alerting.

More Information

Why isn't Remoting Disabled by Default on Windows Server?

I was recently involved in a brief and quite lively Twitter discussion with Don Jones and Jeffery Snover about PowerShell Remoting and why it is enabled by default. I have been involved in a number of discussions about this topic, but never with such a distinguished crowd such as this one. My opinion, and my original comments, where along of the line of  “I believe Remoting should be off by default”, “Well, RDP is disabled by default, why not Remoting”, and “SSH has been off by default for years”, whilst the counter arguments were of the form “Because Nano Server” or “You could always customise your environment to be off by default”. 

Don Jones posted a follow up to this discussion on PowerShell.org titled “Why is Remoting Enabled by Default on Windows Server?” and asked me to put together a post on why I felt it should be off by default. This was difficult for me to put together, so here goes!

It has long been an industry practice, to disable/stop services which are not in use on your clients and servers. The argument is quite simple, enabled services are vulnerable servers, they expose your devices to potential risks. Simply having Remoting off, unless explicitly required, will reduce the attack surface area and increase the security of our systems.

Even Microsoft has followed an off by default methodology for the past 10 to 15 years. Services like POP3 and IMAP are off by default in Exchange, SQL servers do not listen for IP addresses by default, we need to install roles and features individually. Microsoft learnt from a number of major security blunders in the early days (Code Red, Slammer and even Blaster), and focused on a more secure development and deployment model. Why should there be an exception to this posture that has worked extremely well since Windows 2003?

Linux administrators, and developers of Linux distributions have been in a similar situation in the past. For a significantly long period of time, SSHD has been off by default, and administrators have still be able to manage their server fleets. One of the early reasons for an off by default approach in Linux, was that it ensured that administrators were aware of the risks prior to enabling SSHD. Now it can be argued, that this has been a failure, and I think most would agree. I do, however, believe that the failure is not in the off by default configuration, but is in the lack of documentation covering the secure configuration of SSHD. People in glass houses shouldn’t throw stones, as Remoting can be just as poorly deployed.

Remote Desktop is a great example where Microsoft followed these methodologies. RDP is off for a number of reasons with security being only one of them. Ironically, one of the obvious reasons to have RDP off by default is to encourage the move from on server management to remote management. Whilst adoption has not been as high as was expected (due to issues with third party vendors, administrators and to a big extent Microsoft), it is clearly a sign of how ahead of the curve Microsoft has been.

It has become increasingly dangerous to expose management services, be they SSH or RDP on the Internet. If you have ever been responsible to auditing the log files of a server where SSH or RDP is exposed to the Internet, you will be well aware of the automated scan attempts that are performed. Brian Kreb’s has posted on Internet criminals selling access to Linux and Windows servers whose credentials they have brute forced. What happens when the criminals discover Remoting? Bruteforcing credentials via Remoting should be even easier and have written about just such a thing on previous occasions. Should we be enabling these criminals and providing them with even more machines that they can take over?

Well, we are doing this to an extent right now. Users, administrators and developers have all been busy provisioning virtual machines on platforms like Azure and AWS, and whilst in many cases RDP endpoints are on random high ports, the same cannot be said for Remoting. Those who deployed and manage these systems may be well unaware of the risks that they have introduced to their networks. Moving to an off by default model could protect these environments from this sort of configuration error.

As a side note, it is still interesting to me how Microsoft changed Remoting from off to on by default in Windows Server 2012, with very little fanfare. In 2014 when I presented on Lateral Movement with PowerShell, audiences typically responded with a significant amount of surprise, be they from an administration or security background.

In Don’s post, he talks about the fact we could easily create an off by default environment if we so wanted. I really have to disagree with him, and say that he has missed the point to a degree. Whilst it is true, that we could use a customised gold/master image, Group Policy or some other tool to create an environment where Remoting is off by default, it must be highlighted that the inverse, an on by default environment would be just as simple to create with these tools. If you want it on, then turn it on, it isn’t that hard. 

Don also talks about the fact that Remoting is an incredibly controllable, HTTP-based protocol. This introduces the other issue I have with Remoting. Unless you are deploying an Azure Virtual Machine, post install, you will be exposing Remoting over HTTP and not HTTPS. Is this 2015 or 2001? Do we really still need to talk about the virtues of HTTPS? It would be trivial for Microsoft to change the default from HTTP to HTTPS in a manner similar to RDP. 

Now let’s talk about the big elephant in the room, or should I say Nano elephant in the room? What about Nano Server?!?!? Nano Server, whilst it is a new concept for some of us, isn’t a completely new in our industry. Whilst I agree, it is probably easier to have Remoting (and WMI) enabled by default, it isn’t like the deployment of a Nano Server is currently a simple process. Currently Nano Server is coming as a standalone WIM image, we need to manually add packages providing roles, and we currently need to join a domain during installation. How hard would it be to have a step enabling Remoting? It is trivial. 

Having said all of that, perhaps the best middle ground would be to have Remoting enabled on Nano Server, and off for Core and Full installs? Administrators have more option on the latter two than the former. Perhaps a compromise is in order?

Another side note, why doesn’t Microsoft want to enable Remoting on Clients? If Remoting is safe for Internet exposed servers, shouldn’t it be ok for Windows Clients?

So in summary, why should Remoting be off by default?

  • Off by default is an industry standard practice.
  • Off by default has been Microsoft practice for over 10 years.
  • Linux administrators deal with SSHD off, so can we!
  • RDP has been off by default, we lived with that.
  • RDP and SSH are actively brute forced, why open up another attack vector?
  • Off by default reduces administrative misconfiguration/insecure configuration
  • It is just as easy to switch it on, as it is to switch it off.
  • Nano Server isn’t as much of a challenge as we thing, but it could be the exception.

As Don said, whether you agree or not, it is entirely up to you and you are welcome to add your polite, professional comments to this post, or over at the PowerShell.org forums where I have cross-posted. Like Don, I wanted to explain and attempt to justify why I think Microsoft’s approach is not correct. I often believe the discussion is more important than the outcome, and I believe this is definitely the case here.

Kieran Jacobsen

Tips For Managing Microsoft SQL 2012 Always On Availability Groups

Recently I have been spending a significant amount of time working with high availability and disaster recovery solutions. A large amount of this work has been around the deployment and use of SQL2012 AlwaysOn Availability Groups.

Availability Groups should not be confused with SQL Clusters, whilst both technologies make use of Windows Fail-over Clustering; they achieve their goals in some very different ways.

I am not going to go into specifics around the design and deployment of an AlwaysOn Availability Group solution; however, I have some design tips that should help you suffer any serious setbacks.

Firstly, confirm vendor application support. This might seem obvious but it is extremely important. There are a number of applications that do not support AlwaysOn AGs, the list includes Microsoft SCCM2012R2 (and other products in the System Centre Family) as well as some products from Citrix (patches on their way). If you are running any of the products that do not support these features, then you will need to revert back to a normal Fail-over Cluster (or in my case, a geo-cluster).

Secondly, find a reliable place for a File Share based Witness, preferably a second side. This tip is of particular importance if you are running across multiple sites. You should be also aware that the server hosting the share must to be in the same domain as the SQL servers.

Next, check the implications of using AlwaysOn AGs on your Microsoft Licensing. All of the new features come at a cost; you need ensure that you do not affect your overall solution costs. Remember, with AlwaysOn AGs, SQL is actually running on all of your nodes, all the time and this could affect you license requirements.

One annoying thing, you cannot easily create our listeners if you do not have any databases. My tip to bypass this is to simply create an empty database, and then create the AlwaysOn AG and its associated listener based on that database. Once you have that done, then your applications installers can be run against the listener. REMEMBER though, that if an application creates databases via the listener name, they are not configured as part of the AlwaysOn Group, you will need to go and do that manually! I was caught with this one.

I had some issues when creating AlwaysOn AG and the associated listener. These two articles really helped me out: