I have been working with Azure for over a year now and this is the first time I have had a machine compromised. I will say that this entire incident has been a good learning experience. I think that anyone can learn from my mistakes here.
Thinking back this event is interesting from many angles.
- a default machine provisioning behavior is opening a door that you (as the customer) needs to remember to close or not open in the first place.
- the machine is really out there, in the wild, innocent, vulnerable. Not protected in your datacenter behind all the other protections.
- your IT Security folks might not yet be engaged in how to deal with deploying into public clouds.
- you are doing test, exploration, or other work on systems that are not going to production.
Let’s go back in time to the week of November 26.
My innocent Virtual Machine was compromised by an attack against the RDP protocol. The well known RDP port 3389 was the vector.
The attacker succeeded in obtaining access, changing the local administrator password and pwning my box. I can only image that it was then ‘sold’ on some botnet as later it ended up with a PayPal phishing site.
Then it began the process of being capacity to begin finding other vulnerable folks. This is where I noticed what was happening.
(BTW – the MSFT folks that handle this type of incident were extremely helpful in working out what happened, in what order, and how.)
First, lets look at the environment where I put my machine.
It was in Azure. Placing any machine in Azure is equivalent to placing it in the corporate DMZ. There is a load balancer between you and the outside world, and public endpoint definition(s) that define the public ingress port(s).
Beyond that, you can actually do bunches of different configurations of Virtual Networks, or Gateways. You can link Services. All kinds of stuff. But remember, all IaaS machine to machine network traffic within a Service is wide open. There are no private endpoints (as in a PaaS Service).
That said, any IT Pro would immediately say ‘duh’ or ‘you idiot’. Yep. Nothing I can comment there.
There are some default IaaS behaviors that are enabling this;
- automatically opening an RDP endpoint for management
- requiring complex passwords (false sense of security)
- local administrator is not disabled
- no real guidance around practices (not that anyone RTFM’s any longer anyway – they simply SearchTFM)
PaaS Roles don’t have these behaviors. And the RDP connection is secured with a Certificate and a secondary user account.
And, as I think about this; remember that you are administering these lockdowns at the same time that your machine could be getting probed, or brute force attacked. So, choose a logical order, define a strong password at the point to machine creation (however you do it). At least it gives you a fighting chance.
So, this is where i mention a bit of guidance. Here are some options to pick and choose from.
Some initial mitigation suggestions to choose from:
- Use the PaaS roles of Web or Worker as the front end machines whenever possible. They are already hardened.
- Rename the local administrator account.
- Disable the local administrator account and create some uncommonly named user account for administrative access.
- Choose strong plus complex passwords, or passphrases. Not simply one or the other. The OS can enforce complexity but not strength.
- A dictionary attack is likely to hit “P@ssw0rd” but it is unlikely to hit “Just a city boy, born and raised in South Detroit”.
- My favorite example of password strength: http://xkcd.com/936/
- Denying user access after X failed logon attempts (lock the account). This is a Local security policy if not domain joined, or a Domain policy if joined. Consider an automatic (timed) unlock as well, or you could have no recourse but to destroy your machine.
- Do not allow the creation of the default RDP public endpoint. This is only possible through the API / PowerShell. Or delete the auto created endpoint after creating the machine in the Portal.
- Only create the RDP endpoint when remote administration is necessary, and removing it after. But remember that we are human, and unless you have some interface doing this for you, you will probably forget at some point.
- Remove the RDP endpoint and use the Virtual Network Gateway feature of the Azure Virtual Network for secured remote administration without public endpoints. This requires some ground based router, and the VPN is slow, but your ports are closed.
- Remove RDP endpoint & use Azure Connect. This is limited to IPv6 TCP traffic only, but that should cover anything required to manage the OS.
- Avoid 3389 as the public port (I noticed my compromised machine specifically scanning for this port to spread itself) by using a port in the ephemeral range.
- Use the Windows Advanced Firewall rules and define them appropriately.
- Use Windows IP Security Policies and tightly define the sources from which RDP traffic can be accepted from. This is highly effective, but a pain to set up.
- Monitor the machine. Azure provides metrics through the portal and API. Discover a baseline. Use an agent within the machine. This only detects the compromise after it happens and is not preventative.
- Take a snapshot of the clean state. This is not a point and click thing in Azure today, but you can work this out using the Storage cmdlets through destroying your machine, making the diff disk, and reincarnating the machine.
Some of these practices are simply security through obscurity but, in this case, I think it’s OK.
- Changing the name of the administrator account dramatically increases the search space for a username/password pair.
- Changing the RDP endpoint is going to increase the search space for a TCP port/username/password trio.
I didn’t mention antivirus. I won’t go into my theories there. My personal opinion is that if all the tightening measures are done and you still get hacked, most likely your AV would have failed as well.
Always be ready for the possibility of rebuilding a machine. Never assume you can simply move forward, or even install a backup. The best you can do is revert to a known clean state.