Tuesday, January 4, 2011

The Azure VM Role reboot reset – understanding the pale blue cloud

1/5/2011 update:
Only one day has gone by since I originally posted this – and I must say that this has been a very interesting adventure.  The detiled discussion is in the comments.  However, this is a real and valid scenario that a developer should plan for.

Here is a bit more insight into the behavior of VMs in Azure – and one more point that VM Role is NOT a solution for Infrastructure as a Service.
Lest begin with a very simple, one VM scenario:
With a VM Role VM – you (the developer, the person that wants to run your VM on Azure) uploads a VHD into what is now VHD specific storage, in a specific datacenter.
You then create an application in the tool of your choice and define the VHD with the service settings – this links your VHD to a VM definition, firewall configuration, load balancer configuration, etc.
You deploy your VM Role centric service and sit back and wait – then test and voila! it works.
You do stuff with your VM, the VM life changes and all is happy – or is it.
Now, you – being a curious individual – click the “Reboot” button in the Azure portal.  You think, cool, I am rebooting my VM – but you aren’t you are actually resetting your service deployment.  You return to your VM Role VM to find changes missing.
This takes us into behaviors of the Azure Fabric.  On a quick note – if you wanted some type of persistence, you need to use Azure storage for that.  Back to the issue at had – your rolled back VM.  Lets explore a possibility for why this is.
BTW - This behavior is the same for Web Roles, and Worker Roles as well – but it is the Azure base OS image, not yours.
Basically what happened was no different than a revert to a previous snapshot using Hyper-V, or the old VirtualPC rollback mode.  When a VM is deployed there is a base VHD (this can be a base image – or your VM Role VHD) and there is a new differencing disk that is spawned off.
You selecting reboot actually tossed out the differencing disk which contains your latest changes and created a new one, thus reverting your VM Role VM.  This is all fine and dandy, however my biggest question is:  What are the implications upon authentication mechanism such as Active Directory – AD does not deal with rollbacks of itself or of domain joined machines very well at all.
My scenario is that you are using Azure Connect Services connecting back to a domain controller in your environment – you join the domain, someone clicks reboot and your machine is no longer domain joined or you have a mess of authentication errors.  Again, this is not the Azure model.
The Azure model I in this case is that your VM reboots at the fabric layer back to the base image (they recommend that you prepare with sysprep) and it re-joins your domain as a new machine – with all of the pre-installed software.
This is all about persistence and where that persistence resides.  In the VMs of your service there is no persistence, the persistence resides within your application and its interaction with Azure storage or writing back to some element within the enterprise.
This is important to understand, especially if you think of Azure as IaaS  - which you need to stop doing.  It is a platform.  It is similar to a hypervisor but it is not a hypervisor in your interaction with it as a developer or ITPro. 
In a nutshell what happened is that during the reboot of my VM the Azure Fabric considered the VM unhealthy and thus provisioned a new one.  It could be that the differencing disk could not be written back to the root VHD, it could be that “something” in my VM is not as the fabric wants it so it considered it bad and provisioned a new one.
Regardless – this is valid behavior – it is behavior to understand and plan for – and again emphasizes that if you want persistence you must design it in by writing your application state out to Azure Storage (or enterprise storage using Azure Connect) in some way in order to guarantee persistence.
All very interesting.

7 comments:

PowerToTheUsers said...

If every change you make is written to a differencing disk besides the base VHD, can't you put that differencing disk on Azure storage like BLOB or Windows Azure Drive? That way you could make the changes you make persistent, or maybe use different snapshot-braches...

BrianEh said...

You can create your own differencing disk and attach the VM to that (I an in the process of trying this). However, I still bet that the behavior will be the same.

MSFT emphasizes how Azure is NOT IaaS - and the behavior of the system simply empahsizes that point - which is my point.

Azure is a developer oriented thing - it is definately not built for ITPros - you definately have to think like a developer to get creative and do things.

Azure Drive is an option for saving your data - however you have to configure everything (including your application) prior to uploading your VM Role vhd.
This si part of what I am referring to when I state that you have to adapt to using Azure storage - you can't simply rely on the local disks in the VM.

And there is no snapshotting as you have in Hyper-V - no branching, etc. You don't control the hypervisor - the Azure Fabric does - and that limits your options and your creativity.

David Pallmann said...

Brian, this doesn't match my understanding of (or personal experience with) the Reboot action. I don't lose my changes to a VM Role (or any other kind of role) when I reboot an instance. The behavior you are describing sounds like the Reimage action. Could you explain what you believe the difference between Reboot and Reimage to be?

BrianEh said...

My test was simple.

I prepare a VHD, I upload it, I open an administrative desktop session to the VM, I create two documents on the desktop. I then return to the console and select "Reboot" (not reimage).

I wait - what I consider a really long time - and I login to the desktop as the same user. The documents I created are gone. Thus I conclude that my changes are not being written back to the root VHD.

The other artifact is that the VM behaves as if it crashed - I get the "I had a dirty shutdown" dialog to enter something into the event log.

So, I may have found a bug - per your responce now making me think that way.

If I reboot from within the OS of the VM - not using the management interface - my changes persist.

I would expect the fabric reboot to behave with a clean shut down and reboot and writing my changes to disk. I would expect a reimage to drop my differences and return my VM to the uploaded state.

At this time my reboot experience is what I would expect for a reimage.

I am not using any Azure storage here - only the VM Image VHD that has been uploaded.

David Pallmann said...

Interesting. For comparison, this is the experiment I tried:

1. Deployed a VM Role with a base image and 1 instance.
2. Connected to the instance by Remote Desktop and made 2 changes: created a disk file and configured IIS with a second web site.
3. Rebooted the instance in the portal and waited for it to become ready.
4. Again connected by Remote Desktop, and verified that the disk change and the IIS configuration change were still present.

I also repeated this experiment with a differencing VHD and had the same outcome. I'm not sure why our experiences are different.

David

BrianEh said...

I might be re-wirting this post and eating crow by the time we are done - I am sure that there is some outlier that is being missed.

I am uploading new VHDs - so it will be a bit before I can try anything else.

I wonder if there might be something in the service definition or config files that is causing this behavior.

BrianEh said...

I received an interesting response back from MSFT folks about this behavior.

Yes, the definition is the way that I outlined the expected behavior - and the way that David Pallmann also mentioned:
Per the Windows Azure glossary:
reboot
To restart the VM on which a role is hosted.
reimage
To reset the VM to the initial state. The initial state is defined by the settings configured on the VHD that was uploaded to the VM.

At the same time it was affirmed that the reboot might be failing in some way due to something to do with my VM and therefore the fabric is provisioning a new VM - so other folks could see this result as well.

It was also mentioned that if I wanted to guarantee that my application data / state is saved, then (as I postulated) I need to write out to Azure storage (in some way) as this will guarantee the persistence of my data. as there are cases such that I have run in to where something is causing my VM to be reset.

All in all, a very interesting exercise so far. And a big learn.