I.T. Proctology: failover clustering

Showing posts with label failover clustering. Show all posts

Monday, March 23, 2009

HA VMs and the Failover Clustering wizard

Here is a gotcha that just came out of the Hyper-V forums.

I have been assisting a person with a problem that he was having with his VM shaving to failover as a group. We walked through the most common issue; the VMs share a single storage LUN. But the issue persisted. So we went backward.

In the end it was all about the process of making a VM Highly Available in the Failover Clustering manager.

Now, the process has been well documented and blogged about - so I am not going to show screen shots or outline the process. But I will mention the gotcha.

Here is rule of thumb that us crusty clustering folks don't think of anymore. We just do it, and in doing that I didn't even think about the root of the issue that this individual ran into.

Here is the rule of thumb: One run of the Failover Cluster "add resource" wizard equals one Highly Available resource.

You might think, well yea. And for those of us that began with Clustering under NT4, this was a requirement. However, with the ease of use of the new wizard in Failover Cluster Manager I just don't consciously think about it anymore.

Why is this an issue?

Ah, here is the scenario.

I open the wizard to make a number of VMs Highly Available. To safe some time, I add multiple VMs and complete the wizard. I think, great, I am done.

Now, on the backside, Failover Clustering has actually just grouped all of these VM resources together as a single Highly Available entity that is composed of many workloads.

This is where Failover Clustering takes over, and Hyper-V is just an engine.

Failover Clustering deals in 'workloads.'

And in the workload world that can be a website, plus a COM service, plus a LUN, plus a database server - all distinct entities, but all dependent upon each other, and in Failover Clustering you would set these up as a single workload and that they would come online in a specific order.

When we talk VMs - Failover Clustering only has a rule that says, 'Oh, a VM is composed of a configuration file, plus a VHD, plus the volume the VHD resides on' and it is the logic in the wizard that takes care of setting all three of these items as a single Highly Available workload.
(Go ahead; take a look at the details of a Highly Available VM).

The point is that each VM must be set up individually. One invoke of the wizard = one VM.
If you add multiple VMs at the same time then Failover Clustering considers them multiple components of the same workload and will keep them together and fail them over between hosts as a single unit.

This can lead to all kinds of confusion. Such as: why can't i have each VM on its own host? Why can't i fail them over individually?

One problem is the LUN - when multiple VMs share a single LUN, but that is not the issue here.
When might this be something that you want to do?

You would want to do this if you had VM entities that were related or required each other or required an internal virtual network to communicate.

One example is an IIS server that must sit behind a separate VM firewall. In this case, make them fail together, as a single workload.

In the end, I hope this helps broaden the understanding of how Hyper-V extends and depends on other Windows services to provide features.

Sunday, May 11, 2008

How do I stop Failover Clustering long enough to perform some maintenance that involves a reboot without chasing my VM between my clustered hosts?

I have actually been going around with support about this issue myself. Maintenance mode does not exist for the VM workload.

How do I stop Failover Clustering long enough to perform some maintenance that involves a reboot without chasing my VM between my clustered hosts?

In failover cluster manager I can select the Virtual Machine and take it offline - this causes my HA VM to be powered down - and also removes it from the Hyper-V manager.

If I select Manage Virtual Machine in failover cluster manager I get directed to the Hyper-V manager.

If I take my Virtual Machine configuration offline - then my VM is shut down but it is not removed from Hyper-V manager.

If I take a snapshot and try to revert, as soon as my VM gets stopped to perform the revert, it gets migrated to another host.

Here is a solution that works ONLY IF you need your VM off: (this works for changing hardware, etc.)
Open the properties of the Virtual Machine resource in Failover clustering.
Set the off-line action to Shut down (this is supposed to stop the VM from failing over and restarting on another host).
Take the resource off-line.
return to Hyper-V and make your changes.

Here is a solution if you want failover clustering to leave your VM alone for a while (so you can reboot it without if failing over):
Open the Virtual Machine properties in Failover Clustering
Select the policies tab
Select "if resource fails, do not restart"
Save that setting.
Return to the Hyper-V Manager and you can open your Virtual Machine Connection console and do things within your VM that may involve a shutdown or reboot to your hearts content.

Just remember that when you want Failover Clustering to be back in charge, you need to return to Failover Clustering and undo what we did above.

Thursday, May 8, 2008

Hyper-V plus failover clustering, an interesting marriage

Hyper-V is a really cool Windows add-on as by itself it is “just another hypervisor” but with the addition of a bunch of other Windows Roles and Features it quickly becomes much more.
Take High Availability for example. Hyper-V, plus a VM workload, plus Failover Clustering.

For those of you not already familiar with Failover Clustering I am going to talk a bit about Windows clustering in general. First of all, I am speaking of Failover Clustering, not Network Load Balancing clustering, that is totally different.

In the generic sense, Failover Clustering is a way of taking a workload that runs on a clustered node and keeping that workload available. With Hyper-V it involves keeping a VM powered on.
The only big requirement is shared storage. This can be old fashioned SCSI shared storage, fiber SAN, or iSCSI. If Windows can see it as storage and you can present it to more than one server then you have shared storage.

The Failover Clustering setup and validation wizards in Windows Server 2008 make clustering really super simple (makes me cringe when I recall my first NT 4 cluster). You run the wizard, and if you listen to it, you have a fully MSFT supported cluster – you even get a recommended quorum configuration.

One limitation to consider is NTFS. By default only one node (clustering term for a member server in a cluster) can own a LUN at any one time. To be a bit more granular, only one server can write to an NTFS partition at any one time. It is possible to share a LUN with two Windows servers, but even having one reading and the other only looking your volume will begin to degrade very quickly.

This sets up a one Highly Available VM to one LUN model for Hyper-V.

A highly available (HA) guest is made up of three parts. Part one – a configuration file. Part two – the workload. Part three – the LUN (that contains the VHD).

When a HA guest is failed over from one node to another all three parts must be moved between the nodes. The configuration is passed, the volume is passed, and the workload is passed.

The logistics behind this is that your HA guest is saved, its LUN is passed (assuming that all the bits of the VM reside in one folder), and then the guest is started (resumed).

The passing of the LUN prevents having more than one VM workload on a shared volume as the other VMs end up being ignored.

Why? You might ask. In a previous post I had mentioned about struggling with failover clustering for an hour or so, and above I mention that Hyper-V is not making the guest highly available but it is Failover Clustering.

Failover Clustering is acting upon that HA vm workload and doing whatever it takes to keep that VM up and running. It is what is controlling the VM, not Hyper-V.

Hyper-V is still involved, but from the standpoint that the VM guest heartbeat is lost for a moment, then failover clustering is right there, ready to move that VM in a snap and keep that darn thing running. IF there is collateral damage, that is not the fault of failover clustering, but the admin.

Will this behavior change? Who knows. Windows clustering has worked this way for a long time now, and so has NTFS. I guess that if you could get past the NTFS limitation, then you could do it.

That is enough for now. More later.

Friday, April 25, 2008

Hyper-V + Failover Clustering .. the fight for the managed workload

Wow, I just had quite an experience with my Highly Available Hyper-V virtual machine.

First of all, the HA feature with Hyper-V using Failover Clustering works REALLY well.
Second, I have to spend a bit more time thinking about what the heck I am doing before I go nuts again!

Okay, here I am, playing with snapshotting - taking snapshots, reverting, etc. Trying to document how things work and change.
Taking snapshots was a no brainer, everything worked fine.

When I deleted a snapshot I diligently shut down the VM and started monitoring the volume waiting for the merge to happen..I started at the screen for 20 minutes, watching, switching back to the Hyper-V console, watching the volume.

Finally, I tried to refresh the volume - boom! The volume is gone. Oh, cr**! iSCSI must be having problems - poke, troubleshoot, poke some more.

Suddenly, in a fit of frustration - duh, I have failover clustering set up. Check failover clustering - my VM is now running on Host 2. GAAA!!!

Okay, move the workload back. Get the merge to happen properly.

Now, revert.. Do a revert - boom, communication to the VM is lost (since the Host is serving up the console over RDP). I check Failover Clustering again - there is my VM on the other Host again - not properly reverted either.

Wow. The things to think about now.

Microsoft has done a brilliant job at using other Windows Server 2008 features WITH Hyper-V (think about it, Hyper-V is not much at all without all of the WS08 and System Center add-ons.) but the complexities of interplay between these components is not for the admin with a weak constitution.

Keep those skills up to par and be the admin that thinks out of the box.