A few weeks ago, we suffered a power outage. Normally the UPS would solve this problem, but at that time the UPS was in maintenance, so we had to act quickly. Luckily only one of the two datacenters went offline, so we kicked vCenter back online in the other datacenter. Unfortunally the VMware configuration at the customer wasn’t yet prepared to sustain disasters and there was no SRM installed. This gave us no other option then to do a manual fail-over.

DisconHostAt that point we had a cluster of disconnected vm’s and not responding hosts. To re-register the disconnected vm’s in the other cluster, we had to first unregister the vm’s. But as the hosts were down and not responding, we couldn’t unregister the vm’s. The only way to get around this problem, is to remove the not responding hosts from the cluster. This will remove the hosts from vCenter including the registration of the vm’s residing on the hosts.

So off we went and removed all the not responding hosts from the cluster.
RemoveHost
And we waited, and waited, and waited. Yes, still in progress. Nothing to see here. 🙁

After about half an hour, there finally was some action and the hosts were removed from the cluster. Now we could continue bringing back vm’s online at the other datacenter. But what was the cause of this delay? What is vCenter doing well we sat and waited?

We had to know. So we opened up the vpxd log and there were many lines about “IORM”. A quick Google Search brought us to Storage I/O Control. Knowing this, we started counting and saw that for every datastore where Storage I/O Control was enabled, vCenter tried to remove the host from the Storage I/O Control group, with a pause of ±21 seconds between datastores.

2013-12-10T14:32:10.849+01:00 [05700 error 'vpxdvpxdMoStorageResourceManager' opID=F1AD88ED-0000148A-f] [VpxDrmStorageResourceManager::DisableIORMOnHostInt] Failed to disable IORM for datastore datastore-3431 on host host-1248: vmodl.fault.HostCommunication
2013-12-10T14:32:10.849+01:00 [04316 info 'commonvpxLro'] [VpxLRO] -- BEGIN task-internal-14600685 -- host-1248 -- VpxdInvtHostSyncHostLRO.Synchronize --
2013-12-10T14:32:10.849+01:00 [04316 warning 'vpxdvpxdInvtHostCnx'] [VpxdInvtHostSyncHostLRO] Connection not alive for host host-1248
2013-12-10T14:32:10.849+01:00 [04316 warning 'vpxdvpxdInvtHostCnx'] [VpxdInvtHostSyncHostLRO] Failed to fix not responding host host-1248
2013-12-10T14:32:10.849+01:00 [04316 warning 'vpxdvpxdInvtHostCnx'] [VpxdInvtHostSyncHostLRO] Connection not alive for host host-1248

Of course the host wasn’t online and therefore no response was given on the request.

2013-12-10T14:39:14.202+01:00 [05700 error 'vpxdvpxdMoStorageResourceManager' opID=F1AD88ED-0000148A-f] [VpxDrmStorageResourceManager::DisableIORMOnHostInt] Failed to disable IORM for datastore datastore-3431 on host host-1248: vim.fault.NoHost

The hosts had 93 datastores of which 76 were enabled with Storage I/O Control. A quick calculation brought us at 26,6 minutes ( (76*21)/60 ), what was exactly the time hosts had been in a “progress” state before being removed from the cluster and vCenter.