Imagine, not entirely hypothetically, that you have a VMware ESXi 6.0 host that has disconnected from VMware vCenter due to an issue with the management agents on the host, but the virtual machines on the host are still running. Both the affected host and the VMs will show as "disconnected" in vCenter in this case. Attempts to reconnect from the vCenter side fail.

The usual next steps are to check the management network (eg, from the ESXi DCUI) -- it was fine -- and then to try restarting the management agents in ESXi from the DCUI or a ssh shell, which in this not entirely hypothetical case hung for (literally) hours attempting to restart them. When you reach this point the usual advice is that the host has to be rebooted -- which is complicated because it has production VMs on it, and you cannot just vmotion those VMs to somewhere else.... because the connection to vCenter is broken :-(

If you are lucky enough to have:

  • ssh access to the affected ESXi host, so you can easily tell what is running there

  • your VMs hosted on shared storage

  • at least one other working ESXi host with capacity for the affected VMs connected to vCenter and the shared storage

  • ssh access to the working ESXi host

then there may be a relatively non-disruptive way out of this mess where you can cleanly shut down each VM and then start it up again on the working host even when the management agents are not working any more. (In our not entirely hypothetical case we got no response at all to any esxcli or vim-cmd or similar commands, including commands like df -- presumably because they all talk to the local management agents, which were wedged.)

To be able to move the affected VMs with the least downtime like this you need:

  • to know the path on the shared storage to the VM's .vmx file (typically something like /vmfs/volumes/..../VM/VM.vmx)

  • to know the port group on the vDS (distributed switch) for each interface of the VM

  • have a login (or contact that can login) to the VM and shut it down from within the guest OS

Hopefully you can find the first two in your provisioning database (in a larger environment), or someone will remember where the VMs are stored (in a smaller environment), otherwise you will need to find them by manually browsing your storage and vDS in vCenter. Do find out both of these things before shutting down the VM to minimise downtime of the affected VMs.

To move the VM in this manual way the approach is then:

  • Log into vCenter

  • Find a new unused port ID to use for the VM on the new working host that is in the same port group (normally the host/vCenter will do this for you, but because we bypass vCenter to register the VM this does not happen automatically. To do this go to the Networking page in vCenter, and look in the "Ports" tab of the relevant port group for an empty Port ID line, then make a note of that Port ID number. If you have multiple interfaces you will need to do this for each interface of the affected VM. (If you do not do this, the VM will start up with its networking disconnected, and you will get the error Invalid configuration for device '0' or similar, which will lead to unnecessary downtime. If you really cannot figure out the appropriate vDS port groups, you can leave this step until after you have shut down/re-registered the VM, but there will be more downtime.)

  • ssh into the new working host you plan to start the VM, and prepare the command:

    vim-cmd solo/registervm /vmfs/volumes/PATH_TO_VM/VM/VM.vmx

    ready to run as soon as it is time. This will register the VM on the new ESXi host, which will then tell vCenter "hey, I have this VM now" and the VM will no longer show as disconnected in ESXi. (I believe this works because it manually replicates what, eg, VMware HA does.)

  • ssh into the new working host in another window, and run:

    ls /vmfs/volumes/PATH_TO_VM/VM/

    to check for the VM.vmx.lck file indicating the VM is running; it should be present at this point as the VM is still running on the affected host. Be ready to run this command again once the VM is shut down.

  • Now log into the guest OS (via ssh, RDP, etc) and ask the guest OS to shut down (or call your contact and ask them to do that). Monitor the progress shutting down by, eg, pinging the external IP.

  • Once you see ping stop responding, wait a few seconds then re-run your:

    ls /vmfs/volumes/PATH_TO_VM/VM/

    on the new working host. With luck you will see that there is no VM.vmx.lck file left a few seconds after it stops responding to ping, indicating that the shutdown completed successsfully.

  • Once the VM.vmx.lck file is gone, hit enter in the other window where you prepared the:

    vim-cmd solo/registervm /vmfs/volumes/PATH_TO_VM/VM/VM.vmx

    command to register it on the new working host.

  • Then find the VM in vCenter -- it should no longer show as disconnected. Edit its settings, and for each network interface click on the "Advanced Settings" link, and then change the Port ID of the vDS port is connected to from the old one (tied to the broken host) to the free Port ID in the same port group that you found above. Save your changes.

  • Hit the Play button on VM in vCenter. All going well, the VM should start normally, and connect to the network. Wait for the guest OS to boot and then check (or have your contact check) that it is working. (If it does not connect to the network double check the Port ID that you set, and the guest OS -- by this point you should be able to open the VM's console again to look.)

All going well the downtime for each VM is about 30 seconds longer than the time it takes to shut down the guest OS in the VM, and start up the guest OS in the VM again -- so best case 1-2 minutes downtime.

Lather, rinse, and repeat to move the other VMs. I would suggest doing only one at a time to minimise the risk of getting confused about which step you are up to on which VM, and also minimise the downtime for each individual VM due to being distracted by working on a different VM.

If you are very lucky then after a while maybe you will manage to shutdown the VM that caused the host management agents to wedge/not start, and then the host management agents will start and the host will reconnect to vCenter. If so, you can then vMotion the remaining VMs off the affected host as normal. Otherwise keep going with the manual procedure until the host is empty. (You can tell it is empty because you no longer have disconnected VMs in vCenter; also "ps | grep vcpu | cut -f 2 -d : | sort | uniq" makes an acceptable substitute for "esxcli vm process list" -- the latter of which will just hang in this case.)

Once the affected host is empty either reboot it (or power cycle it if you cannot get in to reboot it), or if it did reconnect to vCenter, put it into maintenance mode and then reboot it. That way all the management agents and the vmkernel get a fresh start. If the VMware host logs (eg hostd.log, vmkernel.log, vxpa.log) do not show an obvious hardware cause of the problems -- so that it seems like a bug was triggered instead -- then it is probably safe to put the affected host back into production once it has been rebooted/power cycled.

Thanks to devnull4 and routereflector for the very useful hints to this process (other useful information). In our not entirely hypothetical situation none of the esxcli commands or vim-cmds to run on the non-working host worked -- they all just hung indefinitely -- so we skipped all of those, and just shut down the guests from within the guest OS. (As best we can tell from context it seems like maybe something confused CBT on a specific guest on this host, which caused a pile up of processes waiting on a lock, which caused all the symptoms. Moving a VM that we found out was being backed up via CBT tracking at the time seemed to be the magic step that freed everything else up. The ESXi hosts affected were on the nearly-latest patch level, but we plan to patch them up to date in a maintenance window soon in case the bug we seem to have stumbled across has been fixed.)

The moral of this story is if you find yourself in this situation try to start the SSH shell before trying to resart the Management Agents. It will give you a second way to look at the affected host if the Management Agents do not just restart. In our case it took about 5-10 minutes before the SSH shell started, and during that time the DCUI did not respond to keyboard input. But by contrast restarting the Management Agents through the DCUI took literally 3 hours, during which the DCUI was unusable -- so if we had not started the SSH shell first we would have had no visiblity.