Xen to KVM

Background

About three years ago I managed to consolidate my pile of hosted machines down to a single HP DL380 G3 server (later with an external storage shelf acquired second hand), using Xen to run multiple Linux virtual machines. This all worked fairly well, give or take some minor Xen VM stability issues (occasionally they seemed to lose their network access, even when the recommended workarounds were applied, but restarting the virtual machine would cure it).

However the hardware is now getting fairly long in the tooth (it was second hand when I got it), and takes up a fair amount of space (7U including the storage shelf, plus a spare DL380 G3 offsite), so I decided to replace it with something newer and smaller. After some research I settled on a Dell R210 1U rackmount server, with the biggest disks possible (2 * 1TB) and a fair amount of RAM. The resulting server is probably twice as fast as the old one (alas the days of easily getting 3-4 times speed improvements with every hardware refresh are long gone).

Because the Xen hypervisor has never been fully merged into the Linux kernel, and especially in Debian Lenny the Xen support was relatively unstable, I decided to change virtualisation technology. And because a new Debian stable release is due out fairly soon, I also decided to change to Ubuntu LTS as the host operating system -- the "Long Term Support" version (10.04, in this case) should be supported for at least another 4 years, longer than I'm likely to want to keep the hardware (compared with another year or so for the current Debian stable). KVM is the virtualisation technology chosen by Ubuntu, and that has the advantage of being fully merged into Linux, and allowing the use of standard binaries thanks to the virtualisation technology built into modern CPUs.

Basic KVM setup

So I purchased a Dell R210 server, tested it, and installed it with Ubuntu Server 10.04 LTS and then proceeded to set up KVM. The base KVM installation on Ubuntu is very simple:

sudo aptitude install qemu-kvm libvirt-bin bridge-utils

(which will pull in lots of dependencies, including audio-related ones which seem of relatively little use on a server). It's also useful to install a couple of extra tools for maintaining disk images:

sudo aptitude install parted kpartx

By default KVM comes with a simple network bridge (virbr0), which suits simple use cases but not the more complicated "42 hosts in 1U" style setup that I needed to migrate. Fortunately it's possible to disable the default virbr0 with:

virsh net-list
virsh net-destroy default
virsh net-autostart default --disable

and instead use the Debian/Ubuntu network setup scripts to create multiple bridges of your own choice (I have a "DMZ" bridge, as well as an "internal servers" one, and then use a firewall to route between them). The bridge-utils-interfaces man page describes how you can automatically create a bridge on boot, in /etc/network/interfaces:

auto br-dmz
iface br-dmz inet manual
    bridge_stp off
    bridge_fd  0
    bridge_maxwait 0
    bridge_ports none

(which turns off Spanning Tree, and tells the bridge code to expect things to be forwarded immediately; and doesn't bridge an any existing interfaces -- ideal when you want to route into the bridge, rather than connect it at layer 2 with something else). (If you'd like to assign an IP address to the bridge, for routing purposes then you can do that as normal, just use iface br-dmz inet static and set the normal fields for the IP address, etc.)

After testing a simple virtual machine to make sure that the routing was working, I got started migrating virtual machines. Because I have about a dozen virtual machines, most of which are in production (some for paying clients) I needed a way to migrate from one virtual machine host to the other without extended service interruptions -- so taking everything down, copying it all over, and bringing it all back up again wasn't an option (it'd have taken at least the better part of a day of downtime).

Staged migration

The approach I settled on was to bridge together the virtual machines on the old HP DL380/Xen host with the virtual machines on the new Dell R210/KVM host, through a set of tunnels, so that the virtual machines on each host could talk directly with each other. If you use vtun (or OpenVPN) in Layer 2 mode, then the interfaces can be added into a bridge at each end to join everything together. I chose to use vtun, because I was already using OpenVPN for for Layer 3 tunneling to remote workers (including my laptop), and because it's possible to use vtun without encryption which is useful when transferring lots of data across a local switch.

For vtun Layer 2 bridging can be achieved by specifying:

default {
    type ether;
    proto udp;
    keepalive yes;
}

in the tunnel definitions (or the default for all tunnels, as seen above), and then using brctl to add the newly created interface into the existing bridges when the tunnel comes up, viz:

ifconfig "%% up";
program "brctl addif br-sv %%";

as an up action. Both ends are essentially the same, but one end needs to be configured to start as a server, and the other as a client connecting to it. The magic part is the tunnel type of ether, which means that it will pass layer 2 packets, so that, eg, ethernet broadcasts like ARP will work and it can be added into an ethernet bridge. (The default types, including tun can only be used for Layer 3 routing.)

With the tunnels and bridges set up, virtual machines on either host can directly talk to each other, and it is possible to migrate one virtual machine at a time without any of the others caring where it is running (and providing the MAC address of the ethernet interfaces remains constant -- by specifying the same one in the virtual machine configuration -- nothing will even notice where the virtual machines are located). (There are some caveats, including the fact that the tunnelling overhead will cause more packet fragmentation, but as a short term migration strategy this can be ignored. However do avoid setting up multiple tunnels for the same bridges in parallel -- such as not ensuring that vtun is stopped before starting it again -- as that will cause switching loops, which are bad. The most obvious symptom of this is lots of kernel messages about interfaces being disabled/enabled, especially if your bridges have spanning tree turned off. Stopping all the tunnels and then starting just the right ones again usually works to fix this up.)

Migrating a VM from Xen to KVM

I found several guides to migrating from Xen to KVM. But also need to improvise a bit myself. In particular for historical reasons related to the difficulty in accessing parts of disk images from the host machine, all my virtual machines had "one logical volume per disk partition", rather than what would be done on a physical host (and KVM expects) of having a single disk (image) which had multiple partitions recognised by the virtual machine. Particularly in order to successfully boot the KVM virtual machines with pc-grub, they needed to have a more standard disk layout. In order to partly automate the disk partitioning portion of the work, I wrote a hacky perl script to read the Xen virtual machine configuration file, look at the disk presented to the virtual machine, and turn that into commands to make the new logical volumes for the KVM virtual machine. The script produces output like:

$ ./xen2singlevm /etc/xen/fileserver 
sudo lvcreate -n "fileserver_sda" -L 219251M /dev/r1
sudo parted "/dev/mapper/r1-fileserver_sda" mklabel msdos
sudo parted "/dev/mapper/r1-fileserver_sda" mkpart primary "ext2" 1 4296
sudo parted "/dev/mapper/r1-fileserver_sda" mkpart primary "ext2" 4296 84827
sudo parted "/dev/mapper/r1-fileserver_sda" mkpart primary "ext2" 84827 219249
sudo parted "/dev/mapper/r1-fileserver_sda" set 1 boot on
sudo kpartx -a "/dev/mapper/r1-fileserver_sda"

(where "r1" is the LVM volume group on the destination, and "fileserver" is the virtual machine to migrate). There are some complexities in the script particularly because at least historically the LVM tools output values in MiB (1024*1024) and GiB (1024*1024*1024) (even though they call them "MB" and "GB"), and parted expects the partitioning information in MB (1000*1000), even though the underlying disk sectors are 512 bytes. My hacky perl script kludges over this by calculating sizes in bytes, and rounding up a bit. It also allows 1MB at the start of the disk for boot information -- much more than is needed, but the easiest amount to permit, given parted's units of MB. (Sander van Vugt's article was most helpful in figuring out this approach.)

Having prepared the destination disk space, and made the partitions available, the next step is to prepare the virtual machine to be booted with a standard kernel rather than with a Xen-specific kernel via pygrub. To do this a few new packages need to be installed:

sudo sh -c 'echo "do_initrd = Yes" >>/etc/kernel-img.conf'
sudo aptitude install linux-image-2.6-686 udev grub acpid

and then the Xen virtual machine can be shut down from the console.

Once the Xen virtual machine is shutdown, the disk images can be copied one at a time from the old machine to the new machine. I used netcat (nc) to copy the images, because it's fairly quick and they were just going over a local switch; ssh or similar could also be used. Something like:

(source machine)$ sudo dd if=/dev/mapper/r5-fileserver_root bs=32768 | nc -q 15 DESTMACHINE 5000

(destination machine)$ nc -l DESTMACHINE 5000 | sudo dd iflag=fullblock of=/dev/mapper/r1-fileserver_sda1 bs=32768

will copy the disk image across. (The "sda1" style partitions of the logical volumes are made available by kpartx -a and removed by kpartx -d; in addition parted makes its own maps for the partitions, as "sdap1" and similar. They appear to map to exactly the same bits of disk.) If for some reason the disk copy fails before the end of the source volume check (a) that you're really copying into the destination disk partition (and not, eg, creating a file under a typo name which then fills up), and (b) that the destination partition really is larger than the source and hasn't, eg, been rounded down to the nearest cylinder boundary -- if in doubt, add a bit of size margin to the partition and copy again (you can always resize the file system to fill the partition later).

The -q 15 makes the source netcat disconnect once it has copied everything (actually 15 seconds later), which should be the default behaviour but isn't (annoyingly it will stay connected forever by default). The blocksize (bs) helps improve the streaming across the network, and iflag=fullblock ensures that dd does proper rebuffering of data coming in from the network so that it is doing efficient writes to the network. With those options I was able to sustain about 25 MB/s (around 250Mbps) copying between the two machines (over a gigabit network infrastructure -- the disks are slower than 100MB/s, and I suspect the HP DL380 G3 network interfaces are not optimally connected to stream at full 1Gbps anyway).

Also of note, the destination path name is the partition within the logical volume, created earlier; the kpartx -a call is what makes that visible, so that we can write into the middle of the logical volume/disk image.

Once the disk image is copied, it's very useful to confirm that it was all copied correctly. I used md5sum as a quick check, viz:

(source machine)$ sudo md5sum /dev/mapper/r5-fileserver_root
(destination machine)$ sudo dd if=/dev/mapper/r1-fileserver_sda1 bs=1048576 count=1024 | md5sum

where the count is the number of MiB in the source volume (because that bs=1048576 value is 1024*1024), so that we read only the data that we've initialised and thus expect the checksums to match. Both results should match; if not, something has gone wrong somewhere and the disk image may have to be copied again.

After all the disk images are copied (and I skipped the ones for swap, and just used mkswap -f /dev/mapper/r1-fileserver_sdbp1 to make new swap partitions instead), the next step is to ensure that the partition table recognises the file systems:

kpartx -d /dev/mapper/r1-fileserver_sda
parted /dev/mapper/r1-fileserver_sda print

if it doesn't recognise "ext3" and "linux-swap(v1)" (or whatever your file system types are), it may help to run cfdisk against the disk volume, and see if that prompts it to recognise the file system (I had some issues where the file systems apparently weren't being recognised, which lead to booting issues due to the disk partitions not mounting, hence this extra check; the Debian hints on Debugging Initramfs were helpful in tracking down that the boot was stalling due to not recognising the disk image).

Once the disk is ready, the virtual machine can be booted with KVM.

Initial KVM boot

To get the virtual machine to boot with KVM the first time it's necessary to boot using an external kernel and initramfs file, because grub hasn't been installed (due, in part, to the different disk image of the virtual machine previously). The easiest way is to copy the vmlinuz file and initrd.img files out of a working virtual machine into the KVM host somewhere.

I'm using libvirt to manage my virtual machines since that is what the Ubuntu guide suggested. For the first boot, create a basic KVM configuration file:

[domain type='kvm']
  [name]fileserver[/name]
  [memory]131072[/memory]
  [currentMemory]131072[/currentMemory]
  [vcpu]1[/vcpu]
  [os]
    [type arch='i686' machine='pc-0.12']hvm[/type]
    [kernel]/home/ewen/vmlinuz-2.6.32-5-686[/kernel]
    [initrd]/home/ewen/initrd.img-2.6.32-5-686[/initrd]
    [cmdline]console=ttyS0,9600 root=/dev/sda1 init=/bin/sh[/cmdline]
    [boot dev='hd'/]
  [/os]
  [features]
    [acpi/]
  [/features]
  [clock offset='utc'/]
  [on_poweroff]destroy[/on_poweroff]
  [on_reboot]restart[/on_reboot]
  [on_crash]destroy[/on_crash]
  [devices]
    [emulator]/usr/bin/kvm[/emulator]
    [disk type='block' device='disk']
      [source dev='/dev/r1/fileserver_sda'/]
      [target dev='sda' bus='scsi'/]
    [/disk]
    [interface type='bridge']
      [mac address='00:16:3e:6c:76:52'/]
      [source bridge='br-sv'/]
      [target dev='vnet0'/]
    [/interface]
    [serial type='pty']
      [target port='0'/]
    [/serial]
    [console type='pty']
      [target port='0'/]
    [/console]
  [/devices]
[/domain]

(where the "[" should be replaced with an open angle bracket, and "]" should be replaced by a closed angle bracket; it's supposed to be XML, but I've had to change it to get it through the RSS feed.)

The important parts here are:

kernel, initrd and cmdline values to specify booting from outside the virtual machine.
the cmdline includes parameters for serial console (KVM and libvirt do have options for graphical console, but I've not bothered as all my virtual machines are text-only, and I've used serial console for over a decade)
acpi is enabled for the virtual machine so that Linux knows how to power it off
the disks are listed, as "sda", "sdb", etc based on the newly assembled disk images, which have partition tables within them.
the network interface is connected to the appropriate bridge created earlier, and given the same MAC address as the previous Xen virtual machine (this is particularly important because udev loves to rename network interfaces any time they get a new MAC address, which stops all the network configuration from applying -- hence the need to lock the MAC addresses in place)
the "serial" and "console" sections tell the virtual machine what to do with the serial information, in this case via a virtual tty/pty which can be picked up with the libvirt interface.

This configuration file can then be introduced to libvirt:

sudo virsh --connect qemu:///system
define /path/to/XML/file/above

dumpxml VMNAME can be used to check that it was read correctly, and edit VMNAME can be used to make subsequent changes.

Providing the configuration looks sane, the virtual machine can be started with:

start VMNAME

and then you can connect to the console:

console VMNAME

(I use a small console wrapper shell script to wait on the virtual machine appearing and connect soon after it starts, run from another terminal, in order to see the early boot messages.)

Assuming it all works you should end up at a shell in your new virtual machine, in single user mode. (If you allow the virtual machine to boot fully, and networking works, you could also ssh into the virtual machine to do the rest of the setup.)

Configuring KVM virtual machine to boot automatically

To configure the virtual machine to boot automatically it's necessary to set up grub inside the virtual machine:

mount -o remount,rw /
mv /boot/grub/menu.lst /boot/grub/menu.lst-pygrub    # if using pygrub
update-grub
vi /boot/grub/menu.lst

and then ensure "kopt" contains "console=ttyS0,9600", and that grub knows to use serial console early in the boot, with these configuration lines in /boot/grub/menu.lst:

serial --unit=0 --speed=9600 --word=8 --parity=no --stop=1
terminal --timeout=5 console serial

When that's all done, run:

update-grub
/usr/sbin/grub-install /dev/sda

Also ensure that you have a getty running on the serial console so that you can log in:

vi /etc/inittab
# uncomment entry for /dev/ttyS0
kill -HUP 1
vi /etc/securetty
# ensure /dev/ttyS0 is listed

It's also useful to get rid of the most recent Xen kernel, in order that the standard kernel will boot by default:

aptitude purge linux-image-2.6-xen-686 linux-image-2.6.26-2-xen-686 linux-modules-2.6.26-2-xen-686

(assuming that you have Debian Lenny virtual machines.)

Once that's all done, halt the virtual machine and go back to the virsh interface, so we can tell it to boot from within the virtual machine, using grub.

Use edit VMNAME to edit the virtual machine configuration and remove these three lines:

    [kernel]/home/ewen/vmlinuz-2.6.32-5-686[/kernel]
    [initrd]/home/ewen/initrd.img-2.6.32-5-686[/initrd]
    [cmdline]console=ttyS0,9600 root=/dev/sda1 init=/bin/sh[/cmdline]

which override the boot method.

Start up the virtual machine again (start VMNAME), and watch to see that it boots fully and is usable. If it comes up properly that virtual machine is done.

The last step is to enable autostart on the KVM host:

virsh autostart VMNAME

and disable automatic starting on the Xen host (eg, move it out of /etc/xen/auto).

Lather, rinse, and repeat. For some of the large disk images it can take multiple hours to copy them across, so it may be worth copying just the boot disks and starting the virtual machine up on the KVM host, and then copying the data disks over separately.