Ubuntu Lucid (10.04) to Trusty (14.04) in two larger steps

Ubuntu 10.04 ("Lucid Lynx" aka "Lucid") was a Long Term Support (LTS) release from 2010-04-29, with support in the server version for 5 years -- which ran out at the end of April 2015 (2015-04-29). I upgraded most of the Ubuntu 10.04 LTS systems that I support to Ubuntu 14.04 over the last year. Many of those upgrades were simple, and some relatively easy -- mostly depending on how much non-trivial hardware or non-trivial functionality the system had.

The last system to get upgraded, due to the complexity, was my own colo server running multiple KVM VMs, set up in 2010. I finally upgraded that system to Ubuntu 12.04 LTS ("Precise Pangolin") earlier this month and then to Ubuntu 14.04 LTS ("Trusty Tahr") today, thus gaining 2-4 more years of long term support (depending on the package). I had put off upgrading it -- until after the end of support of 10.04 LTS as it turned out -- until I had enough time to deal with the fallout, because I expected it to not go entirely smoothly. And it did not disappoint -- particularly the 10.04 to 12.04 upgrade both caused all my VMs not to boot until their configuration was manually updated, and also led to a fairly regular stream of "high load average" reports from my server monitoring. The 14.04 upgrade seemed a bit smoother, in part because I did prepare the VM configurations with all the changes I could find in advance. (As a side note, I had intended to do these two upgrades back to back as with my previous upgrades from 10.04 to 14.04, but had to stop at 12.04 -- fortunately pre-reboot so the VMs were still running -- when a client experienced a "no power in data centre" complete failure of A and B side power; the cleanup from everything restarting took much of the day.)

I've listed some notes on issues experienced below, for future reference.

10.04 (Lucid) to 12.04 (Precise)

Qemu/KVM VM startup failure

The main issue on rebooting after the upgrade to 12.04 LTS, was that all the VMs were shown as "running", but none of them had anything on the serial console nor where they reachable on the network. That took a while to debug. (It also turned out not to be due to the KVM/libvirt serial console issue where duplicate "console" entries in the libvirt config file made the config invalid, and the VM fail to even register; mine were visible in the "vmc" list, and shown as running -- just not doing anything useful.)

After struggling with the VMs for a while trying to get any signs of life out of them, I eventually realised that they were failing in the QEMU BIOS -- ie, not even getting far enough for grub to run, hence no serial console or network activity. To debug that I had to add a GUI console to the VM config (since the 1980s-style PC BIOS behaviour still follows us around :-( ). The easiest way, when remote, is to add a VNC network console, ideally on a fixed port so you can connect to it without lots of guessing.

The GUI console showed the Qemu/KVM PC BIOS complaining it had no boot devices:

Booting from Hard Disk...
Boot failed: could not read the boot disk

Booting from Floppy...
Boot failed: could not read the boot disk

Booting from CD-Rom...
Boot failed: could not read from CDROM (code 0003)
No bootable device.

which turned out to be because I had used SCSI devices for all my emulated disks (at one point in virtualisation land they were lower overhead than IDE -- plus they easily allowed more than 2/4 disks -- so many of my VMs on many VM platforms used them). With the Qemu/KVM on Ubuntu 10.04 LTS it appears you could boot from SCSI devices; but with the Qemu/KVM on Ubuntu 12.04 LTS you definitely cannot (as discussed on the KVM list). Apparently SeaBIOS 1.7.0 (as used by Qemu/KVM on Debian Wheezy and Ubuntu Linux 12.04 LTS) does not support SCSI boot; later versions seem to have various types of SCSI boot support.

The immediate work around when I recognised this problem was to change the first couple of disks in each VM to be "ide", and then deal with the out-of-order disk issues that resulted (because mixing IDE and SCSI disks on a system almost guarantees that the BIOS and the Linux device discovery will find them in different orders -- yet another 1980s PC hardware legacy). Fortunately thanks to fighting these issues for 20+ years, most of my VMs mounted their disks by either LABEL= (file system label; you can use swaplabel to add one to swap partitions too!), or UUID= (unique ID, which is reasonably reliable except in the case of MD/RAID 1...). So the actual disk discovery order didn't matter too much -- and where it did I could boot with just the first two (IDE) disks enabled in the VM and sort out the issues there.

Over the next few days I converted the VMs to use virtio disks exclusively, which either did not exist with Qemu/KVM on Ubuntu 10.04 LTS, or were very new then, but since about 2012 have become the best supported type for efficient disk access. At least in recent Linux OS, it's possible to boot from virtio disks (they appear as /dev/vda, /dev/vdb, etc). The main tricks to doing this:

Make sure that the VM is finding its root disk by UUID; if necessary edit /etc/default/grub and comment out:
```
#GRUB_DISABLE_LINUX_UUID=true
```
to enable mounting the root disk by UUID (double negatives for the win). Without this on first reboot the initramfs will probably not find the root disk and you will be sad (revert the change to the VM config, boot up, fix this, and then try again).
Change references in /etc/fstab that are to /dev/sd... to be /dev/vd...
Run update-grub to ensure grub config is current.
Shut down the VM (it has to be powered off).
Edit the VM configuration and change the disk names to "vda", "vdb", etc and the bus to be "virtio"; if there is an "address" entry for the IDE/SCSI drives, then remove it, to allow a new PCI address entry to be created.
Start the VM up again, and make sure it boots
If desired, disable mounting the root disk by UUID again, by editing /etc/default/grub and uncommenting:
```
GRUB_DISABLE_LINUX_UUID=true
```
and then running update-grub again. This avoids having insanely long, difficult to match to physical devices, names in the output of df -m which throw off the formatting ande make it harder to follow (a problem up through Debian Wheezy, but it appears Ubuntu 14.04 LTS may avoid this issue).
Reboot the VM again to make sure it boots if you changed the grub config.

In theory at that point one could get rid of both the IDE and SCSI controllers; but I haven't done that for most of my VMs yet.

Of note, the order of the "vda", "vdb", etc disks is detected by the Linux kernel based on their order on the PCI bus. If you are manually adding them later, make sure that the PCI bus slot specified is numerically increasing in the order you want them detected. The target dev value is only a label -- it does not determine how the Linux kernel will find those devices (but it is worth trying to keep them consistent for your own sanity).

Linux 3.2 Load Average issues

On the Linux 3.2 kernel used by Ubuntu 12.04 LTS the load average of the host machine seemed substantially higher than on the 2.6.32 kernels used in Ubuntu 10.04 LTS -- and substantially higher than made sense based on the process activity on the machine. Prior to the upgrade I would normally expect the load average to sit under 1.0 (the hosted VMs are mostly idle and there is little else running on the host machine), but with the Linux 3.2 kernel on Ubuntu 12.04 LTS the load average rarely went below 1.0 and often sat between 5.0 and 10.0. This caused a lot of "high CPU" alerts from my monitoring system.

It turns out that the kernel Load Average calculations changed significantly around the Linux 3.2 kernel, which led to a lot of bug reports. The main symptom is few if any processes taking CPU, but a load average of 1.x (or higher).

The underlying issue is that the Load Average calculation was substantially changed to avoid undercounting CPU usage, but it had an effect of no longer decaying back below 1.x reliably when the system was mostly idle. So the Load Average could end up stuck above 1.x forever. There seemed to be some patches to Ubuntu 12.04 LTS kernels (eg, 3.2.0-32.51 onwards) to try to improve it, but the only reliable solution reported was to run a later Linux kernel (Linux 3.5 or higher). I put up with the alerts for a couple of weeks until I could find time to upgrade to Ubuntu 14.04 LTS. (After upgrading to Linux 3.13 on Ubuntu 14.04 LTS the load average seems sensible -- even after upgrading to Ubuntu 14.04 LTS without rebooting the load average seemed better, but possibly only due to fewer running VMs.)

12.04 (Precise) to 14.04 (Trusty)

sudoers

The most obvious upgrade issue was that sudo appeared to have changed its hostname matching from hostname --fqdn to hostname (again?) which meant that the custom entries I had to support snmpd (running as snmp) using sudo to run certain commands as root, without needing a password, stopped working -- leading to lots of my needs-root monitoring checks failing and lots of alerts. The easy fix was to change the /etc/sudoers.d/snmp file to allow both the hostname (naosdell) and hostname --fqdn (dellr210.naos.co.nz) names to work:

# Special case for RAID monitoring via snmp
#
snmp    naosdell,dellr210.naos.co.nz = NOPASSWD: /usr/local/sbin/raidstatus
snmp    naosdell,dellr210.naos.co.nz = NOPASSWD: /usr/local/sbin/smartsummary

and then all was well again. (/usr/local/sbin/raidstatus and /usr/local/sbin/smartsummary are wrapper scripts that SNMP can use to get a useful status summary of a MD RAID set or hard drive SMART status into one line for reporting in the monitoring system.)

Nom nom nom

Two new RAID-related options appeared in the kernel options with Ubuntu 14.04 -- nomdmonddf and nomdmonisw. They appear to be underdocumented, and basically only relate to Ubuntu. AFAICT they relate to the plan to change from using dmraid to mdadm to assemble RAID devices; but I've been using mdadm for years.

The "nom nom nom" duplication seems to be caused by:

/etc/default/grub.d/dmraid2mdadm.cfg containing:
```
GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT nomdmonddf nomdmonisw"
```
which unilaterally appends them to the config if the mdadm package is installed, even if they are already there ( :-( ); and
dpkg-reconfigure grub-pc, which picks up those arguments and puts them into /etc/default/grub, so that they're baked in for another round of adding them
update-grub also adding them onto the command line written into /boot/grub/grub.cfg even if they're already there in the text coming from /etc/default/grub.

The correct work around appears to be ensure that they appear zero times in /etc/default/grub, and allow them to be added to /boot/grub/grub.cfg automatically. And tidy up /etc/default/grub each time after running dpkg-reconfigure grub-pc -- eg, after having to reinstall the boot records.

There are some Ubuntu bugs #1291434 and #1318351 about this, so it might eventually be implemented better. Making the "ensure these parameters are present" idempotent -- so it only adds them once -- would seem like a good start!

Grub: error: diskfilter writes are not supported

On rebooting an Ubuntu 14.04 LTS system with storage on MD RAID and LVM, especially with "quiet splash" in the command one gets the rather scary message:

error: diskfilter writes are not supported

Press any key to continue...

and pressing keys don't seem to do much beyond move the cursor around. This is precisely what one does not want to see on a colo system that is being rebooted -- even if pressing a key did work, having to do it on every reboot is a terrible inconvenience that can lead to downtime.

It turns out this is a known bug, related to being unable to save records of failed boots when using MD/LVM. Fortunately the system does actually boot normally after 5-10 seconds delay, even if no key is pressed. And pressing a key does appear to cause it to continue booting immediately.

For now I've just removed quiet from the boot options so it is more obvious the boot is actually continuing. There is some work arounds for the grub configuration listed in the bug comments which apparently avoid this problem for MD RAID and LVM, which hopefully will eventually be incorporated into the the grub package by Ubuntu. But for now, the occasional extra 10 second delay on rebooting should be okay -- as this colo VM host gets rebooted infrequently.