Ubuntu 10.04 ("Lucid Lynx" aka "Lucid") was a Long Term Support (LTS) release from 2010-04-29, with support in the server version for 5 years -- which ran out at the end of April 2015 (2015-04-29). I upgraded most of the Ubuntu 10.04 LTS systems that I support to Ubuntu 14.04 over the last year. Many of those upgrades were simple, and some relatively easy -- mostly depending on how much non-trivial hardware or non-trivial functionality the system had.
The last system to get upgraded, due to the complexity, was my own colo server running multiple KVM VMs, set up in 2010. I finally upgraded that system to Ubuntu 12.04 LTS ("Precise Pangolin") earlier this month and then to Ubuntu 14.04 LTS ("Trusty Tahr") today, thus gaining 2-4 more years of long term support (depending on the package). I had put off upgrading it -- until after the end of support of 10.04 LTS as it turned out -- until I had enough time to deal with the fallout, because I expected it to not go entirely smoothly. And it did not disappoint -- particularly the 10.04 to 12.04 upgrade both caused all my VMs not to boot until their configuration was manually updated, and also led to a fairly regular stream of "high load average" reports from my server monitoring. The 14.04 upgrade seemed a bit smoother, in part because I did prepare the VM configurations with all the changes I could find in advance. (As a side note, I had intended to do these two upgrades back to back as with my previous upgrades from 10.04 to 14.04, but had to stop at 12.04 -- fortunately pre-reboot so the VMs were still running -- when a client experienced a "no power in data centre" complete failure of A and B side power; the cleanup from everything restarting took much of the day.)
I've listed some notes on issues experienced below, for future reference.
10.04 (Lucid) to 12.04 (Precise)
Qemu/KVM VM startup failure
The main issue on rebooting after the upgrade to 12.04 LTS, was that all the VMs were shown as "running", but none of them had anything on the serial console nor where they reachable on the network. That took a while to debug. (It also turned out not to be due to the KVM/libvirt serial console issue where duplicate "console" entries in the libvirt config file made the config invalid, and the VM fail to even register; mine were visible in the "vmc" list, and shown as running -- just not doing anything useful.)
After struggling with the VMs for a while trying to get any signs of life out of them, I eventually realised that they were failing in the QEMU BIOS -- ie, not even getting far enough for grub to run, hence no serial console or network activity. To debug that I had to add a GUI console to the VM config (since the 1980s-style PC BIOS behaviour still follows us around :-( ). The easiest way, when remote, is to add a VNC network console, ideally on a fixed port so you can connect to it without lots of guessing.
The GUI console showed the Qemu/KVM PC BIOS complaining it had no boot devices:
Booting from Hard Disk...
Boot failed: could not read the boot disk
Booting from Floppy...
Boot failed: could not read the boot disk
Booting from CD-Rom...
Boot failed: could not read from CDROM (code 0003)
No bootable device.
which turned out to be because I had used SCSI devices for all my emulated disks (at one point in virtualisation land they were lower overhead than IDE -- plus they easily allowed more than 2/4 disks -- so many of my VMs on many VM platforms used them). With the Qemu/KVM on Ubuntu 10.04 LTS it appears you could boot from SCSI devices; but with the Qemu/KVM on Ubuntu 12.04 LTS you definitely cannot (as discussed on the KVM list). Apparently SeaBIOS 1.7.0 (as used by Qemu/KVM on Debian Wheezy and Ubuntu Linux 12.04 LTS) does not support SCSI boot; later versions seem to have various types of SCSI boot support.
The immediate work around when I recognised this problem was to
change the first couple of disks in each VM to be "ide", and then
deal with the out-of-order disk issues that resulted (because mixing
IDE and SCSI disks on a system almost guarantees that the BIOS and
the Linux device discovery will find them in different orders --
yet another 1980s PC hardware legacy). Fortunately thanks to
fighting these issues for 20+ years, most of my VMs mounted their
disks by either LABEL=
(file system label; you can use
swaplabel
to
add one to swap partitions too!), or UUID=
(unique ID, which is
reasonably reliable except in the case of MD/RAID 1...). So the
actual disk discovery order didn't matter too much -- and where it
did I could boot with just the first two (IDE) disks enabled in the
VM and sort out the issues there.
Over the next few days I converted the VMs to use
virtio disks exclusively,
which either did not exist with Qemu/KVM on Ubuntu 10.04 LTS, or
were very new then, but since about 2012 have become the best
supported type for efficient disk access. At least in recent Linux
OS, it's possible to boot from virtio disks (they appear as /dev/vda
,
/dev/vdb
, etc). The main tricks to doing this:
Make sure that the VM is finding its root disk by UUID; if necessary edit
/etc/default/grub
and comment out:#GRUB_DISABLE_LINUX_UUID=true
to enable mounting the root disk by UUID (double negatives for the win). Without this on first reboot the initramfs will probably not find the root disk and you will be sad (revert the change to the VM config, boot up, fix this, and then try again).
Change references in
/etc/fstab
that are to/dev/sd...
to be/dev/vd...
Run
update-grub
to ensure grub config is current.Shut down the VM (it has to be powered off).
Edit the VM configuration and change the disk names to "vda", "vdb", etc and the bus to be "virtio"; if there is an "address" entry for the IDE/SCSI drives, then remove it, to allow a new PCI address entry to be created.
Start the VM up again, and make sure it boots
If desired, disable mounting the root disk by UUID again, by editing
/etc/default/grub
and uncommenting:GRUB_DISABLE_LINUX_UUID=true
and then running
update-grub
again. This avoids having insanely long, difficult to match to physical devices, names in the output ofdf -m
which throw off the formatting ande make it harder to follow (a problem up through Debian Wheezy, but it appears Ubuntu 14.04 LTS may avoid this issue).Reboot the VM again to make sure it boots if you changed the
grub
config.
In theory at that point one could get rid of both the IDE and SCSI controllers; but I haven't done that for most of my VMs yet.
Of note, the order of the "vda", "vdb", etc disks is detected by
the Linux kernel based on their order on the PCI bus. If you are
manually adding them later, make sure that the PCI bus slot
specified is numerically increasing in the order you want them
detected. The target
dev
value is only a label -- it does not
determine how the Linux kernel will find those devices (but it is worth
trying to keep them consistent for your own sanity).
Linux 3.2 Load Average issues
On the Linux 3.2 kernel used by Ubuntu 12.04 LTS the load average of the host machine seemed substantially higher than on the 2.6.32 kernels used in Ubuntu 10.04 LTS -- and substantially higher than made sense based on the process activity on the machine. Prior to the upgrade I would normally expect the load average to sit under 1.0 (the hosted VMs are mostly idle and there is little else running on the host machine), but with the Linux 3.2 kernel on Ubuntu 12.04 LTS the load average rarely went below 1.0 and often sat between 5.0 and 10.0. This caused a lot of "high CPU" alerts from my monitoring system.
It turns out that the kernel Load Average calculations changed significantly around the Linux 3.2 kernel, which led to a lot of bug reports. The main symptom is few if any processes taking CPU, but a load average of 1.x (or higher).
The underlying issue is that the Load Average calculation was substantially changed to avoid undercounting CPU usage, but it had an effect of no longer decaying back below 1.x reliably when the system was mostly idle. So the Load Average could end up stuck above 1.x forever. There seemed to be some patches to Ubuntu 12.04 LTS kernels (eg, 3.2.0-32.51 onwards) to try to improve it, but the only reliable solution reported was to run a later Linux kernel (Linux 3.5 or higher). I put up with the alerts for a couple of weeks until I could find time to upgrade to Ubuntu 14.04 LTS. (After upgrading to Linux 3.13 on Ubuntu 14.04 LTS the load average seems sensible -- even after upgrading to Ubuntu 14.04 LTS without rebooting the load average seemed better, but possibly only due to fewer running VMs.)
12.04 (Precise) to 14.04 (Trusty)
sudoers
The most obvious upgrade issue was that sudo
appeared to have
changed its hostname matching from hostname --fqdn
to hostname
(again?) which meant that the custom entries I had to support snmpd
(running as snmp
) using sudo
to run certain commands as root,
without needing a password, stopped working -- leading to lots of
my needs-root monitoring checks failing and lots of alerts. The
easy fix was to change the /etc/sudoers.d/snmp
file to allow both
the hostname
(naosdell) and hostname --fqdn
(dellr210.naos.co.nz)
names to work:
# Special case for RAID monitoring via snmp
#
snmp naosdell,dellr210.naos.co.nz = NOPASSWD: /usr/local/sbin/raidstatus
snmp naosdell,dellr210.naos.co.nz = NOPASSWD: /usr/local/sbin/smartsummary
and then all was well again. (/usr/local/sbin/raidstatus
and
/usr/local/sbin/smartsummary
are wrapper scripts that SNMP can
use to get a useful status summary of a MD RAID set or hard drive
SMART status into one line for reporting in the monitoring system.)
Nom nom nom
Two new RAID-related options appeared in the kernel options with
Ubuntu 14.04 -- nomdmonddf
and nomdmonisw
. They appear to be
underdocumented, and basically only relate to Ubuntu. AFAICT they
relate to the plan to change from using dmraid
to mdadm
to
assemble RAID
devices; but I've been using mdadm for years.
The "nom nom nom" duplication seems to be caused by:
/etc/default/grub.d/dmraid2mdadm.cfg
containing:GRUB_CMDLINE_LINUX_DEFAULT="$GRUB_CMDLINE_LINUX_DEFAULT nomdmonddf nomdmonisw"
which unilaterally appends them to the config if the
mdadm
package is installed, even if they are already there ( :-( ); anddpkg-reconfigure grub-pc
, which picks up those arguments and puts them into/etc/default/grub
, so that they're baked in for another round of adding themupdate-grub
also adding them onto the command line written into/boot/grub/grub.cfg
even if they're already there in the text coming from/etc/default/grub
.
The correct work around appears to be ensure that they appear zero times
in /etc/default/grub
, and allow them to be added to /boot/grub/grub.cfg
automatically. And tidy up /etc/default/grub
each time after running
dpkg-reconfigure grub-pc
-- eg, after having to reinstall the boot
records.
There are some Ubuntu bugs #1291434 and #1318351 about this, so it might eventually be implemented better. Making the "ensure these parameters are present" idempotent -- so it only adds them once -- would seem like a good start!
Grub: error: diskfilter writes are not supported
On rebooting an Ubuntu 14.04 LTS system with storage on MD RAID and LVM, especially with "quiet splash" in the command one gets the rather scary message:
error: diskfilter writes are not supported
Press any key to continue...
and pressing keys don't seem to do much beyond move the cursor around. This is precisely what one does not want to see on a colo system that is being rebooted -- even if pressing a key did work, having to do it on every reboot is a terrible inconvenience that can lead to downtime.
It turns out this is a known bug, related to being unable to save records of failed boots when using MD/LVM. Fortunately the system does actually boot normally after 5-10 seconds delay, even if no key is pressed. And pressing a key does appear to cause it to continue booting immediately.
For now I've just removed quiet
from the boot options so it is more obvious
the boot is actually continuing. There is some work arounds for the grub
configuration listed in the bug comments which apparently avoid this
problem for MD RAID and LVM, which hopefully will eventually be incorporated
into the the grub package by Ubuntu. But for now, the occasional extra
10 second delay on rebooting should be okay -- as this colo VM host gets
rebooted infrequently.