Fundamental Interconnectedness

This is the occasional blog of Ewen McNeill. It is also available on LiveJournal as ewen_mcneill, and Dreamwidth as ewen_mcneill_feed.

KeePassXC (source, wiki) is a password manager forked from KeePassX which is a Linux/Unix port of the Windows KeePass Password Safe. KeePassXC was started because of concern about the relatively slow integration of community code into KeePassX -- ie it is a "Community" fork with more maintainers. KeePassXC seems to have been making regular releases in 2017, with the most recent (KeePassXC 2.2.0) adding Yubikey 2FA support for unlocking databases. KeePassXC also provides builds for Linux, macOS, and Windows, including package builds for several Linux distributions (eg an unofficial Debian/Ubuntu community package build, built from the deb package source with full build instructions).

For macOS / OS X there is a KeePassXC 2.2.0 for macOS binary bundle, and KeePassXC 2.2.0 for macOS sha256 digest. They are GitHub "release" downloads, which are served off Amazon S3. KeePassXC provide instructions on verifying the SHA256 Digest and GPG signature. To verify the SHA256 digest:

  • wget

  • wget

  • Check the SHA256 digest matches:

    ewen@ashram:~/Desktop$ shasum -a 256 -c KeePassXC-2.2.0.dmg.digest
    KeePassXC-2.2.0.dmg: OK

To verify the GPG signature of the release:

  • wget

  • wget (which is stored inside the website repository)

  • gpg --import keepassxc_master_signing_key.asc

  • gpg --recv-keys 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2 (alternatively or in addition; in theory it should report it is unchanged)

    ewen@ashram:~/Desktop$ gpg --recv-keys 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2
    gpg: requesting key 6397D0D2 from hkps server
    gpg: key 6397D0D2: "KeePassXC Release <>" not changed
    gpg: Total number processed: 1
    gpg:              unchanged: 1
  • Compare the fingerprint on the website with the output of "gpg --fingerprint 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2":

    ewen@ashram:~/Desktop$ gpg --fingerprint 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2
    pub   4096R/6397D0D2 2017-01-03
          Key fingerprint = BF5A 669F 2272 CF43 24C1  FDA8 CFB4 C216 6397 D0D2
    uid                  KeePassXC Release <>
    sub   2048R/A26FD9C4 2017-01-03 [expires: 2019-01-03]
    sub   2048R/FB5A2517 2017-01-03 [expires: 2019-01-03]
    sub   2048R/B59076A8 2017-01-03 [expires: 2019-01-03]

    to check that the GPG key retrieved is the expected one.

  • Compare the GPG signature of the release:

    ewen@ashram:~/Desktop$ gpg --verify KeePassXC-2.2.0.dmg.sig
    gpg: assuming signed data in `KeePassXC-2.2.0.dmg'
    gpg: Signature made Mon 26 Jun 11:55:34 2017 NZST using RSA key ID B59076A8
    gpg: Good signature from "KeePassXC Release <>"
    gpg: WARNING: This key is not certified with a trusted signature!
    gpg:          There is no indication that the signature belongs to the owner.
    Primary key fingerprint: BF5A 669F 2272 CF43 24C1  FDA8 CFB4 C216 6397 D0D2
         Subkey fingerprint: C1E4 CBA3 AD78 D3AF D894  F9E0 B7A6 6F03 B590 76A8

    at which point if you trust the key you downloaded is supposed to be signing the code you intend to run, the verification is complete. (There are some signatures on the signing key, but I did not try to track down a GPG signed path from my key to the signing keys, as the fingerprint verification seemed sufficient.)

In addition for Windows and OS X, KeePassXC raised funds for an AuthentiCode code signing certificate earlier this year. When signed, this results in a "known publisher" which avoids the Windows and OS X warnings about running "untrusted" code, and acts as a second verification of the intended code running. It is not clear that the .dmg or on OS X is signed at present, as "codesign -dv ..." reports both the .dmg file and the .app as not signed (note that it is possible to use Authenticode Code Signing Certificate with OS X's Signing Tools). My guess is maybe the KeePassXC developers focused on Windows executable signing first (and Apple executables normally need to be signed by a key signed by Apple anyway).

Having verified the downloaded binary package, on OS X it can be installed in the usual manner by mounting the .dmg file, and dragging the .app to somewhere in /Applications. There is a link to /Applications in the .dmg file, but without the clever folder background art that some .dmg files it is less obvious that you are intended to drag it into /Applications to install. (However there is no included installer, so the obvious alternative is "drag'n'drop" to install.)

Once installed, run to start. Create a new password database and give it at least a long master password, then save the database (with the updated master password). After the database is created it is possible to re-open with the relevant database with the usual:


thanks to the application association with the .kdbx file extension. This makes it easier to manage multiple databases. When opened in this way the application will propmpt for the master password of the specific database immediately (with the other known databases available as tabs).

KeePassXC YubiKey Support

KeePassXC YubiKey support is via the YubiKey HMAC-SHA1 Challenge-Response authentication, where the YubiKey mixes a shared secret with a challenge token to create a response token. This method was chosen for the KeePassXC YubiKey support because it provides a determinstic response without, eg, needing to reliably track counters or deal with gaps in monotonically increasing values, such as is needed with U2F -- Universal 2nd Factor. This trades a reduction in security (due to just relying on a shared secret) for robustness (eg, not getting permanently locked out of password database due to the YubiKeys counter having moved on to a newer value than the password database), and ease of use (eg, not having to activate the YubiKey at both open and close of a database; the KeePassXC ticket ticket #127 contains some useful discussion of the tradeoffs with authentiction modes needing counters; pwsafe also uses YubiKey Challenge-Response mode, presumably for similar reasons).

The design chosen seems similar to KeeChallenge, a plugin for KeePass2 (source) to support YubiKey authentiction for the Windows KeePass. There is a good setup guide to Securing KeePass with a Second Factor descriting how to set up the YubiKey and KeeChallenge, which seems broadly transferrable to using the similar KeePassXC YubiKey Challenge-Response feature. (A third party YubiKey Handbook contains an example of configuring the Challenge-Response mode from the command line for a slightly different purpose.)

By contrast, the Windows KeePass built in support is OATH-HOTP authentication (see also KeePass and YubiKey), which does not seem to be supported on KeePassXC -- some people also note OTP 2nd Factor provides authentication not encryption which may limit the extra protection in the case of a local database. HOTP also uses a shared key and a counter so suffers from similar shared secret risks as the Challenge Response mechanism, as well as robustness risks in needing to track the counter value -- one guide to the OATH-HOTP mode warns about keeping OTP recovery codes to get back in again after being locked out due to the counter getting out of sync. See also HOTP and TOTP details; HOTP hashes a secret key and a counter, whereas TOTP hashes a secret key and the time, which means it is easier to accidentally get out of sync with HOTP. TOTP seems to be more widely deployed in client-server situations, presumably because it is self-recovering given a reasonably accurate time source.

Configuring a YubiKey to support Challenge-Response HMAC-SHA1

To configure one or more YubiKeys to support Challenge-Response you need to:

  • Install the YubiKey Personalisation Tool from the Apple App Store; it is a zero cost App, but obviously will not be very useful without a YubiKey or two. (The YubiKey Personalisation Tool is also available for other platforms, and in a command line version.)

  • Run the YubiKey Personalization

  • Plug in a suitable YubiKey, eg, YubiKey 4; the cheaper YubiKey U2F security key does not have sufficient functionality. (Curiously the first time that I plugged a new YubiKey 4 in, the Keyboard Assistant in OS X 10.11 (El Capitan) wanted to identify it as a keyboard, which seems to be a known problem -- apparently one can just kill the dialog, but I ended up touching the YubiKey, then manually selecting a ANSI keyboard, which also seems to a valid approach. See also the YubiKey User Guide examples for mac OS X.)

  • Having done that, the YubiKey Personalisation Tool should show "YubiKey is inserted", details of the programming status, serial number, and firwmware version, and a list of the features supported.

  • Change to the "Challenge Response" tab, and click on the "HMAC-SHA1" button.

  • Select "Confguration Slot 2" (if you overwrite Configuration Slot 1 then YubiKey Cloud will not work, and that apparently is not recoverable, so using Slot 2 is best unless you are certain you will never need YubiKey Cloud; out of the factory only Configuration Slot 1 is programmed).

  • Assuming you have multiple YubiKeys (and you should, to allow recovery if you lose one or it stops functioning) tick "Program Multiple YubiKeys" at the top, and choose "Same Secret for all Keys" from the dropdown, so that all the keys share the same secret (ie, they are interchangable for this Challenge-Response HMAC-SHA1 mode).

  • You probably want to tick "Require user input (button press)", to make it harder for a remote attacker to activate the Challenge-Response functionality.

  • Select "Fixed 64-byte input" for the HMAC-SHA1 mode (required by KeeChallenge for KeePass; unclear if it is required for KeePassXC but selecting it did work).

  • Click on the "Generate" button to generate a random 20-byte value in hex.

  • Record a copy of the 20-byte value somewhere safe, as it will be needed to program an additional/replacement YubiKey with the same secret later (unlike KeeChallenge it is not needed to set up KeePassXC; instead KeePassXC will simply ask the YubiKey to run through the Challenge-Response algorithm as part of the configuration process, not caring about the secret key used, only caring about getting repeatable results).

    Beware the dialog box seems to be only wide enough to display 19 of the bytes (not 20), and not resizeable, so you have to scroll in the input box to see all the bytes :-( Make sure you get all 20 bytes, or you will be left trying to guess the first or last byte later on. (And make sure you keep the copy of the shared secret secure, as anyone with that shared secret can program a working YubiKey that will be functionally identical to your own. Printing it out and storing it somewhere safe would be better than storing it in plain text on the computers you are using KeePassXC on... and storing it inside KeePassXC creates a catch-22 situation!)

  • Double check your settings, then click on "Write Configuration" to store the secret key out to the attached YubiKey.

  • The YubiKey Personalisation Tool will want to write a "log" file (actually a .csv file), which will also *contain the secret key, so make sure you keep that log safe, or securely delete it.

  • Pull out the first YubiKey, and insert the next one. You should see a "YubiKey is removed" message then a "YubiKey is inserted" message. Click on "Write Configuration" for the next one. Repeat until you have programmed all the YubiKeys you want to be interchangeable for the Challenge-Response HMAC-SHA1 algorithm. (Two, kept separately, seems like the useful minimum, and three may well make sense.)

Configuring a KeePassXC to use password and YubiKey authentication

  • Insert one of the programmed YubiKeys

  • Open KeePassXC on an existing password database (or create a new one), and authenticate to it.

  • Go to Database -> Change Master Key.

  • Enter your Password twice (ie, so that the Password will be set back to the same password)

  • Tick "Challenge Response" as well (so that the Password and "Challenge Respone" are both ticked)

  • An option like "YubiKey (nnnnnnn) Challenge Response - Slot 2 - Press" should appear in the drop down list

  • Click the "OK" button

  • Save the password database

  • When prompted press the button on your YubiKey (which will allow it to use the YubiKey Challenge Response secret to update the database).

Accessing the KeePassXC database with password and YubiKey authentication

To test that this has worked, close KeePassXC (or at least lock the database), then open KeePassXC again. You will get a prompt for access credentials as usual, without any options ticked.

Verify that you can open the database using both the password and the YubiKey Challenge-Response, by typing in the password and ticking "Challenge Response" (after checking it detected the YubiKey) and then clicking on "OK". When prompted, click the button on your YubiKey, and the database should open. (KeePassXC seems to recognise that the Challenge-Response is needed if you have opened the database with the YubiKey and the YubiKey is present; but you will need to remember to also enter the password each time you authenticate. At least it will auto-select the Password as soon as you type one in. The first time around opening a specific database is just one additional box to tick, which is fairly easy to remember particularly if you use the same combination -- password and YubiKey Challenge-Response -- on all your databases.)

You can confirm that both the password and the YubiKey Challenge Response and required, by trying to authenticate just using the Password (enter Password, untick "Challenge Response", press OK), and by trying to authenticate just using the YubiKey (tick "Challenge Response", untick Password, press OK). In both cases it should tell you "Unable to open database" (the "Wrong key or database file is corrupt" really means "insufficient authentication" to recover the database encryption key in this case; they could perhaps more accurately say "could not decrypt master key" here, perhaps with a suggestion to check the authentication details provided).

If you have programmed multiple YubiKeys with the same Challenge-Response shared secret (and hopefully you have programmed at least two), be sure to check opening the database with each YubiKey to verify that they are programmed identically and thus are interchangable for opening the password database. It should open identically with each key (because they all share the same secret when you programmed the keys, and thus the Challenge Response values are identical).

If you have multiple databases that you want to protect with the YubiKey Challenge-Response method, you will need to go through the Database -> Change Master Key steps and verification steps for each one. It probably makes sense to change them all at the same time, to avoid having to try to remember which ones need the YubiKey and which ones do not.

Usability of KeePassXC with Password and YubiKey authentiction

Once you have configured KeePassXC for Password and YubiKey authentication, and opened the database at least once using the YubiKey, the usability is fairly good. Use:


to open a specific KeePassXC password database directly, and KeePassXC will launch with a window to authenticate to that password database. So long as one of the appropriate YubiKeys is plugged in, after a short delay (less time than it takes to type in your password) the YubiKey will be detected, and Challenge-Response selected. The you just type in your password as usual (which auto selects "Password" as well), hit enter (which auto-OKs the dialog), and touch your YubiKey when prompted.

One side effect of configuring your KeePassXC databases like this is that they are not able to be opened in other KeePass related tools, except maybe the Windows KeePass with the KeeChallenge plugin (which uses a similar method; I have not tested that). For desktop use, KeePassXC should work pretty much everywhere that is likely to be useful (modern Windows, modern macOS / OS X, modern Linux), as should the YubiKey, so desktop portability is fairly good. But, for instance, MiniKeePass (source, on the iOS App Store) will not be able to open the password database. Amongst other reasons, while the "camera connection kit" can be used to link a YubiKey to an iOS device, the YubiKey iOS HowTo points out that U2F, OATH-TOTP and Challenge-Response functionality will not work (and I found suggestions on the Internet this only worked with older iOS versions).

If access from a mobile device is important, then you may want to divide your passwords amongst multiple KeePass databases: a "more secure" one including the YubiKey Challenge-Response and a "less secure" one that only requires a password for compatibility. For instance it might make sense to store "low risk" website passwords in their own database protected only by a relatively short master password, and synchronise that database for use by MiniKeePass (using the DropBox app). But keep higher security/higher risk passwords protected by password and YubiKey Challenge-Response and only accessible from a desktop application (and not synchronised via DropBox to reduce exposure of the database itself).

It also looks like, in the UI, it should be possible to configure KeePassXC to require only the YubiKey Challenge-Response (no password), simply by changing the master key and only specifying YubiKey Challenge-Response. Since the Challenge-Response shared secret is fairly short (20 bytes, so 160 bits), secured only by that shared key, and the algorithm is known, that too would be a relatively low security form of authentication. Possibly again for "low value" passwords like random website logins with no real risk it might offer a more secure way to store per-website random passwords, ratehr than reusing the same password on the each website. But the combination of password and YubiKey Challenge-Response would be preferable for most password databases over the YubiKey Challenge-Response alone, even if the password itself was fairly short (eg under 16 characters).

Posted Sun Jul 23 15:34:28 2017 Tags:

Apple's Time Machine software, included with macOS for about the last 10 years is a service to automatically back up a computer to one or more external drives or machines. Once configured it pretty much looks after itself, usually keeping hourly/weekly/monthly snapshots for sensible periods of time. It can even rotate the snapshots amongst multiple targets to give multiple backups -- although it really wants to see every drive around once a week, otherwise it starts to regularly complain about no backups to a given drive, even when there are several other working backups. (Which makes it a poor choice for offline, offsite, backups which are not brought back onsite again frequently; full disk clones are better for that use case.)

More recent versions of Time Machine include local snapshots, which are copies saved to the internal drive in between Time Machine snapshots to an external target -- for instance when that external target is not available. This is quite useful functionality on, eg, a laptop that is not always on its home network or connected to the external Time Machine drive. These local snapshots do take up some space on the internal drive, but Time Machine will try to ensure there is at least 10% free space on the internal drive and aim for 20% free space (below that Time Machine local snapshots are usually cycled out fairly quickly, particularly if you do something that needs more disk space).

On my older MacBook Pro, the internal SSD (large, but not gigantic, for the time when it was bought, years ago) has been "nearly full" for a long time, so I have been regularly looking for things taking up space that that do not need to be on the internal hard drive. In one of these explorations I found that while Time Machine's main local snapshot directory was tiny:

ewen@ashram:~$ sudo du -sm /.MobileBackups
1       /.MobileBackups

as expected with an almost full drive causing the snapshots to be expired rapidly, there was another parallel directory which was surprisingly big:

ewen@ashram:~$ sudo du -sm /.MobileBackups.trash/
21448   /.MobileBackups.trash/

(21.5GB -- approximately 2-3 times the free space on the drive). When I looked in /.MobileBackups.trash/ I found a bunch of old snapshots from 2014 and 2016, some of which were many gigabytes each:

root@ashram:/.MobileBackups.trash# du -sm *
2468    Computer
412     MobileBackups_2016-10-22-214323
16824   MobileBackups_2016-10-24-163201
1746    MobileBackups_2016-10-26-084240
1       MobileBackups_2016-12-18-144553
1       MobileBackups_2017-02-05-125225
1       MobileBackups_2017-05-18-180448
root@ashram:/.MobileBackups.trash# du -sm Computer/*
1480    Computer/2014-06-08-213847
58      Computer/2014-06-15-122559
156     Computer/2014-06-15-162406
166     Computer/2014-06-29-183344
608     Computer/2014-07-06-151454
3       Computer/2016-10-22-174000

Some searching online indicated that this was a fairly common problem (there are many other similar reports). As best I can tell what is supposed to happen is:

  • /.MobileBackups is automatically managed by Time Machine to store local snapshots, and they are automatically expired as needed to try to keep the free disk space at least above 10%.

  • /MobileBackups.trash appears if for some reason Time Machine cannot remove a particular local snapshot or needs to start again (eg a local snapshot was not able to complete); in that case Time Machine will move the snapshot out of the main /.MobileBackups directory into /MobileBackups.trash directory. The idea is that eventually whatever is locking the files in the snapshot to prevent them from being deleted will be cleared, eg, by a reboot, and then /.MobileBackups.trash will get cleaned up. This is part of the reason for reboots being suggested as part of the resolution for Time Machine issues.

However there appears to be some scenarios where it is impossible to remove /.MobileBackups.trash, which just leads to them gradually accumulating over time. Some people report hundreds of gigabytes used there. Because /.MobileBackups.trash is not the main Time Machine Local Snapshots, it shows up as "Other" in the OS X Storage Report -- rather than "Backups". And of course if it cannot be deleted, it will not be automatically removed to make space when you need more space on the drive :-(

Searching for /.MobileBackups.trash in /var/log/system.log turned up the hint that Time Machine was trying to remove the directory, but being rejected:

Jul 18 16:31:36 ashram[852]: Failed to delete
/.MobileBackups.trash, error: Error Domain=NSCocoaErrorDomain
Code=513 "“.MobileBackups.trash” couldn’t be removed because you
don’t have permission to access it."
UserInfo={NSFilePath=/.MobileBackups.trash, NSUserStringVariant=(
), NSUnderlyingError=0x7feb82514860 {ErrorDomain=NSPOSIXErrorDomain
Code=1 "Operation not permitted"}}

(plus lots of "audit warning" messages about the drive being nearly full, which was the problem I first started with). There are some other references to that failure on OS X 10.11 (El Capitan), which I am running on the affected machine.

Based on various online hints I tried:

  • Forcing a full Time Machine backup to an external drive, which is supposed to cause it to clean up the drives (it did do a cleanup, but it was not able to remove /.MobileBackups.trash).

  • Disabling the Time Machine local snapshots:

    sudo tmutil disablelocal

    which is supposed to remove the /.MobileBackups and /.MobileBackups.trash directories; it did remove /.MobileBackups but could not remove /.MobileBackups.trash.

  • Emptying the Finder Trash (no difference to /.MobileBackups.trash)

  • Wait a while to see if it got automatically removed (nope!)

  • Forcing a full Time Machine backup to an external drive, now that the local Time Machine snapshots are turned off. That took ages to get through the prepare stage (the better part of an hour), suggesting it was rescanning everything... but it did not reduce the space usage in /.MobileBackups.trash in the slightest.

Since I had not affected /.MobileBackups.trash at all, I then did some more research into possible causes for why the directory might not be removable. I found a reference suggesting file flags might be an issue, but searching for the schg and uchg flags did not turn up anything:

sudo find /.MobileBackups.trash/ -flags +schg
sudo find /.MobileBackups.trash/ -flags +uchg

(uchg is the "user immutable" flag; schg is the "system immutable" flag). There are also xattr attributes (which I have used previously to avoid accidental movement of directories in my home directory), which should be visible as "+" (attributes) or "@" (permissions) when doing "ls -l" -- but in some quick hunting around I was not seeing those either (eg sudo ls -leO@ CANDIDATE_DIR).

I did explictly try removing the immutable flags recursively:

sudo chflags -f -R nouchg /.MobileBackups.trash
sudo chflags -f -R noschg /.MobileBackups.trash

but that made no obvious difference.

Next, after finding a helpful guide to reclaiming space from Time Machine Local snapshots I ensured that the Local Snapshots were off, then rebooted the system:

sudo tmutil disablelocal

followed by Apple -> Restart... In theory that is supposed to free up the /.MobileBackups.trash snapshots for deletion, and then delete them. At least when you do another Time Machine backup -- so I forced one of those after the system came back up again. No luck, /.MobileBackups.trash was the same as before.

After seeing reports that /.MobileBackups.trash could be safely removed manually, and having (a) two full recent Time Machine shapshots and (b) having just rebooted with the Time Machine Local Snapshots turned off, I decided it was worth trying to manaully remove /.MobileBackups.trash. I did:

sudo rm -rf "/.MobileBackups.trash"

with the double quotes included to try to reduce the footgun potential of typos (rm -rf / is something you very rarely want to do, especially by accident!).

That was able to remove most of the files, by continuing when it had errors, but still left hundreds of files and directories that it reported being unable to remove:

ewen@ashram:~$ sudo rm -rf "/.MobileBackups.trash"
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/Library/Preferences/SystemConfiguration: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/Library/Preferences: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/Library: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/private/var/db: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/private/var: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/private: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume: Directory not empty

At least most of the disk space was reclaimed, with just 45MB left:

-=- cut here -=-
ewen@ashram:~$ sudo du -sm /.MobileBackups.trash/
45      /.MobileBackups.trash/
-=- cut here -=-

In order to get back to a useful state I then moved that directory out of the way:

sudo mv /.MobileBackups.trash /var/tmp/mobilebackups-trash-undeleteable-2017-07-18

and rebooted my machine again to ensure everything was in a fresh start state.

When the system came back up again, I tried removing various parts of /var/tmp/mobilebackups-trash-undeleteable-2017-07-18 with no more success. Since the problem had followed the files rather than the location I figured there had to be something about the files which prevented them from being removed. So I did some more research.

The most obvious is the Time Machine Safety Net, which is special protections around the Time Machine snapshots to deal with the fact that they create hard links to directories (to conserve inodes, I assume) which can confuse rm. The recommended approach is to use "tmutil delete", but while it will take a full path doing something like:

tmutil delete /var/tmp/mobilebackups-trash-undeleteable-2017-07-18/MobileBackups_2016-10-22-214323

will just fail with a report that it is an "Invalid deletion target":

ewen@ashram:/var/tmp$ sudo tmutil delete /var/tmp/mobilebackups-trash-undeleteable-2017-07-18/MobileBackups_2016-10-22-214323
/private/var/tmp/mobilebackups-trash-undeleteable-2017-07-18/MobileBackups_2016-10-22-214323: Invalid deletion target (error 22)
Total deleted: 0B

and nothing will be deleted. My guess is that it at least tries to ensure that it is inside a Time Machine backup directory.

Another approach suggested is to use Finder to delete the directory, as that has hooks to the extra cleanup magic required, so I did:

open /var/tmp

and then highlighted mobilebackups-trash-undeleteable-2017-07-18 and tried to do permanently delete it with Alt-Cmd-Delete. After a confirmation prompt, and some file counting, that failed with:

The operation can't be completed because an unexpected error occurred (error code -8072).

deleting nothing. Explicitly changing the problem directories to be owned by me:

sudo chown -R ewen:staff /var/tmp/mobilebackups-trash-undeleteable-2017-07-18

also failed to change anything.

There is an even lower level technique to bypass the Time Machine Safety Net, using a helper bypass tool, which on OS X 10.11 (El Capitan) is in "/System/Library/Extensions/TMSafetyNet.kext/Contents/Helpers/bypass". However running the rm with the bypass tool did not get me any further forward:

cd /var/tmp
sudo /System/Library/Extensions/TMSafetyNet.kext/Contents/Helpers/bypass rm -rf mobilebackups-trash-undeleteable-2017-07-18

failed with the same errors, leaving the whole 45MB still present. (From what I can tell online using the bypass tool is fairly safe if you are removing all the Time Machine snapshots, but can leave very incomplete snapshots if you merely try to remove some snapshots -- due precisely to the directory hard links which is the reason that the Time Machine Safety Net exists in the first place. Proceed with caution if you are not trying to delete everything!)

More hunting for why root could not remove files, turned up the OS X 10.11+ (El Capitan onwards) System Integrity Protection which adds quite a few restrictions to what root can do. In particular the theory was that the file had a restricted flag on it which means that only restricted processes, signed by Apple, would be able to modify them.

That left me with the options of either trying to move the files back somewhere that "tmutil delete" might be willing to deal with, or trying to override System Integrity Protection for long enough to remove the files. Since Time Machine had failed to delete the files, apparently for months or years, I chose to go with the more brute force approach of overriding System Integrity Protection for a while so that I could clean up.

The only way to override System Integrity Protection is to boot into System Recovery mode, and run "csrutil disable", then reboot again to access the drive with Sytsem Integrity Protection disabled. To do this:

  • Apple -> Restart...

  • Hold down Cmd-R when the system chimes for restarting, and/or the Apple Logo appears; you have started a Recovery Boot if the background stays black rather than showing a color backdrop prompting for your password

  • When Recovery mode boots up, use Utilities -> Terminal to start a terminal.

  • In the Terminal window, run:

     csrutil disable
  • Reboot the system again from the menus

When the normal boot completes and you log in, you are running without System Integrity Protection enabled -- the foot gun is now on automatic!

Having done that, OS X was happy to let me delete the left over trash:

ewen@ashram:/var/tmp$ sudo du -sm mobilebackups-trash-undeleteable-2017-07-18/
45      mobilebackups-trash-undeleteable-2017-07-18/
ewen@ashram:/var/tmp$ sudo rm -rf mobilebackups-trash-undeleteable-2017-07-18
ewen@ashram:/var/tmp$ ls mob*
ls: mob*: No such file or directory

so I had finally solved the problem I started with, leaving no "undeleteable" files around for later. My guess is that those snapshots happened to run at a time that captured files with restricted flags on them, which then could not be removed (at least once Time Machine had thrown them out of /.MobileBackups and into /.MobileBackups.trash). But it seems unfortunate that the log messages could not have provided more useful instructions.

All that was left was to put the system back to normal:

  • Boot into recovery mode again (Apple -> Restart...; hold down Cmd-R at the chime/Apple logo)

  • Inside Recovery Mode, re-enable System Integrity Protection, with:

    csrutil enable

    inside Utilities -> Termimal.

  • Reboot the system again from the menus.

At this point System Integrity Protection is operating normally, which you can confirm with the "csrutil status" command that you can run at any time:

ewen@ashram:~$ csrutil status
System Integrity Protection status: enabled.

(changes to the status can be made only in Recovery Mode).

Finally re-enable Time Machine local snaphots because on a mobile device it is a useful feature:

sudo tmutil enablelocal

and then force the first local snapshot to be made now to get the process off to an immediate start:

sudo tmutil snapshot

At which point you should have /.MobileBackups with a snapshot or two inside it:

root@ashram:~# ls -l /.MobileBackups/Computer/
total 8
-rw-r--r--  1 root  wheel  263 18 Jul 17:37 .mtm.private.plist
drwxr-xr-x@ 3 root  wheel  102 18 Jul 17:37 2017-07-18-173719
drwxr-xr-x@ 3 root  wheel  102 18 Jul 17:37 2017-07-18-173758

and if you look in the Time Machine Preferences Window you should see the line that it will create "Local snapshots as space permits".

Quite the adventure! But my system now has about three times as much free disk space as it did previously, which was definitely worth the effort.

Posted Wed Jul 19 17:55:22 2017 Tags:

After upgrading to the "Windows 10 Creators Update", on my dual booted Dell XPS 9360, I installed the Windows Subsystem for Linux, because my fingers like being able to use Unix/Linux commands :-)

There are a few steps to enabling and installing "Bash on Ubuntu on Windows", on a 64-bit Windows 10 Creators Update install:

  • Turn on the Windows Subsystem for Linux, by starting an Administrator PowerShell (right click on Windows icon at bottom left, choose "Windows PowerShell (Admin)" from the menu), then run:

    Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux

    It will run for a little while with a text progress message, then ask to reboot and do a small amount of installation before restarting.

  • Turn on Developer Mode to enable installing extra features: In Settings -> Update and Security -> For developers move the radio selection to Developer Mode (default seems to be "Sideload Apps"; settings can be found by left clicking on the Windows icon at the bottom left, then click on the "cog wheel"). It will install a "Developer Mode package" (which I guess includes, eg, additional certificates).

  • Start a cmd prompt (eg, Windows -> Run... cmd), and inside that run "bash" to trigger the install of the Linux environment. (Note that without doing the above two steps this will fail completely, with a "not found" message, so if you get a "not found" message double check you have done the steps above.) You are promoted to accept the terms at, which seems to just be a shortlink to the Ubuntu Licensing Page.

    The text also notes that this is a "beta feature" which is presumably why it is necessary to enable "Devleoper Mode"; the install itself seems to download from the Windows (application) Store. (They also warn against modifying Linux files from Windows applications which illustrates the complexity of making the two subsystems play nicely together. It seems like this behaves a little more like a "Linux container" running under Windows than parallel processes.)

  • It detected the locale needed for New Zealand (en_NZ) and offered to change it from the default (en_US); I said "yes". (Note there was quite a long delay after this answer before the next prompt, enough I wondered if it had read the answer -- give it another minute or two.)

  • Then it prompted for a Unix-style user name (see the WSL guide to Linux User Account and Permissions, at It also prompts for a Unix-style password, which I assume is mostly used to run sudo.

  • Install the outstanding Ubuntu Linux 16.04 updates (ie, the ones released since the install snapshot was made):

    sudo apt-get update
    sudo apt-get dist-upgrade

After that, other than the default prompt/vim colours being terrible for a black background console (default on Windows), the environment works in a pretty similar manner to a native Linux environment.

The "Bash on Ubuntu on Windows" menu option added, which runs bash directly, suffers from the same "black background" readability issues. But fortunately if you go to Properties (click on top left of title bar, choose Properties) you can change the colours -- I simple changed the background to be 192/192/192 (default foreground grey), and the foreground to be 0/0/0 (default background black), and the default prompt/vim etc colours look more like they are intended.

There is some more documentation for Bash on Ubuntu on Windows which cover the whole Ubuntu on Windows feature. Of note, the Windows 10 Creators Update" version of the feature is based on Ubuntu Linux 16.04, which means I now have Ubuntu Linux 16.04 functionality on both sides of the dual boot environment :-) The newer version also seems to have improved Linux/Windows interoperabilty, and 24-bit colour support in the console.

Posted Sun Jul 16 10:36:05 2017 Tags:

Tickets for the Wellington 2017 edition of the New Zealand International Film Festival went on sale this morning at 10:00. As with 2014 and 2015 the online ticketing worked... rather poorly for the first hour or so after tickets went on sale. Leading to various tweets calling for patience -- and an apology from the NZIFF Festival Director on Facebook. I had fairly high hopes at 10:00 this morning, after being told by other Festival regulars that 2016 had been better than previous years -- but they were quickly dashed. After trying for the first half hour and getting no where I gave up until about 11:15, and then eventually managed to buy the tickets I wanted gradually, mostly one at a time, over the next hour.

As I have said previously, ticketing is a hard problem. Given a popular event, and limited (good) seats, there will always be a rush for the (best) seats as soon as the sales start. The demand in the first day will always be hundreds of times higher than the demand two weeks later, and the demand in the first hour will be 75% of the demand in the first day. That is just part of the business, so what you need to do is plan for that to happen.

The way that NZIFF (and/or their providers) have set up their online ticketing appears, even four years in, to not properly plan for efficiently handling a large number of buyers all wanting to buy at once. Some of the obvious problems with their implementation include:

  • putting the tickets for about 500 (five hundred) events on sale at the exact same moment -- so instead of a moderate sized stampede for tickets to one event, there are many stampedes for many events, all competing for the same server/network resources.

  • only collecting information about the types of tickets required at ticket purchase time, rather than collecting it in the "wishlist" in advance.

  • only collecting details of the purchaser at ticket purchase time, rather than collecting them in advance (eg, "create account profile" as part of building the wishlist), requiring another more round trips to the server, and storing more data in the database during the contentious ticket sale period.

  • relying on a "best available seat" algorithm that has no user control, and typically picks at best a mediocre seat, thus forcing many more users through the "choose my own seat" process which requires more intensive server interaction

  • not collecting money in advance (eg, selling "movie bucks" credits), which means that the period where the seat allocations are conditional waiting on payment is extended much longer, which both delays finalising seats free for the next buyer to choose from and requires more writes to the database

Less obviously, it appears as if there are some other technical problems:

  • Not designing to automatically "scale out" rapidly to more servers when the site is busy

  • Not pre-scaling to a large enough size, and pre-warming the servers (so they have everything they need in RAM) before opening up ticket sales

  • Breaking the web pages up into too many small requests and stages, increasing the client/server interaction (and thus both load on the server and points at which the process could go wrong) dramatically

  • Writing too much to disk during the processing

  • Reading too much from disk during the interactions

  • Not offloading enough to third party services (eg, CDNs)

and behind all of these is inadequate load testing to simulate the thundering herd of requests that come as an inevitable part of the ticket sales problem, leading to false confidence that "this year we will be okay", only to have those hopes crushed in the first 10 minutes.

So how do we make this ticket sales problem more manageable:

  • Stagger the event sales -- with 500+ events over 15+ days there is no good reason to put all of the events on sale at exactly the same time. It just makes the ticket sales problem two orders of magnitude worse than a single popular event. So break up the ticket releases into stages -- open up sales for the first few days of the festival at 10:00 on the first day, then open up sales for the next few days of the festival in the afternoon or the next day. Clearly having many many days with new tickets going on sale is impractical, but staggering the ticket sales opening over 2-5 days is fairly easily achieved, and instantly halves (or better) the "sales open" server load.

  • Collect every possible piece of information you can in advance of ticket sales, and write it to the database in advance. This would include all the name and contact details needed to complete the sale, a confirmation the user has read the terms and conditions, and details of how many tickets of which types the user wants. All of this can be part of the account profile and "wishlist". Ideally the only thing left for ticket sales time is seat allocation.

  • Preferably also collect the users preferred seat, or a way to hint to the seat allocation policy where to pick. Many regular movie goers (and almost all of the early sales will be regulars) will know the venues like the back of their hand, and can probably name off the top of their head their favourite seat. Obviously you cannot guarantee the exact seat will still be available when they buy their ticket, but if your seat selection algorithm is choosing "seat nearest to the user desired one" rather than "arbitrary seat the average person might not hate", then there is a good chance the user will not have to interact with the "choose my seat" screen at all. (For about half the films I booked this morning pre-entering my preferred seat would have just worked to give me the perfect seat. But since I had no way to pre-enter it, I had to go through the "I want to choose a better seat than the automatic one" on every single movie session.)

  • Ideally, collect the users money in advance. Many of the most eager purchasers will be literally spending hundreds of dollars, and going to dozens of sessions. Most of them would probably be willing to pre-purchase, say, a block of "10 tickets of type FOO" to be allocated to sessions later, if it sped up their ticket purchasing process. Having the money in advance both saves the users typing in their credit card details over and over again, and also means the server can go directly from "book my session" to "session confirmed" with no waiting -- avoiding writing details to the database at any intermediate step. (This also potentially decreases the festival expenses on credit card transaction fees by an order of magnitude.)

  • Maintain an in-RAM cache of "hot" information, such as the seats which are available/sold/in the process of being sold for each active session. Use memcached or other similar products. Make the initial decisions about which seats to offer the user from those tables, only accessing the database to store a permanent reservation once the seats are found.

  • Done completely you end up with a process that is:

    • User ticks one or more sessions for which they want to finalise their ticket purchase

    • The website returns a page saying "use your credit to buy these seats for these session", pre populated with the nearest seat to the ones they pre-indicated they wanted. It saves a temporary seat reservation to the database with a short timeout, and marks it in the RAM cache as "sale in progress". These writes to the database can be very short (and thus quick) because they are just a 4-tuple of (userid, eventid, seatid, expiry time).

    • User clicks "yes, complete sale", their single interaction if the seat they wanted (or a "close enough" one) is available.

    • The website makes the temporary seat reservations as final (by writing "never" in the expiry time), and writes the new credit balance to the database, and returns a page saying "you have these seats, and this credit left, tickets to follow via email".

    Occasionally a user might need to dive into the seat selection page to try to find a better choice, but for users in that critical first hour there is a pretty good chance that they will get something close to the seat they wanted. And the users will rapidly decide the algorithm is doing as well as is possible when they dive into the seat selection page and find all the ones nearer their preferred seat are gone already.

  • Organise the website so as much as possible is static content -- all images, styling (CSS), descriptions of films, etc, is cache-friendly static content. That both allows the browsers not to even ask for it again, and for any checks for whether it has changed to be met with a very quickly answered "all good, you have the latest version".
    Redirect all that static content to an external CDN to keep it away from the sales transaction process.

  • For data that has to be dynamically loaded (eg, seats available) send it in the most compact form possible, and unpack it on the client. CPU time on the client is effectively free in this process as there are many many client CPUs, and relatively few server resources. Try to offload as much work as possible to the browser CPUs, and make them rely on as little as possible coming from the central server.

  • By getting the sales process down to "are you sure?" / "yes", very few server interactions are required, so users get through the process quicker and go away (reducing load) happy. It also means that there is very little to write to the database, so the database contention is dramatically reduced. Done properly almost nothing has to be read from the database.

  • The quick turn around then makes it possible to do things like, eg, keep a HTTPS connection open from the browser to the load balancer to the back end webserver for the 15-30 seconds it takes to complete the sale, avoiding a bunch of network congestion and setup time. This also dramatically reduces the risk of the sales process failing at step 7 of 10, and the user having to start again (which means the load generated by all previous steps was wasted load on the server and means the user is frustrated). By taking the payment out of line from the seat allocation/finalisation process, the web browser only needs to interact with a single server maximising the chances of keeping the connection "hot", ready for when the user eagerly clicks "yes, perfect, I want those" button. Which completes the transaction as quickly as possible.

  • The quick turn around would also encourage users to purchase multiple sessions at once, rather than resorting to purchasing one ticket at a time just to have a chance of anything working. And users purchasing, eg, 10 sessions at a time will get all the tickets they were after much quicker, then leave the site -- and more server resources available for all other users.

  • Host the server as close as possible to the actual users, so that the web connection set up time is as small as possible, and the data transfers to the user happen as fast as possible. Having connections stall for long periods due to packet loss and long TCP timeouts (scaled to expect long delays due to distance) just ties up server resources and makes the users frustrated.

  • Pre-start lots of additional capacity in advance of the "on sale now" time, and pre-warm it by running a bunch of test transactions through, so the servers are warmed up and ready to go. A day or two later you can manually (or automatically) scale back to a more realistic "sale days 2-20" capacity. With most "cloud" providers you will pay a few hundred dollars extra on the first few hours, or days, in exchange for many happy customers and thus many sales. The extra sales possible as a result may well pay for the extra "first day" hosting costs. And most "cloud" providers will allow you to return that extra capacity on the second day at no extra cost -- so it is a single day cost.

Implementing any of this would help. Implementing all of it would make a dramatic difference to the first day experience of the most enthusiastic customers. I for one would be extremely grateful even just to avoid having to type in my name and contact details several dozen times (between failed and successful one-ticket-at-a-time attempts), or avoiding having to type my credit card details in a couple of dozen times in a rush to try to "complete the sale while the site is still talking to me".

Posted Thu Jul 6 20:52:32 2017 Tags:

git svn provides a way to check out Subversion repositories and interact with them using the git interface. It is useful to avoid the mental leap of working with another revision control system tool occassionally when, eg, dealing with RANCID repositories of switch and router configuration (historically RANCID only supported CVS, and then more recently CVS and Subversion; recent versions do support git directly, but not all my clients are using recent enough versions to have direct git support).

Unfortunately the git svn command interaction is still fairly different from "native git" repository interaction, which causes some confusion. But fortunately with a few kludges you can hide the worst of this from day-to-day interation.

Starting at the beginning, you "clone" the repository with something like:

git svn clone svn+ssh://

(using a specific path within the Subversion repository as suggested by Janos Gyerik, rather than fighting with the git svn/Subversion branch impedance mismatch).

After that you're supposed to use:

git svn rebase

to update the repository pulling in changes made "upstream"; "git pull" simply will not work.

However my fingers, trained by years of git usage, really want to type "git pull" -- if I have to type something else to update, then I might as well just run svn directly. So I went looking for a solution to make "git pull" work on git svn clone (which I never changed locally).

An obvious answer would be to define a git alias (see Git Aliases), but sadly it is not possible to define a git alias that shadows an internal command, and it appears this is considered a feature. I could call the alias something else, but then I am back at "have to type something different, so I might as well just run svn" :-(

A comment on the same Stack Overflow thread suggests the best answer is to define a bash "shell function" that intercepts calls to git and redirects commands as appropriate. In my case I want "git pull" to run "git svn rebase" if (and only if) I am in a git svn repository. Inspecting those repositories showed that one unique feature they have is that there is a .git/svn directory -- so that gave me a way to tell which repositories to run this command on. Some more searching turned up git rev-parse --show-toplevel as the way to find the .git directory, so my work around could work no matter how deep I am in the git svn repository.

Putting these bits together I came up with this shell function that intercepts "git pull", checks for a git svn repository, and if we are running "git pull" in a git svn repository runs "git svn rebase" instead -- which does a fetch and update, just like "git pull" would do on a native repository:

function git {
    local _GIT

    if test "$1" = "pull"; then
        _GIT=$(command git rev-parse --show-toplevel)
        if test -n "${_GIT}" -a -e "${_GIT}/.git/svn"; then
            command git svn rebase
            command git "$@"
        command git "$@"

(The "command git" bit forces bash to ignore the alias, and run the git in the PATH instead, preventing infinite recursion -- without having to hard code the path of the git binary.)

Now "git pull" functions like normal in git repositories, and magically does the right thing on git svn repositories; and all other git commands run as normal.

It is definitely a kludge, but avoiding a daily "whoops, that is the repository that is special" confusion is well worth it. (git log and git diff seem to "just work" in git svn repositories -- which are the main two other commands I end up using on RANCID repositories.)

Posted Tue Jul 4 14:52:58 2017 Tags:

Debian 7.0 ("Wheezy") was originally released about four years ago, in May 2013; the last point release (7.11) was released a year ago, in June 2016. While Debian 7.0 ("Wheezy") has benefited from the Debian Long Term Support with a further two years of support -- until 2018-05-31 -- the software in the release is now pretty old, particularly software relating to TLS (Transport Layer Security) where the most recent version supported by Debian Wheezy is now the oldest still reasonably usable on the Internet. (The Long Term Support also covered only a few platforms -- but they were the most commonly used platforms including x86 and amd64.)

More recently Debian released Debian 8.0 ("Jessie"), originally a couple of years ago in May 2015 (with the latest update, Debian 8.8, released last month, in May 2017). Debian are also planning on releasing Debian Stretch (presumably as Debian 9.0) mid June 2017 -- in a couple of weeks. This means that Debian Stretch is still a "testing" distribution, which does not have security support, but all going according to plan later this month (June 2017) it will released and will have testing support after the release -- for several years (between the normal security support, and likely Debian Long Term Support).

Due to a combination of lack of spare time last year, and the Debian LTS providing some additional breathing room to schedule updates, I still have a few "legacy" Debian installations currently running Debian Wheezy (7.11). At this point it does not make much sense to upgrade them to Debian Jessie (itself likely to go into Long Term Support in about a year), so I have decided to upgrade these systems from Debian Wheezy (7.11) through Debian Jessie (8.8) and straight on Debian Stretch (currently "testing', but hopefully soon 9.0). My plan is to start with the systems least reliant on immediate security support -- ie, those that are not exposed to the Internet directly. I have done this before, going from Ubuntu Lucid (10.04) to Ubuntu Trusty (14.04) in two larger steps, both of which were Ubuntu LTS distributions.

Most of these older "still Debian Wheezy" systems were originally much older Debian installs, that have already been incrementally upgraded several times. For the two hosts that I looked at this week, the oldest one was originally installed as Debian Sarge, and the newest one was originally installed as Debian Etch, as far as I can tell -- although both have been re-homed on new hardware since the originally installs. From memory the Debian Sarge install ended up being a Debian Sarge install only due to the way that two older hosts were merged together some years ago -- some parts of that install date back to even older Debian versions, around Debian Slink first released in 1999. So there are 10-15 years of legacy install decisions there, as well as both systems having a number of additional packages installed for specific long-discarded tasks that create additional clutter (such is the disadvantage of the traditional Unix "one big system" approach, versus the modern approach of many VMs or containers). While I do have plans to gradually break the remaining needed services to separate, automatically built, VMs or containers, it is clearly not going to happen overnight :-)

The first step in planning such an update is to look at the release notes:

The upgrade instructions are relatively boilerplate (prepare for an upgrade, check system status, change apt sources, minimal package updates then full package updates) but do contain hints as to possible upgrade problems with specific packages and how to work around them.

The "issues to be aware of" contain a lot of compatibility hints of things which may break as a result of the upgrade. In particular Debian 8 (Jessie) brings:

  • Apache 2.4 which both has significantly different configuration syntax and only includes files ending in .conf (breaking, eg, naming virtual servers after just the domain name); as does the Squid proxy configuration (see Squid 3.2, 3.3, and 3.4release notes, particularly Helper Name Changes).

  • systemd (in the form of systemd-sysv) by default, which potentially breaks local init changes (or custom scripts) and halt no longer powering off by default -- that behaviour apparently being declared "a bug that was never fixed" in the old init scripts, after many many years of it working that way. It got documented, but that is about it. (IMHO the only use of "halt but do not power of is in systems like Juniper JunOS where a key on the console can be used on the halted system to cause it to boot again in the case of accidental halts; it is not clear that actually works with systemd. systemd itself has of course been rather controversial, eventually leading to Devuan Jessie 1.0 which is basically Debian Jessie without systemd. While I am not really a fan of many of systemds technical decisions, the adoption by most of the major Linux distributions makes interaction with it inevitable, so I am not going out of my way to avoid it on these machines.)

  • The "nobody" user (and others) will have their shell changed to /usr/sbin/nologin -- which mostly affects running commands like:

    sudo su -c /path/to/command nobody

    Those commands instead need to be run as:

    sudo su -s /bin/bash -c /path/to/command nobody

    Alternatively you can choose to decline the change for just the nobody user -- the upgrade tool asks per user change in an interactive upgrade if your debconf question priority is medium or lower. In my case nobody was the last user shell change mentioned.

  • systemd will start, fsck, and mount both / and /usr (if it is a separate device) during the initramfs. In particularly this means that if they are RAID (md) or LVM volumes they need to be started by the time that initramfs runs, or startable by initramfs. There also seem to be some races around this startup, which may mean that not everything starts correctly; at least once I got dumped into the systemd rescue shell, and had to run "vgchange -a y" for systemd, wait for everything to be automatically mounted, and then tell it to continue booting (exit), but one boot it booted correctly by itself so it is defintely a race. (See, eg, Debian bug #758808, Debian bug #774882, and Debian bug #782793. The latter reports a fix in lvm2 2.02.126-3 which is not in Debian Jessie, but is in Debian Stretch, so I did not try too hard to fix this in Debian Jessie before moving on. The main system I experienced this on booted correctly, first time, on Debian Stretch, and continued to reboot automatically, where as on Debian Jessie it needed manual attention pretty much every boot.)

Debian 9 (Stretch) seems to be bringing:

  • Restrictions around separate /usr (it must be mounted by initramfs if it is separate; but the default Debian Stretch initramfs will do this)

  • net-tools (arp, ifconfig, netstat, route, etc) are deprecated (and not installed by default) in favour of using iproute2 (ip ...) commands. Which is a problem for cross-platform finger-macros that have worked for 20-30 years... so I suspect net-tools will be a common optional package for quite a while yet :-)

  • A warning that a Debian 8.8 (Jessie) or Debian 9 (Stretch) kernel is needed for compatibility with the PIE (Position Independent Executable) compile mode for executables in Debian 9 (Stretch), and thus it is extra important to (a) install all Debian 8 (Jessie) updates and reboot before upgrading to Debian 9 (Stretch), and (b) to reboot very soon after upgrading to Debian 9 (Stretch). This also affects, eg, the output of file -- reporting shared object rather than executable (because the executables are now compiled more like shared libraries, for security reasons). (Position independent code (PIC) is also somewhat slower on registered limited machines like 32-bit x86 -- but gcc 5.0+ contains some performance improvements for PIC which apparently help reduce the penalty. This is probably a good argument to prefer amd64 -- 64-bit mode -- for new installs. And even the x86 support is i686 or higher only; Debian Jessie is the last release to support i586 class CPUs.)

  • SSH v1, and older ciphers, are disabled in OpenSSH (although it appears Debian Stretch will have a version where they can still be turned back on; the next OpenSSH release is going to remove SSH v1 support entirely, and it is already removed from the development tree). Also ssh root password login is disabled on upgrade. These ssh changes are particularly an upgrade risk -- one would want to be extra sure of having an out of band console to reach any newly upgraded machines before rebooting them.

  • Changes around apt package pinning calculations (although it would be best to remove all pins and alternative package repositories during the upgrade anyway).

  • The Debian FTP Servers are going away which means that ftp URLs should be changed to http -- the names seem likely to remain for the foreseeable future for use with http.

I have listed some notes on issues experienced below, for future reference and will update this list with anything else I find as I upgrade more of the remaining legacy installs over the next few months.

Debian 7 (Wheezy) to Debian 8 (Jessie)

  • webkitgtk (libwebkitgtk-1.0-common) has limited security support. To track down why this is needed:

    apt-cache rdepends libwebkitgtk-1.0-common

    which turns up libwebkitgtk-1.0-0, which is used by a bunch of packages. To find the installed packages that need it:

    apt-cache rdepends --installed libwebkitgtk-1.0-0

    which gives libproxy0 and libcairo2, and repeating that pattern indicates many things installed depending on libcairo2. Ultimately iceweasel / firefox-esr are one of the key triggering packages (but not the only one). I chose to ignore this at this point until getting to Debian Stretch -- and once on Debian Stretch I will enable backports to keep firefox-esr relatively up to date.

  • console-tools has been removed, due to being unmaintained upstream, which is relatively unimportant for my systems which are mostly VMs (with only serial console) or okay with the default Linux kernel console. (The other packages removed on upgrade appear to just be, eg, old versions of gcc, perl, or other packaged replaced by newer versions with a new name.)

  • /etc/default/snmpd changed, which removes custom options and also disables the mteTrigger and mteTriggerConf features. The main reason for the change seems to be to put the PID file into /run/ instead of /var/run/ /etc/snmp/snmpd.conf also changes by default, which will probably need to be merged by hand.

    On SNMP restart a bunch of errors appeared:

    Error: Line 278: Parse error in chip name
    Error: Line 283: Label statement before first chip statement
    Error: Line 284: Label statement before first chip statement
    Error: Line 285: Label statement before first chip statement
    Error: Line 286: Label statement before first chip statement
    Error: Line 287: Label statement before first chip statement
    Error: Line 288: Label statement before first chip statement
    Error: Line 289: Label statement before first chip statement
    Error: Line 322: Compute statement before first chip statement
    Error: Line 323: Compute statement before first chip statement
    Error: Line 324: Compute statement before first chip statement
    Error: Line 325: Compute statement before first chip statement
    Error: Line 1073: Parse error in chip name
    Error: Line 1094: Parse error in chip name
    Error: Line 1104: Parse error in chip name
    Error: Line 1114: Parse error in chip name
    Error: Line 1124: Parse error in chip name

    but snmpd apparently started again. The line numbers are too high to be /etc/snmp/snmpd.conf, and as bug report #722224 notes, the filename is not mentioned. An upstream mailing list message implies it relates to lm_sensors object, and the same issue happened on upgrade from SLES 11.2 to 11.3. The discussion in the SLES thread pointed at hyphens in chip names in /etc/sensors.conf being the root cause.

    As a first step, I removed libsensors3 which was no longer required:

    apt-get purge libsensors3

    That appeared to be sufficient to remove the problematic file, and then:

    service snmpd stop
    service snmpd start
    service snmpd restart

    all ran without producing that error. My assumption is that old /etc/sensors.conf was from a much older install, and no longer in the preferred location or format. (For the first upgrade where I encountered it, the machine was now a VM so lm-sensors reading "hardware" sensors was not particularly relevant.)

  • libsnmp15 was removed, but not purged. The only remaining file was /etc/snmp/snmp.conf (note not the daemon configuration, but the client configuration), which contained:

    # As the snmp packages come without MIB files due to license reasons, loading
    # of MIBs is disabled by default. If you added the MIBs you can reenable
    # loading them by commenting out the following line.
    mibs :

    on default systems to disable of the SNMP MIBs from being loaded. Typically one would want to enable SNMP MIB usage and thus to get names of things rather than just long numeric OID strings. snmp-mibs-downloader appears to still exist in Debian 8 (Jessie), but it is in non-free.

    The snmp client package did not seem to be installed, so I installed it manually along with snmp-mibs-downloader:

    sudo apt-get install snmp snmp-mibs-downloader

    which caused that, rather than libsnmp15 to own the /etc/snmp/snmp.conf configuration file, which makes more sense. After that I could purge both libsnmp15 and console-tools:

    sudo apt-get purge libsnmp15 console-tools

    (console-tools was an easy choice to purge as I had not actively used its configuration previously, and thus could be pretty sure that none of it was necessary.)

    To actually use the MIBs one needs to comment out the "mibs :" line in /etc/snmp/snmp.conf manually, as per the instructions in the file.

  • Fortunately it appeared I did not have any locally modified init scripts which needed to be ported. The suggested check is:

    dpkg-query --show -f'${Conffiles}' | sed 's, /,\n/,g' | \
       grep /etc/init.d | awk 'NF,OFS="  " {print $2, $1}' | \
       md5sum --quiet -c

    and while the first system I upgraded had one custom written init script it was for an old tool which did not matter any longer, so I just left it to be ignored.

    I did have problems with the rsync daemon, as listed below.

  • Some "dummy" transitional packages were installed, which I removed:

    sudo apt-get purge module-init-tools iproute

    (replaced by udev/kmod and iproute2 respectively). The ttf-dejavu packages also showed up as "dummy" transitional packages but owned a lot of files so I left them alone for now.

  • Watching the system console revealed the errors:

    systemd-logind[4235]: Failed to enable subscription: Launch helper exited with unknown return code 1
    systemd-logind[4235]: Failed to fully start up daemon: Input/output error

    which some users have reported when being unable to boot their system, although in my case it happened before rebooting so possibly was caused by a mix of systemd and non-systemd things running.

    systemctl --failed reports:

    Failed to get D-Bus connection: Unknown error -1

    as in that error report, possibly due to the wrong dbus running; the running dbus in this system is from the Debian 7 (Wheezy) install, and the systemd/dbus interaction changed a lot after that. (For complicated design choice reasons, historically dbus could not be restarted, so changing it requires rebooting.)

    The system did reboot properly (although it appeared to force a check of the root disk), so I assume this was a transitional update issue.

  • There were a quite a few old Debian 7 (Wheezy) libraries, which I found with:

    dpkg -l | grep deb7

    that seemed no longer to be required, so I removed them manually. (Technically that only finds packages with security updates within Debian Wheezy, but those seem the most likely to be problematic to leave lying around.)

    At one point after the upgrade apt-get offered a large selection of packages to autoremove, but after some other tidy up and rebooting it no longer showed any packages to autoremove; it is unclear what happened to cause that change in report. I eventually found the list in my scrollback and pasted the contents into /tmp/notrequired, then did:

    for PKG in $(cat /tmp/notrequired); do echo $PKG; done | tee /tmp/notrequired.list
    dpkg -l | grep -f /tmp/notrequired.list

    to list the ones that were still installed. Since this included the libwebkitgtk-1.0-common and libwebkitgtk-1.0-0 packages mentioned above, I did:

    sudo apt-get purge libwebkitgtk-1.0-common libwebkitgtk-1.0-0

    to remove those. Then I went through the remainder of the list, and removed anything marked "transitional" or otherwise apparently no longer necessary to this machine (eg, where there was a newer version of the same library installed). This was fairly boring rote cleanup, but given my plan to upgrade straight to Debian 9 (Stretch) it seemed worth starting with a system as tidy as possible.

    I left installed the ones that seemed like I might have installed them deliberately (eg, -perl modules) for some non-packaged tool, just to be on the safe side.

  • I found yet more transitional packages to remove with:

    dpkg -l | grep -i transitional

    and removed them with:

    sudo apt-get purge iceweasel mailx mktemp netcat sysvinit

    after using "dpkg -L PACKAGE" to check that they contained only documentation; sysvinit contained a couple of helper tools (init and telinit) but their functionality has been replaced by separate systemd programs (eg systemctl) so I removed those too.

    Because netcat is useful, I manually installed the dependency it had brought in to ensure that was selected as an installed package:

    sudo apt-get install netcat-traditional

    While it appeared that multiarch-support should also be removable as a no-longer required transitional package, since it was listed as transitional and contained only manpages, in practice attempts to remove it resulted in libc6 wanting to be removed too, which would rapidly lead to a broken system. (On my system the first attempt failed on gnuplot, which was individually fixable by installing, eg, gnuplot-nox explicitly and removing the gnuplot meta package, but since removing multiarch-support lead to removing libc6 I did not end up going down that path.)

    For consistency I also needed to run aptitude and interactively tell aptitude about these decisions.

  • After all this tidying up, I found nothing was listening on the rsync port (tcp/873) any longer. Historically I had run the rsync daemon using /etc/init.d/rsync, which still existed, and still belonged to the rsync package.

    sudo service rsync start

    did work, to start the rsync daemon, but it did not start at boot. Debian Bug #764616 provided the hint that:

    sudo systemctl enable rsync

    was needed to enable it starting at boot. As Tobias Frost noted on Debian Bug #764616 this appears to be a regression from Debian Wheezy. It appears the bug eventually got fixed in rsync package 3.1.2-1, but that did not get backported to Debian Jessie (which has 3.1.1-3) so I guess the regression remains for everyone to trip over :-( If I was not already planning on upgrading to Debian Stretch then I might have raised backporting the fix as a suggestion.

  • inn2 (for UseNet) is no longer supported on 32-bit (x86); only the LFS (Large File Support) package, inn2-lfs is supported, and it has a different on-disk database format (64-bit pointers rather than 32-bit pointers). The upgrade is not automatic (due to the incompatible database format) so you have to touch /etc/news/convert-inn-data and then install inn2-lfs to upgrade:

    You are trying to upgrade inn2 on a 32-bit system where an old inn2 package
    without Large File Support is currently installed.
    Since INN 2.5.4, Debian has stopped providing a 32-bit inn2 package and a
    LFS-enabled inn2-lfs package and now only this LFS-enabled inn2 package is
    This will require rebuilding the history index and the overview database,
    but the postinst script will attempt to do it for you.
    Please create an empty /etc/news/convert-inn-data file and then try again
    upgrading inn2 if you want to proceed.

    Because this fails out the package installation it causes apt-get dist-upgrade to fail, which leaves the system in a partially upgraded messy state. For systems with inn2 installed on 32-bit this is probably the biggest upgrade risk.

    To try moving forward:

    sudo touch /etc/news/convert-inn-data
    sudo apt-get -f install

    All going well the partly installed packages will be fixed up, then:

    [ ok ] Stopping news server: innd.
    Deleting the old overview database, please wait...
    Rebuilding the overview database, please wait...

    will run (which will probably take many minutes on most non-trivial inn2 installs; in my case these are old inn2 installs, which have been hardly used for years, but do have a lot of retained posts, as a historical archive). You can watch the progress of the intermediate files needed for the overview database being built with:

    watch ls -l /var/spool/news/incoming/tmp/
    watch ls -l /var/spool/news/overview/

    in other windows, but otherwise there is no real indication of progress or how close you are to completion. The "/usr/lib/news/bin/makehistory -F -O -x" process that is used in rebuilding the overview file is basically IO bound, but also moderately heavy on CPU. (The history file index itself, in /var/lib/news/history.* seems to rebuild fairly quickly; it appears to be the overview files that take a very long time, due to the need to re-read all the articles.)

    It may also help to know where makehistory is up to reading, eg:

    MKHISTPID=$(ps axuwww | awk '$11 ~ /makehistory/ && $12 ~ /-F/ { print $2; }')
    sudo watch ls -l "/proc/${MKHISTPID}/fd"

    which will at least give some idea which news articles are being scanned. (As far as I can tell one temporary file is created per UseNet group, which is then merged into the overview history; the merge phase is quick, but the article scan is pretty slow. Beware the articles are apparently scanned in inode order rather than strictly numerical order, which makes it harder to tell group progress -- but at least you can tell which group it is on.)

    In one of my older news servers, with pretty slow disk IO, rebuilding the overview file took a couple of hours of wall clock time. But it is slow even given the disk bandwidth, because it makes many small read transactions. This is for about 9 million articles, mostly in a few groups where a lot of history was retained, including single groups with 250k-350k articles retained -- and thus stored in a single directory by inn2. On ext4 (but probably without directory indexes, due to being created on ext2/ext3).

    Note that all of this delay blocks the rest of the upgrade of the system, due to it being done in the post-install script -- and the updated package will bail out of the install if you do not let it do the update in the post-install script. Given the time required it seems like a less disruptive upgrade approach could have been chosen, particularly given the issue is not mentioned at all as far as I can see in the "Issues to be aware of for Jessie" page. My inclination for the next one would be to hold inn2, and upgrade everything else first, then come back to upgrading inn2 and anything held back because of it.

    Some searching turned up enabling ext4 dir_index handling to speed up access for larger directories:

    sudo service inn2 stop
    sudo umount /dev/r1/news
    sudo tune2fs -O dir_index,uninit_bg /dev/r1/news
    sudo tune2fs -l /dev/r1/news
    sudo e2fsck -fD /dev/r1/news
    sudo mount /dev/r1/news
    sudo service inn2 start

    I apparently did not do this on the previous OS upgrade to avoid locking myself out of using earlier OS kernels; but these ext4 features have been supported for many years now.

    In hindisght this turned out to be a bad choice, causing a lot more work. It is unclear if the file system was already broken, or if changing these options and doing partial fscks broke it :-( At minimum I would suggest doing a e2fsck -f /dev/r1/news before changing any options, to at least know whether the file system is good before the options are changed.

    In my case when I first tried this change I also set "-O uninit_bg" since it was mentioned in the online hints, and then after the first e2fsck, tried to do one more "e2fsck -f /dev/r1/news" to be sure the file system was okay before mounting it again. But apparently parts of the file system need to be initialised by a kernel thread when "uninit_bg is set.

    I ended up with a number of reports of like:

    Inode 8650758, i_size is 5254144, should be 6232064.  Fix? yes
    Inode 8650758, i_blocks is 10378, should be 10314.  Fix? yes

    followed by a huge number of reports like:

    Pass 2: Checking directory structure
    Directory inode 8650758 has an unallocated block #5098.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5099.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5100.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5101.  Allocate? yes

    which were so numerous to allocate by hand (although I tried saying "yes" to a few by hand), and they could not be fixed automatically (eg, not fixable by "sudo e2fsck -pf /dev/r1/news").

    It is unclear if this was caused by "-O uninit_bg", or some earlier issue on the file system (this older hardware has not been entirely stable), or whether there was some need for more background initialisation to happen which I interrupted by mounting the disk, then unmounting it, and then deciding to check it again.

    Since the file system could still be mounted, so I tried making a new partition and using tar to copy everything off it first before trying to repair it. But the tar copy also reported many many kernel messages like:

    Jun 11 19:12:10 HOSTNAME kernel: [24027.265835] EXT4-fs error (device dm-3): __ext4_read_dirblock:874: 
    inode #9570798: block 6216: comm tar: Directory hole found

    and in general the copy proceeded extremely slowly (way way below the disk bandwidth). So I gave up on trying to make a tar copy first, as it seemed like it would take all night with no certainty of completing. I assume these holes are the same "unallocated blocks" that fsck complained about.

    Given that the news spool was mostly many year old articles which I also had not looked at in years, instead I used dd to make a bitwise copy of the partition:

    dd if=/dev/r1/news of=/dev/r1/news_backup bs=32768

    which ran at something approaching the underlying disk speed, and at least gives me a "broken" copy to try a second repair on if I find a better answer later.

    Running a non-interactive "no change" fsck:

    e2fsck -nf /dev/r1/news

    indicated the scope of the problem was pretty huge, with both many unallocated block reports as above, and also many errors like:

    Problem in HTREE directory inode 8650758: block #1060 has invalid depth (2)
    Problem in HTREE directory inode 8650758: block #1060 has bad max hash
    Problem in HTREE directory inode 8650758: block #1060 not referenced

    which I assume indicate dir_index directories that did not get properly indexed, as well as a whole bunch of files that would end up in lost+found. So the file system was pretty messed up.

    Figuring backing out might help, I turned dir_index off again:

    tune2fs -O ^dir_index /dev/r1/news
    tune2fs -l /dev/r1/news

    There were still a lot of errors when checking with e2fsck -nf /dev/r1/news, but at least some of them were that there were directories with the INDEX_FL flag set on filesystem without htree support, so it seemed like letting fsck fix that would avoid a bunch of the later errors.

    So as a last ditch attempt, no longer really caring about the old UseNet articles (and knowing they are probably on the previous version of this hosts disks anyway), I tried:

     e2fsck -yf /dev/r1/news

    and that did at least result in fewer errors/corrections, but it did throw a lot of things in lost+found :-(

    I ran e2fsck -f /dev/r1/news again to see if it had fixed everything there was to fix, and at least it did come up clean this time. On mounting the file system, there were 7000 articles in lost+found, out of several million on the file system. So I suppose it could have been worse. Grepping through them, they appear to have been from four Newsgroups (presumably the four inodes originally reported as having problems), and all are ones I do not really care about any longer. inn2 still started, so I declared success at this point.

    At some point perhaps I should have another go at enabling dir_index, but definitely not during a system upgrade!

  • python2.6 and related packages, and squid (2.x; replaced by squid3) needed to be removed before db5.1-util could be upgraded. They are apparently linked via libdb5.1, which is not provided in Debian Jessie, but is specified as broken by db5.1-util unless it is a newer version than was in Debian Wheezy. In Debian Jessie only the binary tools are provided, and it offers to uninstall them as an unneeded package.

    Also netatalk is in Debian Wheezy and depends on libdb5.1, but is not in Debian Jessie at all. This surprised other people too, and netatalk seems to be back in Debian Stretch. But it is still netatalk 2.x, rather than netatalk 3.x which has been released for years; some has attempted to modify the netatalk package to netatalk 3.1, but that also seems to have been abandoned for the last couple of years. (Because I was upgrading through to Debian Stretch, I chose to leave the Debian Wheezy version of netatalk installed, and libdb5.1 from Debian Wheezy installed until after the upgrade to Debian Stretch.)

Debian 8 (Jessie) to Debian 9 (Stretch)

  • Purged the now removed packages:

    # dpkg -l | awk '/^rc/ { print $2 }'


    sudo apt-get purge $(dpkg -l | awk '/^rc/ { print $2 }')

    to clear old the old configuration files.

  • Checked changes in /etc/default/grub:

    diff /etc/default/grub.ucf-dist /etc/default/grub

    and updated grub using update-grub.

  • Checked changes in /etc/ssh/sshd_config:

    grep -v "^#" /etc/ssh/sshd_config.ucf-old | grep '[a-z]'
    grep -v "^#" /etc/ssh/sshd_config | grep '[a-z]'

    and checked that the now commented out lines are the defaults. Check that sshd stops/starts/restarts with the new configuration:

    sudo service ssh stop
    sudo service ssh start
    sudo service ssh restart

    and that ssh logins work after the upgrade.

  • The isc-dhcp-server service failed to start because it wanted to start both IPv4 and IPv6 service, and the previous configuration (and indeed the network) only had IPv4 configuration:

    dhcpd[15518]: No subnet6 declaration for eth0

    Looking further back in the log I saw:

    isc-dhcp-server[15473]: Launching both IPv4 and IPv6 servers [...]

    with the hint "(please configure INTERFACES in /etc/default/isc-dhcp-server if you only want one or the other)".

    Setting INTERFACES in /etc/default/isc-dhcp-server currently works to avoid starting the IPv6 server, but it results in a warning:

    DHCPv4 interfaces are no longer set by the INTERFACES variable in
    /etc/default/isc-dhcp-server.  Please use INTERFACESv4 instead.
    Migrating automatically for now, but this will go away in the future.

    so I edited /etc/default/isc-dhcp-server and changed it to set INTERFACESv4 instead of INTERFACES.

    After that:

    sudo service isc-dhcp-server stop
    sudo service isc-dhcp-server start
    sudo service isc-dhcp-server restart

    worked without error, and syslog reported:

    isc-dhcp-server[15710]: Launching IPv4 server only.
    isc-dhcp-server[15710]: Starting ISC DHCPv4 server: dhcpd.
  • The /etc/rsyslog.conf has changed somewhat, particularly around the syntax for loading modules. Lines like:

    $ModLoad imuxsock # provides support for local system logging

    have changed to:

    module(load="imuxsock") # provides support for local system logging

    I used diff /etc/rsyslog.conf /etc/rsyslog.conf.dpkg-dist to find these changes and merged them by hand. I also removed any old commented out sections no longer present in the new file, but kept my own custom changes (for centralised syslog).

    Then tested with:

    sudo service rsyslog stop
    sudo service rsyslog start
    sudo service rsyslog restart
  • This time, even after reboot, apt-get reported a whole bunch of unneeded packages, so I ran:

    sudo apt-get --purge autoremove

    to clean them up.

  • An aptitude search:

    aptitude search '~i(!~ODebian)'

    from the Debian Stretch Release Notes on Checking system status provided a hint on finding packages which used to be provided, but are no longer present in Debian. I went through the list by hand and manually purged anything which was clearly an older package that had been replaced (eg old cpp and gcc packages) or was no longer required. There were a few that I did still need, so I have left those installed -- but it would be better to find a newer Debian packaged replacement to ensure there are updates (eg, vncserver).

  • Removing the Debian 8 (Jessie) kernel:

    sudo apt-get purge linux-image-3.16.0-4-686-pae

    gave the information that the libc6-i686 library package was no longer needed, as in Debian 9 (Stretch) it is just a transitional package, so I did:

    sudo apt-get --purge autoremove

    to clean that up. (I tried removing the multiarch-support "transitional" package again at this point, but there were still a few packages with unmet dependencies without, including gnuplot, libinput10, libreadline7, etc, so it looks like this "transitional" package is going to be with us for a while yet.)

  • update-initramfs reported a wrong UUID for resuming (presumably due to the swap having been reinitialised at some point):

    update-initramfs: Generating /boot/initrd.img-4.9.0-3-686-pae
    W: initramfs-tools configuration sets RESUME=UUID=22dfb0a9-839a-4ed2-b20b-7cfafaa3713f
    W: but no matching swap device is available.
    I: The initramfs will attempt to resume from /dev/vdb1
    I: (UUID=717eb7a5-b49c-4409-9ad2-eb2383957e77)
    I: Set the RESUME variable to override this.

    which I tracked down to config in /etc/initramfs-tools/conf.d/resume, that contains only that one single line.

    To get rid of the warning I updated the UUID in /etc/initramfs-tools/conf.d/resume to match the new auto-detected one, and tested that worked by running:

    sudo update-initramfs -u
  • The log was being spammed with:

    console-kit-daemon[775]: missing action
    console-kit-daemon[775]: GLib-CRITICAL: Source ID 6214 was not found when attempting to remove it
    console-kit-daemon[775]: console-kit-daemon[775]: GLib-CRITICAL: Source ID 6214 was not found when attempting to remove it

    messages. Based on the hint that consolekit is not necessary since Debian Jessie in the majority of cases, and knowing almost all logins to this server are via ssh, I followed the instructions in that message to remove consolekit:

    sudo apt-get purge consolekit libck-connector0 libpam-ck-connector

    to silence those messages. (This may possibly be a Debian 8 (Jessie) related tidy up, but I did not discover it until after upgrading to Debian 9 (Stretch).)

  • A local internal (ancient, Debian Woody vintage) apt repository no longer works:

    W: The repository 'URL' does not have a Release file.
    N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
    N: See apt-secure(8) manpage for repository creation and user configuration details.

    since the one needed local package was already installed long ago, I just commented that repository out in /etc/apt/sources.list. The process for building apt repositories has been updated considerably in the last 10-15 years.

  • After upgrading and rebooting, on one old (upgraded many times) system systemd-journald and rsyslogd were running flat out after boot, and lpd was running regularly. Between them they were spamming the /var/log/syslog file with:

    lpd[847]: select: Bad file descriptor

    lines, many, many, many times a second. I stopped lpd with:

    sudo service lpd stop

    and the system load returned to normal, and the log lines stopped. The lpd in this case was provided by the lpr package:

    ewen@HOST:~$ dpkg -S /usr/sbin/lpd
    lpr: /usr/sbin/lpd

    and it did not seem to have changed much since the Debian Jessie lpr package -- Debian Wheezy had 1:2008.05.17+nmu1, Debian Jessie had 1:2008.05.17.1, and Debian Stretch has 1:2008.05.17.2. According to the Debian Changelog the only difference between Debian Jessie and Debian Stretch is that Debian Stretch's version was updated to later Debian packaging standards.

    Searching on the Internet did not turn up anyone else reporting the same issue in lpr.


    sudo service lpd start

    again a while after boot did not produce the same symptoms, so for now I have left it running.

    However some investigation in /etc/printcap revealed that this system had not been used for printing for quite some time, as its only printer entries referred to printers that had been taken out of service a couple of years earlier. So if the problem reoccurs I may just remove the lpr package completely.

    ETA, 2017-07-14: This happened again after another (unplanned) reboot (caused by multiple brownouts getting through the inexpensive UPS). Because I did not notice in time, it then filled up / with a 4.5GB /var/log/lpr.log, with endless messages of:

    Jul 14 06:25:25 tv lpd[844]: select: Bad file descriptor
    Jul 14 06:25:25 tv lpd[844]: select: Bad file descriptor
    Jul 14 06:25:25 tv lpd[844]: select: Bad file descriptor

    so since I had not used the printing functionality on this machine since I ended up just removing it completely:

    sudo cp /dev/null /var/log/lpr.log
    sudo cp -p /etc/printcap /etc/printcap-old-2017-07-14
    sudo apt-get purge lpr
    sudo logrotate -f /etc/logrotate.d/rsyslog
    sudo logrotate -f /etc/logrotate.d/rsyslog

    which seemed more time efficient than trying to debug the problem of which file descriptor it was talking about (my guess is maybe one which systemd closed for lpd, that the previous init system did not close, but I have no detailed investigation of that). I kept a copy of /etc/printcap in case I do want to try to restore the printing functionality (or debug it later), but most likely I would just set up printing from scratch.

    The two (forced) log rotates were to force compression of the other copies of the 4GB of log messages (in /var/log/syslog, which rotates daily by default, and /var/log/messages which rotates weekly by default), having removed /var/log/lpr.log which was another 4.5GB. Unsurprisingly they compress quite well given the logs were spammed with a single message -- but even compressed they are still about 13MB.

After fixing up those upgrade issues the first upgraded system seems to have been running properly on Debian 9 (Stretch) for the last few days, including helping publish this blog post :-)

ETA, 2017-06-11: Updates, particularly around inn2 upgrade issues.

ETA, 2017-06-17: Updates on boot issues in jessie, fixed by stretch.

Posted Wed Jun 7 10:50:46 2017 Tags:

I have Java installed for precisely one reason: to be able to access Dell iDRAC consoles on both my own server and various client servers. Since Java on the web has been a terrible idea for years, and since the Dell iDRAC relies on various binary modules which do not work on Mac OS X, I have restricted this Java install to a single VM on my desktop which I start up when I need to access the iDRAC consoles.

For the last few years, this "iDRAC console" VM has been an Ubuntu 14.04 LTS VM, with OpenJDK 7 installed. It was the latest available at the time I installed it, and since it was working I left it alone. Unfortunately after upgrading some client Dell hosts to the latest iDRAC firmware, as part of a redeployment exercise, those iDRACs stopped working with this Ubuntu 14.04/OpenJDK 7 environment. But I was able to work around that by using a newer Java environment on a client VM.

Today, when I went to use the Java console with my own older Dell server, the iDRAC console no longer started properly, failing with a Java error:

Fatal: Application Error: Cannot grant permissions to unsigned jars.

which was a surprise as it had previously worked as recently as a few weeks ago.

One StackExchange hint suggested this policy could be overridden by running:


and changing the Policy Settings to allow "Execute unowned code". But in my case that made no difference. I also tried setting the date in the VM back a year, in case maybe the signing certificate had now expired out -- but that too made no difference.

Given the hint that OpenJDK 8 actually worked, and finding some backports of OpenJDK 8 to Ubuntu 14.04 LTS (which was released shortly after OpenJDK 8 came out, so does not contain it), I decided to try installing the OpenJDK 8 versions on Ubuntu 14.04 LTS. Fortunately this did actually work.

To install OpenJDK 8 on Ubuntu 14.04 LTS ("trusty") you need to install from the OpenJDK builds PPA, which is not officially part of Ubuntu but this one is managed by someone linked with Ubuntu, so is a bit more trustworthy than "random software found on the Intenet".

Installation of the OpenJDK 8 JRE:

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk

and it can be made the default by running:

sudo update-alternatives --config java

and choosing the OpenJDK 8 version.

Unfortunately that does not include javaws, which is the JNLP client that actually triggers the iDRAC console startup -- which meant that OpenJDK 7 was still running (and failing) trying to launch the iDRAC console. Some hunting turned up the need to install icedtea-8-plugin from another Ubuntu PPA to get a newer javaws that would work with OpenJDK 8. To install this one:

sudo add-apt-repository ppa:maarten-fonville/ppa
sudo apt-get update
sudo apt-get install icedtea-8-plugin

Amongst other things this updates the icedtea-netx package, which includes javaws, to also include a version for OpenJDK 8. Unfortunately the updated package (a) did not make the new OpenJDK 8 javaws the default, nor did update-alternatives --config javaws offer the OpenJDK 8 javaws as an option. Which meant the old, non-working, OpenJDK 7 version still launched.

To actually use the newer OpenJDK 8 javaws, I had to manually update the /etc/alternatives symlink:

cd /etc/alternatives
sudo rm javaws
sudo ln -s /usr/lib/jvm/java-8-openjdk-i386/jre/bin/javaws .

After which, finally, I could launch the iDRAC console again and carry on with what I originally planned to do. I hope this will have fixed the iDRAC console access on the newer iDRAC firmware on some of my client machines too; but I have not tested that so far.

Posted Mon May 29 11:29:04 2017 Tags:

After running into problems trying to get git-annex to run on an SMB share on my Synology DS216+, prompted by the git-annex author, and an example with an earlier Synology NAS I decided to install the standalone version of git-annex directly on my Synology DS216+.

My approach was similar to the earlier "Synology NAS and git annex" tip, but the DS216+ uses an x86_64 CPU:

ewen@nas01:/$ grep "model name" /proc/cpuinfo | uniq
model name  : Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz

and I chose a slightly different approach to getting everything working, in part based on my experience setting up the standalone git-annex on a Mac OS X server. I am using Synology DSM "DSM 6.1.1-15101 Update 4", which is the latest release as I write (released 2017-05-25).

To install git-annex:

  • In the Synology web interface (DSM) enable the "SSH Service", in Control Panel -> Terminal, by ticking "Enable SSH Service", and verify that you can ssh to your Synology NAS. Only accounts in the administrators group can use the ssh service, so you will need to create an administrator account to use if you do not already have one. (If your Synology NAS is exposed to the Internet directly now would be a very good time to ensure you have a strong password on the account; mine is behind a separate firewall.)

  • In the Synology web interface (DSM) go to the Package Center and search for "Git Server" (git) from Synology and install that package. It should install in a few seconds, and currently appears to install git 2.8.0:

      ewen@nas01:/$ git --version
      git version 2.8.0

    which while not current (eg my laptop has git 2.13.0), is only about a year old. It is a symlink (in /usr/bin/git) into the Git package in /var/packages/Git/target/bin/git.

  • Verify that you can now reach the necessary parts of the git package:

    for FILE in git git-shell git-receive-pack git-upload-pack; do
        which "${FILE}"

    should produce something like:

  • Download the latest git-annex standalone x86-64 tarball, and gpg signature

  • Verify the git-annex gpg signature (as with previous installs):

    gpg --verify git-annex-standalone-amd64.tar.gz.sig

    which should report a "Good signature" from the "git-annex distribution signing key" (DSA key ID 89C809CB, Primary key fingerprint: 4005 5C6A FD2D 526B 2961 E78F 5EE1 DBA7 89C8 09CB).

    If you have not already verified that key is the right signature key it can be verified against, eg, keys in the Debian keyring as Joey Hess is a former Debian Developer.

  • Once you are happy with git-annex tarball you downloaded, copy it onto the NAS somewhere suitable, eg on the NAS:

    sudo mkdir /volume1/thirdparty
    sudo mkdir /volume1/thirdparty/archives
    sudo chown "$(id -nu):$(id -ng)" /volume1/thirdparty/archives

    then from wherever you downloaded the git-annex archive:

    scp -p git-annex-standalone-amd64.tar.gz*
  • Extract the archive on the NAS:

    cd /volume1/thidparty
    sudo tar -xzf archives/git-annex-standalone-amd64.tar.gz

    The extracted archive is about 160MB, because of bundling all the required tools:

    ewen@nas01:/volume1/thirdparty$ du -sm git-annex.linux/
    161 git-annex.linux/

    to make it a stand alone version (as well as static linking everything).

  • Symlink git-annex into /usr/local/bin so we have a common place

    to reference these binaries:

    cd /usr/local/bin
    sudo ln -s /volume1/thirdparty/git-annex.linux/git-annex .

    In a normal login shell /usr/local/bin will be on the PATH, and:

    which git-annex

    should print:


    and you should be able to run git-annex by itself and have it print out the basic help text.

    Unfortunately this does not work for non-interactive shells, because the Synology NAS uses the "/bin/sh" symlink to bash, which means that non-interactive shells do not process ~/.bashrc, and non-interactive shells also do not read /etc/profile (which is where /usr/local/bin/ is added to the PATH). So we have to add some more work arounds, with symlinks into /usr/bin/ later (see below).

    For reference, this is my /etc/passwd entry created by the Synology NAS web interface (DSM):

    ewen@nas01:~$ grep "$(id -un):" /etc/passwd
    ewen:x:1026:100:Ewen McNeill:/var/services/homes/ewen:/bin/sh
  • To fix the warning:

    warning: /bin/sh: setlocale: LC_ALL: cannot change locale (en_US.utf8)

    we have to pre-create the locales directory that the git-annex runshell script tries to write the locales into, with permissions that a regular user can write into, and then run git-annex once.

    sudo mkdir /volume1/thirdparty/git-annex.linux/locales
    sudo chown "$(id -un):$(id -gn)" /volume1/thirdparty/git-annex.linux/locales

    On the Synology NAS, with the default locale:

    ewen@nas01:~$ set | egrep "LANG|LC_ALL"

    this should create:

    ewen@nas01:~$ ls /volume1/thirdparty/git-annex.linux/locales/en_US.utf8/

    And then we can revert the file permissions to root owned:

    sudo chown -R root:root /volume1/thirdparty/git-annex.linux/locales

    Note that it is possible to change the interactive locale by setting LANG and LC_ALL in, eg, ~/.bash_profile (but this will not work for non-interactive shells). git-annex only supports utf8 locales, but that is probably the most useful modern choice anyway. (I chose not to bother as en_US.utf8 is close enough to my usual locale -- en_NZ.utf8 -- that it did not really matter at present; the main difference would be the date format, and I do not expect to use git-annex interactively on the Synology NAS often enough for that to be an issue. I just wanted the warning message gone, as it turns up repeatedly in interactive use.)

  • To usefully use git-annex you probably also want to enable the "User Home" feature, so that the home directory for your user is created and you can store things like ssh keys; this also enables a per-user share via (CIFS, etc). To do this, in the Synology web interface (DSM) go to Control Panel -> User -> User Home and tick "Enable user home service", and hit Apply. This will create a /volume1/homes directory, a directory for each user, and a /var/services/homes symlink pointing at /volume1/homes so that the shell directories are reachable.

    Once that is done, when you ssh into the NAS, the message about your home directory being missing:

    Could not chdir to home directory /var/services/homes/ewen: No such

    file or directory

    should be gone, and you should arrive in your home directory at login:

      ewen@nas01:~$ pwd
  • If you do have a home directory, you might also want to do some common git setup:

    git config --global ...    # Insert your email address
    git config --global ...     # Insert your name

    which should run without any complaints, creating a ~/.gitconfig file with the values you supply.

  • Assuming you do have a user home directory you can usefully run the next step to have git-annex auto-generate a couple of necessary helper scripts in ${HOME}/.ssh/ -- which cannot be automatically created otherwise (but see the contents below if you want to try to create them by hand).

    To create the helper scripts automatically run:


    which will start a new shell, with /volume1/thirdparty/git-annex.linux/bin in the "${PATH}" so you can interactively use the git-annex versions of tools (eg, for testing).

    It also creates the two helper scripts that we need:

    $ ls -l ${HOME}/.ssh
    total 8
    -rwxrwxrwx 1 ewen users 241 May 28 11:20 git-annex-shell
    -rwxrwxrwx 1 ewen users  74 May 28 11:20 git-annex-wrapper
  • Since (a) these scripts are not user specific and (b) "${HOME}/.ssh" is not on the PATH by default, it is much more useful to move these scripts into, eg, /usr/local/bin/, so they are in a central location.

    To do this:

    cd /usr/local/bin
    sudo mv "${HOME}/.ssh/git-annex-shell" .
    sudo mv "${HOME}/.ssh/git-annex-wrapper" .
    sudo chown root:root git-annex-shell git-annex-wrapper
    sudo chmod 755 git-annex-shell git-annex-wrapper

    This should give you two trivial shell scripts, which hard code the path to where you unpacked git-annex:

    ewen@nas01:/usr/local/bin$ ls -l git-annex-*
    -rwxr-xr-x 1 root root 241 May 28 11:20 git-annex-shell
    -rwxr-xr-x 1 root root  74 May 28 11:20 git-annex-wrapper
    ewen@nas01:/usr/local/bin$ cat git-annex-shell
    set -e
    if [ "x$SSH_ORIGINAL_COMMAND" != "x" ]; then
    exec '/volume1/thirdparty/git-annex.linux/runshell' git-annex-shell -c
    exec '/volume1/thirdparty/git-annex.linux/runshell' git-annex-shell -c "$@"
    ewen@nas01:/usr/local/bin$ cat git-annex-wrapper
    set -e
    exec '/volume1/thirdparty/git-annex.linux/runshell' "$@"

    (which gives you enough to create them by hand if you need to, substituting the path where you unpacked the git-annex standalone archive for /volume1/thirdparty/).

  • To be able to run these helper scripts, and git-annex itself, from a non-interactive shell -- such as when git-annex itself is trying to run the remote git-annex, we need to ensure that git-annex, git-annex-shell and git-annex-wrapper are reachable via a directory that is in the default PATH. That default PATH is very minimal, containing:

    ewen@ashram:~$ ssh 'set' | grep PATH

    Since /bin and /sbin are both symlinks anyway:

    ewen@nas01:~$ ls -l /bin
    lrwxrwxrwx 1 root root 7 May 21 18:57 /bin -> usr/bin
    ewen@nas01:~$ ls -l /sbin
    lrwxrwxrwx 1 root root 8 May 21 18:57 /sbin -> usr/sbin

    that gives us only two choices -- /usr/bin and /usr/sbin -- which are on the default PATH. Given that git-annex is not a system administration tool, only /usr/bin makes sense.

    To do symlink them into /usr/bin:

    cd /usr/bin
    sudo ln -s /usr/local/bin/git-annex* .

    I am expecting that this step may need to be redone periodically, as various Synology updates update /usr/bin, which is why I have a "master" copy in /usr/local/bin and just symlink it into /usr/bin. For git-annex this is a chain of two symlinks:

    ewen@nas01:~$ ls -l /usr/bin/git-annex
    lrwxrwxrwx 1 root root 24 May 28 11:57 /usr/bin/git-annex -> /usr/local/bin/git-annex
    ewen@nas01:~$ ls -l /usr/local/bin/git-annex
    lrwxrwxrwx 1 root root 45 May 28 11:01 /usr/local/bin/git-annex -> /volume1/thirdparty/git-annex.linux/git-annex

    which is slightly inefficient, but still convenient for restoring later.

  • Now is a convenient time to set up ssh key access to the Synology NAS, by creating ${HOME}/.ssh/authorized_keys as usual. Since we do not need a special key to trigger a special hard coded path to git-annex-shell (because it is on the PATH) you can use your regular key if you want rather than a dedicated "git-annex on Synology NAS" key.

    Ensure that the permissions on the ${HOME}/.ssh directory and the authorized_keys file are appropriately locked down so that sshd will trust them, eg:

    chmod go-w .
    chmod 2700 .ssh
    chmod 400 .ssh/authorized_keys

    and then you should be able to ssh to the NAS with key authentication; if it does not work use "ssh -v ..." to figure out the error, which is most likely a permissions problem like:

    debug1: Remote: Ignored authorized keys: bad ownership or modes for directory /volume1/homes/ewen

    because the permissions on the default created directories are very permissive (and would allow anyone to create a ssh authorized key entry), so sshd will not trust the files until the permissions are corrected.

  • All going well at this point you should be able to verify that you can reach all the necessary programs from a non-interactive ssh session with something like:

    for FILE in git-annex git-annex-shell git-annex-wrapper git git-shell git-receive-pack git-upload-pack; do
        ssh NAS "which ${FILE}"

    and get back answers like:


    if one or more of those is missing from the output you will want to figure out why before continuing.

  • To centralise my git-annex storage, I created an "annex" share through the Synology NAS web interface (DSM) in Control Panel -> Shared Folder. This created a /volume1/annex directory.

  • To make that easily accessible, I created a top level symlink to it:

    sudo ln -s /volume1/annex /


    ewen@nas01:~$ ls -l /annex
    lrwxrwxrwx+ 1 root root 14 May 28 12:10 /annex -> /volume1/annex

    This matches the pattern I use on some other machines.

Once all these setup is done, git-annex can be used effectively like any other Linux/Unix machine. For instance you can "push a clone" onto the NAS using "git bundle" and "git clone" from the bundle, and then add that as a "git remote" and use "git annex sync" and "git annex copy ...`" to copy into it.

The "standalone git-annex" will probably need updating periodically (for bug/security fixes, new features, etc), but it should be possible to do that simply by replacing the unpacked tarfile contents as required; everything else points back to that directory. (Possibly the locale generation step might need to be done by hand again.)

Finally for future reference, it is also possible to run a Debian chroot on the Synology NAS, which would open up even more possibilities for using the NAS as a more general purpose machine.

ETA 2017-06-25: Beware that (certain?) Synology updates will rebuild the root file system, and/or clean out unexpected symlinks. So after an update, or reboot, it is necessary to redo:

sudo ln -s /volume1/annex /
cd /usr/bin
sudo ln -s /usr/local/bin/git-annex* .

before git-annex-shell will be automatically found, and the nas01:/annex/... paths will work again

I have worked around this by creating /usr/local/bin/activate-git-annex containing:

#! /bin/sh
# Relink git-annex paths onto the Synology root file system
if [ -e /annex ]; then
  sudo ln -s /volume1/annex /
cd /usr/bin && for FILE in /usr/local/bin/git-annex*; do
                  if [ -e "$(basename ${FILE})" ]; then
                    sudo ln -s "${FILE}" .;

So that, when git-annex breaks after an upgrade, I can just run:


from an interactive shell and it will all work again. (/usr/local/bin seems to survive upgrades, but unfortunately, as noted above, is not in the default path for a non-interactive shell, so it is not a complete solution. However it is in the default path for an interactive shell hence the simple command above.)

ETA 2017-07-16: Fixed up logic of "check if already done" test in /usr/local/bin/activate-git-annex. It appears that will need to be run every time that the Synology is updated, at least in a way that will cause it to reboot.

Posted Sun May 28 13:09:13 2017 Tags:

Imagine, not entirely hypothetically, that you have a VMware ESXi 6.0 host that has disconnected from VMware vCenter due to an issue with the management agents on the host, but the virtual machines on the host are still running. Both the affected host and the VMs will show as "disconnected" in vCenter in this case. Attempts to reconnect from the vCenter side fail.

The usual next steps are to check the management network (eg, from the ESXi DCUI) -- it was fine -- and then to try restarting the management agents in ESXi from the DCUI or a ssh shell, which in this not entirely hypothetical case hung for (literally) hours attempting to restart them. When you reach this point the usual advice is that the host has to be rebooted -- which is complicated because it has production VMs on it, and you cannot just vmotion those VMs to somewhere else.... because the connection to vCenter is broken :-(

If you are lucky enough to have:

  • ssh access to the affected ESXi host, so you can easily tell what is running there

  • your VMs hosted on shared storage

  • at least one other working ESXi host with capacity for the affected VMs connected to vCenter and the shared storage

  • ssh access to the working ESXi host

then there may be a relatively non-disruptive way out of this mess where you can cleanly shut down each VM and then start it up again on the working host even when the management agents are not working any more. (In our not entirely hypothetical case we got no response at all to any esxcli or vim-cmd or similar commands, including commands like df -- presumably because they all talk to the local management agents, which were wedged.)

To be able to move the affected VMs with the least downtime like this you need:

  • to know the path on the shared storage to the VM's .vmx file (typically something like /vmfs/volumes/..../VM/VM.vmx)

  • to know the port group on the vDS (distributed switch) for each interface of the VM

  • have a login (or contact that can login) to the VM and shut it down from within the guest OS

Hopefully you can find the first two in your provisioning database (in a larger environment), or someone will remember where the VMs are stored (in a smaller environment), otherwise you will need to find them by manually browsing your storage and vDS in vCenter. Do find out both of these things before shutting down the VM to minimise downtime of the affected VMs.

To move the VM in this manual way the approach is then:

  • Log into vCenter

  • Find a new unused port ID to use for the VM on the new working host that is in the same port group (normally the host/vCenter will do this for you, but because we bypass vCenter to register the VM this does not happen automatically. To do this go to the Networking page in vCenter, and look in the "Ports" tab of the relevant port group for an empty Port ID line, then make a note of that Port ID number. If you have multiple interfaces you will need to do this for each interface of the affected VM. (If you do not do this, the VM will start up with its networking disconnected, and you will get the error Invalid configuration for device '0' or similar, which will lead to unnecessary downtime. If you really cannot figure out the appropriate vDS port groups, you can leave this step until after you have shut down/re-registered the VM, but there will be more downtime.)

  • ssh into the new working host you plan to start the VM, and prepare the command:

    vim-cmd solo/registervm /vmfs/volumes/PATH_TO_VM/VM/VM.vmx

    ready to run as soon as it is time. This will register the VM on the new ESXi host, which will then tell vCenter "hey, I have this VM now" and the VM will no longer show as disconnected in ESXi. (I believe this works because it manually replicates what, eg, VMware HA does.)

  • ssh into the new working host in another window, and run:

    ls /vmfs/volumes/PATH_TO_VM/VM/

    to check for the VM.vmx.lck file indicating the VM is running; it should be present at this point as the VM is still running on the affected host. Be ready to run this command again once the VM is shut down.

  • Now log into the guest OS (via ssh, RDP, etc) and ask the guest OS to shut down (or call your contact and ask them to do that). Monitor the progress shutting down by, eg, pinging the external IP.

  • Once you see ping stop responding, wait a few seconds then re-run your:

    ls /vmfs/volumes/PATH_TO_VM/VM/

    on the new working host. With luck you will see that there is no VM.vmx.lck file left a few seconds after it stops responding to ping, indicating that the shutdown completed successsfully.

  • Once the VM.vmx.lck file is gone, hit enter in the other window where you prepared the:

    vim-cmd solo/registervm /vmfs/volumes/PATH_TO_VM/VM/VM.vmx

    command to register it on the new working host.

  • Then find the VM in vCenter -- it should no longer show as disconnected. Edit its settings, and for each network interface click on the "Advanced Settings" link, and then change the Port ID of the vDS port is connected to from the old one (tied to the broken host) to the free Port ID in the same port group that you found above. Save your changes.

  • Hit the Play button on VM in vCenter. All going well, the VM should start normally, and connect to the network. Wait for the guest OS to boot and then check (or have your contact check) that it is working. (If it does not connect to the network double check the Port ID that you set, and the guest OS -- by this point you should be able to open the VM's console again to look.)

All going well the downtime for each VM is about 30 seconds longer than the time it takes to shut down the guest OS in the VM, and start up the guest OS in the VM again -- so best case 1-2 minutes downtime.

Lather, rinse, and repeat to move the other VMs. I would suggest doing only one at a time to minimise the risk of getting confused about which step you are up to on which VM, and also minimise the downtime for each individual VM due to being distracted by working on a different VM.

If you are very lucky then after a while maybe you will manage to shutdown the VM that caused the host management agents to wedge/not start, and then the host management agents will start and the host will reconnect to vCenter. If so, you can then vMotion the remaining VMs off the affected host as normal. Otherwise keep going with the manual procedure until the host is empty. (You can tell it is empty because you no longer have disconnected VMs in vCenter; also "ps | grep vcpu | cut -f 2 -d : | sort | uniq" makes an acceptable substitute for "esxcli vm process list" -- the latter of which will just hang in this case.)

Once the affected host is empty either reboot it (or power cycle it if you cannot get in to reboot it), or if it did reconnect to vCenter, put it into maintenance mode and then reboot it. That way all the management agents and the vmkernel get a fresh start. If the VMware host logs (eg hostd.log, vmkernel.log, vxpa.log) do not show an obvious hardware cause of the problems -- so that it seems like a bug was triggered instead -- then it is probably safe to put the affected host back into production once it has been rebooted/power cycled.

Thanks to devnull4 and routereflector for the very useful hints to this process (other useful information). In our not entirely hypothetical situation none of the esxcli commands or vim-cmds to run on the non-working host worked -- they all just hung indefinitely -- so we skipped all of those, and just shut down the guests from within the guest OS. (As best we can tell from context it seems like maybe something confused CBT on a specific guest on this host, which caused a pile up of processes waiting on a lock, which caused all the symptoms. Moving a VM that we found out was being backed up via CBT tracking at the time seemed to be the magic step that freed everything else up. The ESXi hosts affected were on the nearly-latest patch level, but we plan to patch them up to date in a maintenance window soon in case the bug we seem to have stumbled across has been fixed.)

The moral of this story is if you find yourself in this situation try to start the SSH shell before trying to resart the Management Agents. It will give you a second way to look at the affected host if the Management Agents do not just restart. In our case it took about 5-10 minutes before the SSH shell started, and during that time the DCUI did not respond to keyboard input. But by contrast restarting the Management Agents through the DCUI took literally 3 hours, during which the DCUI was unusable -- so if we had not started the SSH shell first we would have had no visiblity.

Posted Mon May 15 13:07:30 2017 Tags:

For many years, including on OS X 10.9 (Snow Leopard) and on OS X 10.11 (El Capitan), I have relied on to automatically import photos from my iPhone when it is connected to my Mac.

Unfortunately stopped working when I upgraded to iOS 10.3.1 -- it launches, but does not import anything, when the phone is connected while logged in (whether or not the phone is locked). It seems like this issue might have started with with iOS 10.2 -- but I went straight from 9.x to 10.3.1, so do not know if 10.2 or 10.3 first introduced the problem.

In my case even logging in with the phone connected and unlocked does not seem to work for me on OS X 10.11 (El Capitan). From reading bug reports (and some hints in a Reddit thread) it appears the issue is that is starting correctly, but is unable to get a list of photos to import -- which is consistent with opening, but showing a 100% progress bar immediately then closing.

Manually importing into (with auto-launched) appears to be Apple's official answer, but I do not want a manual import -- or to use Supposedly the iCloud My Photo Stream feature can get in the way, but I do not have any iCloud features turned on, and other users report it made no differnce for them. So my current pick is that something changed around enumerating the list of photos on the iPhone, and no one bothered to upgrade to match :-( Some users report upgrading to recent versions of macOS Sierra (but only recent versions) made it work again -- however I am not about to embark on a major OS upgrade at present, just in the hope of making this one thing work again.

For now I am trying to remember to periodically use "Image" to manually import new images off my iPhone as a backup (in addition to the automatic copy in the iTunes backup of my phone). Sadly this is rather less convenient than the automatically running import.

To use "Image" to do a manual import:

  • connect the iPhone to the computer

  • open the "Image"

  • select the phone from the list of devices, and if necessary enter the passcode on the phone to unlock it and grant access

  • highlight the photos since the last import (:-( ), eg determined by the file names or date

  • make sure the destination setting at the bottom is "Pictures"

  • make sure that the "Make subfolders per camera" setting is ticked (to replicate how I had set up)

  • make sure "Delete after import" is not ticked (in the bottom left area)

Then click on "Import" at the bottom right.

It appears that "Image" may recognise the photos which are already imported, once everything else is set up (eg, they have a green tick), so the step of selecting the new photos may not be necessary -- or may be able to be guided by those green ticks -- but for now I am still doing that step of determining what to import manually, in order to have more control over what is imported.

Hopefully eventually there is an OS X 10.11 (El Capitan) update that restores this useful automatic import functionality.

Posted Sun May 14 11:22:02 2017 Tags: