Fundamental Interconnectedness

This is the occasional blog of Ewen McNeill. It is also available on LiveJournal as ewen_mcneill, and Dreamwidth as ewen_mcneill_feed.

Earlier this week "Do Not Reply" let me know, in an email titled "Return Available", that it was time to file my company GST return:

You have GST returns for period ending 30 September 2017, due
30 October 2017, now available for filing for the following IRD numbers:

So this weekend I finished up the data entry and calculated the data needed to file the GST return, as usual. (Dear IRD, if you are listening, perhaps "Do Not Reply" is not the most optimal sender for official correspondence? Maybe you could consider, eg, "IRD Notification Service"? Also "Return Available" seems like a confusing way to say "please file your GST return this month". Just saying.)

Of note for understanding what transpires below, I was forced to register for "MyIR" a couple of years ago to request IRD provide a Tax Residency Certificate; other countries have information, but IRD only provide a guide to determining tax residency, and needed the concept of a Tax Residency Certificate explained to them, including the fields required by their Double Taxation Treaty partners.

Because of that "MyIR" registration, I am now forced to file GST returns online (once you have registered, filing on paper is no longer an option). Previously the online filing has been relatively simple, but this weekend while the filing went okay, trying to exit out of the "MyGST" part of the "MyIR" website of the Inland Revenue Department turned into a comedy of errors:

  1. The "Log Off" button in the "MyGST" site, something you would hope would be regularly tested, failed to work. It tries to access (via Javascript obscurity):

    https://services.ird.govt.nz/myirlogout.jsp
    

    which seems a plausible enough URL, but actually ends up with:

    Secure Connection Failed
    
    The connection to the server was reset while the page was loading.
    

    every time I tried. (The "Logout" link on the "MyIR" site, also loading via Javascript, went to a different ird.govt.nz site, but did actually work; it is unclear if logging out of "MyIR" also logs you out of "MyGST", as they are presented as separate websites.)

  2. Since a working "Log Off" function seemed important to a site that holds sensitive, potentially confidental, information I tried to report the issue. Conveniently the "MyGST" site has a handy "Send us a message" link on its front page, so I attempted to use that. However I found:

    • It will not accept ".txt" attachments (to illustrate the problem): "File Type .txt is not allowed" with no indication of why it is not allowed. (I assume "not on the whitelist", but that raises the questions (a) why?! and (b) what "File Type"s are allowed. Experimentally I determined that PNG and PDF were allowed.)

    • There is no option to contact about the website, only "something else".

    • When you "Submit" the message you've written, the website simply returns to the "MyGST" home page with no indication whether or not the message was sent, where you might see the sent message, and no copy of the sent message emailed to you. (I tried twice; same result both times.)

    So that did not seem very promising.

    For the record, I eventually found -- much later -- that you can check if the message has been sent by:

    • Going to the "Activity Centre" tab of "MyGST"

    • Clicking on the "More..." button next to the "Messages" heading

    • Clicking on the "Outbox" tab of that mailbox

    and you will see your messages there, and can click on each one to view them. (Which showed that each of my two attempts had apparently been sent twice, despite the website not informing me it had done so; oops. It is unclear to me how they ended up each being sent twice; I did not, eg, click through a "resend POST data" dialogue.)

  3. When it was unclear if "Send us a message" in "MyGST" worked, I thought the next best option would be to go back to the "MyIR" site, and use "Secure mail" which is IRD's preferred means of contact (as I found out when, eg, trying to get a Tax Residency Certificate a couple of years ago). Unfortunately when I attempted to use that I found:

    • There is no option to choose "Website" or "GST" from the form at all, so I had to send an "All Other" / "All Other" message;

    • There was no option to add attachments to the message, so I could not include the screenshots/error output; and

    • When I submitted that message, I got a generic 404 error!

      https://www1.e-services.ird.govt.nz/error/error404.html
      

      which told me:

      Contact us
      Page not available
      
      The page you are trying to access is not available on our website.
      
      If you have reached this page by following a link on our website
      rather than using a bookmark, please take a moment to e-mail the
      General comments form with the details of the page you were trying
      to access.
      

    The "MyIR" "Secure Mail" feature does have an obvious "Sent" tab, so in another window I was quickly able to check that it had not in fact been sent. At this point I assumed I was 0 for 3 -- not a great batting average.

  4. Still, the 404 page did offer a link to the General Comments page:

    https://www.ird.govt.nz/cgi-bin/form.cgi?form=comments

    so it seemed worth reporting the accumulating list of problems. That "General Comments" page is (naturally) very general, but:

    • "Website" is not a category they have anticipated receiving comments about (so "Other" it is again); and

    • Your choices for response are:

      • No response required

      • In writing by mail

      • Over the phone

      • In writing by fax

      And that is it: no option to ask for a response by email. But if your 1990s fax machine is still hooked up and working then IRD is ready to respond to your online communication with your preferred option! (It appears based on your response here the second stage of the form requires you to enter different values; but the "In writing by mail" does not even collect a postcode!)

      In fairness, the second stage of the form also allowed an optional email address to be entered -- which I did -- so possibly they might treat one of the above as "by email"; it is just not at all obvious to the user.

    • The box for entering comments was 40 characters wide by 4 characters deep -- there are programmable calculators with a larger display! (In fairness Firefox on OS X at least does allow resizing this; but nothing says "we hope you do not have much to say" like allowing a old-Tweet length worth of text to be visible on the screen at once.)

    Anyway undeterred by all of this I reported in brief, the three problems I had encountered so far: (1) "MyGST" Log Off function broken; (2) "MyGST" "Send us a message" function apparently not working; (3) "MyIR" "Secure Mail" sending resulting in a 404.

    That one was successful, giving me a "comment sent" confirmation page, although without any tracking number or other identifier (the closest to an identifier is "Your request was sent on Sunday 8 October 2017 at 14:40"). Sadly my neatly laid out bullet point list of issues encountered was turned into a single line of terribly formatted run on text; it appears they were serious about people keeping their comments to old-Tweet length!

  5. After this experience I was surprised to find that the only working thing -- the General Comments Form -- offered me a chance to:

    Send feedback about this form

    Since I seemed to be on a yak shaving mission to use every feedback form on the site, who could resist?! I (successfully!) offered them anonymous feedback that:

    • In 2017, offering "response by email" might be a useful update;

    • Perhaps "In writing by fax" could be retired;

    • 40x4 character comment forms are... rather small and difficult to use.

    Only I had to do so much more tersely because the "Online Form Feedback" comment field was itself 40x4 characters.

On the plus side:

  • I did manage to file my GST return

  • Eventually if one is patient enough, one does get auto-logged out of the "MyIR" site, so maybe one does get auto-logged out of the "MyGST" site as well;

  • Apparently I did manage to report the original "MyGST" "Log Off" problem after all (and hopefully someone at IRD can merge those into a single ticket, rather than having four people investigating the problem).

Now to actually pay my GST amount due.

If IRD do respond with anything useful I will add an update to this post to record that, eg, some of the above issues have been fixed. At least two of them ("MyGST" "Log Off" isue and "MyIR" "Secure Mail" sending seem likely to be encountered by other users and fixed.)

ETA 2017-10-17: IRD responded to my second contact attempt (in MyGST) with:

"""Good Afternoon Ewen.

Thank you for your email on the 8th October 2017.

seeing you are having issues with the online service please
contact us on 0800 227 770.

As this service doesn't deal with these issues, This service
is for web message responses for GST accounts. We have forwarded
this message to be directed to our Technical Services as this
is a case for them."""

which, at one week to reply, is much better than their estimated reply time. I have assumed that "forwarded [...] to our Technical Services [...]" will be sufficient to get the original reports in front of someone who might be able to actually investigate/fix them, and not done anything further (calling an 0800 number for (frontline) "technical support" seems unlikely to end well over such a technical issue).

The "MyGST" "Log Off" functionality is still broken though. The "MyIR" logout functionality is slow, but does eventually work.

However going back to an earlier GST page after using the "MyIR" log out functionality, and reloading still shows I am in my "MyGST" account", and can access pages in "MyGST" that I previously had not viewed in this session. So it appears the two logoff functions are separate -- even though they are controlled by a single "logon" screen. By contrast, trying to go to a "MyIR" page does correctly show the login screen again.

So we learn that logging off "MyGST" separately is important, and that it is still broken (at least in Firefox 52 LTS on OS X 10.11, and Safari 11 on OS X 10.11; both retested today, 2017-07-17).

Posted Sun Oct 8 16:47:59 2017 Tags:

I have a Huawei HG659 Home Gateway supplied as part of my Vodafone FibreX installation. Out of the box, since early 2017, the Vodafone FibreX / Huawei HG659 combination has natively provided IPv6 support. This automagically works on modern mac OS:

ewen@osx:~$ ping6 google.com
PING6(56=40+8+8 bytes) 2407:7000:9b0e:4856:b971:8973:3fe3:1a51 --> 2404:6800:4006:804::200e
16 bytes from 2404:6800:4006:804::200e, icmp_seq=0 hlim=57 time=46.574 ms
16 bytes from 2404:6800:4006:804::200e, icmp_seq=1 hlim=57 time=43.953 ms

and Linux:

ewen@linux:~$ ping6 google.com
PING google.com(syd15s03-in-x0e.1e100.net (2404:6800:4006:804::200e)) 56 data bytes
64 bytes from syd15s03-in-x0e.1e100.net (2404:6800:4006:804::200e): icmp_seq=1 ttl=57 time=44.8 ms
64 bytes from syd15s03-in-x0e.1e100.net (2404:6800:4006:804::200e): icmp_seq=2 ttl=57 time=43.8 ms

to provide global IPv6 connectivity, for a single internal VLAN, without having to do anything else.

Vodafone delegates a /56 prefix to each customer, which in theory means that it should be possible to further sub-delegate that within our own network for multiple subnets -- most IPv6 features will work down to /64 subnets. I think the /56 is being provided via DHCPv6 Prefix Delegation (see RFC3633 and RFC3769; see also OpenStack Prefix Delegation discussion).

Recently I started looking at whether I could configure an internal Mikrotik router to route dynamically-obtained IPv6 prefixes from the Huawei HG659's /56 pool, to create a separate -- more isolated -- internal subnet. A very useful Mikrotik IPv6 Home Example provided the Mikrotik configuration required, although I did have to update it slightly for later Mikrotik versions (tested with RouterOS 6.40.1).

Enable IPv6 features on the Mikrotik if they are not already enabled:

/system package enable ipv6
/system package print

If the "print" shows you an "X" with a note that it will be enabled after reboot, then also reboot the Mikrotik at this point:

/system reboot

After that, you should have a IPv6 Link Local Address on the active interface, which you can see with:

[admin@naos-rb951-2n] > /ipv6 addr print
Flags: X - disabled, I - invalid, D - dynamic, G - global, L - link-local
 #    ADDRESS                                     FROM-... INTERFACE        ADV
 0 DL fe80::d6ca:6dff:fe50:6c44/64                         ether1           no
[admin@naos-rb951-2n] >

(The IPv6 Link Local addresses are recognisable as being in fe80::/64, and on the Mikrotik will show as "DL" -- dynamically assigned, link local.)

Once that is working configure the Mikrotik IPv6 DHCPv6 client to request a Prefix Delegation with:

/ipv6 dhcp-client add interface=ether1 pool-name=ipv6-local \
      add-default-route=yes use-peer-dns=yes request=prefix

Unfortunately when I tried that, it never succeeded in getting an answer from the Huawei HG659. Instead the status was stuck in "searching":

[admin@naos-rb951-2n] > /ipv6 dhcp-client print detail
Flags: D - dynamic, X - disabled, I - invalid
 0    interface=ether1 status=searching... duid="0x00030001d4ca6d506c44"
      dhcp-server-v6=:: request=prefix add-default-route=yes use-peer-dns=yes
      pool-name="ipv6-local" pool-prefix-length=64 prefix-hint=::/0
[admin@naos-rb951-2n] >

which makes me think that while the Huawei HG659 appears to be able to request an IPv6 prefix delegation (with a DHCPv6 client) it does not appear provide a DHCPv6 server that is capable of prefix delegation, which rather defeats the purpose of having a /56 delegated :-(

Specifying:

/ipv6 dhcp-client remove 0
/ipv6 dhcp-client add interface=ether1 pool-name=ipv6-local \
      add-default-route=yes use-peer-dns=yes request=address,prefix

which apepars to be the syntax to request an interface address and a prefix delegation did not work any better, still getting stuck with a status of "searching...":

[admin@naos-rb951-2n] > /ipv6 dhcp-client print  detail
Flags: D - dynamic, X - disabled, I - invalid
 0    interface=ether1 status=searching... duid="0x00030001d4ca6d506c44"
      dhcp-server-v6=:: request=address,prefix add-default-route=yes
      use-peer-dns=yes pool-name="ipv6-local" pool-prefix-length=64
      prefix-hint=::/0
[admin@naos-rb951-2n] >

If I delete that and just request an address:

/ipv6 dhcp-client remove 0
/ipv6 dhcp-client add interface=ether1 pool-name=ipv6-local \
      add-default-route=yes use-peer-dns=yes request=address

then the DHCPv6 request does succeed very quickly:

[admin@naos-rb951-2n] > /ipv6 dhcp-client print
Flags: D - dynamic, X - disabled, I - invalid
 #    INTERFACE                     STATUS        REQUEST
 0    ether1                        bound         address
[admin@naos-rb951-2n] >

and there is an additional IPv6 address visible for that interface:

[admin@naos-rb951-2n] > /ipv6 addr print
Flags: X - disabled, I - invalid, D - dynamic, G - global, L - link-local
 #    ADDRESS                                     FROM-... INTERFACE        ADV
 0 DL fe80::d6ca:6dff:fe50:6c44/64                         ether1           no
 1 IDG ;;; duplicate address detected
      2407:xxxx:xxxx:4800::2/64                            ether1           no
[admin@naos-rb951-2n] >

Unfortunately the "I" flag and the "duplicate address detected" comment are both very bad signs -- that the address supplied by DHCPv6 is unusable. When I look around other devices on my network I find that they too have that address, including my main OS X 10.11 laptop:

ewen@ashram:~$ ifconfig -a | grep -B 10 ::2 | egrep "^en|::2"
en6: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet6 2407:xxxx:xxxx:4800::2 prefixlen 128 dynamic
ewen@ashram:~$

and another OS X 10.11 laptop:

ewen@mandir:~$ ifconfig -a | grep -B 9 ::2 | egrep "^en|::2"
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet6 2407:xxxx:xxxx:4800::2 prefixlen 128 duplicated dynamic
ewen@mandir:~$

which implies that the Huawei HG659 DHCPv6 server is handing out the same (::2) address to multiple clients (possibly all clients?!) and only the first client to make the request has a reasonable chance of working (in theory the others will discover via RFC7527 Duplicate Address Detection that the address is already in used, and invalidate it, to allow the first client to work).

From all of this I conclude that the Huawei HG659 DHCPv6 server will basically only work in a useful fashion for a single DHCPv6 client, that wants a single address -- so it is almost useless. In particular the DHCPv6 server does not appear to be a way to get use of parts of at the IPv6 /56 delegation provided by Vodafone.

Yet IPv6 global transit does work from multiple OS X and Linux devices on my home network -- so they are clearly not (solely) reliant on IPv6 DHCPv6 working properly.

The reason they have working IPv6 transit is that OS X and Linux will also do SLAAC -- Stateless Addesss Auto-Configuration (RFC4826 -- to obtain an IPv6 address and default route. SLAAC uses the IPv6 Neighbor Discovery Protocol (RFC4861) to determine the IPv6 address prefix (/64), and a Modified EUI-64 algorithm (described in RFC5342 section 2.2) to determine the IPv6 address suffix (64-bits).

Providing the Hauwei HG659 is configured to send IPv6 RA ("Route Advertisement") messages (Home Interface -> LAN Interface -> RA Settings -> Enable RA is ticked), then SLAAC should work. There are two other settings:

  • "RA mode": automatic / manual. In automatic mode it appears to pick a prefix from the /56 that the IPv6 DHCPv6 Prefix Delegation client obtained from Vodafone -- apparently the "56" prefix (at least in my case), for no really obvious reason. In manual mode you can specify a prefix, but that does not seem very useful when the larger prefix you have is dynamically allocated....

  • "ULA mode": disable / automatic / manual. This controls the delegation of IPv6 Unique Local Addresses (RFC4193), which are site-local addresses in the fc00::/7 block. By default it is set to "automatic" which appears to result in the Huawei HG659 picking a prefix block at random (as indicated by a fd00::/8 address). "manual" allows manual specification of the block to use, and "disable" I assume turns off this feature.

Together these four features (IPv6 Link Local Addresses, IPv6 DHCPv6, IPv6 SLAAC, RFC4193 Unique Local Addresses) explain most of the IPv6 addresses that I see on my OS X client machines. For instance (some of the globally unique 56 prefix replaced with xxxx:xxxx, and the last three octets of the SLAAC addresses replaced by yy:yyyy for privacy):

ewen@ashram:~$ ifconfig en6 | egrep "^en|inet6" 
en6: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet6 fe80::6a5b:35ff:feyy:yyyy%en6 prefixlen 64 scopeid 0x4
        inet6 2407:xxxx:xxxx:4856:6a5b:35ff:feyy:yyyy prefixlen 64 autoconf
        inet6 2407:xxxx:xxxx:4856:b971:8973:3fe3:1a51 prefixlen 64 autoconf temporary
        inet6 fd50:1d9:5e3e:8300:6a5b:35ff:feyy:yyyy prefixlen 64 autoconf
        inet6 fd50:1d9:5e3e:8300:e010:afec:6457:2850 prefixlen 64 autoconf temporary
        inet6 2407:xxxx:xxxx:4800::2 prefixlen 128 dynamic
@ashram:~$

In this list:

  • The fe80::6a5b:35ff:feyy:yyyy%en6 address is the IPv6 Link Local Address, derived from the prefix fe80:://64 and an EUI-64 suffix derived from the interface MAC address (as described in RFC2373). It is approximately the first 3 octets of the MAC address, then ff:fe, then the last 3 octets of the MAC address -- but the Universal/Local bit of the MAC address is inverted in IPv6, so as to make ::1, ::2 style hand created addresses end up automatically marked as "local". (While this seems clever, with perfect hindsight it would perhaps have been better if the IEEE MAC address Universal/Local flag was a Local/Universal flag with the bit values inverted, for the same reason... and perhaps better positioned in the bit pattern.) In this case 0x68 in the MAC address becomes 0x6a:

    ewen@ashram:~$ perl -le 'printf("%08b\n", 0x68);'
    01101000
    ewen@ashram:~$ perl -le 'printf("%08b\n", 0x6a);'
    01101010
    ewen@ashram:~$
    

    by setting this additional (7th from the left) bit.

  • The 2407:xxxx:xxxx:4856:6a5b:35ff:feyy:yyyy address is the globally routable IPv6 SLAAC address, derived from the SLACC /64 prefix obtained from the IPv6 Route Advertisement packets and the EUI-64 suffix ss described above (where the SLAAC /64 prefix provided by the Hauwei HG659 itself came from an IPv6 DHCPv6 Prefix Delegation request made by the Huawei HG659). This address is recognisable by the "autoconf" flag indicating SLAAC, and the non-fd prefix.

  • The fd50:1d9:5e3e:8300:6a5b:35ff:feyy:yyyy address is the Unique Local Address (RFC4193), derived from a randomly generated prefix in fd00::/8 and the EUI-64 suffix as described above. This address is recognisable by the "autoconf" flag indicating SLAAC, and the fd prefix. (See also "3 Ways to Ruin Your Future Network with IPv6 Unique Local Addresses" Part 1 and Part 2 -- basically by re-introducing all the pain of NAT to IPv6, as well as all the pain of "everyone uses the same site-local prefixes".)

  • The 2407:xxxx:xxxx:4800::2 address is obtained from the Huawei HG659 DHCPv6 server, and consists of the first /64 in the /56 that the Huawei HG659 DHCPv6 client obtained via Prefix Delegation, and a DHCP assigned suffix, starting with ::2 (where I think the Huawei HG659 itself is ::1, but it does not respond to ICMP with that address). This address is recognisable by the "dynamic" flag indicating DHCPv6.

    Unfortunately as described above the Huawei HG659 DHCPv6 DHCP server is broken (at least in Huawei HG659 firmware version V100R001C206B020), and mistakenly hands out the same DHCP assigned suffix to multiple clients. This means that only the lucky first DHCPv6 client on the network will have a working DHCPv6 address. (It also appears, as described above, that it does not support DHCPv6 Prefix Delegation.)

That explains all but two of the IPv6 addresses listed. The remaining two have the "temporary" flag:

ewen@ashram:~$ ifconfig en6 | egrep "^en|inet6" | egrep "^en6|temporary"
en6: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
         inet6 2407:xxxx:xxxx:4856:b971:8973:3fe3:1a51 prefixlen 64 autoconf temporary
         inet6 fd50:1d9:5e3e:8300:e010:afec:6457:2850 prefixlen 64 autoconf temporary
ewen@ashram:~$

and those are even more special. IPv6 Temporary Addresses are created to reduce the ability to track the same device across multiple locations through the SLAAC EUI-64 suffix -- which being predictably derived from the MAC address will stay the same across multiple SLAAC prefixes. Mac OS X (since OS X 10.7 -- Lion) and Microsoft Windows since (Windows Vista) will generate, and use them, by default.

The relevant RFC is RFC4941 which defines "Privacy Extensions for Stateless Address Autoconfiguration in IPv6". Basically it defines a method to create additional ("temporary") IPv6 addresses, following a method like IPv6 SLAAC, which are not derived from a permanent identifier like the ethernet MAC address -- instead a IPv6 suffix is randomly generated and used instead of the EUI-64 suffix. Amusingly the suggested algorithm appears to be old enough to use the (now widely deprecated) MD5 hash algorithm as part of the derivation steps. (These temporary/"Privacy" addresses are supported on many modern OS.)

These RFC4941 "temporary" addresses normally have a shorter lifetime, which can be seen on OS X with "ifconfig -L":

ewen@ashram:~$ ifconfig -L en6 | egrep "^en6|temporary"
en6: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet6 2407:xxxx:xxxx:4856:b971:8973:3fe3:1a51 prefixlen 64 autoconf temporary pltime 3440 vltime 7040
        inet6 fd50:1d9:5e3e:8300:e010:afec:6457:2850 prefixlen 64 autoconf temporary pltime 3440 vltime 7040
ewen@ashram:~$

but on the system I checked both the temporary and the permanent SLAAC addresses had the same pltime/vltime values, which I assume are derived from the SLAAC validity times. The "pltime" is the "preferred" lifetime, and the "vltime" is the "valid" lifetime; I think that after the preferred lifetime an attempt will be made to generate renew the address, and after the valid lifetime the address will be expired (assuming it is not renewed/replaced before then).

It appears that in macOS 10.12 (Sierra) and later, even the non-temporary IPv6 addresses no longer use the EUI-64 approach to derive the address suffix from the MAC address -- which means the "permanent" addresses also changed between 10.11 and 10.12. I do not currently have a macOS 10.12 (Sierra) system to test this on. I found a claim these are RFC 3972 "Cryptographically Generated Addresses", but there does not seem to be much evidence for the exact algorithm used. (There are also suggestions that this is an implementation of RFC7217 "Semantically Opaque Interface Identifiers" which effectively make the IPv6 suffix also depend on the IPv6 prefix. Ie, the resulting address would be stable given the same Prefix, but different for each prefix. See also an IPv6 on OS X Hardening Guide -- from 2015, so probably somewhat out of date now.)

Returning to the problem I started with, configuring a Mikrotik for IPv6, I found that the Mikrotik could have an interfacen address configured with SLAAC, by setting:

/ipv6 settings set accept-router-advertisements=yes

or:

/ipv6 settings set accept-router-advertisements=yes-if-forwarding-disabled forward=no

(see Mikrotik IPv6 Settings), but at least on 6.40.1 this still does not result in an IPv6 SLAAC address being visible anywhere, even after a reboot. (Bouncing the interface, or rebooting -- /system reboot -- is required to initiate SLAAC.)

You can check the address did get assigned properly by pinging it from another SLAAC configured system, with the EUI-64 derived suffix:

ewen@ashram:~$ ping6 -c 2 2407:xxxx:xxxx:4856:d6ca:6dff:feyy:yyyy
PING6(56=40+8+8 bytes) 2407:xxxx:xxxx:4856:6a5b:35ff:fe88:8f6e --> 2407:7000:9b0e:4856:d6ca:6dff:feyy:yyyy
16 bytes from 2407:xxxx:xxxx:4856:d6ca:6dff:feyy:yyyy, icmp_seq=0 hlim=255 time=0.440 ms
16 bytes from 2407:xxxx:xxxx:4856:d6ca:6dff:feyy:yyyy, icmp_seq=1 hlim=255 time=0.504 ms

--- 2407:xxxx:xxxx:4856:d6ca:6dff:feyy:yyyy ping6 statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/std-dev = 0.440/0.472/0.504/0.032 ms
ewen@ashram:~$

In the default IPv6 settings:

[admin@naos-rb951-2n] > /ipv6 settings print
                       forward: yes
              accept-redirects: yes-if-forwarding-disabled
  accept-router-advertisements: yes-if-forwarding-disabled
          max-neighbor-entries: 8192
[admin@naos-rb951-2n] >

then IPv6 SLAAC will not be performed; but with either of the settings above (after a reboot: /system reboot) then SLAAC will be performed.

Other than the UI/display issue, this is consistent with the idea that the WAN interface of a router should be assignable using SLAAC, but not entirely consistent with the documentation which says SLAAC cannot be used on routers. It is just that to be useful routers typically need IP addresses for multiple interfaces, and the only way to meaningfully obtain those is either IPv6 DHCPv6 Prefix Delegation -- or static configuration.

Since the Huawei HG659 appears not to provide a usable IPv6 DHCPv6 server there is no way to get DHCPv6 Prefix Delegation working internally, which means my best option will be to replace Huawei HG659 with something else as the "Home Gateway" connected to the Technicolor TC4400VDF modem. Generally people seem to be using a Mikrotik RB750Gr3 (a "hEX") for which there is a general Mikrotik with Vodafone setup guide available. It is a small 5 * GigE router capable of up to 2 Gbps throughput in ideal conditions (by contrast the old Mikrotik RB951-2n that I had lying around to test with is only 5 * 10/100 interfaces, so slower than my FibreX connection let alone my home networking).

In theory the Mikrotik IPv6 support includes both DHCPv6 Prefix Delegation in the client and server, including on-delegating smaller prefixes. Which should mean that if a Mikrotik RB750Gr3 were directly connected to the Technicolor TC4400VDF cable modem it could handle all my requirements, including creating isolated subnets in IPv4 and IPv6. (The Huawei HG659 supports the typical home "DMZ" device by NAT'ing all IPv4 traffic to a specific internal IP, but it is not very isolated unless you NAT to another firewall like the Mikrotik and then forward from there to the isolated subnet -- and I would really prefer to avoid double NAT. That Huawei HG659 DMZ support also appears to be IPv4 only, and it does not appear to support static IPv4 routes on the LAN interface either -- the static routing functions only allow you to choose a WAN interface.)

Since I seem to have hit up against the limits of the Huawei HG659, my "interim" use of the supplied Huawei HG659 appears to be coming to an end. In the meantime I have turned off the DHCPv6 server on the Huawei HG659 (Home Network -> Lan Interface -> IPv6 DHCP Server -> IPv6 DHCP Server should be unticked).

For the record, the Mikrotik MAC Telnet reimplmentation appears to work quite well on OS X 10.11, providing you already know the MAC address you want to reach (eg, from the sticker on the outside of the Mikrotik). That helps a lot with reconfiguration of the Mikrotik for a new purpose, without relying on a Microsoft Windows system or WINE.

Posted Sun Aug 20 17:42:10 2017 Tags:

KeePassXC (source, wiki) is a password manager forked from KeePassX which is a Linux/Unix port of the Windows KeePass Password Safe. KeePassXC was started because of concern about the relatively slow integration of community code into KeePassX -- ie it is a "Community" fork with more maintainers. KeePassXC seems to have been making regular releases in 2017, with the most recent (KeePassXC 2.2.0) adding Yubikey 2FA support for unlocking databases. KeePassXC also provides builds for Linux, macOS, and Windows, including package builds for several Linux distributions (eg an unofficial Debian/Ubuntu community package build, built from the deb package source with full build instructions).

For macOS / OS X there is a KeePassXC 2.2.0 for macOS binary bundle, and KeePassXC 2.2.0 for macOS sha256 digest. They are GitHub "release" downloads, which are served off Amazon S3. KeePassXC provide instructions on verifying the SHA256 Digest and GPG signature. To verify the SHA256 digest:

  • wget https://github.com/keepassxreboot/keepassxc/releases/download/2.2.0/KeePassXC-2.2.0.dmg

  • wget https://github.com/keepassxreboot/keepassxc/releases/download/2.2.0/KeePassXC-2.2.0.dmg.digest

  • Check the SHA256 digest matches:

    ewen@ashram:~/Desktop$ shasum -a 256 -c KeePassXC-2.2.0.dmg.digest
    KeePassXC-2.2.0.dmg: OK
    ewen@ashram:~/Desktop$
    

To verify the GPG signature of the release:

  • wget https://github.com/keepassxreboot/keepassxc/releases/download/2.2.0/KeePassXC-2.2.0.dmg.sig

  • wget https://keepassxc.org/keepassxc_master_signing_key.asc (which is stored inside the website repository)

  • gpg --import keepassxc_master_signing_key.asc

  • gpg --recv-keys 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2 (alternatively or in addition; in theory it should report it is unchanged)

    ewen@ashram:~/Desktop$ gpg --recv-keys 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2
    gpg: requesting key 6397D0D2 from hkps server hkps.pool.sks-keyservers.net
    gpg: key 6397D0D2: "KeePassXC Release <release@keepassxc.org>" not changed
    gpg: Total number processed: 1
    gpg:              unchanged: 1
    ewen@ashram:~/Desktop$
    
  • Compare the fingerprint on the website with the output of "gpg --fingerprint 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2":

    ewen@ashram:~/Desktop$ gpg --fingerprint 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2
    pub   4096R/6397D0D2 2017-01-03
          Key fingerprint = BF5A 669F 2272 CF43 24C1  FDA8 CFB4 C216 6397 D0D2
    uid                  KeePassXC Release <release@keepassxc.org>
    sub   2048R/A26FD9C4 2017-01-03 [expires: 2019-01-03]
    sub   2048R/FB5A2517 2017-01-03 [expires: 2019-01-03]
    sub   2048R/B59076A8 2017-01-03 [expires: 2019-01-03]
    ewen@ashram:~/Desktop$
    

    to check that the GPG key retrieved is the expected one.

  • Compare the GPG signature of the release:

    ewen@ashram:~/Desktop$ gpg --verify KeePassXC-2.2.0.dmg.sig
    gpg: assuming signed data in `KeePassXC-2.2.0.dmg'
    gpg: Signature made Mon 26 Jun 11:55:34 2017 NZST using RSA key ID B59076A8
    gpg: Good signature from "KeePassXC Release <release@keepassxc.org>"
    gpg: WARNING: This key is not certified with a trusted signature!
    gpg:          There is no indication that the signature belongs to the owner.
    Primary key fingerprint: BF5A 669F 2272 CF43 24C1  FDA8 CFB4 C216 6397 D0D2
         Subkey fingerprint: C1E4 CBA3 AD78 D3AF D894  F9E0 B7A6 6F03 B590 76A8
    ewen@ashram:~/Desktop$
    

    at which point if you trust the key you downloaded is supposed to be signing the code you intend to run, the verification is complete. (There are some signatures on the signing key, but I did not try to track down a GPG signed path from my key to the signing keys, as the fingerprint verification seemed sufficient.)

In addition for Windows and OS X, KeePassXC raised funds for an AuthentiCode code signing certificate earlier this year. When signed, this results in a "known publisher" which avoids the Windows and OS X warnings about running "untrusted" code, and acts as a second verification of the intended code running. It is not clear that the .dmg or KeePassXC.app on OS X is signed at present, as "codesign -dv ..." reports both the .dmg file and the .app as not signed (note that it is possible to use Authenticode Code Signing Certificate with OS X's Signing Tools). My guess is maybe the KeePassXC developers focused on Windows executable signing first (and Apple executables normally need to be signed by a key signed by Apple anyway).

Having verified the downloaded binary package, on OS X it can be installed in the usual manner by mounting the .dmg file, and dragging the .app to somewhere in /Applications. There is a link to /Applications in the .dmg file, but without the clever folder background art that some .dmg files it is less obvious that you are intended to drag it into /Applications to install. (However there is no included installer, so the obvious alternative is "drag'n'drop" to install.)

Once installed, run KeePassXC.app to start. Create a new password database and give it at least a long master password, then save the database (with the updated master password). After the database is created it is possible to re-open KeePassXC.app with the relevant database with the usual:

open PATH/TO/DATABASE.kdbx

thanks to the application association with the .kdbx file extension. This makes it easier to manage multiple databases. When opened in this way the application will propmpt for the master password of the specific database immediately (with the other known databases available as tabs).

KeePassXC YubiKey Support

KeePassXC YubiKey support is via the YubiKey HMAC-SHA1 Challenge-Response authentication, where the YubiKey mixes a shared secret with a challenge token to create a response token. This method was chosen for the KeePassXC YubiKey support because it provides a determinstic response without, eg, needing to reliably track counters or deal with gaps in monotonically increasing values, such as is needed with U2F -- Universal 2nd Factor. This trades a reduction in security (due to just relying on a shared secret) for robustness (eg, not getting permanently locked out of password database due to the YubiKeys counter having moved on to a newer value than the password database), and ease of use (eg, not having to activate the YubiKey at both open and close of a database; the KeePassXC ticket ticket #127 contains some useful discussion of the tradeoffs with authentiction modes needing counters; pwsafe also uses YubiKey Challenge-Response mode, presumably for similar reasons).

The design chosen seems similar to KeeChallenge, a plugin for KeePass2 (source) to support YubiKey authentiction for the Windows KeePass. There is a good setup guide to Securing KeePass with a Second Factor descriting how to set up the YubiKey and KeeChallenge, which seems broadly transferrable to using the similar KeePassXC YubiKey Challenge-Response feature. (A third party YubiKey Handbook contains an example of configuring the Challenge-Response mode from the command line for a slightly different purpose.)

By contrast, the Windows KeePass built in support is OATH-HOTP authentication (see also KeePass and YubiKey), which does not seem to be supported on KeePassXC -- some people also note OTP 2nd Factor provides authentication not encryption which may limit the extra protection in the case of a local database. HOTP also uses a shared key and a counter so suffers from similar shared secret risks as the Challenge Response mechanism, as well as robustness risks in needing to track the counter value -- one guide to the OATH-HOTP mode warns about keeping OTP recovery codes to get back in again after being locked out due to the counter getting out of sync. See also HOTP and TOTP details; HOTP hashes a secret key and a counter, whereas TOTP hashes a secret key and the time, which means it is easier to accidentally get out of sync with HOTP. TOTP seems to be more widely deployed in client-server situations, presumably because it is self-recovering given a reasonably accurate time source.

Configuring a YubiKey to support Challenge-Response HMAC-SHA1

To configure one or more YubiKeys to support Challenge-Response you need to:

  • Install the YubiKey Personalisation Tool from the Apple App Store; it is a zero cost App, but obviously will not be very useful without a YubiKey or two. (The YubiKey Personalisation Tool is also available for other platforms, and in a command line version.)

  • Run the YubiKey Personalization Tool.app

  • Plug in a suitable YubiKey, eg, YubiKey 4; the cheaper YubiKey U2F security key does not have sufficient functionality. (Curiously the first time that I plugged a new YubiKey 4 in, the Keyboard Assistant in OS X 10.11 (El Capitan) wanted to identify it as a keyboard, which seems to be a known problem -- apparently one can just kill the dialog, but I ended up touching the YubiKey, then manually selecting a ANSI keyboard, which also seems to a valid approach. See also the YubiKey User Guide examples for mac OS X.)

  • Having done that, the YubiKey Personalisation Tool should show "YubiKey is inserted", details of the programming status, serial number, and firwmware version, and a list of the features supported.

  • Change to the "Challenge Response" tab, and click on the "HMAC-SHA1" button.

  • Select "Confguration Slot 2" (if you overwrite Configuration Slot 1 then YubiKey Cloud will not work, and that apparently is not recoverable, so using Slot 2 is best unless you are certain you will never need YubiKey Cloud; out of the factory only Configuration Slot 1 is programmed).

  • Assuming you have multiple YubiKeys (and you should, to allow recovery if you lose one or it stops functioning) tick "Program Multiple YubiKeys" at the top, and choose "Same Secret for all Keys" from the dropdown, so that all the keys share the same secret (ie, they are interchangable for this Challenge-Response HMAC-SHA1 mode).

  • You probably want to tick "Require user input (button press)", to make it harder for a remote attacker to activate the Challenge-Response functionality.

  • Select "Fixed 64-byte input" for the HMAC-SHA1 mode (required by KeeChallenge for KeePass; unclear if it is required for KeePassXC but selecting it did work).

  • Click on the "Generate" button to generate a random 20-byte value in hex.

  • Record a copy of the 20-byte value somewhere safe, as it will be needed to program an additional/replacement YubiKey with the same secret later (unlike KeeChallenge it is not needed to set up KeePassXC; instead KeePassXC will simply ask the YubiKey to run through the Challenge-Response algorithm as part of the configuration process, not caring about the secret key used, only caring about getting repeatable results).

    Beware the dialog box seems to be only wide enough to display 19 of the bytes (not 20), and not resizeable, so you have to scroll in the input box to see all the bytes :-( Make sure you get all 20 bytes, or you will be left trying to guess the first or last byte later on. (And make sure you keep the copy of the shared secret secure, as anyone with that shared secret can program a working YubiKey that will be functionally identical to your own. Printing it out and storing it somewhere safe would be better than storing it in plain text on the computers you are using KeePassXC on... and storing it inside KeePassXC creates a catch-22 situation!)

  • Double check your settings, then click on "Write Configuration" to store the secret key out to the attached YubiKey.

  • The YubiKey Personalisation Tool will want to write a "log" file (actually a .csv file), which will also *contain the secret key, so make sure you keep that log safe, or securely delete it.

  • Pull out the first YubiKey, and insert the next one. You should see a "YubiKey is removed" message then a "YubiKey is inserted" message. Click on "Write Configuration" for the next one. Repeat until you have programmed all the YubiKeys you want to be interchangeable for the Challenge-Response HMAC-SHA1 algorithm. (Two, kept separately, seems like the useful minimum, and three may well make sense.)

Configuring a KeePassXC to use password and YubiKey authentication

  • Insert one of the programmed YubiKeys

  • Open KeePassXC on an existing password database (or create a new one), and authenticate to it.

  • Go to Database -> Change Master Key.

  • Enter your Password twice (ie, so that the Password will be set back to the same password)

  • Tick "Challenge Response" as well (so that the Password and "Challenge Respone" are both ticked)

  • An option like "YubiKey (nnnnnnn) Challenge Response - Slot 2 - Press" should appear in the drop down list

  • Click the "OK" button

  • Save the password database

  • When prompted press the button on your YubiKey (which will allow it to use the YubiKey Challenge Response secret to update the database).

Accessing the KeePassXC database with password and YubiKey authentication

To test that this has worked, close KeePassXC (or at least lock the database), then open KeePassXC again. You will get a prompt for access credentials as usual, without any options ticked.

Verify that you can open the database using both the password and the YubiKey Challenge-Response, by typing in the password and ticking "Challenge Response" (after checking it detected the YubiKey) and then clicking on "OK". When prompted, click the button on your YubiKey, and the database should open. (KeePassXC seems to recognise that the Challenge-Response is needed if you have opened the database with the YubiKey and the YubiKey is present; but you will need to remember to also enter the password each time you authenticate. At least it will auto-select the Password as soon as you type one in. The first time around opening a specific database is just one additional box to tick, which is fairly easy to remember particularly if you use the same combination -- password and YubiKey Challenge-Response -- on all your databases.)

You can confirm that both the password and the YubiKey Challenge Response and required, by trying to authenticate just using the Password (enter Password, untick "Challenge Response", press OK), and by trying to authenticate just using the YubiKey (tick "Challenge Response", untick Password, press OK). In both cases it should tell you "Unable to open database" (the "Wrong key or database file is corrupt" really means "insufficient authentication" to recover the database encryption key in this case; they could perhaps more accurately say "could not decrypt master key" here, perhaps with a suggestion to check the authentication details provided).

If you have programmed multiple YubiKeys with the same Challenge-Response shared secret (and hopefully you have programmed at least two), be sure to check opening the database with each YubiKey to verify that they are programmed identically and thus are interchangable for opening the password database. It should open identically with each key (because they all share the same secret when you programmed the keys, and thus the Challenge Response values are identical).

If you have multiple databases that you want to protect with the YubiKey Challenge-Response method, you will need to go through the Database -> Change Master Key steps and verification steps for each one. It probably makes sense to change them all at the same time, to avoid having to try to remember which ones need the YubiKey and which ones do not.

Usability of KeePassXC with Password and YubiKey authentiction

Once you have configured KeePassXC for Password and YubiKey authentication, and opened the database at least once using the YubiKey, the usability is fairly good. Use:

open PATH/TO/DATABASE.kdbx

to open a specific KeePassXC password database directly, and KeePassXC will launch with a window to authenticate to that password database. So long as one of the appropriate YubiKeys is plugged in, after a short delay (less time than it takes to type in your password) the YubiKey will be detected, and Challenge-Response selected. The you just type in your password as usual (which auto selects "Password" as well), hit enter (which auto-OKs the dialog), and touch your YubiKey when prompted.

One side effect of configuring your KeePassXC databases like this is that they are not able to be opened in other KeePass related tools, except maybe the Windows KeePass with the KeeChallenge plugin (which uses a similar method; I have not tested that). For desktop use, KeePassXC should work pretty much everywhere that is likely to be useful (modern Windows, modern macOS / OS X, modern Linux), as should the YubiKey, so desktop portability is fairly good. But, for instance, MiniKeePass (source, on the iOS App Store) will not be able to open the password database. Amongst other reasons, while the "camera connection kit" can be used to link a YubiKey to an iOS device, the YubiKey iOS HowTo points out that U2F, OATH-TOTP and Challenge-Response functionality will not work (and I found suggestions on the Internet this only worked with older iOS versions).

If access from a mobile device is important, then you may want to divide your passwords amongst multiple KeePass databases: a "more secure" one including the YubiKey Challenge-Response and a "less secure" one that only requires a password for compatibility. For instance it might make sense to store "low risk" website passwords in their own database protected only by a relatively short master password, and synchronise that database for use by MiniKeePass (using the DropBox app). But keep higher security/higher risk passwords protected by password and YubiKey Challenge-Response and only accessible from a desktop application (and not synchronised via DropBox to reduce exposure of the database itself).

It also looks like, in the UI, it should be possible to configure KeePassXC to require only the YubiKey Challenge-Response (no password), simply by changing the master key and only specifying YubiKey Challenge-Response. Since the Challenge-Response shared secret is fairly short (20 bytes, so 160 bits), secured only by that shared key, and the algorithm is known, that too would be a relatively low security form of authentication. Possibly again for "low value" passwords like random website logins with no real risk it might offer a more secure way to store per-website random passwords, ratehr than reusing the same password on the each website. But the combination of password and YubiKey Challenge-Response would be preferable for most password databases over the YubiKey Challenge-Response alone, even if the password itself was fairly short (eg under 16 characters).

Posted Sun Jul 23 15:34:28 2017 Tags:

Apple's Time Machine software, included with macOS for about the last 10 years is a service to automatically back up a computer to one or more external drives or machines. Once configured it pretty much looks after itself, usually keeping hourly/weekly/monthly snapshots for sensible periods of time. It can even rotate the snapshots amongst multiple targets to give multiple backups -- although it really wants to see every drive around once a week, otherwise it starts to regularly complain about no backups to a given drive, even when there are several other working backups. (Which makes it a poor choice for offline, offsite, backups which are not brought back onsite again frequently; full disk clones are better for that use case.)

More recent versions of Time Machine include local snapshots, which are copies saved to the internal drive in between Time Machine snapshots to an external target -- for instance when that external target is not available. This is quite useful functionality on, eg, a laptop that is not always on its home network or connected to the external Time Machine drive. These local snapshots do take up some space on the internal drive, but Time Machine will try to ensure there is at least 10% free space on the internal drive and aim for 20% free space (below that Time Machine local snapshots are usually cycled out fairly quickly, particularly if you do something that needs more disk space).

On my older MacBook Pro, the internal SSD (large, but not gigantic, for the time when it was bought, years ago) has been "nearly full" for a long time, so I have been regularly looking for things taking up space that that do not need to be on the internal hard drive. In one of these explorations I found that while Time Machine's main local snapshot directory was tiny:

ewen@ashram:~$ sudo du -sm /.MobileBackups
1       /.MobileBackups
ewen@ashram:~$ 

as expected with an almost full drive causing the snapshots to be expired rapidly, there was another parallel directory which was surprisingly big:

ewen@ashram:~$ sudo du -sm /.MobileBackups.trash/
21448   /.MobileBackups.trash/
ewen@ashram:~$

(21.5GB -- approximately 2-3 times the free space on the drive). When I looked in /.MobileBackups.trash/ I found a bunch of old snapshots from 2014 and 2016, some of which were many gigabytes each:

root@ashram:/.MobileBackups.trash# du -sm *
2468    Computer
412     MobileBackups_2016-10-22-214323
16824   MobileBackups_2016-10-24-163201
1746    MobileBackups_2016-10-26-084240
1       MobileBackups_2016-12-18-144553
1       MobileBackups_2017-02-05-125225
1       MobileBackups_2017-05-18-180448
root@ashram:/.MobileBackups.trash# du -sm Computer/*
1480    Computer/2014-06-08-213847
58      Computer/2014-06-15-122559
156     Computer/2014-06-15-162406
166     Computer/2014-06-29-183344
608     Computer/2014-07-06-151454
3       Computer/2016-10-22-174000
root@ashram:/.MobileBackups.trash# 

Some searching online indicated that this was a fairly common problem (there are many other similar reports). As best I can tell what is supposed to happen is:

  • /.MobileBackups is automatically managed by Time Machine to store local snapshots, and they are automatically expired as needed to try to keep the free disk space at least above 10%.

  • /MobileBackups.trash appears if for some reason Time Machine cannot remove a particular local snapshot or needs to start again (eg a local snapshot was not able to complete); in that case Time Machine will move the snapshot out of the main /.MobileBackups directory into /MobileBackups.trash directory. The idea is that eventually whatever is locking the files in the snapshot to prevent them from being deleted will be cleared, eg, by a reboot, and then /.MobileBackups.trash will get cleaned up. This is part of the reason for reboots being suggested as part of the resolution for Time Machine issues.

However there appears to be some scenarios where it is impossible to remove /.MobileBackups.trash, which just leads to them gradually accumulating over time. Some people report hundreds of gigabytes used there. Because /.MobileBackups.trash is not the main Time Machine Local Snapshots, it shows up as "Other" in the OS X Storage Report -- rather than "Backups". And of course if it cannot be deleted, it will not be automatically removed to make space when you need more space on the drive :-(

Searching for /.MobileBackups.trash in /var/log/system.log turned up the hint that Time Machine was trying to remove the directory, but being rejected:

Jul 18 16:31:36 ashram com.apple.mtmd[852]: Failed to delete
/.MobileBackups.trash, error: Error Domain=NSCocoaErrorDomain
Code=513 "“.MobileBackups.trash” couldn’t be removed because you
don’t have permission to access it."
UserInfo={NSFilePath=/.MobileBackups.trash, NSUserStringVariant=(
    Remove
), NSUnderlyingError=0x7feb82514860 {ErrorDomain=NSPOSIXErrorDomain
Code=1 "Operation not permitted"}}

(plus lots of "audit warning" messages about the drive being nearly full, which was the problem I first started with). There are some other references to that failure on OS X 10.11 (El Capitan), which I am running on the affected machine.

Based on various online hints I tried:

  • Forcing a full Time Machine backup to an external drive, which is supposed to cause it to clean up the drives (it did do a cleanup, but it was not able to remove /.MobileBackups.trash).

  • Disabling the Time Machine local snapshots:

    sudo tmutil disablelocal
    

    which is supposed to remove the /.MobileBackups and /.MobileBackups.trash directories; it did remove /.MobileBackups but could not remove /.MobileBackups.trash.

  • Emptying the Finder Trash (no difference to /.MobileBackups.trash)

  • Wait a while to see if it got automatically removed (nope!)

  • Forcing a full Time Machine backup to an external drive, now that the local Time Machine snapshots are turned off. That took ages to get through the prepare stage (the better part of an hour), suggesting it was rescanning everything... but it did not reduce the space usage in /.MobileBackups.trash in the slightest.

Since I had not affected /.MobileBackups.trash at all, I then did some more research into possible causes for why the directory might not be removable. I found a reference suggesting file flags might be an issue, but searching for the schg and uchg flags did not turn up anything:

sudo find /.MobileBackups.trash/ -flags +schg
sudo find /.MobileBackups.trash/ -flags +uchg

(uchg is the "user immutable" flag; schg is the "system immutable" flag). There are also xattr attributes (which I have used previously to avoid accidental movement of directories in my home directory), which should be visible as "+" (attributes) or "@" (permissions) when doing "ls -l" -- but in some quick hunting around I was not seeing those either (eg sudo ls -leO@ CANDIDATE_DIR).

I did explictly try removing the immutable flags recursively:

sudo chflags -f -R nouchg /.MobileBackups.trash
sudo chflags -f -R noschg /.MobileBackups.trash

but that made no obvious difference.

Next, after finding a helpful guide to reclaiming space from Time Machine Local snapshots I ensured that the Local Snapshots were off, then rebooted the system:

sudo tmutil disablelocal

followed by Apple -> Restart... In theory that is supposed to free up the /.MobileBackups.trash snapshots for deletion, and then delete them. At least when you do another Time Machine backup -- so I forced one of those after the system came back up again. No luck, /.MobileBackups.trash was the same as before.

After seeing reports that /.MobileBackups.trash could be safely removed manually, and having (a) two full recent Time Machine shapshots and (b) having just rebooted with the Time Machine Local Snapshots turned off, I decided it was worth trying to manaully remove /.MobileBackups.trash. I did:

sudo rm -rf "/.MobileBackups.trash"

with the double quotes included to try to reduce the footgun potential of typos (rm -rf / is something you very rarely want to do, especially by accident!).

That was able to remove most of the files, by continuing when it had errors, but still left hundreds of files and directories that it reported being unable to remove:

ewen@ashram:~$ sudo rm -rf "/.MobileBackups.trash"
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/Library/Preferences/SystemConfiguration: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/Library/Preferences: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/Library: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/private/var/db: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/private/var: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/private: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume: Directory not empty
[....]

At least most of the disk space was reclaimed, with just 45MB left:

-=- cut here -=-
ewen@ashram:~$ sudo du -sm /.MobileBackups.trash/
45      /.MobileBackups.trash/
ewen@ashram:~$ 
-=- cut here -=-

In order to get back to a useful state I then moved that directory out of the way:

sudo mv /.MobileBackups.trash /var/tmp/mobilebackups-trash-undeleteable-2017-07-18

and rebooted my machine again to ensure everything was in a fresh start state.

When the system came back up again, I tried removing various parts of /var/tmp/mobilebackups-trash-undeleteable-2017-07-18 with no more success. Since the problem had followed the files rather than the location I figured there had to be something about the files which prevented them from being removed. So I did some more research.

The most obvious is the Time Machine Safety Net, which is special protections around the Time Machine snapshots to deal with the fact that they create hard links to directories (to conserve inodes, I assume) which can confuse rm. The recommended approach is to use "tmutil delete", but while it will take a full path doing something like:

tmutil delete /var/tmp/mobilebackups-trash-undeleteable-2017-07-18/MobileBackups_2016-10-22-214323

will just fail with a report that it is an "Invalid deletion target":

ewen@ashram:/var/tmp$ sudo tmutil delete /var/tmp/mobilebackups-trash-undeleteable-2017-07-18/MobileBackups_2016-10-22-214323
/private/var/tmp/mobilebackups-trash-undeleteable-2017-07-18/MobileBackups_2016-10-22-214323: Invalid deletion target (error 22)
Total deleted: 0B
ewen@ashram:/var/tmp$ 

and nothing will be deleted. My guess is that it at least tries to ensure that it is inside a Time Machine backup directory.

Another approach suggested is to use Finder to delete the directory, as that has hooks to the extra cleanup magic required, so I did:

open /var/tmp

and then highlighted mobilebackups-trash-undeleteable-2017-07-18 and tried to do permanently delete it with Alt-Cmd-Delete. After a confirmation prompt, and some file counting, that failed with:

The operation can't be completed because an unexpected error occurred (error code -8072).

deleting nothing. Explicitly changing the problem directories to be owned by me:

sudo chown -R ewen:staff /var/tmp/mobilebackups-trash-undeleteable-2017-07-18

also failed to change anything.

There is an even lower level technique to bypass the Time Machine Safety Net, using a helper bypass tool, which on OS X 10.11 (El Capitan) is in "/System/Library/Extensions/TMSafetyNet.kext/Contents/Helpers/bypass". However running the rm with the bypass tool did not get me any further forward:

cd /var/tmp
sudo /System/Library/Extensions/TMSafetyNet.kext/Contents/Helpers/bypass rm -rf mobilebackups-trash-undeleteable-2017-07-18

failed with the same errors, leaving the whole 45MB still present. (From what I can tell online using the bypass tool is fairly safe if you are removing all the Time Machine snapshots, but can leave very incomplete snapshots if you merely try to remove some snapshots -- due precisely to the directory hard links which is the reason that the Time Machine Safety Net exists in the first place. Proceed with caution if you are not trying to delete everything!)

More hunting for why root could not remove files, turned up the OS X 10.11+ (El Capitan onwards) System Integrity Protection which adds quite a few restrictions to what root can do. In particular the theory was that the file had a restricted flag on it which means that only restricted processes, signed by Apple, would be able to modify them.

That left me with the options of either trying to move the files back somewhere that "tmutil delete" might be willing to deal with, or trying to override System Integrity Protection for long enough to remove the files. Since Time Machine had failed to delete the files, apparently for months or years, I chose to go with the more brute force approach of overriding System Integrity Protection for a while so that I could clean up.

The only way to override System Integrity Protection is to boot into System Recovery mode, and run "csrutil disable", then reboot again to access the drive with Sytsem Integrity Protection disabled. To do this:

  • Apple -> Restart...

  • Hold down Cmd-R when the system chimes for restarting, and/or the Apple Logo appears; you have started a Recovery Boot if the background stays black rather than showing a color backdrop prompting for your password

  • When Recovery mode boots up, use Utilities -> Terminal to start a terminal.

  • In the Terminal window, run:

     csrutil disable
    
  • Reboot the system again from the menus

When the normal boot completes and you log in, you are running without System Integrity Protection enabled -- the foot gun is now on automatic!

Having done that, OS X was happy to let me delete the left over trash:

ewen@ashram:/var/tmp$ sudo du -sm mobilebackups-trash-undeleteable-2017-07-18/
Password:
45      mobilebackups-trash-undeleteable-2017-07-18/
ewen@ashram:/var/tmp$ sudo rm -rf mobilebackups-trash-undeleteable-2017-07-18
ewen@ashram:/var/tmp$ ls mob*
ls: mob*: No such file or directory
ewen@ashram:/var/tmp$ 

so I had finally solved the problem I started with, leaving no "undeleteable" files around for later. My guess is that those snapshots happened to run at a time that captured files with restricted flags on them, which then could not be removed (at least once Time Machine had thrown them out of /.MobileBackups and into /.MobileBackups.trash). But it seems unfortunate that the log messages could not have provided more useful instructions.

All that was left was to put the system back to normal:

  • Boot into recovery mode again (Apple -> Restart...; hold down Cmd-R at the chime/Apple logo)

  • Inside Recovery Mode, re-enable System Integrity Protection, with:

    csrutil enable
    

    inside Utilities -> Termimal.

  • Reboot the system again from the menus.

At this point System Integrity Protection is operating normally, which you can confirm with the "csrutil status" command that you can run at any time:

ewen@ashram:~$ csrutil status
System Integrity Protection status: enabled.
ewen@ashram:~$ 

(changes to the status can be made only in Recovery Mode).

Finally re-enable Time Machine local snaphots because on a mobile device it is a useful feature:

sudo tmutil enablelocal

and then force the first local snapshot to be made now to get the process off to an immediate start:

sudo tmutil snapshot

At which point you should have /.MobileBackups with a snapshot or two inside it:

root@ashram:~# ls -l /.MobileBackups/Computer/
total 8
-rw-r--r--  1 root  wheel  263 18 Jul 17:37 .mtm.private.plist
drwxr-xr-x@ 3 root  wheel  102 18 Jul 17:37 2017-07-18-173719
drwxr-xr-x@ 3 root  wheel  102 18 Jul 17:37 2017-07-18-173758
root@ashram:~# 

and if you look in the Time Machine Preferences Window you should see the line that it will create "Local snapshots as space permits".

Quite the adventure! But my system now has about three times as much free disk space as it did previously, which was definitely worth the effort.

Posted Wed Jul 19 17:55:22 2017 Tags:

After upgrading to the "Windows 10 Creators Update", on my dual booted Dell XPS 9360, I installed the Windows Subsystem for Linux, because my fingers like being able to use Unix/Linux commands :-)

There are a few steps to enabling and installing "Bash on Ubuntu on Windows", on a 64-bit Windows 10 Creators Update install:

  • Turn on the Windows Subsystem for Linux, by starting an Administrator PowerShell (right click on Windows icon at bottom left, choose "Windows PowerShell (Admin)" from the menu), then run:

    Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
    

    It will run for a little while with a text progress message, then ask to reboot and do a small amount of installation before restarting.

  • Turn on Developer Mode to enable installing extra features: In Settings -> Update and Security -> For developers move the radio selection to Developer Mode (default seems to be "Sideload Apps"; settings can be found by left clicking on the Windows icon at the bottom left, then click on the "cog wheel"). It will install a "Developer Mode package" (which I guess includes, eg, additional certificates).

  • Start a cmd prompt (eg, Windows -> Run... cmd), and inside that run "bash" to trigger the install of the Linux environment. (Note that without doing the above two steps this will fail completely, with a "not found" message, so if you get a "not found" message double check you have done the steps above.) You are promoted to accept the terms at https://aka.ms/uowterms, which seems to just be a shortlink to the Ubuntu Licensing Page.

    The text also notes that this is a "beta feature" which is presumably why it is necessary to enable "Devleoper Mode"; the install itself seems to download from the Windows (application) Store. (They also warn against modifying Linux files from Windows applications which illustrates the complexity of making the two subsystems play nicely together. It seems like this behaves a little more like a "Linux container" running under Windows than parallel processes.)

  • It detected the locale needed for New Zealand (en_NZ) and offered to change it from the default (en_US); I said "yes". (Note there was quite a long delay after this answer before the next prompt, enough I wondered if it had read the answer -- give it another minute or two.)

  • Then it prompted for a Unix-style user name (see the WSL guide to Linux User Account and Permissions, at http://aka.ms/wslusers). It also prompts for a Unix-style password, which I assume is mostly used to run sudo.

  • Install the outstanding Ubuntu Linux 16.04 updates (ie, the ones released since the install snapshot was made):

    sudo apt-get update
    sudo apt-get dist-upgrade
    

After that, other than the default prompt/vim colours being terrible for a black background console (default on Windows), the environment works in a pretty similar manner to a native Linux environment.

The "Bash on Ubuntu on Windows" menu option added, which runs bash directly, suffers from the same "black background" readability issues. But fortunately if you go to Properties (click on top left of title bar, choose Properties) you can change the colours -- I simple changed the background to be 192/192/192 (default foreground grey), and the foreground to be 0/0/0 (default background black), and the default prompt/vim etc colours look more like they are intended.

There is some more documentation for Bash on Ubuntu on Windows which cover the whole Ubuntu on Windows feature. Of note, the Windows 10 Creators Update" version of the feature is based on Ubuntu Linux 16.04, which means I now have Ubuntu Linux 16.04 functionality on both sides of the dual boot environment :-) The newer version also seems to have improved Linux/Windows interoperabilty, and 24-bit colour support in the console.

Posted Sun Jul 16 10:36:05 2017 Tags:

Tickets for the Wellington 2017 edition of the New Zealand International Film Festival went on sale this morning at 10:00. As with 2014 and 2015 the online ticketing worked... rather poorly for the first hour or so after tickets went on sale. Leading to various tweets calling for patience -- and an apology from the NZIFF Festival Director on Facebook. I had fairly high hopes at 10:00 this morning, after being told by other Festival regulars that 2016 had been better than previous years -- but they were quickly dashed. After trying for the first half hour and getting no where I gave up until about 11:15, and then eventually managed to buy the tickets I wanted gradually, mostly one at a time, over the next hour.

As I have said previously, ticketing is a hard problem. Given a popular event, and limited (good) seats, there will always be a rush for the (best) seats as soon as the sales start. The demand in the first day will always be hundreds of times higher than the demand two weeks later, and the demand in the first hour will be 75% of the demand in the first day. That is just part of the business, so what you need to do is plan for that to happen.

The way that NZIFF (and/or their providers) have set up their online ticketing appears, even four years in, to not properly plan for efficiently handling a large number of buyers all wanting to buy at once. Some of the obvious problems with their implementation include:

  • putting the tickets for about 500 (five hundred) events on sale at the exact same moment -- so instead of a moderate sized stampede for tickets to one event, there are many stampedes for many events, all competing for the same server/network resources.

  • only collecting information about the types of tickets required at ticket purchase time, rather than collecting it in the "wishlist" in advance.

  • only collecting details of the purchaser at ticket purchase time, rather than collecting them in advance (eg, "create account profile" as part of building the wishlist), requiring another more round trips to the server, and storing more data in the database during the contentious ticket sale period.

  • relying on a "best available seat" algorithm that has no user control, and typically picks at best a mediocre seat, thus forcing many more users through the "choose my own seat" process which requires more intensive server interaction

  • not collecting money in advance (eg, selling "movie bucks" credits), which means that the period where the seat allocations are conditional waiting on payment is extended much longer, which both delays finalising seats free for the next buyer to choose from and requires more writes to the database

Less obviously, it appears as if there are some other technical problems:

  • Not designing to automatically "scale out" rapidly to more servers when the site is busy

  • Not pre-scaling to a large enough size, and pre-warming the servers (so they have everything they need in RAM) before opening up ticket sales

  • Breaking the web pages up into too many small requests and stages, increasing the client/server interaction (and thus both load on the server and points at which the process could go wrong) dramatically

  • Writing too much to disk during the processing

  • Reading too much from disk during the interactions

  • Not offloading enough to third party services (eg, CDNs)

and behind all of these is inadequate load testing to simulate the thundering herd of requests that come as an inevitable part of the ticket sales problem, leading to false confidence that "this year we will be okay", only to have those hopes crushed in the first 10 minutes.

So how do we make this ticket sales problem more manageable:

  • Stagger the event sales -- with 500+ events over 15+ days there is no good reason to put all of the events on sale at exactly the same time. It just makes the ticket sales problem two orders of magnitude worse than a single popular event. So break up the ticket releases into stages -- open up sales for the first few days of the festival at 10:00 on the first day, then open up sales for the next few days of the festival in the afternoon or the next day. Clearly having many many days with new tickets going on sale is impractical, but staggering the ticket sales opening over 2-5 days is fairly easily achieved, and instantly halves (or better) the "sales open" server load.

  • Collect every possible piece of information you can in advance of ticket sales, and write it to the database in advance. This would include all the name and contact details needed to complete the sale, a confirmation the user has read the terms and conditions, and details of how many tickets of which types the user wants. All of this can be part of the account profile and "wishlist". Ideally the only thing left for ticket sales time is seat allocation.

  • Preferably also collect the users preferred seat, or a way to hint to the seat allocation policy where to pick. Many regular movie goers (and almost all of the early sales will be regulars) will know the venues like the back of their hand, and can probably name off the top of their head their favourite seat. Obviously you cannot guarantee the exact seat will still be available when they buy their ticket, but if your seat selection algorithm is choosing "seat nearest to the user desired one" rather than "arbitrary seat the average person might not hate", then there is a good chance the user will not have to interact with the "choose my seat" screen at all. (For about half the films I booked this morning pre-entering my preferred seat would have just worked to give me the perfect seat. But since I had no way to pre-enter it, I had to go through the "I want to choose a better seat than the automatic one" on every single movie session.)

  • Ideally, collect the users money in advance. Many of the most eager purchasers will be literally spending hundreds of dollars, and going to dozens of sessions. Most of them would probably be willing to pre-purchase, say, a block of "10 tickets of type FOO" to be allocated to sessions later, if it sped up their ticket purchasing process. Having the money in advance both saves the users typing in their credit card details over and over again, and also means the server can go directly from "book my session" to "session confirmed" with no waiting -- avoiding writing details to the database at any intermediate step. (This also potentially decreases the festival expenses on credit card transaction fees by an order of magnitude.)

  • Maintain an in-RAM cache of "hot" information, such as the seats which are available/sold/in the process of being sold for each active session. Use memcached or other similar products. Make the initial decisions about which seats to offer the user from those tables, only accessing the database to store a permanent reservation once the seats are found.

  • Done completely you end up with a process that is:

    • User ticks one or more sessions for which they want to finalise their ticket purchase

    • The website returns a page saying "use your credit to buy these seats for these session", pre populated with the nearest seat to the ones they pre-indicated they wanted. It saves a temporary seat reservation to the database with a short timeout, and marks it in the RAM cache as "sale in progress". These writes to the database can be very short (and thus quick) because they are just a 4-tuple of (userid, eventid, seatid, expiry time).

    • User clicks "yes, complete sale", their single interaction if the seat they wanted (or a "close enough" one) is available.

    • The website makes the temporary seat reservations as final (by writing "never" in the expiry time), and writes the new credit balance to the database, and returns a page saying "you have these seats, and this credit left, tickets to follow via email".

    Occasionally a user might need to dive into the seat selection page to try to find a better choice, but for users in that critical first hour there is a pretty good chance that they will get something close to the seat they wanted. And the users will rapidly decide the algorithm is doing as well as is possible when they dive into the seat selection page and find all the ones nearer their preferred seat are gone already.

  • Organise the website so as much as possible is static content -- all images, styling (CSS), descriptions of films, etc, is cache-friendly static content. That both allows the browsers not to even ask for it again, and for any checks for whether it has changed to be met with a very quickly answered "all good, you have the latest version".
    Redirect all that static content to an external CDN to keep it away from the sales transaction process.

  • For data that has to be dynamically loaded (eg, seats available) send it in the most compact form possible, and unpack it on the client. CPU time on the client is effectively free in this process as there are many many client CPUs, and relatively few server resources. Try to offload as much work as possible to the browser CPUs, and make them rely on as little as possible coming from the central server.

  • By getting the sales process down to "are you sure?" / "yes", very few server interactions are required, so users get through the process quicker and go away (reducing load) happy. It also means that there is very little to write to the database, so the database contention is dramatically reduced. Done properly almost nothing has to be read from the database.

  • The quick turn around then makes it possible to do things like, eg, keep a HTTPS connection open from the browser to the load balancer to the back end webserver for the 15-30 seconds it takes to complete the sale, avoiding a bunch of network congestion and setup time. This also dramatically reduces the risk of the sales process failing at step 7 of 10, and the user having to start again (which means the load generated by all previous steps was wasted load on the server and means the user is frustrated). By taking the payment out of line from the seat allocation/finalisation process, the web browser only needs to interact with a single server maximising the chances of keeping the connection "hot", ready for when the user eagerly clicks "yes, perfect, I want those" button. Which completes the transaction as quickly as possible.

  • The quick turn around would also encourage users to purchase multiple sessions at once, rather than resorting to purchasing one ticket at a time just to have a chance of anything working. And users purchasing, eg, 10 sessions at a time will get all the tickets they were after much quicker, then leave the site -- and more server resources available for all other users.

  • Host the server as close as possible to the actual users, so that the web connection set up time is as small as possible, and the data transfers to the user happen as fast as possible. Having connections stall for long periods due to packet loss and long TCP timeouts (scaled to expect long delays due to distance) just ties up server resources and makes the users frustrated.

  • Pre-start lots of additional capacity in advance of the "on sale now" time, and pre-warm it by running a bunch of test transactions through, so the servers are warmed up and ready to go. A day or two later you can manually (or automatically) scale back to a more realistic "sale days 2-20" capacity. With most "cloud" providers you will pay a few hundred dollars extra on the first few hours, or days, in exchange for many happy customers and thus many sales. The extra sales possible as a result may well pay for the extra "first day" hosting costs. And most "cloud" providers will allow you to return that extra capacity on the second day at no extra cost -- so it is a single day cost.

Implementing any of this would help. Implementing all of it would make a dramatic difference to the first day experience of the most enthusiastic customers. I for one would be extremely grateful even just to avoid having to type in my name and contact details several dozen times (between failed and successful one-ticket-at-a-time attempts), or avoiding having to type my credit card details in a couple of dozen times in a rush to try to "complete the sale while the site is still talking to me".

Posted Thu Jul 6 20:52:32 2017 Tags:

git svn provides a way to check out Subversion repositories and interact with them using the git interface. It is useful to avoid the mental leap of working with another revision control system tool occassionally when, eg, dealing with RANCID repositories of switch and router configuration (historically RANCID only supported CVS, and then more recently CVS and Subversion; recent versions do support git directly, but not all my clients are using recent enough versions to have direct git support).

Unfortunately the git svn command interaction is still fairly different from "native git" repository interaction, which causes some confusion. But fortunately with a few kludges you can hide the worst of this from day-to-day interation.

Starting at the beginning, you "clone" the repository with something like:

git svn clone svn+ssh://svn.example.com/var/lib/rancid/SVN/switches

(using a specific path within the Subversion repository as suggested by Janos Gyerik, rather than fighting with the git svn/Subversion branch impedance mismatch).

After that you're supposed to use:

git svn rebase

to update the repository pulling in changes made "upstream"; "git pull" simply will not work.

However my fingers, trained by years of git usage, really want to type "git pull" -- if I have to type something else to update, then I might as well just run svn directly. So I went looking for a solution to make "git pull" work on git svn clone (which I never changed locally).

An obvious answer would be to define a git alias (see Git Aliases), but sadly it is not possible to define a git alias that shadows an internal command, and it appears this is considered a feature. I could call the alias something else, but then I am back at "have to type something different, so I might as well just run svn" :-(

A comment on the same Stack Overflow thread suggests the best answer is to define a bash "shell function" that intercepts calls to git and redirects commands as appropriate. In my case I want "git pull" to run "git svn rebase" if (and only if) I am in a git svn repository. Inspecting those repositories showed that one unique feature they have is that there is a .git/svn directory -- so that gave me a way to tell which repositories to run this command on. Some more searching turned up git rev-parse --show-toplevel as the way to find the .git directory, so my work around could work no matter how deep I am in the git svn repository.

Putting these bits together I came up with this shell function that intercepts "git pull", checks for a git svn repository, and if we are running "git pull" in a git svn repository runs "git svn rebase" instead -- which does a fetch and update, just like "git pull" would do on a native repository:

function git {
    local _GIT

    if test "$1" = "pull"; then
        _GIT=$(command git rev-parse --show-toplevel)
        if test -n "${_GIT}" -a -e "${_GIT}/.git/svn"; then
            command git svn rebase
        else
            command git "$@"
        fi
    else
        command git "$@"
    fi
}

(The "command git" bit forces bash to ignore the alias, and run the git in the PATH instead, preventing infinite recursion -- without having to hard code the path of the git binary.)

Now "git pull" functions like normal in git repositories, and magically does the right thing on git svn repositories; and all other git commands run as normal.

It is definitely a kludge, but avoiding a daily "whoops, that is the repository that is special" confusion is well worth it. (git log and git diff seem to "just work" in git svn repositories -- which are the main two other commands I end up using on RANCID repositories.)

Posted Tue Jul 4 14:52:58 2017 Tags:

Debian 7.0 ("Wheezy") was originally released about four years ago, in May 2013; the last point release (7.11) was released a year ago, in June 2016. While Debian 7.0 ("Wheezy") has benefited from the Debian Long Term Support with a further two years of support -- until 2018-05-31 -- the software in the release is now pretty old, particularly software relating to TLS (Transport Layer Security) where the most recent version supported by Debian Wheezy is now the oldest still reasonably usable on the Internet. (The Long Term Support also covered only a few platforms -- but they were the most commonly used platforms including x86 and amd64.)

More recently Debian released Debian 8.0 ("Jessie"), originally a couple of years ago in May 2015 (with the latest update, Debian 8.8, released last month, in May 2017). Debian are also planning on releasing Debian Stretch (presumably as Debian 9.0) mid June 2017 -- in a couple of weeks. This means that Debian Stretch is still a "testing" distribution, which does not have security support, but all going according to plan later this month (June 2017) it will released and will have testing support after the release -- for several years (between the normal security support, and likely Debian Long Term Support).

Due to a combination of lack of spare time last year, and the Debian LTS providing some additional breathing room to schedule updates, I still have a few "legacy" Debian installations currently running Debian Wheezy (7.11). At this point it does not make much sense to upgrade them to Debian Jessie (itself likely to go into Long Term Support in about a year), so I have decided to upgrade these systems from Debian Wheezy (7.11) through Debian Jessie (8.8) and straight on Debian Stretch (currently "testing', but hopefully soon 9.0). My plan is to start with the systems least reliant on immediate security support -- ie, those that are not exposed to the Internet directly. I have done this before, going from Ubuntu Lucid (10.04) to Ubuntu Trusty (14.04) in two larger steps, both of which were Ubuntu LTS distributions.

Most of these older "still Debian Wheezy" systems were originally much older Debian installs, that have already been incrementally upgraded several times. For the two hosts that I looked at this week, the oldest one was originally installed as Debian Sarge, and the newest one was originally installed as Debian Etch, as far as I can tell -- although both have been re-homed on new hardware since the originally installs. From memory the Debian Sarge install ended up being a Debian Sarge install only due to the way that two older hosts were merged together some years ago -- some parts of that install date back to even older Debian versions, around Debian Slink first released in 1999. So there are 10-15 years of legacy install decisions there, as well as both systems having a number of additional packages installed for specific long-discarded tasks that create additional clutter (such is the disadvantage of the traditional Unix "one big system" approach, versus the modern approach of many VMs or containers). While I do have plans to gradually break the remaining needed services to separate, automatically built, VMs or containers, it is clearly not going to happen overnight :-)

The first step in planning such an update is to look at the release notes:

The upgrade instructions are relatively boilerplate (prepare for an upgrade, check system status, change apt sources, minimal package updates then full package updates) but do contain hints as to possible upgrade problems with specific packages and how to work around them.

The "issues to be aware of" contain a lot of compatibility hints of things which may break as a result of the upgrade. In particular Debian 8 (Jessie) brings:

  • Apache 2.4 which both has significantly different configuration syntax and only includes files ending in .conf (breaking, eg, naming virtual servers after just the domain name); as does the Squid proxy configuration (see Squid 3.2, 3.3, and 3.4release notes, particularly Helper Name Changes).

  • systemd (in the form of systemd-sysv) by default, which potentially breaks local init changes (or custom scripts) and halt no longer powering off by default -- that behaviour apparently being declared "a bug that was never fixed" in the old init scripts, after many many years of it working that way. It got documented, but that is about it. (IMHO the only use of "halt but do not power of is in systems like Juniper JunOS where a key on the console can be used on the halted system to cause it to boot again in the case of accidental halts; it is not clear that actually works with systemd. systemd itself has of course been rather controversial, eventually leading to Devuan Jessie 1.0 which is basically Debian Jessie without systemd. While I am not really a fan of many of systemds technical decisions, the adoption by most of the major Linux distributions makes interaction with it inevitable, so I am not going out of my way to avoid it on these machines.)

  • The "nobody" user (and others) will have their shell changed to /usr/sbin/nologin -- which mostly affects running commands like:

    sudo su -c /path/to/command nobody
    

    Those commands instead need to be run as:

    sudo su -s /bin/bash -c /path/to/command nobody
    

    Alternatively you can choose to decline the change for just the nobody user -- the upgrade tool asks per user change in an interactive upgrade if your debconf question priority is medium or lower. In my case nobody was the last user shell change mentioned.

  • systemd will start, fsck, and mount both / and /usr (if it is a separate device) during the initramfs. In particularly this means that if they are RAID (md) or LVM volumes they need to be started by the time that initramfs runs, or startable by initramfs. There also seem to be some races around this startup, which may mean that not everything starts correctly; at least once I got dumped into the systemd rescue shell, and had to run "vgchange -a y" for systemd, wait for everything to be automatically mounted, and then tell it to continue booting (exit), but one boot it booted correctly by itself so it is defintely a race. (See, eg, Debian bug #758808, Debian bug #774882, and Debian bug #782793. The latter reports a fix in lvm2 2.02.126-3 which is not in Debian Jessie, but is in Debian Stretch, so I did not try too hard to fix this in Debian Jessie before moving on. The main system I experienced this on booted correctly, first time, on Debian Stretch, and continued to reboot automatically, where as on Debian Jessie it needed manual attention pretty much every boot.)

Debian 9 (Stretch) seems to be bringing:

  • Restrictions around separate /usr (it must be mounted by initramfs if it is separate; but the default Debian Stretch initramfs will do this)

  • net-tools (arp, ifconfig, netstat, route, etc) are deprecated (and not installed by default) in favour of using iproute2 (ip ...) commands. Which is a problem for cross-platform finger-macros that have worked for 20-30 years... so I suspect net-tools will be a common optional package for quite a while yet :-)

  • A warning that a Debian 8.8 (Jessie) or Debian 9 (Stretch) kernel is needed for compatibility with the PIE (Position Independent Executable) compile mode for executables in Debian 9 (Stretch), and thus it is extra important to (a) install all Debian 8 (Jessie) updates and reboot before upgrading to Debian 9 (Stretch), and (b) to reboot very soon after upgrading to Debian 9 (Stretch). This also affects, eg, the output of file -- reporting shared object rather than executable (because the executables are now compiled more like shared libraries, for security reasons). (Position independent code (PIC) is also somewhat slower on registered limited machines like 32-bit x86 -- but gcc 5.0+ contains some performance improvements for PIC which apparently help reduce the penalty. This is probably a good argument to prefer amd64 -- 64-bit mode -- for new installs. And even the x86 support is i686 or higher only; Debian Jessie is the last release to support i586 class CPUs.)

  • SSH v1, and older ciphers, are disabled in OpenSSH (although it appears Debian Stretch will have a version where they can still be turned back on; the next OpenSSH release is going to remove SSH v1 support entirely, and it is already removed from the development tree). Also ssh root password login is disabled on upgrade. These ssh changes are particularly an upgrade risk -- one would want to be extra sure of having an out of band console to reach any newly upgraded machines before rebooting them.

  • Changes around apt package pinning calculations (although it would be best to remove all pins and alternative package repositories during the upgrade anyway).

  • The Debian FTP Servers are going away which means that ftp URLs should be changed to http -- the ftp.CC.debian.org names seem likely to remain for the foreseeable future for use with http.

I have listed some notes on issues experienced below, for future reference and will update this list with anything else I find as I upgrade more of the remaining legacy installs over the next few months.

Debian 7 (Wheezy) to Debian 8 (Jessie)

  • webkitgtk (libwebkitgtk-1.0-common) has limited security support. To track down why this is needed:

    apt-cache rdepends libwebkitgtk-1.0-common
    

    which turns up libwebkitgtk-1.0-0, which is used by a bunch of packages. To find the installed packages that need it:

    apt-cache rdepends --installed libwebkitgtk-1.0-0
    

    which gives libproxy0 and libcairo2, and repeating that pattern indicates many things installed depending on libcairo2. Ultimately iceweasel / firefox-esr are one of the key triggering packages (but not the only one). I chose to ignore this at this point until getting to Debian Stretch -- and once on Debian Stretch I will enable backports to keep firefox-esr relatively up to date.

  • console-tools has been removed, due to being unmaintained upstream, which is relatively unimportant for my systems which are mostly VMs (with only serial console) or okay with the default Linux kernel console. (The other packages removed on upgrade appear to just be, eg, old versions of gcc, perl, or other packaged replaced by newer versions with a new name.)

  • /etc/default/snmpd changed, which removes custom options and also disables the mteTrigger and mteTriggerConf features. The main reason for the change seems to be to put the PID file into /run/snmpd.pid instead of /var/run/snmpd.pid. /etc/snmp/snmpd.conf also changes by default, which will probably need to be merged by hand.

    On SNMP restart a bunch of errors appeared:

    Error: Line 278: Parse error in chip name
    Error: Line 283: Label statement before first chip statement
    Error: Line 284: Label statement before first chip statement
    Error: Line 285: Label statement before first chip statement
    Error: Line 286: Label statement before first chip statement
    Error: Line 287: Label statement before first chip statement
    Error: Line 288: Label statement before first chip statement
    Error: Line 289: Label statement before first chip statement
    Error: Line 322: Compute statement before first chip statement
    Error: Line 323: Compute statement before first chip statement
    Error: Line 324: Compute statement before first chip statement
    Error: Line 325: Compute statement before first chip statement
    Error: Line 1073: Parse error in chip name
    Error: Line 1094: Parse error in chip name
    Error: Line 1104: Parse error in chip name
    Error: Line 1114: Parse error in chip name
    Error: Line 1124: Parse error in chip name
    

    but snmpd apparently started again. The line numbers are too high to be /etc/snmp/snmpd.conf, and as bug report #722224 notes, the filename is not mentioned. An upstream mailing list message implies it relates to lm_sensors object, and the same issue happened on upgrade from SLES 11.2 to 11.3. The discussion in the SLES thread pointed at hyphens in chip names in /etc/sensors.conf being the root cause.

    As a first step, I removed libsensors3 which was no longer required:

    apt-get purge libsensors3
    

    That appeared to be sufficient to remove the problematic file, and then:

    service snmpd stop
    service snmpd start
    service snmpd restart
    

    all ran without producing that error. My assumption is that old /etc/sensors.conf was from a much older install, and no longer in the preferred location or format. (For the first upgrade where I encountered it, the machine was now a VM so lm-sensors reading "hardware" sensors was not particularly relevant.)

  • libsnmp15 was removed, but not purged. The only remaining file was /etc/snmp/snmp.conf (note not the daemon configuration, but the client configuration), which contained:

    #
    # As the snmp packages come without MIB files due to license reasons, loading
    # of MIBs is disabled by default. If you added the MIBs you can reenable
    # loading them by commenting out the following line.
    mibs :
    

    on default systems to disable of the SNMP MIBs from being loaded. Typically one would want to enable SNMP MIB usage and thus to get names of things rather than just long numeric OID strings. snmp-mibs-downloader appears to still exist in Debian 8 (Jessie), but it is in non-free.

    The snmp client package did not seem to be installed, so I installed it manually along with snmp-mibs-downloader:

    sudo apt-get install snmp snmp-mibs-downloader
    

    which caused that, rather than libsnmp15 to own the /etc/snmp/snmp.conf configuration file, which makes more sense. After that I could purge both libsnmp15 and console-tools:

    sudo apt-get purge libsnmp15 console-tools
    

    (console-tools was an easy choice to purge as I had not actively used its configuration previously, and thus could be pretty sure that none of it was necessary.)

    To actually use the MIBs one needs to comment out the "mibs :" line in /etc/snmp/snmp.conf manually, as per the instructions in the file.

  • Fortunately it appeared I did not have any locally modified init scripts which needed to be ported. The suggested check is:

    dpkg-query --show -f'${Conffiles}' | sed 's, /,\n/,g' | \
       grep /etc/init.d | awk 'NF,OFS="  " {print $2, $1}' | \
       md5sum --quiet -c
    

    and while the first system I upgraded had one custom written init script it was for an old tool which did not matter any longer, so I just left it to be ignored.

    I did have problems with the rsync daemon, as listed below.

  • Some "dummy" transitional packages were installed, which I removed:

    sudo apt-get purge module-init-tools iproute
    

    (replaced by udev/kmod and iproute2 respectively). The ttf-dejavu packages also showed up as "dummy" transitional packages but owned a lot of files so I left them alone for now.

  • Watching the system console revealed the errors:

    systemd-logind[4235]: Failed to enable subscription: Launch helper exited with unknown return code 1
    systemd-logind[4235]: Failed to fully start up daemon: Input/output error
    

    which some users have reported when being unable to boot their system, although in my case it happened before rebooting so possibly was caused by a mix of systemd and non-systemd things running.

    systemctl --failed reports:

    Failed to get D-Bus connection: Unknown error -1
    

    as in that error report, possibly due to the wrong dbus running; the running dbus in this system is from the Debian 7 (Wheezy) install, and the systemd/dbus interaction changed a lot after that. (For complicated design choice reasons, historically dbus could not be restarted, so changing it requires rebooting.)

    The system did reboot properly (although it appeared to force a check of the root disk), so I assume this was a transitional update issue.

  • There were a quite a few old Debian 7 (Wheezy) libraries, which I found with:

    dpkg -l | grep deb7
    

    that seemed no longer to be required, so I removed them manually. (Technically that only finds packages with security updates within Debian Wheezy, but those seem the most likely to be problematic to leave lying around.)

    At one point after the upgrade apt-get offered a large selection of packages to autoremove, but after some other tidy up and rebooting it no longer showed any packages to autoremove; it is unclear what happened to cause that change in report. I eventually found the list in my scrollback and pasted the contents into /tmp/notrequired, then did:

    for PKG in $(cat /tmp/notrequired); do echo $PKG; done | tee /tmp/notrequired.list
    dpkg -l | grep -f /tmp/notrequired.list
    

    to list the ones that were still installed. Since this included the libwebkitgtk-1.0-common and libwebkitgtk-1.0-0 packages mentioned above, I did:

    sudo apt-get purge libwebkitgtk-1.0-common libwebkitgtk-1.0-0
    

    to remove those. Then I went through the remainder of the list, and removed anything marked "transitional" or otherwise apparently no longer necessary to this machine (eg, where there was a newer version of the same library installed). This was fairly boring rote cleanup, but given my plan to upgrade straight to Debian 9 (Stretch) it seemed worth starting with a system as tidy as possible.

    I left installed the ones that seemed like I might have installed them deliberately (eg, -perl modules) for some non-packaged tool, just to be on the safe side.

  • I found yet more transitional packages to remove with:

    dpkg -l | grep -i transitional
    

    and removed them with:

    sudo apt-get purge iceweasel mailx mktemp netcat sysvinit
    

    after using "dpkg -L PACKAGE" to check that they contained only documentation; sysvinit contained a couple of helper tools (init and telinit) but their functionality has been replaced by separate systemd programs (eg systemctl) so I removed those too.

    Because netcat is useful, I manually installed the dependency it had brought in to ensure that was selected as an installed package:

    sudo apt-get install netcat-traditional
    

    While it appeared that multiarch-support should also be removable as a no-longer required transitional package, since it was listed as transitional and contained only manpages, in practice attempts to remove it resulted in libc6 wanting to be removed too, which would rapidly lead to a broken system. (On my system the first attempt failed on gnuplot, which was individually fixable by installing, eg, gnuplot-nox explicitly and removing the gnuplot meta package, but since removing multiarch-support lead to removing libc6 I did not end up going down that path.)

    For consistency I also needed to run aptitude and interactively tell aptitude about these decisions.

  • After all this tidying up, I found nothing was listening on the rsync port (tcp/873) any longer. Historically I had run the rsync daemon using /etc/init.d/rsync, which still existed, and still belonged to the rsync package.

    sudo service rsync start
    

    did work, to start the rsync daemon, but it did not start at boot. Debian Bug #764616 provided the hint that:

    sudo systemctl enable rsync
    

    was needed to enable it starting at boot. As Tobias Frost noted on Debian Bug #764616 this appears to be a regression from Debian Wheezy. It appears the bug eventually got fixed in rsync package 3.1.2-1, but that did not get backported to Debian Jessie (which has 3.1.1-3) so I guess the regression remains for everyone to trip over :-( If I was not already planning on upgrading to Debian Stretch then I might have raised backporting the fix as a suggestion.

  • inn2 (for UseNet) is no longer supported on 32-bit (x86); only the LFS (Large File Support) package, inn2-lfs is supported, and it has a different on-disk database format (64-bit pointers rather than 32-bit pointers). The upgrade is not automatic (due to the incompatible database format) so you have to touch /etc/news/convert-inn-data and then install inn2-lfs to upgrade:

    You are trying to upgrade inn2 on a 32-bit system where an old inn2 package
    without Large File Support is currently installed.
    
    Since INN 2.5.4, Debian has stopped providing a 32-bit inn2 package and a
    LFS-enabled inn2-lfs package and now only this LFS-enabled inn2 package is
    supported.
    
    This will require rebuilding the history index and the overview database,
    but the postinst script will attempt to do it for you.
    
    [...]
    
    Please create an empty /etc/news/convert-inn-data file and then try again
    upgrading inn2 if you want to proceed.
    

    Because this fails out the package installation it causes apt-get dist-upgrade to fail, which leaves the system in a partially upgraded messy state. For systems with inn2 installed on 32-bit this is probably the biggest upgrade risk.

    To try moving forward:

    sudo touch /etc/news/convert-inn-data
    sudo apt-get -f install
    

    All going well the partly installed packages will be fixed up, then:

    [ ok ] Stopping news server: innd.
    Deleting the old overview database, please wait...
    Rebuilding the overview database, please wait...
    

    will run (which will probably take many minutes on most non-trivial inn2 installs; in my case these are old inn2 installs, which have been hardly used for years, but do have a lot of retained posts, as a historical archive). You can watch the progress of the intermediate files needed for the overview database being built with:

    watch ls -l /var/spool/news/incoming/tmp/
    watch ls -l /var/spool/news/overview/
    

    in other windows, but otherwise there is no real indication of progress or how close you are to completion. The "/usr/lib/news/bin/makehistory -F -O -x" process that is used in rebuilding the overview file is basically IO bound, but also moderately heavy on CPU. (The history file index itself, in /var/lib/news/history.* seems to rebuild fairly quickly; it appears to be the overview files that take a very long time, due to the need to re-read all the articles.)

    It may also help to know where makehistory is up to reading, eg:

    MKHISTPID=$(ps axuwww | awk '$11 ~ /makehistory/ && $12 ~ /-F/ { print $2; }')
    sudo watch ls -l "/proc/${MKHISTPID}/fd"
    

    which will at least give some idea which news articles are being scanned. (As far as I can tell one temporary file is created per UseNet group, which is then merged into the overview history; the merge phase is quick, but the article scan is pretty slow. Beware the articles are apparently scanned in inode order rather than strictly numerical order, which makes it harder to tell group progress -- but at least you can tell which group it is on.)

    In one of my older news servers, with pretty slow disk IO, rebuilding the overview file took a couple of hours of wall clock time. But it is slow even given the disk bandwidth, because it makes many small read transactions. This is for about 9 million articles, mostly in a few groups where a lot of history was retained, including single groups with 250k-350k articles retained -- and thus stored in a single directory by inn2. On ext4 (but probably without directory indexes, due to being created on ext2/ext3).

    Note that all of this delay blocks the rest of the upgrade of the system, due to it being done in the post-install script -- and the updated package will bail out of the install if you do not let it do the update in the post-install script. Given the time required it seems like a less disruptive upgrade approach could have been chosen, particularly given the issue is not mentioned at all as far as I can see in the "Issues to be aware of for Jessie" page. My inclination for the next one would be to hold inn2, and upgrade everything else first, then come back to upgrading inn2 and anything held back because of it.

    Some searching turned up enabling ext4 dir_index handling to speed up access for larger directories:

    sudo service inn2 stop
    sudo umount /dev/r1/news
    sudo tune2fs -O dir_index,uninit_bg /dev/r1/news
    sudo tune2fs -l /dev/r1/news
    sudo e2fsck -fD /dev/r1/news
    sudo mount /dev/r1/news
    sudo service inn2 start
    

    I apparently did not do this on the previous OS upgrade to avoid locking myself out of using earlier OS kernels; but these ext4 features have been supported for many years now.

    In hindisght this turned out to be a bad choice, causing a lot more work. It is unclear if the file system was already broken, or if changing these options and doing partial fscks broke it :-( At minimum I would suggest doing a e2fsck -f /dev/r1/news before changing any options, to at least know whether the file system is good before the options are changed.

    In my case when I first tried this change I also set "-O uninit_bg" since it was mentioned in the online hints, and then after the first e2fsck, tried to do one more "e2fsck -f /dev/r1/news" to be sure the file system was okay before mounting it again. But apparently parts of the file system need to be initialised by a kernel thread when "uninit_bg is set.

    I ended up with a number of reports of like:

    Inode 8650758, i_size is 5254144, should be 6232064.  Fix? yes
    Inode 8650758, i_blocks is 10378, should be 10314.  Fix? yes
    

    followed by a huge number of reports like:

    Pass 2: Checking directory structure
    Directory inode 8650758 has an unallocated block #5098.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5099.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5100.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5101.  Allocate? yes
    

    which were so numerous to allocate by hand (although I tried saying "yes" to a few by hand), and they could not be fixed automatically (eg, not fixable by "sudo e2fsck -pf /dev/r1/news").

    It is unclear if this was caused by "-O uninit_bg", or some earlier issue on the file system (this older hardware has not been entirely stable), or whether there was some need for more background initialisation to happen which I interrupted by mounting the disk, then unmounting it, and then deciding to check it again.

    Since the file system could still be mounted, so I tried making a new partition and using tar to copy everything off it first before trying to repair it. But the tar copy also reported many many kernel messages like:

    Jun 11 19:12:10 HOSTNAME kernel: [24027.265835] EXT4-fs error (device dm-3): __ext4_read_dirblock:874: 
    inode #9570798: block 6216: comm tar: Directory hole found
    

    and in general the copy proceeded extremely slowly (way way below the disk bandwidth). So I gave up on trying to make a tar copy first, as it seemed like it would take all night with no certainty of completing. I assume these holes are the same "unallocated blocks" that fsck complained about.

    Given that the news spool was mostly many year old articles which I also had not looked at in years, instead I used dd to make a bitwise copy of the partition:

    dd if=/dev/r1/news of=/dev/r1/news_backup bs=32768
    

    which ran at something approaching the underlying disk speed, and at least gives me a "broken" copy to try a second repair on if I find a better answer later.

    Running a non-interactive "no change" fsck:

    e2fsck -nf /dev/r1/news
    

    indicated the scope of the problem was pretty huge, with both many unallocated block reports as above, and also many errors like:

    Problem in HTREE directory inode 8650758: block #1060 has invalid depth (2)
    Problem in HTREE directory inode 8650758: block #1060 has bad max hash
    Problem in HTREE directory inode 8650758: block #1060 not referenced
    

    which I assume indicate dir_index directories that did not get properly indexed, as well as a whole bunch of files that would end up in lost+found. So the file system was pretty messed up.

    Figuring backing out might help, I turned dir_index off again:

    tune2fs -O ^dir_index /dev/r1/news
    tune2fs -l /dev/r1/news
    

    There were still a lot of errors when checking with e2fsck -nf /dev/r1/news, but at least some of them were that there were directories with the INDEX_FL flag set on filesystem without htree support, so it seemed like letting fsck fix that would avoid a bunch of the later errors.

    So as a last ditch attempt, no longer really caring about the old UseNet articles (and knowing they are probably on the previous version of this hosts disks anyway), I tried:

     e2fsck -yf /dev/r1/news
    

    and that did at least result in fewer errors/corrections, but it did throw a lot of things in lost+found :-(

    I ran e2fsck -f /dev/r1/news again to see if it had fixed everything there was to fix, and at least it did come up clean this time. On mounting the file system, there were 7000 articles in lost+found, out of several million on the file system. So I suppose it could have been worse. Grepping through them, they appear to have been from four Newsgroups (presumably the four inodes originally reported as having problems), and all are ones I do not really care about any longer. inn2 still started, so I declared success at this point.

    At some point perhaps I should have another go at enabling dir_index, but definitely not during a system upgrade!

  • python2.6 and related packages, and squid (2.x; replaced by squid3) needed to be removed before db5.1-util could be upgraded. They are apparently linked via libdb5.1, which is not provided in Debian Jessie, but is specified as broken by db5.1-util unless it is a newer version than was in Debian Wheezy. In Debian Jessie only the binary tools are provided, and it offers to uninstall them as an unneeded package.

    Also netatalk is in Debian Wheezy and depends on libdb5.1, but is not in Debian Jessie at all. This surprised other people too, and netatalk seems to be back in Debian Stretch. But it is still netatalk 2.x, rather than netatalk 3.x which has been released for years; some has attempted to modify the netatalk package to netatalk 3.1, but that also seems to have been abandoned for the last couple of years. (Because I was upgrading through to Debian Stretch, I chose to leave the Debian Wheezy version of netatalk installed, and libdb5.1 from Debian Wheezy installed until after the upgrade to Debian Stretch.)

Debian 8 (Jessie) to Debian 9 (Stretch)

  • Purged the now removed packages:

    # dpkg -l | awk '/^rc/ { print $2 }'
    fonts-droid
    libcwidget3:i386
    libmagickcore-6.q16-2:i386
    libmagickwand-6.q16-2:i386
    libproxy1:i386
    libsigc++-2.0-0c2a:i386
    libtag1-vanilla:i386
    perl-modules
    #
    

    with:

    sudo apt-get purge $(dpkg -l | awk '/^rc/ { print $2 }')
    

    to clear old the old configuration files.

  • Checked changes in /etc/default/grub:

    diff /etc/default/grub.ucf-dist /etc/default/grub
    

    and updated grub using update-grub.

  • Checked changes in /etc/ssh/sshd_config:

    grep -v "^#" /etc/ssh/sshd_config.ucf-old | grep '[a-z]'
    grep -v "^#" /etc/ssh/sshd_config | grep '[a-z]'
    

    and checked that the now commented out lines are the defaults. Check that sshd stops/starts/restarts with the new configuration:

    sudo service ssh stop
    sudo service ssh start
    sudo service ssh restart
    

    and that ssh logins work after the upgrade.

  • The isc-dhcp-server service failed to start because it wanted to start both IPv4 and IPv6 service, and the previous configuration (and indeed the network) only had IPv4 configuration:

    dhcpd[15518]: No subnet6 declaration for eth0
    

    Looking further back in the log I saw:

    isc-dhcp-server[15473]: Launching both IPv4 and IPv6 servers [...]
    

    with the hint "(please configure INTERFACES in /etc/default/isc-dhcp-server if you only want one or the other)".

    Setting INTERFACES in /etc/default/isc-dhcp-server currently works to avoid starting the IPv6 server, but it results in a warning:

    DHCPv4 interfaces are no longer set by the INTERFACES variable in
    /etc/default/isc-dhcp-server.  Please use INTERFACESv4 instead.
    Migrating automatically for now, but this will go away in the future.
    

    so I edited /etc/default/isc-dhcp-server and changed it to set INTERFACESv4 instead of INTERFACES.

    After that:

    sudo service isc-dhcp-server stop
    sudo service isc-dhcp-server start
    sudo service isc-dhcp-server restart
    

    worked without error, and syslog reported:

    isc-dhcp-server[15710]: Launching IPv4 server only.
    isc-dhcp-server[15710]: Starting ISC DHCPv4 server: dhcpd.
    
  • The /etc/rsyslog.conf has changed somewhat, particularly around the syntax for loading modules. Lines like:

    $ModLoad imuxsock # provides support for local system logging
    

    have changed to:

    module(load="imuxsock") # provides support for local system logging
    

    I used diff /etc/rsyslog.conf /etc/rsyslog.conf.dpkg-dist to find these changes and merged them by hand. I also removed any old commented out sections no longer present in the new file, but kept my own custom changes (for centralised syslog).

    Then tested with:

    sudo service rsyslog stop
    sudo service rsyslog start
    sudo service rsyslog restart
    
  • This time, even after reboot, apt-get reported a whole bunch of unneeded packages, so I ran:

    sudo apt-get --purge autoremove
    

    to clean them up.

  • An aptitude search:

    aptitude search '~i(!~ODebian)'
    

    from the Debian Stretch Release Notes on Checking system status provided a hint on finding packages which used to be provided, but are no longer present in Debian. I went through the list by hand and manually purged anything which was clearly an older package that had been replaced (eg old cpp and gcc packages) or was no longer required. There were a few that I did still need, so I have left those installed -- but it would be better to find a newer Debian packaged replacement to ensure there are updates (eg, vncserver).

  • Removing the Debian 8 (Jessie) kernel:

    sudo apt-get purge linux-image-3.16.0-4-686-pae
    

    gave the information that the libc6-i686 library package was no longer needed, as in Debian 9 (Stretch) it is just a transitional package, so I did:

    sudo apt-get --purge autoremove
    

    to clean that up. (I tried removing the multiarch-support "transitional" package again at this point, but there were still a few packages with unmet dependencies without, including gnuplot, libinput10, libreadline7, etc, so it looks like this "transitional" package is going to be with us for a while yet.)

  • update-initramfs reported a wrong UUID for resuming (presumably due to the swap having been reinitialised at some point):

    update-initramfs: Generating /boot/initrd.img-4.9.0-3-686-pae
    W: initramfs-tools configuration sets RESUME=UUID=22dfb0a9-839a-4ed2-b20b-7cfafaa3713f
    W: but no matching swap device is available.
    I: The initramfs will attempt to resume from /dev/vdb1
    I: (UUID=717eb7a5-b49c-4409-9ad2-eb2383957e77)
    I: Set the RESUME variable to override this.
    

    which I tracked down to config in /etc/initramfs-tools/conf.d/resume, that contains only that one single line.

    To get rid of the warning I updated the UUID in /etc/initramfs-tools/conf.d/resume to match the new auto-detected one, and tested that worked by running:

    sudo update-initramfs -u
    
  • The log was being spammed with:

    console-kit-daemon[775]: missing action
    console-kit-daemon[775]: GLib-CRITICAL: Source ID 6214 was not found when attempting to remove it
    console-kit-daemon[775]: console-kit-daemon[775]: GLib-CRITICAL: Source ID 6214 was not found when attempting to remove it
    

    messages. Based on the hint that consolekit is not necessary since Debian Jessie in the majority of cases, and knowing almost all logins to this server are via ssh, I followed the instructions in that message to remove consolekit:

    sudo apt-get purge consolekit libck-connector0 libpam-ck-connector
    

    to silence those messages. (This may possibly be a Debian 8 (Jessie) related tidy up, but I did not discover it until after upgrading to Debian 9 (Stretch).)

  • A local internal (ancient, Debian Woody vintage) apt repository no longer works:

    W: The repository 'URL' does not have a Release file.
    N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
    N: See apt-secure(8) manpage for repository creation and user configuration details.
    

    since the one needed local package was already installed long ago, I just commented that repository out in /etc/apt/sources.list. The process for building apt repositories has been updated considerably in the last 10-15 years.

  • After upgrading and rebooting, on one old (upgraded many times) system systemd-journald and rsyslogd were running flat out after boot, and lpd was running regularly. Between them they were spamming the /var/log/syslog file with:

    lpd[847]: select: Bad file descriptor
    

    lines, many, many, many times a second. I stopped lpd with:

    sudo service lpd stop
    

    and the system load returned to normal, and the log lines stopped. The lpd in this case was provided by the lpr package:

    ewen@HOST:~$ dpkg -S /usr/sbin/lpd
    lpr: /usr/sbin/lpd
    ewen@HOST:~$
    

    and it did not seem to have changed much since the Debian Jessie lpr package -- Debian Wheezy had 1:2008.05.17+nmu1, Debian Jessie had 1:2008.05.17.1, and Debian Stretch has 1:2008.05.17.2. According to the Debian Changelog the only difference between Debian Jessie and Debian Stretch is that Debian Stretch's version was updated to later Debian packaging standards.

    Searching on the Internet did not turn up anyone else reporting the same issue in lpr.

    Doing:

    sudo service lpd start
    

    again a while after boot did not produce the same symptoms, so for now I have left it running.

    However some investigation in /etc/printcap revealed that this system had not been used for printing for quite some time, as its only printer entries referred to printers that had been taken out of service a couple of years earlier. So if the problem reoccurs I may just remove the lpr package completely.

    ETA, 2017-07-14: This happened again after another (unplanned) reboot (caused by multiple brownouts getting through the inexpensive UPS). Because I did not notice in time, it then filled up / with a 4.5GB /var/log/lpr.log, with endless messages of:

    Jul 14 06:25:25 tv lpd[844]: select: Bad file descriptor
    Jul 14 06:25:25 tv lpd[844]: select: Bad file descriptor
    Jul 14 06:25:25 tv lpd[844]: select: Bad file descriptor
    

    so since I had not used the printing functionality on this machine since I ended up just removing it completely:

    sudo cp /dev/null /var/log/lpr.log
    sudo cp -p /etc/printcap /etc/printcap-old-2017-07-14
    sudo apt-get purge lpr
    sudo logrotate -f /etc/logrotate.d/rsyslog
    sudo logrotate -f /etc/logrotate.d/rsyslog
    

    which seemed more time efficient than trying to debug the problem of which file descriptor it was talking about (my guess is maybe one which systemd closed for lpd, that the previous init system did not close, but I have no detailed investigation of that). I kept a copy of /etc/printcap in case I do want to try to restore the printing functionality (or debug it later), but most likely I would just set up printing from scratch.

    The two (forced) log rotates were to force compression of the other copies of the 4GB of log messages (in /var/log/syslog, which rotates daily by default, and /var/log/messages which rotates weekly by default), having removed /var/log/lpr.log which was another 4.5GB. Unsurprisingly they compress quite well given the logs were spammed with a single message -- but even compressed they are still about 13MB.

After fixing up those upgrade issues the first upgraded system seems to have been running properly on Debian 9 (Stretch) for the last few days, including helping publish this blog post :-)

ETA, 2017-06-11: Updates, particularly around inn2 upgrade issues.

ETA, 2017-06-17: Updates on boot issues in jessie, fixed by stretch.

Posted Wed Jun 7 10:50:46 2017 Tags:

I have Java installed for precisely one reason: to be able to access Dell iDRAC consoles on both my own server and various client servers. Since Java on the web has been a terrible idea for years, and since the Dell iDRAC relies on various binary modules which do not work on Mac OS X, I have restricted this Java install to a single VM on my desktop which I start up when I need to access the iDRAC consoles.

For the last few years, this "iDRAC console" VM has been an Ubuntu 14.04 LTS VM, with OpenJDK 7 installed. It was the latest available at the time I installed it, and since it was working I left it alone. Unfortunately after upgrading some client Dell hosts to the latest iDRAC firmware, as part of a redeployment exercise, those iDRACs stopped working with this Ubuntu 14.04/OpenJDK 7 environment. But I was able to work around that by using a newer Java environment on a client VM.

Today, when I went to use the Java console with my own older Dell server, the iDRAC console no longer started properly, failing with a Java error:

Fatal: Application Error: Cannot grant permissions to unsigned jars.

which was a surprise as it had previously worked as recently as a few weeks ago.

One StackExchange hint suggested this policy could be overridden by running:

/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/itweb-settings

and changing the Policy Settings to allow "Execute unowned code". But in my case that made no difference. I also tried setting the date in the VM back a year, in case maybe the signing certificate had now expired out -- but that too made no difference.

Given the hint that OpenJDK 8 actually worked, and finding some backports of OpenJDK 8 to Ubuntu 14.04 LTS (which was released shortly after OpenJDK 8 came out, so does not contain it), I decided to try installing the OpenJDK 8 versions on Ubuntu 14.04 LTS. Fortunately this did actually work.

To install OpenJDK 8 on Ubuntu 14.04 LTS ("trusty") you need to install from the OpenJDK builds PPA, which is not officially part of Ubuntu but this one is managed by someone linked with Ubuntu, so is a bit more trustworthy than "random software found on the Intenet".

Installation of the OpenJDK 8 JRE:

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk

and it can be made the default by running:

sudo update-alternatives --config java

and choosing the OpenJDK 8 version.

Unfortunately that does not include javaws, which is the JNLP client that actually triggers the iDRAC console startup -- which meant that OpenJDK 7 was still running (and failing) trying to launch the iDRAC console. Some hunting turned up the need to install icedtea-8-plugin from another Ubuntu PPA to get a newer javaws that would work with OpenJDK 8. To install this one:

sudo add-apt-repository ppa:maarten-fonville/ppa
sudo apt-get update
sudo apt-get install icedtea-8-plugin

Amongst other things this updates the icedtea-netx package, which includes javaws, to also include a version for OpenJDK 8. Unfortunately the updated package (a) did not make the new OpenJDK 8 javaws the default, nor did update-alternatives --config javaws offer the OpenJDK 8 javaws as an option. Which meant the old, non-working, OpenJDK 7 version still launched.

To actually use the newer OpenJDK 8 javaws, I had to manually update the /etc/alternatives symlink:

cd /etc/alternatives
sudo rm javaws
sudo ln -s /usr/lib/jvm/java-8-openjdk-i386/jre/bin/javaws .

After which, finally, I could launch the iDRAC console again and carry on with what I originally planned to do. I hope this will have fixed the iDRAC console access on the newer iDRAC firmware on some of my client machines too; but I have not tested that so far.

Posted Mon May 29 11:29:04 2017 Tags:

After running into problems trying to get git-annex to run on an SMB share on my Synology DS216+, prompted by the git-annex author, and an example with an earlier Synology NAS I decided to install the standalone version of git-annex directly on my Synology DS216+.

My approach was similar to the earlier "Synology NAS and git annex" tip, but the DS216+ uses an x86_64 CPU:

ewen@nas01:/$ grep "model name" /proc/cpuinfo | uniq
model name  : Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz
ewen@nas01:/$

and I chose a slightly different approach to getting everything working, in part based on my experience setting up the standalone git-annex on a Mac OS X server. I am using Synology DSM "DSM 6.1.1-15101 Update 4", which is the latest release as I write (released 2017-05-25).

To install git-annex:

  • In the Synology web interface (DSM) enable the "SSH Service", in Control Panel -> Terminal, by ticking "Enable SSH Service", and verify that you can ssh to your Synology NAS. Only accounts in the administrators group can use the ssh service, so you will need to create an administrator account to use if you do not already have one. (If your Synology NAS is exposed to the Internet directly now would be a very good time to ensure you have a strong password on the account; mine is behind a separate firewall.)

  • In the Synology web interface (DSM) go to the Package Center and search for "Git Server" (git) from Synology and install that package. It should install in a few seconds, and currently appears to install git 2.8.0:

      ewen@nas01:/$ git --version
      git version 2.8.0
      ewen@nas01:/$
    

    which while not current (eg my laptop has git 2.13.0), is only about a year old. It is a symlink (in /usr/bin/git) into the Git package in /var/packages/Git/target/bin/git.

  • Verify that you can now reach the necessary parts of the git package:

    for FILE in git git-shell git-receive-pack git-upload-pack; do
        which "${FILE}"
    done
    

    should produce something like:

    /bin/git
    /bin/git-shell
    /bin/git-receive-pack
    /bin/git-upload-pack
    
  • Download the latest git-annex standalone x86-64 tarball, and gpg signature

  • Verify the git-annex gpg signature (as with previous installs):

    gpg --verify git-annex-standalone-amd64.tar.gz.sig
    

    which should report a "Good signature" from the "git-annex distribution signing key" (DSA key ID 89C809CB, Primary key fingerprint: 4005 5C6A FD2D 526B 2961 E78F 5EE1 DBA7 89C8 09CB).

    If you have not already verified that key is the right signature key it can be verified against, eg, keys in the Debian keyring as Joey Hess is a former Debian Developer.

  • Once you are happy with git-annex tarball you downloaded, copy it onto the NAS somewhere suitable, eg on the NAS:

    sudo mkdir /volume1/thirdparty
    sudo mkdir /volume1/thirdparty/archives
    sudo chown "$(id -nu):$(id -ng)" /volume1/thirdparty/archives
    

    then from wherever you downloaded the git-annex archive:

    scp -p git-annex-standalone-amd64.tar.gz* nas01.em.naos.co.nz:/volume1/thirdparty/archives/
    
  • Extract the archive on the NAS:

    cd /volume1/thidparty
    sudo tar -xzf archives/git-annex-standalone-amd64.tar.gz
    

    The extracted archive is about 160MB, because of bundling all the required tools:

    ewen@nas01:/volume1/thirdparty$ du -sm git-annex.linux/
    161 git-annex.linux/
    ewen@nas01:/volume1/thirdparty$
    

    to make it a stand alone version (as well as static linking everything).

  • Symlink git-annex into /usr/local/bin so we have a common place

    to reference these binaries:

    cd /usr/local/bin
    sudo ln -s /volume1/thirdparty/git-annex.linux/git-annex .
    

    In a normal login shell /usr/local/bin will be on the PATH, and:

    which git-annex
    

    should print:

    /usr/local/bin/git-annex
    

    and you should be able to run git-annex by itself and have it print out the basic help text.

    Unfortunately this does not work for non-interactive shells, because the Synology NAS uses the "/bin/sh" symlink to bash, which means that non-interactive shells do not process ~/.bashrc, and non-interactive shells also do not read /etc/profile (which is where /usr/local/bin/ is added to the PATH). So we have to add some more work arounds, with symlinks into /usr/bin/ later (see below).

    For reference, this is my /etc/passwd entry created by the Synology NAS web interface (DSM):

    ewen@nas01:~$ grep "$(id -un):" /etc/passwd
    ewen:x:1026:100:Ewen McNeill:/var/services/homes/ewen:/bin/sh
    ewen@nas01:~$
    
  • To fix the warning:

    warning: /bin/sh: setlocale: LC_ALL: cannot change locale (en_US.utf8)
    

    we have to pre-create the locales directory that the git-annex runshell script tries to write the locales into, with permissions that a regular user can write into, and then run git-annex once.

    sudo mkdir /volume1/thirdparty/git-annex.linux/locales
    sudo chown "$(id -un):$(id -gn)" /volume1/thirdparty/git-annex.linux/locales
    

    On the Synology NAS, with the default locale:

    ewen@nas01:~$ set | egrep "LANG|LC_ALL"
    LANG=en_US.utf8
    LC_ALL=en_US.utf8
    ewen@nas01:~$
    

    this should create:

    ewen@nas01:~$ ls /volume1/thirdparty/git-annex.linux/locales/en_US.utf8/
    LC_ADDRESS  LC_IDENTIFICATION  LC_MONETARY  LC_PAPER
    LC_COLLATE  LC_MEASUREMENT     LC_NAME      LC_TELEPHONE
    LC_CTYPE    LC_MESSAGES        LC_NUMERIC   LC_TIME
    ewen@nas01:~$
    

    And then we can revert the file permissions to root owned:

    sudo chown -R root:root /volume1/thirdparty/git-annex.linux/locales
    

    Note that it is possible to change the interactive locale by setting LANG and LC_ALL in, eg, ~/.bash_profile (but this will not work for non-interactive shells). git-annex only supports utf8 locales, but that is probably the most useful modern choice anyway. (I chose not to bother as en_US.utf8 is close enough to my usual locale -- en_NZ.utf8 -- that it did not really matter at present; the main difference would be the date format, and I do not expect to use git-annex interactively on the Synology NAS often enough for that to be an issue. I just wanted the warning message gone, as it turns up repeatedly in interactive use.)

  • To usefully use git-annex you probably also want to enable the "User Home" feature, so that the home directory for your user is created and you can store things like ssh keys; this also enables a per-user share via (CIFS, etc). To do this, in the Synology web interface (DSM) go to Control Panel -> User -> User Home and tick "Enable user home service", and hit Apply. This will create a /volume1/homes directory, a directory for each user, and a /var/services/homes symlink pointing at /volume1/homes so that the shell directories are reachable.

    Once that is done, when you ssh into the NAS, the message about your home directory being missing:

    Could not chdir to home directory /var/services/homes/ewen: No such
    

    file or directory

    should be gone, and you should arrive in your home directory at login:

      ewen@nas01:~$ pwd
      /var/services/homes/ewen
      ewen@nas01:~$
    
  • If you do have a home directory, you might also want to do some common git setup:

    git config --global user.email ...    # Insert your email address
    git config --global user.name ...     # Insert your name
    

    which should run without any complaints, creating a ~/.gitconfig file with the values you supply.

  • Assuming you do have a user home directory you can usefully run the next step to have git-annex auto-generate a couple of necessary helper scripts in ${HOME}/.ssh/ -- which cannot be automatically created otherwise (but see the contents below if you want to try to create them by hand).

    To create the helper scripts automatically run:

    /volume1/thirdparty/git-annex.linux/runshell
    

    which will start a new shell, with /volume1/thirdparty/git-annex.linux/bin in the "${PATH}" so you can interactively use the git-annex versions of tools (eg, for testing).

    It also creates the two helper scripts that we need:

    $ ls -l ${HOME}/.ssh
    total 8
    -rwxrwxrwx 1 ewen users 241 May 28 11:20 git-annex-shell
    -rwxrwxrwx 1 ewen users  74 May 28 11:20 git-annex-wrapper
    $
    
  • Since (a) these scripts are not user specific and (b) "${HOME}/.ssh" is not on the PATH by default, it is much more useful to move these scripts into, eg, /usr/local/bin/, so they are in a central location.

    To do this:

    cd /usr/local/bin
    sudo mv "${HOME}/.ssh/git-annex-shell" .
    sudo mv "${HOME}/.ssh/git-annex-wrapper" .
    sudo chown root:root git-annex-shell git-annex-wrapper
    sudo chmod 755 git-annex-shell git-annex-wrapper
    

    This should give you two trivial shell scripts, which hard code the path to where you unpacked git-annex:

    ewen@nas01:/usr/local/bin$ ls -l git-annex-*
    -rwxr-xr-x 1 root root 241 May 28 11:20 git-annex-shell
    -rwxr-xr-x 1 root root  74 May 28 11:20 git-annex-wrapper
    ewen@nas01:/usr/local/bin$ cat git-annex-shell
    #!/bin/sh
    set -e
    if [ "x$SSH_ORIGINAL_COMMAND" != "x" ]; then
    exec '/volume1/thirdparty/git-annex.linux/runshell' git-annex-shell -c
    "$SSH_ORIGINAL_COMMAND"
    else
    exec '/volume1/thirdparty/git-annex.linux/runshell' git-annex-shell -c "$@"
    fi
    ewen@nas01:/usr/local/bin$ cat git-annex-wrapper
    #!/bin/sh
    set -e
    exec '/volume1/thirdparty/git-annex.linux/runshell' "$@"
    ewen@nas01:/usr/local/bin$
    

    (which gives you enough to create them by hand if you need to, substituting the path where you unpacked the git-annex standalone archive for /volume1/thirdparty/).

  • To be able to run these helper scripts, and git-annex itself, from a non-interactive shell -- such as when git-annex itself is trying to run the remote git-annex, we need to ensure that git-annex, git-annex-shell and git-annex-wrapper are reachable via a directory that is in the default PATH. That default PATH is very minimal, containing:

    ewen@ashram:~$ ssh nas01.em.naos.co.nz 'set' | grep PATH
    PATH=/usr/bin:/bin:/usr/sbin:/sbin
    ewen@ashram:~$
    

    Since /bin and /sbin are both symlinks anyway:

    ewen@nas01:~$ ls -l /bin
    lrwxrwxrwx 1 root root 7 May 21 18:57 /bin -> usr/bin
    ewen@nas01:~$ ls -l /sbin
    lrwxrwxrwx 1 root root 8 May 21 18:57 /sbin -> usr/sbin
    ewen@nas01:~$
    

    that gives us only two choices -- /usr/bin and /usr/sbin -- which are on the default PATH. Given that git-annex is not a system administration tool, only /usr/bin makes sense.

    To do symlink them into /usr/bin:

    cd /usr/bin
    sudo ln -s /usr/local/bin/git-annex* .
    

    I am expecting that this step may need to be redone periodically, as various Synology updates update /usr/bin, which is why I have a "master" copy in /usr/local/bin and just symlink it into /usr/bin. For git-annex this is a chain of two symlinks:

    ewen@nas01:~$ ls -l /usr/bin/git-annex
    lrwxrwxrwx 1 root root 24 May 28 11:57 /usr/bin/git-annex -> /usr/local/bin/git-annex
    ewen@nas01:~$ ls -l /usr/local/bin/git-annex
    lrwxrwxrwx 1 root root 45 May 28 11:01 /usr/local/bin/git-annex -> /volume1/thirdparty/git-annex.linux/git-annex
    ewen@nas01:~$
    

    which is slightly inefficient, but still convenient for restoring later.

  • Now is a convenient time to set up ssh key access to the Synology NAS, by creating ${HOME}/.ssh/authorized_keys as usual. Since we do not need a special key to trigger a special hard coded path to git-annex-shell (because it is on the PATH) you can use your regular key if you want rather than a dedicated "git-annex on Synology NAS" key.

    Ensure that the permissions on the ${HOME}/.ssh directory and the authorized_keys file are appropriately locked down so that sshd will trust them, eg:

    cd
    chmod go-w .
    chmod 2700 .ssh
    chmod 400 .ssh/authorized_keys
    

    and then you should be able to ssh to the NAS with key authentication; if it does not work use "ssh -v ..." to figure out the error, which is most likely a permissions problem like:

    debug1: Remote: Ignored authorized keys: bad ownership or modes for directory /volume1/homes/ewen
    

    because the permissions on the default created directories are very permissive (and would allow anyone to create a ssh authorized key entry), so sshd will not trust the files until the permissions are corrected.

  • All going well at this point you should be able to verify that you can reach all the necessary programs from a non-interactive ssh session with something like:

    for FILE in git-annex git-annex-shell git-annex-wrapper git git-shell git-receive-pack git-upload-pack; do
        ssh NAS "which ${FILE}"
    done
    

    and get back answers like:

    /usr/bin/git-annex
    /usr/bin/git-annex-shell
    /usr/bin/git-annex-wrapper
    /usr/bin/git
    /usr/bin/git-shell
    /usr/bin/git-receive-pack
    /usr/bin/git-upload-pack
    

    if one or more of those is missing from the output you will want to figure out why before continuing.

  • To centralise my git-annex storage, I created an "annex" share through the Synology NAS web interface (DSM) in Control Panel -> Shared Folder. This created a /volume1/annex directory.

  • To make that easily accessible, I created a top level symlink to it:

    sudo ln -s /volume1/annex /
    

    giving:

    ewen@nas01:~$ ls -l /annex
    lrwxrwxrwx+ 1 root root 14 May 28 12:10 /annex -> /volume1/annex
    ewen@nas01:~$
    

    This matches the pattern I use on some other machines.

Once all these setup is done, git-annex can be used effectively like any other Linux/Unix machine. For instance you can "push a clone" onto the NAS using "git bundle" and "git clone" from the bundle, and then add that as a "git remote" and use "git annex sync" and "git annex copy ...`" to copy into it.

The "standalone git-annex" will probably need updating periodically (for bug/security fixes, new features, etc), but it should be possible to do that simply by replacing the unpacked tarfile contents as required; everything else points back to that directory. (Possibly the locale generation step might need to be done by hand again.)

Finally for future reference, it is also possible to run a Debian chroot on the Synology NAS, which would open up even more possibilities for using the NAS as a more general purpose machine.

ETA 2017-06-25: Beware that (certain?) Synology updates will rebuild the root file system, and/or clean out unexpected symlinks. So after an update, or reboot, it is necessary to redo:

sudo ln -s /volume1/annex /
cd /usr/bin
sudo ln -s /usr/local/bin/git-annex* .

before git-annex-shell will be automatically found, and the nas01:/annex/... paths will work again

I have worked around this by creating /usr/local/bin/activate-git-annex containing:

#! /bin/sh
# Relink git-annex paths onto the Synology root file system
#
if [ -e /annex ]; then
  :
else
  sudo ln -s /volume1/annex /
fi
cd /usr/bin && for FILE in /usr/local/bin/git-annex*; do
                  if [ -e "$(basename ${FILE})" ]; then
                    :
                  else
                    sudo ln -s "${FILE}" .;
                  fi
                done

So that, when git-annex breaks after an upgrade, I can just run:

activate-git-annex

from an interactive shell and it will all work again. (/usr/local/bin seems to survive upgrades, but unfortunately, as noted above, is not in the default path for a non-interactive shell, so it is not a complete solution. However it is in the default path for an interactive shell hence the simple command above.)

ETA 2017-07-16: Fixed up logic of "check if already done" test in /usr/local/bin/activate-git-annex. It appears that will need to be run every time that the Synology is updated, at least in a way that will cause it to reboot.

Posted Sun May 28 13:09:13 2017 Tags: