Fundamental Interconnectedness

This is the occasional blog of Ewen McNeill. It is also available on LiveJournal as ewen_mcneill, and Dreamwidth as ewen_mcneill_feed.

Introduction

When I upgraded my Vodafone cable connection to Vodafone FibreX installation, it came with a Huawei HG659 Home Gateway supplied as part of the connection, including some IPv6 support and 802.11ac WiFi. While I was not particularly keen on using a "telco CPE" as my home edge device (amongst other things they have a general reputation for being poorly secured), it was fast than anything else I had at the time so I planned to use it until I had a specific reason to need something else to guide the next purchase.

That specific reason came along when I wanted a proper Network "DMZ" for my home network to host some development systems (I often work from home). The Huawei HG659 supports NAT pinholes routed to an internal system on the (single) LAN, but otherwise does not provide any real DMZ network isolation.

I purchased a Mikrotik RB750Gr3 -- nicknamed the Mikrotik "hEX" -- to be my replacement home edge router. They are around NZ$120 from the main local Mikrotik reseller, making them a fairly cost effective home router for someone needing extra flexibility. The Mikrotik RB750Gr3 is a dual-core 850Mhz MIPS CPU, with 256MB of RAM, 16MB of flash, and 5 Gigabit Ethernet (copper 1GBase-T) interfaces (via a single Ethernet switch chip). It is packaged in an indoor case, and while it is capable of PoE via ether1 typically in a home environment it would be powered by a (supplied) 24V DC plug pack.

(The other common "advanced home user" replacement seems to be a Ubiquiti device like the UniFi Security Gateway which can be configured for VLAN tagging, or the Ubiquiti EdgeRouter. I choose the Mikrotik because I am very familiar with them from previous jobs, and client sites, and for single devices much prefer direct configuration to forced-management via a "Security Controller" as is required by the Ubiquiti UniFi line. Something like the Ubiquiti EdgeRouter X 5-port, also a similar price in New Zealand, and apparently configurable from the command line, may well work just as well as the Mikrotik RB750Gr3 for this purpose -- I have not tried it myself.)

For my home Vodafone FibreX configuration I wanted:

  • A "WAN" connection to the Vodafone supplied TechniColor TC4400VDF DOCSIS 3.1 cable modem

  • Multiple LAN ports switched together

  • A "DMZ" interface with its own routing firewall rules, that was separated from both the WAN and LAN

  • Equal support for both IPv4 and IPv6 (Vodafone FibreX has provided IPv6 by default since shortly before my connection was converted to FibreX)

  • The ability to use IPv6 to expose devices in the DMZ, at least to known external locations, without needing to use IPv4 NAT or a VPN

I also wanted to continue to use the Vodafone-supplied Huawei HG659 home gateway for its 802.11ac WiFi, since it is the only 802.11ac AP in my house at present; I achieved that by simply disconnecting the WAN interface of the Huawei HG659 but leaving the LAN interface connected to the Mikrotik RB750Gr3 -- and changing the LAN address of the Huawei HG659 away from my LAN default gateway address. This means the Huawei HG659 acts as an 802.11ac WiFi to Gigabit Ethernet bridge (which also allows using the Huawei HG659 LAN ports as an additional Ethernet switch, that is handy as it is on my "comms" UPS). Presumably the Huawei HG659 is vulnerable to the Wifi KRACK replay attacks, but Vodafone issued a Huawei HG659 firmware update in August 2017, available for download via the Huawei HG659 user guide page, so maybe there will be another update available later in the year to fix more issues. In my configuration (without the WAN port connected), Vodafone will not be able to upgrade my Huawei HG659, but they do give instructions on manually installing the upgrade.

Configuration

The configuration I chose is based on a GeekZone Forum example of Mikrotik configuration for Vodafone FibreX, a Mikrotik Wiki guide to "Securing Your Router", and a Mikrotik IPv6 Home Example. As well as a lot of experience configuring Mikrotik devices as routers over the years.

To do the initial configuration I used a reimplementation of the Mikrotik mac-telnet feature, which works directly over the Ethernet connection without requiring IP address configured. Looking now, I see there is a later fork of the mac-telnet reimplementation with more features, as well as at least one other older independent implementation. (These mac-telnet reimplementations are much easier to use on Linux / OS X than trying to install WINE to get the Windows Mikrotik MAC-Telnet running -- although that does work too, and I have done it in the past.) Of note, there is also a Wireshark dissector for the MAC-Telnet protocol, and a reverse engineered packet description of the MAC-Telnet protocol -- it looks like it uses Ethernet broadcast frames with IPv4 UDP-like packets to/from port 20561.

Interface layout

The Mikrotik RB750Gr3 is a 5-GigE (1GBase-T) interface device, with all ports connected to an internal Ethernet switch chip. For my use case I wanted:

  • ether1 as the WAN interface, connected to the Vodafone supplied TechniColor TC4400VDF cable modem

  • ether2, ether3, and ether4 as LAN interfaces, with fast switching amongst them; and

  • ether5 as the DMZ interface

so that is the configuration used below. The Ethernet switch chip functionality ("master-port" is used for the LAN interfaces, which are then all ether2 as far as the rest of the configuration is concerned), and the other two (ether1 for WAN; ether5 for DMZ) are stand alone interfaces.

The Vodafone FibreX configuration -- unlike earlier Vodafone cable modem gateways, but like many UFB connections -- uses VLAN tagging, with VLAN 10, on the WAN connection. I assume Vodafone do this to have a fairly consistent configuration on their Huawei HG659 devices, which they use across multiple different connection types. Unlike, many UFB connections, which still use PPPoE, the Vodafone FibreX connection continues to use what is now called "IPoE" -- IP over Ethernet without additional layers like PPPoE (which itself is IP over PPP over Ethernet).

This means that there is an additional logical interface

  • VLAN 10 on ether1, which I have called "fibrex" in this configuration

to which all the IP level configuration is attached. Raw (untagged) ether1 is only used to reach the management IP of the TechniColor TC4400VDF cable modem (on 192.168.100.1) -- and then only to check that it is alive, as the default page provides basically no information, and is password protected with unspecified passwords. (Sadly there is no equivalent of the Motorola SB5100 modem light status page.)

Interface configuration

To implement this interface layout, label the Ethernet interfaces, and join ether3 and ether4 to ether2 via the Ethernet switch:

/int ethernet set ether1 comment="WAN"
/int ethernet set ether2 comment="LAN"
/int ethernet set ether3 master-port=ether2 comment="LAN (switched)"
/int ethernet set ether4 master-port=ether2 comment="LAN (switched)"
/int ethernet set ether5 comment="DMZ"

then add the fibrex VLAN interface:

/int vlan add name=fibrex interface=ether1 vlan-id=10 comment="VLAN 10 on ether1"

to hold the IP configuration facing the Vodafone FibreX connection.

Mikrotik base configuration

  • Upgrade to a recent Mikrotik RouterOS; that was 6.40.3 when I did my install, but 6.40.4 or later is recommended now as 6.40.4 includes the Wifi KRACK improvements, as well as several IPv6 related fixes.

  • Set the system name to something to help you identify it, and enable IPv6 functionality:

    /system identity set name=MY-rb750gr3
    /system package enable ipv6
    /system reboot
    

    (a reboot is required to get the IPv6 modules running).

  • Set an admin password, and (optionally) add your own admin-level account:

    /password
    /user add copy-from=admin name=ME comment="MY FULL NAME"
    

    then log in with the new account, and set its password:

    /password
    
  • Disable unnecessary services:

    /ip service print
    /ip service set telnet disabled=yes
    /ip service set ftp disabled=yes
    /ip service set www disabled=yes
    /ip service set api disabled=yes
    /ip service set winbox disabled=yes
    /ip service set api-ssl disabled=yes
    

    then check what is left enabled:

    /ip service print where disabled=no
    
  • Restrict ssh access to known internal networks and trusted IPs:

    /ip service set ssh address=A.B.C.D/24,E.F.G.H/32
    /ip ssh set strong-crypto=yes
    /ip ssh print
    
  • Disable Mikrotik WinBox server (since I do not use it; the client only runs on Windows / WINE):

    /tool mac-server mac-winbox set [find] disabled=yes
    /tool mac-server mac-winbox print
    
  • Permit mac-telnet only from internal interfaces, by overriding default and changing default access to disabled:

    /tool mac-server add interface=ether2 disabled=no
    /tool mac-server add interface=ether3 disabled=no
    /tool mac-server add interface=ether4 disabled=no
    /tool mac-server add interface=ether5 disabled=no
    /tool mac-server print
    /tool mac-server set 0 disabled=yes
    /tool mac-server print
    

    and disable the MAC-based "ping" functionality completely:

    /tool mac-server ping set enabled=no
    /tool mac-server ping print
    
  • Turn off Mikrotik Neighbor Discovery on external interfaces:

    /ip neighbor discovery print
    /ip neighbor discovery set ether1 discover=no
    /ip neighbor discovery set fibrex discover=no
    /ip neighbor discovery print
    
  • Disable other extraneous services:

    /ip dns set allow-remote-requests=no
    /ip proxy set enabled=no
    /ip socks set enabled=no
    /ip upnp set enabled=no
    /ip cloud set ddns-enabled=no update-time=no
    /tool bandwidth-server set enabled=no
    

    (several of those default to off, but it is good to be sure they are turned off when unneeded).

  • Disable IPv6 Neighbor Discovery by default (we will enable it specifically on internal interfaces later):

    /ipv6 nd set [find interface=all] disabled=yes
    /ipv6 nd print
    

    (and Vodafone FibreX requires DHCPv6 to obtain IPv6 addresses for the WAN interface, and an IPv6 pool for the internal interfaces).

IPv4 interface configuration

The Vodafone FibreX configuration expects you to use DHCPv4 to obtain the IP address, and by default hands out short-life leases (about 10 minutes) from a dynamic pool. It is possible to request a static IP address, but that is delivered as a static DHCPv4 lease, so DHCPv4 is still required. (I have requested a static IPv4 address because I often work from home, and needed access added for my IPv4 address on several client's firewall.)

  • Configure the WAN interface, which needs to do DHCPv4 on the fibrex (VLAN 10 on ether1) interface:

    /ip dhcp-client add interface=fibrex add-default-route=yes use-peer-dns=no use-peer-ntp=no disabled=no
    

The internal addressing needs to use IPv4 "Site Local" (RFC1918) addresses, due to the practical exhaustion of the IPv4 global address pool about 10 years ago :-( I would strongly recommend picking a less common RFC1918 address -- not 192.168.1.0/24, 192.168.88.0/24, or any other vendor default -- to avoid future confusion.

  • Configure the LAN interface with a chosen RFC1918 address:

    /ip addr add interface=ether2 address=A.B.C.D/24 comment="LAN"
    
  • Configure the DMZ interface with another RFC1918 address (since "home" connections come with only a single IPv4 address :-( ):

    /ip addr add interface=ether5 address=E.F.G.H/24 comment="DMZ"
    

At this point the Mikrotik should route IPv4 traffic properly between the LAN and the DMZ -- but connections out to the Internet will fail due to the aforementioned IPv4 address exhaustion meaning that NAT is required -- see below for NAT configuration in the firewall section.

IPv6 interface configuration

The Vodafone FibreX provision of IPv6 is a dynamic IPv6 /56 delivered via DHCPv6; there is no option for static DHCPv6 leases, and the leases appear to be tied to the router's MAC address (so changing routers will result in completely new addresses). However the DHCPv6 leases are quite long (about 2 weeks), and renewal does appear to work, so the DHCPv6 addresses should be fairly consistent.

For IPv6 we need to request both an IPv6 address for the router's WAN interface and a pool (/56) of IPv6 addresses to use for allocating internal IP addresses. This is because IPv6 address allocation is designed to provide connectivity-based IP addresses, to minimise the size of the routing table. (There is a range of IPv6 Unique Local Addresses which are roughly equivalent to IPv4 RFC1918 addresses as "Site Local" addresses -- but they are not intended for use with Global Internet Routing, nor is NAT expected to be used with IPv6; instead end devices are expected to have multiple IPv6 addresses.)

  • Request the IPv6 address and pool from the fibrex interface:

    /ipv6 dhcp-client add interface=fibrex pool-name=fibrex-pool add-default-route=yes use-peer-dns=no request=address,prefix
    

Once we have the fibrex-pool we can then assign internal interfaces out of that pool:

  • The LAN first:

    /ipv6 address add interface=ether2 from-pool=fibrex-pool advertise=yes
    /ipv6 firewall address-list add list=ipv6_lan \
          address=[/ipv6 address get [/ipv6 address find interface=ether2 from-pool=fibrex-pool] address]
    
  • And then the DMZ:

    /ipv6 address add interface=ether5 from-pool=fibrex-pool advertise=yes
    /ipv6 firewall address-list add list=ipv6_dmz \
          address=[/ipv6 address get [/ipv6 address find interface=ether5 from-pool=fibrex-pool] address]
    

Because these addresses are dynamic (drawn from a pool, which could change), we add them into "/ipv6 firewall address-list" entries to make it easier to use in firewall rules. (We could arrange for scripts to be run each time the pool changes, and thus these IPs change, but in practice in the last few weeks they have been very stable, so I have not yet automated updating the firewall address-lists on pool change.)

The "advertise=yes" makes the address eligible for the Mikrotik to advertise it for SLACC, which I have previously found worked best on my network (due to the Huawei HG659 DHCPv6 handing out duplicate addresses :-( ). This also avoids the need to set up stateful DHCPv6 on the internal interfaces.

To actually enable Neighbor Discovery Router Announcements (for SLAAC) on these interfaces, because we disabled it globally above, we need to configure Neighbor Discovery for these internal interfaces:

  • On the LAN:

    /ipv6 nd add interface=ether2 disabled=no ra-interval=3m20s-10m \
          ra-delay=3s mtu=unspecified reachable-time=unspecified \
          retransmit-interval=unspecified ra-lifetime=30m hop-limit=unspecified \
          advertise-mac-address=yes advertise-dns=no \
          managed-address-configuration=no other-configuration=no comment="LAN (SLAAC)"
    
  • On the DMZ:

    /ipv6 nd add interface=ether5 disabled=no ra-interval=3m20s-10m \
        ra-delay=3s mtu=unspecified reachable-time=unspecified \
        retransmit-interval=unspecified ra-lifetime=30m hop-limit=unspecified \
        advertise-mac-address=yes advertise-dns=no \
        managed-address-configuration=no other-configuration=no comment="DMZ (SLAAC)"
    

Internet edge firewalling

The firewall configuration becomes fairly complex because we have three routing interfaces (WAN = Internet; LAN; DMZ), as well as the Mikrotik itself, and two network protocols (IPv4 and IPv6) which have completely separate addresses and firewall rules. This means that we need firewall rules to cover:

  • LAN to Internet

  • DMZ to Internet

  • LAN to DMZ

  • DMZ to LAN

  • Internet to LAN

  • Internet to DMZ

for both IPv4 and IPv6. Some of those can be very generic policies (eg, "Internet to LAN" should not allow any "unexpected" traffic; "LAN to Internet" may be okay allowing pretty much everything out), but others need a fair amount of detail.

In addition the IP addresses of the WAN interface is notionally dynamic for both IPv4 and IPv6, and the LAN and DMZ interface ranges are also dynamic for IPv6 (due to being auto-assigned out of DHCPv6 provided pools). And IPv4 Internet access requires NAT, due to home connections being provided with only a single IPv4 IP address to share amongst several internal devices.

Since IPv4 and IPv6 are essentially completely independent, they are covered separately below.

IPv4 Firewalling

IPv4 Address Lists

The easiest way to obtain flexibility in Mikrotik firewall rule sets is to make extensive use of the "/ip firewall address-list" facility to attach names to groups of IPv4 addresses -- and then use only those names (rather than literal IPv4 addresses) in the rule set as much as possible.

We start with definitions of the internal interfaces:

/ip firewall address-list add list=ipv4_lan address=A.B.C.D/24
/ip firewall address-list add list=ipv4_dmz address=E.F.G.H/24

which should match the IPv4 subnets used for the interface definitions above (a similar auto-define approach could be used to set the IPv4 address-lists as was used with the IPv6 address lists, but since the IPv4 internal addresses are fixed it does not seem necessary).

Another useful address list is a list of addresses which can externally manage the Mikrotik (eg, a work address for when you need to get into your home connection):

/ip firewall address-list add list=ipv4_ext_mgmt address=G.H.I.J comment="EXPLANATION"

repeat as needed to add multiple addresses; using, eg, a DNS name or site/company name in the comments help with figuring out which one is which later on when they inevitably need to be updated.

It is also useful to define a "bogon" address list, which should not appear on the Internet -- this IPv4 list taken from RFC6890:

/ip firewall address-list
add address=0.0.0.0/8 comment=RFC6890 list=ipv4_bogon
add address=10.0.0.0/8 comment=RFC6890 list=ipv4_bogon
add address=172.16.0.0/12 comment=RFC6890 list=ipv4_bogon
add address=192.168.0.0/16 comment=RFC6890 list=ipv4_bogon
add address=169.254.0.0/16 comment=RFC6890 list=ipv4_bogon
add address=127.0.0.0/8 comment=RFC6890 list=ipv4_bogon
add address=224.0.0.0/4 comment=Multicast list=ipv4_bogon
add address=198.18.0.0/15 comment=RFC6890 list=ipv4_bogon
add address=192.0.0.0/24 comment=RFC6890 list=ipv4_bogon
add address=192.0.2.0/24 comment=RFC6890 list=ipv4_bogon
add address=198.51.100.0/24 comment=RFC6890 list=ipv4_bogon
add address=203.0.113.0/24 comment=RFC6890 list=ipv4_bogon
add address=100.64.0.0/10 comment=RFC6890 list=ipv4_bogon
add address=240.0.0.0/4 comment=RFC6890 list=ipv4_bogon
add address=192.88.99.0/24 comment="6to4 relay Anycast [RFC 3068]" list=ipv4_bogon
/

(Note the trailing "/" to return the Mikrotik context back to the top level.)

IPv4 NAT to the Internet

Once we have done that, we can define the NAT rules needed for Internet access:

/ip firewall nat add chain=srcnat action=masquerade out-interface=fibrex src-address-list=ipv4_lan
/ip firewall nat add chain=srcnat action=masquerade out-interface=fibrex src-address-list=ipv4_dmz

which specifies that any IPv4 traffic from the LAN or DMZ address ranges, allowed out by the firewall rules to the Internet, will be sent out using the current IP of the fibrex interface. The use of "action=masquerade" means it should automatically adapt if the IPv4 external address ever changes.

IPv4 ICMP filtering

Over the years IPv4 ICMP has acquired a number of special case uses which probably should not be used on the modern Internet, so it can be useful to be more specific about ICMP required. We can do this by creating an ipv4_icmp filter that whitelists the expected types and blocks all other ICMP.

This list should not be considered exhaustive, but is probably the minimum required for a functioning IPv4 connection:

/ip firewall filter
add chain=ipv4_icmp protocol=icmp icmp-options=0:0 action=accept comment="echo reply"
add chain=ipv4_icmp protocol=icmp icmp-options=3:0 action=accept comment="net unreachable"
add chain=ipv4_icmp protocol=icmp icmp-options=3:1 action=accept comment="host unreachable"
add chain=ipv4_icmp protocol=icmp icmp-options=3:4 action=accept comment="host unreachable fragmentation required"
add chain=ipv4_icmp protocol=icmp icmp-options=4:0 action=accept comment="allow source quench"
add chain=ipv4_icmp protocol=icmp icmp-options=8:0 action=accept comment="allow echo request"
add chain=ipv4_icmp protocol=icmp icmp-options=11:0 action=accept comment="allow time exceed"
add chain=ipv4_icmp protocol=icmp icmp-options=12:0 action=accept comment="allow parameter bad"
add chain=ipv4_icmp protocol=icmp action=drop comment="deny all other types"
/

(Again note the trailing "/" to reset the Mikrotik context to the top level.)

IPv4 Input/Output from the Mikrotik

The Mikrotik firewalling has an "input" chain for traffic to the Mikrotik itself, an "output" chain for traffic from the Mikrotik itself, and a "forward" chain for traffic originating outside the Mikrotik destined for somewhere outside the Mikrotik.

Having defined all the above address lists and helper filters, we can now define the "input" and "output" chains. These need to allow DHCPv4 (RC2131), as well as management traffic from known locations -- and block unexpected traffic from external locations.

The IPv4 input filter:

/ip firewall filter
add chain=input action=accept connection-state=established,related
add chain=input action=jump   protocol=icmp jump-target=ipv4_icmp
add chain=input action=accept protocol=udp in-interface=fibrex src-port=67 dst-port=68 comment="IPv4 DHCP"
add chain=input action=accept in-interface=fibrex src-address-list=ipv4_ext_mgmt
add chain=input action=accept in-interface=ether2 src-address-list=ipv4_lan
add chain=input action=accept in-interface=ether5 src-address-list=ipv4_dmz
add chain=input action=drop   in-interface=fibrex
add chain=input action=drop   in-interface=ether1
add chain=input action=reject
/

and the IPv4 output filter:

/ip firewall filter
add chain=output action=accept connection-state=established,related
add chain=output action=jump   protocol=icmp jump-target=ipv4_icmp
add chain=output action=accept protocol=udp out-interface=fibrex src-port=68 dst-port=67
add chain=output action=accept protocol=udp port=123 dst-address-list=ipv4_ntp_servers comment="NTP"
add chain=output action=reject out-interface=fibrex
add chain=output action=reject out-interface=ether1
add chain=output comment="Mikrotik Beacons to LAN" \
    dst-address=255.255.255.255 out-interface=ether2 port=5678 protocol=udp
add chain=output action=reject out-interface=ether2 log=yes log-prefix="To LAN"
add chain=output comment="Mikrotik Beacons to DMZ" \
    dst-address=255.255.255.255 out-interface=ether5 port=5678 protocol=udp
add chain=output action=reject out-interface=ether5 log=yes log-prefix="To DMZ"
add chain=output action=drop
/

Where 255.255.255.255 is the IPv4 broadcast address.

Of note, IPv4 ICMP is filtered via the "known good" ICMPv4 whitelist defined above, in both directions, and DHCPv4 is allowed on the fibrex VLAN tagged interface, in both directions, and traffic from management sources to the Mikrotik is permitted (including, by choice, all the internal addresses -- that could be locked down further if desired).

Traffic from the external interfaces is simply dropped without logging (because the Internet is filled with constant scanning), but traffic to the Internet is logged to help debug missing rules. Traffic to internal interfaces also logged to help determine (a) if there are missing rules and (b) if anything is trying to reach into the internal network.

IPv4 traffic through the Mikrotik

We can fairly easily define default policies for traffic between the three interfaces:

  • Anything from the Internet to an internal interface should be blocked unless it is part of a connection established outbound, or a specific rule allowing traffic to the DMZ

  • Anything to the Internet should be allowed by default from known IPs

  • LAN to DMZ traffic should be allowed by default, but may be more filtered later

  • DMZ to LAN traffic should be limited, initially just ICMPv4 (but maybe later, eg, DNS and logging)

This means that only two of these cases need special treatment, LAN to DMZ, and DMZ to LAN:

/ip firewall filter
add chain=ipv4_lan_to_dmz action=accept
/

/ip firewall filter
add chain=ipv4_dmz_to_lan action=jump jump-target=ipv4_icmp
add chain=ipv4_dmz_to_lan action=reject
/

And then we can define some policies for traffic arriving on the LAN:

/ip firewall filter
add chain=ipv4_lan_out action=jump   src-address-list=ipv4_lan dst-address-list=ipv4_dmz in-interface=ether2 out-interface=ether5 jump-target=ipv4_lan_to_dmz
add chain=ipv4_lan_out action=accept src-address-list=ipv4_lan dst-address-list=!ipv4_bogons in-interface=ether2 out-interface=fibrex
add chain=ipv4_lan_out action=reject
/

and for traffic arriving on the DMZ:

/ip firewall filter
add chain=ipv4_dmz_out action=jump   src-address-list=ipv4_dmz dst-address-list=ipv4_lan in-interface=ether5 out-interface=ether2 jump-target=ipv4_dmz_to_lan
add chain=ipv4_dmz_out action=accept src-address-list=ipv4_dmz dst-address-list=!ipv4_bogons in-interface=ether5 out-interface=fibrex
add chain=ipv4_dmz_out action=reject
/

which use those LAN/DMZ policies for traffic between the LAN and DMZ interfaces, and the bogon list to filter traffic out to the Internet. "Everything else" unexpected is rejected; but in practice there should not be anything else.

Once those are defined, we can define a general IPv4 forwarding policy which hooks all of these together, and adds blocks for unexpected inbound traffic:

/ip firewall filter
add chain=forward action=fasttrack-connection connection-state=established,related comment="FastTrack (if possible)"
add chain=forward action=accept               connection-state=established,related comment="Other Established, Related"
add chain=forward action=drop                 connection-state=invalid comment="Drop invalid" log=yes log-prefix=Invalid
add chain=forward action=jump in-interface=ether2 jump-target=ipv4_lan_out
add chain=forward action=jump in-interface=ether5 jump-target=ipv4_dmz_out
add chain=forward action=drop in-interface=fibrex connection-nat-state=!dstnat connection-state=new comment="Inbound non-NAT" log=yes log-prefix=!NAT
add chain=forward action=drop in-interface=fibrex
add chain=forward action=drop in-interface=ether1
add chain=forward action=reject
/

and then our IPv4 firewall policy is complete, if fairly minimal.

The main thing I anticipate adding over time is some DMZ to LAN pinholes, maybe some Internet to DMZ pinholes (using IPv4 Destination NAT) and perhaps some further lock down of the LAN to DMZ traffic. Since those are all in their own rule sets they should be fairly easy to modify.

IPv6 Firewalling

IPv6 Firewalling is completely separate from IPv4 Firewalling on the Mikrotik (and many devices), due to using completely separate IP addresses, but by using "firewall address-lists" the shape of the firewall rules can (and arguably should) look very similar.

IPv6 address lists

We can define external management addresses:

/ipv6 firewall address-list add list=ipv6_ext_mgmt address=AAAA:BBBB:CCC:DDDD:EEEE:FFFF:GGGG:HHHH comment="DESCRIPTION"

to go along with the ipv6_lan and ipv6_dmz definitions that we calculated above (when defining the IPv6 IP addresses).

We can also add some helper address lists for known IPv6 types:

/ipv6 firewall address-list add list=ipv6_link_local address=fe80::/16
/ipv6 firewall address-list add list=ipv6_multicast  address=ff02::/16

and addresses which should not appear on the Internet:

/ipv6 firewall address-list add list=ipv6_bogons address=fc00::/7 comment="IPv6 Unique Local Addresses"

(in this case IPv6 ULA addresses mentioned above, which are site local).

IPv6 ICMP filtering

ICMPv6 is even more critical to IPv6 than ICMPv4 is to IPv4, so we need to be careful with filtering; there are also fewer "tried to 20 years ago, do not use now" ICMPv6 entries. However to match the pattern of firewall rules between IPv4 and IPv6, I also defined an ICMPv6 whitelist. This should definitely be considered the minimum and will almost certainly need expanding over time; hence the "accept but log" at the end before the "default drop" -- thus accepting everything, but tracking "unexpected" traffic.

/ipv6 firewall filter
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=1:0-255 comment="Destination Unreachable"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=2:0-255 comment="Packet Too Big"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=3:0-255 comment="Time Exceeded"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=4:0-255 comment="Parameter Problem"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=128:0-255 comment="Echo Request"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=129:0-255 comment="Echo Reply"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=132:0-255 comment="Multicast Listener Done"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=133:0-255 comment="Router Solicitation (NDP)"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=134:0-255 comment="Router Announcement (NDP)"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=135:0-255 comment="Neighbor Solicitation (NDP)"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=136:0-255 comment="Neighbor Announcement (NDP)"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=137:0-255 comment="Neighbor Redirect (NDP)"
add chain=ipv6_icmp action=accept protocol=icmpv6 icmp-options=143:0-255 comment="Version 2 Multicast Listener Report"
add chain=ipv6_icmp action=accept protocol=icmpv6 log=yes log-prefix=ICMPv6
add chain=ipv6_icmp action=drop
/

We can also add another filter for "just ping" to be used in more specific scenarios:

/ipv6 firewall filter
add chain=ipv6_ping action=accept protocol=icmpv6 icmp-options=128:0-255 comment="Echo Request"
add chain=ipv6_ping action=accept protocol=icmpv6 icmp-options=129:0-255 comment="Echo Reply"
/

IPv6 Input/Output from the Mikrotik

Having done that, we can define somewhat longer lists of traffic to/from the Mikrotik itself. Note that due to the extensive use of IPv6 Link Local addresses for key functions it is important that we allow those on each interface (the same addresses are used on each interface, with an interface specific route). We also need to allow IPv6 Link Local Multicast for the same reason. Like IPv4 we obviously need to allow DHCP, since that is how we get the addresses, but DHCPv6 is a different protocol on different UDP ports from DHCPv4 on IPv4.

Then we have an IPv6 input rule set looking similar to the IPv4 one:

/ipv6 firewall filter
add chain=input action=accept connection-state=established,related
add chain=input action=jump   protocol=icmpv6 jump-target=ipv6_icmp
add action=accept chain=input comment="DHCPv6 Replies" \
    dst-address-list=ipv6_link_local dst-port=546 in-interface=fibrex \
    protocol=udp src-address-list=ipv6_link_local src-port=547
add chain=input action=accept in-interface=fibrex src-address-list=ipv6_ext_mgmt
add chain=input action=accept in-interface=ether2 src-address-list=ipv6_lan
add chain=input action=accept in-interface=ether2 src-address-list=ipv6_link_local
add chain=input action=accept in-interface=ether2 src-address-list=ipv6_multicast
add chain=input action=accept in-interface=ether5 src-address-list=ipv6_dmz
add chain=input action=accept in-interface=ether5 src-address-list=ipv6_link_local
add chain=input action=accept in-interface=ether5 src-address-list=ipv6_multicast
add chain=input action=drop   in-interface=fibrex
add chain=input action=drop   in-interface=ether1
add chain=input action=reject
/

and a similar looking output rule set:

/ipv6 firewall filter
add chain=output action=accept connection-state=established,related
add chain=output action=jump   protocol=icmpv6 jump-target=ipv6_icmp
add action=accept chain=output comment=DHCPv6 dst-address=ff02::1:2/128 \
    dst-port=547 out-interface=fibrex protocol=udp \
    src-address-list=ipv6_link_local src-port=546
add chain=output action=accept out-interface=fibrex protocol=udp dst-port=68-69 comment="DHCP"
add chain=output action=reject out-interface=fibrex
add chain=output action=reject out-interface=ether1
add chain=output comment="Mikrotik Beacons to LAN" dst-address=ff02::1/128 \
    out-interface=ether2 port=5678 protocol=udp
add chain=output action=reject out-interface=ether2 log=yes log-prefix="To LAN"
add chain=output comment="Mikrotik Beacons to DMZ" dst-address=ff02::1/128 \
    out-interface=ether5 port=5678 protocol=udp
add chain=output action=reject out-interface=ether5 log=yes log-prefix="To DMZ"
add chain=output action=drop
/

The ff02::1/128 address is the "all nodes in link-local" address is basically the IPv6 equivalent of an IPv4 broadcast on a LAN segment; IPv6 does not have broadcast addresses as such.

IPv6 traffic through the Mikrotik

The IPv6 firewall for traffic through the Mikrotik is like the IPv4 firewall for traffic through the Mikrotik, but simpler because it does not require any NAT -- we have globally unique addresses everywhere (I have chosen not to use IPv6 Unique Local Addresses at this time).

The main difference is that we can also receive traffic from the Internet direct to those global addresses -- and for now I have chosen to allow ICMPv6 Ping and nothing else, to help with debugging routing and other issues.

So we have IPv6 firewall chains for traffic from LAN to DMZ and DMZ to LAN:

/ipv6 firewall filter
add chain=ipv6_lan_to_dmz action=accept
/

/ipv6 firewall filter
add chain=ipv6_dmz_to_lan action=jump protocol=icmpv6 jump-target=ipv6_ping
add chain=ipv6_dmz_to_lan action=reject
/

which are then used by IPv6 firewall chains for traffic originating on the LAN and DMZ:

/ipv6 firewall filter
add chain=ipv6_lan_out action=jump   src-address-list=ipv6_lan \
    dst-address-list=ipv6_dmz in-interface=ether2 out-interface=ether5 \
    jump-target=ipv6_lan_to_dmz
add chain=ipv6_lan_out action=accept src-address-list=ipv6_lan \
    dst-address-list=!ipv6_bogons in-interface=ether2 out-interface=fibrex
add chain=ipv6_lan_out action=reject
/

/ipv6 firewall filter
add chain=ipv6_dmz_out action=jump   src-address-list=ipv6_dmz \
    dst-address-list=ipv6_lan in-interface=ether5 out-interface=ether2 \
    jump-target=ipv6_dmz_to_lan
add chain=ipv6_dmz_out action=accept src-address-list=ipv6_dmz \
    dst-address-list=!ipv6_bogons in-interface=ether5 out-interface=fibrex
add chain=ipv6_dmz_out action=reject
/

to both handle LAN to DMZ and LAN to Internet -- and DMZ to LAN and DMZ to Internet -- traffic.

Then we have some IPv6 inbound firewall rules to handle Internet originated traffic:

/ipv6 firewall filter
add chain=ipv6_lan_in action=jump src-address-list=!ipv6_bogons \
    dst-address-list=ipv6_lan in-interface=fibrex out-interface=ether2 \
    protocol=icmpv6 jump-target=ipv6_ping
add chain=ipv6_lan_in action=reject
/

/ipv6 firewall filter
add chain=ipv6_dmz_in action=jump src-address-list=!ipv6_bogons \
    dst-address-list=ipv6_dmz in-interface=fibrex out-interface=ether5 \
    protocol=icmpv6 jump-target=ipv6_ping
add chain=ipv6_dmz_in action=reject
/

(If I were starting again I might have called these ipv6_to_lan and ipv6_to_dmz, and the "out" ones ipv6_from_lan and ipv6_from_dmz; but I wanted to be consistent with the already defined above IPv4 firewall chain names, and the Mikrotik makes it non-trivial to change firewall chain names.)

Once all of the above is defined, we can define a general IPv6 "forward" policy that hooks into all these other chains as required:

/ipv6 firewall filter
add chain=forward action=accept connection-state=established,related comment="Other Established, Related"
add chain=forward action=drop   connection-state=invalid comment="Drop invalid" log=yes log-prefix=Invalid
add chain=forward action=jump in-interface=ether2 jump-target=ipv6_lan_out
add chain=forward action=jump in-interface=ether5 jump-target=ipv6_dmz_out
add chain=forward action=jump in-interface=fibrex out-interface=ether2 jump-target=ipv6_lan_in
add chain=forward action=jump in-interface=fibrex out-interface=ether5 jump-target=ipv6_dmz_in
add chain=forward action=drop in-interface=fibrex
add chain=forward action=drop in-interface=ether1
add chain=forward action=reject
/

Then the basic IPv6 firewall should be complete, ready to be extended over time. If the IPv6 addresses change then the address-lists will need some tweaking, but in theory the rules themselves should be fairly static.

Other firewall related configuration

The IPv4 firewall state timeouts are relatively short by default in some cases so it can help to extend these:

/ip firewall connection tracking set tcp-fin-wait-timeout=5m \
    tcp-close-wait-timeout=5m tcp-last-ack-timeout=5m \
    tcp-time-wait-timeout=5m tcp-close-timeout=5m

and ideally we would do the same for IPv6, but there is no specific IPv6 firewall connection tracking options; it is unclear if the IPv4 settings also apply to IPv6, or if the IPv6 connection tracking times are simply not exposed.

Over time other firewall rules can be added to allow, eg,

  • NTP for time synchronisation via the WAN interface (to known NTP servers):

    /system ntp client set primary-ntp=A.B.C.D server-dns-names=foo.example.com enabled=yes
    /ip firewall address-list add list=ipv4_ntp_servers address=A.B.C.D
    

    which hooks into the IPv4 output firewall rule set above.

  • Allow the DMZ to use the LAN DNS server:

    /ip firewall address-list add list=ipv4_lan_dns_server address=E.F.G.H
    /ip firewall filter print where chain=ipv4_dmz_to_lan
    /ip firewall filter add chain=ipv4_dmz_to_lan protocol=udp dst-port=53 \
        dst-address-list=ipv4_lan_dns_server comment="DNS (UDP)" place-before=NN
    /ip firewall filter add chain=ipv4_dmz_to_lan protocol=tcp dst-port=53 \
        dst-address-list=ipv4_lan_dns_server comment="DNS (TCP)" place-before=NN
    

    which will need appropriate place-before=NN entries to put it into the right location in the rules.

Conclusion

With all of this set up, it should be possible to plug ether1 of the Mikrotik into the Vodafone TechniColor TC4400VDF in place of the Huawei HG659. Then power cycle the Vodafone TechniColor TC4400VDF to force it to forget the internal MAC addresses, and let the network know to expect a new connection. Once the TechniColor TC4400VDF boots, the Mikrotik should be able to get IPv4 and IPv6 addresses via DHCPv4 and DHCPv6. You can inspect the DHCP state with:

/ip dhcp-client print
/ip dhcp-client print detail

/ipv6 dhcp-client print
/ipv6 dhcp-client print detail

and in both cases you are looking for a status of "bound", and some appropriate IP addresses, and an appropriate lease expiry time (minutes for the IPv4 address; weeks for the IPv6 addresses).

For IPv6 you can also inspect the IPv6 pool allocated, and what was assigned to the LAN and DMZ interfaces:

/ipv6 pool print
/ipv6 address print detail where global and interface=ether2
/ipv6 address print detail where global and interface=ether5

Providing the addresses used from the IPv6 pool are at the very start of the pool (ie, first allocations) they should be fairy stable over reboot of the Mikrotik (on each boot it will revert to the start of the pool). The addresses on those interfaces should be compared with the address lists for the LAN / DMZ IPv6 addresses if there are issues with IPv6 reachability:

/ipv6 firewall address-list print where list=ipv6_lan
/ipv6 firewall address-list print where list=ipv6_dmz

and if they are out of sync, use the commands shown in the IPv6 address definition sections to update the address-lists with the current values.

This configuration has worked fairly reliably for me for the last month. The main issues have been:

  • Issues with DHCPv4 and DHCPv6, which eventually lead to fairly broadly allowing DHCPv4 and DHCPv6 on the WAN interface (I think what was happening was that the state relating to the DHCP request was timing out and the reply was being ignored; DHCP uses different addresses at different stages depending on whether or not it already has an IP address); and

  • Multiple issues with the Vodafone FibreX headend going away (the Vodafone TechniColor TC4400VDF showing no uplink/downlink), particularly around three weeks ago (where it went away twice within 30 minutes on each of 2 days; I think Vodafone were doing some sort of maintenance, but it is unclear exactly what -- and it happened in the middle of the work day rather than overnight).

I have also changed the LAN address of the Vodafone supplied Huawei HG659 to a different IPv4 IP and disabled unneeded functionality:

  • IPv6 RA (Route Announcements)

  • IPv6 DHCPv6

  • UPnP

so that it does not interfere, and then continued to use it as just an access point. It does complain it is "not connected to the Internet" (as the WAN interface is not connected, so its DHCP requests are failing) but otherwise it seems to work fine as just an access point.

Handling IPv6 DHCPv6 client/pool address changes

ETA 2017-11-12: After a few more weeks, it has turned out that the Vodafone "Dynamic but Stable" IPv6 addresses do end up changing often enough to be annoying (the change breaks the IPv6 firewalling, which breaks IPv6 for LAN clients, which causes delays in connecting :-( ). It also appears that the previous 2-week leases might have been reduced to a somewhat more sensible "several hours" lease time.

To handle this I have put some effort into Mikrotik scripting to track the changing LAN/DMZ IPv6 address ranges. Ideally this would happen when the IPv6 addresses themselves changed, but I cannot find a scripting hook on "/ipv6 address" or "/ipv6 pool" to use. The next best thing is to hook into the "/ipv6 dhcp-client" scipting features, and run a script when the DHCPv6 addresses are acquired, applied or removed. But since the IPv6 pool updates and IPv6 address updates from those pools might happen asynchronously, we need a bit of a delay before trying to update the "/ipv6 firewall address-list" entries -- I've chosen around 30 seconds as likely to be sufficent. Sadly there is not an easy way to schedule a script to "run in 30 seconds" (cf at on Unix systems); the best option seems to be to enable/disable a scheduler event that runs every 30 seconds, as a "one shot" run.

So the process is:

  • "/ipv6 dhcp-client ... script=..." which runs a "/system script" that enables the "update filters every 30 seconds" scheduler event.

  • In around 30 seconds, that script launches and (a) runs the script that will update the IPv6 address-lists, and (b) disables the "update filters every 30 seconds" scheduler event.

  • For some more robustness there is also another hourly scheduler event (which stays enabled) which also runs the same script to update the IPv6 address lists); hourly seemed often enough to minimise the "wrong IP" pain, while still keeping resource usage fairly low (amongst other things we are rewriting the config each time!)

The individual per-interface address list updates are simply the commands given earlier to set the "/ipv6 firewall address-list ..." entries, preceeded by a command to clear the existing address-list entries (to avoid them accumulating months of old history!).

The basic per-interface scripts are:

/system script
add name=ipv6-update-lan-range owner=ewen policy=read,write \
    source="/ipv6 firewall address-list remove [/ipv6 firewall \
            address-list find list=ipv6_lan]; \
    /ipv6 firewall address-list add list=ipv6_lan address=[/ipv6 \
          address get [/ipv6 address find interface=ether2 \
          from-pool=fibrex-pool] address]"
/

/system script
add name=ipv6-update-dmz-range owner=ewen policy=read,write \
    source="/ipv6 firewall address-list remove [/ipv6 firewall \
           address-list find list=ipv6_dmz]; \
    /ipv6 firewall address-list add list=ipv6_dmz address=[/ipv6 \
          address get [/ipv6 address find interface=ether5 \
          from-pool=fibrex-pool] address]"
/

(Note the use of ";" in between the two comamnds to separate them; the alternative is to embed CR (\r) and NL (\n) characters into the script.)

Having done that for convenience we combine them into one script which calls both:

/system script add name=ipv6-update-filters owner=ewen policy=read,write \
    source="/system script run ipv6-update-lan-range; \
            /system script run ipv6-update-dmz-range"

Then we can schedule that top level script hourly:

/system scheduler add name="ipv6-update-filters" \
        on-event="ipv6-update-filters" interval=1h

as a background precaution.

To make it run "on demand" for IPv6 DHCPv6 client changes, we need to create a delayed one-shot variant. To do this we make a place holder "once" script:

/system script add name="ipv6-update-filters-once" policy=read,write \
        source="/system script run ipv6-update-filters"

that initially just runs the top level script. Then we scheduled that to run every 30 seconds, but leave it disabled:

/system scheduler add name="ipv6-update-filters-once" \
        on-event="ipv6-update-filters-once" disabled=yes interval=30s

and update the script that is run to disable the scheduler event that started it, so that enabling the schduler event will result in a "once" run:

/system script set [/system script find name=ipv6-update-filters-once] \
        source="/system script run ipv6-update-filters; \
                /system scheduler set disabled=yes [/system scheduler \
                        find name=ipv6-update-filters-once]" \
        comment="Oneshot schedulable ipv6-update-filters"

and finally we create a script to enable that event when required:

/system script add name="ipv6-update-filters-in-30-seconds" \
        policy="read,write" source="/system scheduler set disabled=no \
                [/system scheduler find name=ipv6-update-filters-once]" \
        comment="Enable 'once' IPv6 Filter Update"

which we can test by hand with:

/system script run ipv6-update-filters-in-30-seconds

and then watch it with:

/system scheduler print
[...]
/system scheduler print
/ipv6 firewall address-list print

and we should see the "once" scheduler event get enabled, then after a while show that it has run once (run-count is 1), and be disabled again. Looking at the "/ipv6 firewall address-list print" should then show the updated addresses.

Once we are sure that it works, we can then hook it up to the IPv6 DHCPv6 client with:

/ipv6 dhcp-client set [/ipv6 dhcp-client find interface=fibrex] \
      script="/system script run ipv6-update-filters-in-30-seconds"

which in theory will run the "one-shot" filter update about 30 seconds after the DHCP change (and since it is idempotent, and via a scheduler event with its own event timing, it should not get run repeatedly very often -- and if it does, it should still work out okay). The hourly event remains as a backup.

Ideally there would be an easier way to express this "address-list contains the network of this interface" policy than the kludge described above, particularly with IPv6 address-lists where the underlying addresses are likely to change regularly for many users (IPv4 mostly avoids this problem by NAT and masquerade just tracking the changing IP). But hopefully IPv6 address changes will now require less manual intervention.

Of note, both freedommafia.net Mikrotik examples, and bluecrow.net Mikrotik examples are useful hints as to the range of things that can be done with Mikrotik Scripting. It takes a bit of creativity to express what you want, but the scripting language is reasonably full featured.

Posted Mon Oct 23 14:47:36 2017 Tags:

Earlier this week "Do Not Reply" let me know, in an email titled "Return Available", that it was time to file my company GST return:

You have GST returns for period ending 30 September 2017, due
30 October 2017, now available for filing for the following IRD numbers:

So this weekend I finished up the data entry and calculated the data needed to file the GST return, as usual. (Dear IRD, if you are listening, perhaps "Do Not Reply" is not the most optimal sender for official correspondence? Maybe you could consider, eg, "IRD Notification Service"? Also "Return Available" seems like a confusing way to say "please file your GST return this month". Just saying.)

Of note for understanding what transpires below, I was forced to register for "MyIR" a couple of years ago to request IRD provide a Tax Residency Certificate; other countries have information, but IRD only provide a guide to determining tax residency, and needed the concept of a Tax Residency Certificate explained to them, including the fields required by their Double Taxation Treaty partners.

Because of that "MyIR" registration, I am now forced to file GST returns online (once you have registered, filing on paper is no longer an option). Previously the online filing has been relatively simple, but this weekend while the filing went okay, trying to exit out of the "MyGST" part of the "MyIR" website of the Inland Revenue Department turned into a comedy of errors:

  1. The "Log Off" button in the "MyGST" site, something you would hope would be regularly tested, failed to work. It tries to access (via Javascript obscurity):

    https://services.ird.govt.nz/myirlogout.jsp
    

    which seems a plausible enough URL, but actually ends up with:

    Secure Connection Failed
    
    The connection to the server was reset while the page was loading.
    

    every time I tried. (The "Logout" link on the "MyIR" site, also loading via Javascript, went to a different ird.govt.nz site, but did actually work; it is unclear if logging out of "MyIR" also logs you out of "MyGST", as they are presented as separate websites.)

  2. Since a working "Log Off" function seemed important to a site that holds sensitive, potentially confidental, information I tried to report the issue. Conveniently the "MyGST" site has a handy "Send us a message" link on its front page, so I attempted to use that. However I found:

    • It will not accept ".txt" attachments (to illustrate the problem): "File Type .txt is not allowed" with no indication of why it is not allowed. (I assume "not on the whitelist", but that raises the questions (a) why?! and (b) what "File Type"s are allowed. Experimentally I determined that PNG and PDF were allowed.)

    • There is no option to contact about the website, only "something else".

    • When you "Submit" the message you've written, the website simply returns to the "MyGST" home page with no indication whether or not the message was sent, where you might see the sent message, and no copy of the sent message emailed to you. (I tried twice; same result both times.)

    So that did not seem very promising.

    For the record, I eventually found -- much later -- that you can check if the message has been sent by:

    • Going to the "Activity Centre" tab of "MyGST"

    • Clicking on the "More..." button next to the "Messages" heading

    • Clicking on the "Outbox" tab of that mailbox

    and you will see your messages there, and can click on each one to view them. (Which showed that each of my two attempts had apparently been sent twice, despite the website not informing me it had done so; oops. It is unclear to me how they ended up each being sent twice; I did not, eg, click through a "resend POST data" dialogue.)

  3. When it was unclear if "Send us a message" in "MyGST" worked, I thought the next best option would be to go back to the "MyIR" site, and use "Secure mail" which is IRD's preferred means of contact (as I found out when, eg, trying to get a Tax Residency Certificate a couple of years ago). Unfortunately when I attempted to use that I found:

    • There is no option to choose "Website" or "GST" from the form at all, so I had to send an "All Other" / "All Other" message;

    • There was no option to add attachments to the message, so I could not include the screenshots/error output; and

    • When I submitted that message, I got a generic 404 error!

      https://www1.e-services.ird.govt.nz/error/error404.html
      

      which told me:

      Contact us
      Page not available
      
      The page you are trying to access is not available on our website.
      
      If you have reached this page by following a link on our website
      rather than using a bookmark, please take a moment to e-mail the
      General comments form with the details of the page you were trying
      to access.
      

    The "MyIR" "Secure Mail" feature does have an obvious "Sent" tab, so in another window I was quickly able to check that it had not in fact been sent. At this point I assumed I was 0 for 3 -- not a great batting average.

  4. Still, the 404 page did offer a link to the General Comments page:

    https://www.ird.govt.nz/cgi-bin/form.cgi?form=comments

    so it seemed worth reporting the accumulating list of problems. That "General Comments" page is (naturally) very general, but:

    • "Website" is not a category they have anticipated receiving comments about (so "Other" it is again); and

    • Your choices for response are:

      • No response required

      • In writing by mail

      • Over the phone

      • In writing by fax

      And that is it: no option to ask for a response by email. But if your 1990s fax machine is still hooked up and working then IRD is ready to respond to your online communication with your preferred option! (It appears based on your response here the second stage of the form requires you to enter different values; but the "In writing by mail" does not even collect a postcode!)

      In fairness, the second stage of the form also allowed an optional email address to be entered -- which I did -- so possibly they might treat one of the above as "by email"; it is just not at all obvious to the user.

    • The box for entering comments was 40 characters wide by 4 characters deep -- there are programmable calculators with a larger display! (In fairness Firefox on OS X at least does allow resizing this; but nothing says "we hope you do not have much to say" like allowing a old-Tweet length worth of text to be visible on the screen at once.)

    Anyway undeterred by all of this I reported in brief, the three problems I had encountered so far: (1) "MyGST" Log Off function broken; (2) "MyGST" "Send us a message" function apparently not working; (3) "MyIR" "Secure Mail" sending resulting in a 404.

    That one was successful, giving me a "comment sent" confirmation page, although without any tracking number or other identifier (the closest to an identifier is "Your request was sent on Sunday 8 October 2017 at 14:40"). Sadly my neatly laid out bullet point list of issues encountered was turned into a single line of terribly formatted run on text; it appears they were serious about people keeping their comments to old-Tweet length!

  5. After this experience I was surprised to find that the only working thing -- the General Comments Form -- offered me a chance to:

    Send feedback about this form

    Since I seemed to be on a yak shaving mission to use every feedback form on the site, who could resist?! I (successfully!) offered them anonymous feedback that:

    • In 2017, offering "response by email" might be a useful update;

    • Perhaps "In writing by fax" could be retired;

    • 40x4 character comment forms are... rather small and difficult to use.

    Only I had to do so much more tersely because the "Online Form Feedback" comment field was itself 40x4 characters.

On the plus side:

  • I did manage to file my GST return

  • Eventually if one is patient enough, one does get auto-logged out of the "MyIR" site, so maybe one does get auto-logged out of the "MyGST" site as well;

  • Apparently I did manage to report the original "MyGST" "Log Off" problem after all (and hopefully someone at IRD can merge those into a single ticket, rather than having four people investigating the problem).

Now to actually pay my GST amount due.

If IRD do respond with anything useful I will add an update to this post to record that, eg, some of the above issues have been fixed. At least two of them ("MyGST" "Log Off" isue and "MyIR" "Secure Mail" sending seem likely to be encountered by other users and fixed.)

ETA 2017-10-17: IRD responded to my second contact attempt (in MyGST) with:

"""Good Afternoon Ewen.

Thank you for your email on the 8th October 2017.

seeing you are having issues with the online service please
contact us on 0800 227 770.

As this service doesn't deal with these issues, This service
is for web message responses for GST accounts. We have forwarded
this message to be directed to our Technical Services as this
is a case for them."""

which, at one week to reply, is much better than their estimated reply time. I have assumed that "forwarded [...] to our Technical Services [...]" will be sufficient to get the original reports in front of someone who might be able to actually investigate/fix them, and not done anything further (calling an 0800 number for (frontline) "technical support" seems unlikely to end well over such a technical issue).

The "MyGST" "Log Off" functionality is still broken though. The "MyIR" logout functionality is slow, but does eventually work.

However going back to an earlier GST page after using the "MyIR" log out functionality, and reloading still shows I am in my "MyGST" account", and can access pages in "MyGST" that I previously had not viewed in this session. So it appears the two logoff functions are separate -- even though they are controlled by a single "logon" screen. By contrast, trying to go to a "MyIR" page does correctly show the login screen again.

So we learn that logging off "MyGST" separately is important, and that it is still broken (at least in Firefox 52 LTS on OS X 10.11, and Safari 11 on OS X 10.11; both retested today, 2017-07-17).

Posted Sun Oct 8 16:47:59 2017 Tags:

I have a Huawei HG659 Home Gateway supplied as part of my Vodafone FibreX installation. Out of the box, since early 2017, the Vodafone FibreX / Huawei HG659 combination has natively provided IPv6 support. This automagically works on modern mac OS:

ewen@osx:~$ ping6 google.com
PING6(56=40+8+8 bytes) 2407:7000:9b0e:4856:b971:8973:3fe3:1a51 --> 2404:6800:4006:804::200e
16 bytes from 2404:6800:4006:804::200e, icmp_seq=0 hlim=57 time=46.574 ms
16 bytes from 2404:6800:4006:804::200e, icmp_seq=1 hlim=57 time=43.953 ms

and Linux:

ewen@linux:~$ ping6 google.com
PING google.com(syd15s03-in-x0e.1e100.net (2404:6800:4006:804::200e)) 56 data bytes
64 bytes from syd15s03-in-x0e.1e100.net (2404:6800:4006:804::200e): icmp_seq=1 ttl=57 time=44.8 ms
64 bytes from syd15s03-in-x0e.1e100.net (2404:6800:4006:804::200e): icmp_seq=2 ttl=57 time=43.8 ms

to provide global IPv6 connectivity, for a single internal VLAN, without having to do anything else.

Vodafone delegates a /56 prefix to each customer, which in theory means that it should be possible to further sub-delegate that within our own network for multiple subnets -- most IPv6 features will work down to /64 subnets. I think the /56 is being provided via DHCPv6 Prefix Delegation (see RFC3633 and RFC3769; see also OpenStack Prefix Delegation discussion).

Recently I started looking at whether I could configure an internal Mikrotik router to route dynamically-obtained IPv6 prefixes from the Huawei HG659's /56 pool, to create a separate -- more isolated -- internal subnet. A very useful Mikrotik IPv6 Home Example provided the Mikrotik configuration required, although I did have to update it slightly for later Mikrotik versions (tested with RouterOS 6.40.1).

Enable IPv6 features on the Mikrotik if they are not already enabled:

/system package enable ipv6
/system package print

If the "print" shows you an "X" with a note that it will be enabled after reboot, then also reboot the Mikrotik at this point:

/system reboot

After that, you should have a IPv6 Link Local Address on the active interface, which you can see with:

[admin@naos-rb951-2n] > /ipv6 addr print
Flags: X - disabled, I - invalid, D - dynamic, G - global, L - link-local
 #    ADDRESS                                     FROM-... INTERFACE        ADV
 0 DL fe80::d6ca:6dff:fe50:6c44/64                         ether1           no
[admin@naos-rb951-2n] >

(The IPv6 Link Local addresses are recognisable as being in fe80::/64, and on the Mikrotik will show as "DL" -- dynamically assigned, link local.)

Once that is working configure the Mikrotik IPv6 DHCPv6 client to request a Prefix Delegation with:

/ipv6 dhcp-client add interface=ether1 pool-name=ipv6-local \
      add-default-route=yes use-peer-dns=yes request=prefix

Unfortunately when I tried that, it never succeeded in getting an answer from the Huawei HG659. Instead the status was stuck in "searching":

[admin@naos-rb951-2n] > /ipv6 dhcp-client print detail
Flags: D - dynamic, X - disabled, I - invalid
 0    interface=ether1 status=searching... duid="0x00030001d4ca6d506c44"
      dhcp-server-v6=:: request=prefix add-default-route=yes use-peer-dns=yes
      pool-name="ipv6-local" pool-prefix-length=64 prefix-hint=::/0
[admin@naos-rb951-2n] >

which makes me think that while the Huawei HG659 appears to be able to request an IPv6 prefix delegation (with a DHCPv6 client) it does not appear provide a DHCPv6 server that is capable of prefix delegation, which rather defeats the purpose of having a /56 delegated :-(

Specifying:

/ipv6 dhcp-client remove 0
/ipv6 dhcp-client add interface=ether1 pool-name=ipv6-local \
      add-default-route=yes use-peer-dns=yes request=address,prefix

which apepars to be the syntax to request an interface address and a prefix delegation did not work any better, still getting stuck with a status of "searching...":

[admin@naos-rb951-2n] > /ipv6 dhcp-client print  detail
Flags: D - dynamic, X - disabled, I - invalid
 0    interface=ether1 status=searching... duid="0x00030001d4ca6d506c44"
      dhcp-server-v6=:: request=address,prefix add-default-route=yes
      use-peer-dns=yes pool-name="ipv6-local" pool-prefix-length=64
      prefix-hint=::/0
[admin@naos-rb951-2n] >

If I delete that and just request an address:

/ipv6 dhcp-client remove 0
/ipv6 dhcp-client add interface=ether1 pool-name=ipv6-local \
      add-default-route=yes use-peer-dns=yes request=address

then the DHCPv6 request does succeed very quickly:

[admin@naos-rb951-2n] > /ipv6 dhcp-client print
Flags: D - dynamic, X - disabled, I - invalid
 #    INTERFACE                     STATUS        REQUEST
 0    ether1                        bound         address
[admin@naos-rb951-2n] >

and there is an additional IPv6 address visible for that interface:

[admin@naos-rb951-2n] > /ipv6 addr print
Flags: X - disabled, I - invalid, D - dynamic, G - global, L - link-local
 #    ADDRESS                                     FROM-... INTERFACE        ADV
 0 DL fe80::d6ca:6dff:fe50:6c44/64                         ether1           no
 1 IDG ;;; duplicate address detected
      2407:xxxx:xxxx:4800::2/64                            ether1           no
[admin@naos-rb951-2n] >

Unfortunately the "I" flag and the "duplicate address detected" comment are both very bad signs -- that the address supplied by DHCPv6 is unusable. When I look around other devices on my network I find that they too have that address, including my main OS X 10.11 laptop:

ewen@ashram:~$ ifconfig -a | grep -B 10 ::2 | egrep "^en|::2"
en6: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet6 2407:xxxx:xxxx:4800::2 prefixlen 128 dynamic
ewen@ashram:~$

and another OS X 10.11 laptop:

ewen@mandir:~$ ifconfig -a | grep -B 9 ::2 | egrep "^en|::2"
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet6 2407:xxxx:xxxx:4800::2 prefixlen 128 duplicated dynamic
ewen@mandir:~$

which implies that the Huawei HG659 DHCPv6 server is handing out the same (::2) address to multiple clients (possibly all clients?!) and only the first client to make the request has a reasonable chance of working (in theory the others will discover via RFC7527 Duplicate Address Detection that the address is already in used, and invalidate it, to allow the first client to work).

From all of this I conclude that the Huawei HG659 DHCPv6 server will basically only work in a useful fashion for a single DHCPv6 client, that wants a single address -- so it is almost useless. In particular the DHCPv6 server does not appear to be a way to get use of parts of at the IPv6 /56 delegation provided by Vodafone.

Yet IPv6 global transit does work from multiple OS X and Linux devices on my home network -- so they are clearly not (solely) reliant on IPv6 DHCPv6 working properly.

The reason they have working IPv6 transit is that OS X and Linux will also do SLAAC -- Stateless Addesss Auto-Configuration (RFC4826 -- to obtain an IPv6 address and default route. SLAAC uses the IPv6 Neighbor Discovery Protocol (RFC4861) to determine the IPv6 address prefix (/64), and a Modified EUI-64 algorithm (described in RFC5342 section 2.2) to determine the IPv6 address suffix (64-bits).

Providing the Hauwei HG659 is configured to send IPv6 RA ("Route Advertisement") messages (Home Interface -> LAN Interface -> RA Settings -> Enable RA is ticked), then SLAAC should work. There are two other settings:

  • "RA mode": automatic / manual. In automatic mode it appears to pick a prefix from the /56 that the IPv6 DHCPv6 Prefix Delegation client obtained from Vodafone -- apparently the "56" prefix (at least in my case), for no really obvious reason. In manual mode you can specify a prefix, but that does not seem very useful when the larger prefix you have is dynamically allocated....

  • "ULA mode": disable / automatic / manual. This controls the delegation of IPv6 Unique Local Addresses (RFC4193), which are site-local addresses in the fc00::/7 block. By default it is set to "automatic" which appears to result in the Huawei HG659 picking a prefix block at random (as indicated by a fd00::/8 address). "manual" allows manual specification of the block to use, and "disable" I assume turns off this feature.

Together these four features (IPv6 Link Local Addresses, IPv6 DHCPv6, IPv6 SLAAC, RFC4193 Unique Local Addresses) explain most of the IPv6 addresses that I see on my OS X client machines. For instance (some of the globally unique 56 prefix replaced with xxxx:xxxx, and the last three octets of the SLAAC addresses replaced by yy:yyyy for privacy):

ewen@ashram:~$ ifconfig en6 | egrep "^en|inet6" 
en6: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet6 fe80::6a5b:35ff:feyy:yyyy%en6 prefixlen 64 scopeid 0x4
        inet6 2407:xxxx:xxxx:4856:6a5b:35ff:feyy:yyyy prefixlen 64 autoconf
        inet6 2407:xxxx:xxxx:4856:b971:8973:3fe3:1a51 prefixlen 64 autoconf temporary
        inet6 fd50:1d9:5e3e:8300:6a5b:35ff:feyy:yyyy prefixlen 64 autoconf
        inet6 fd50:1d9:5e3e:8300:e010:afec:6457:2850 prefixlen 64 autoconf temporary
        inet6 2407:xxxx:xxxx:4800::2 prefixlen 128 dynamic
@ashram:~$

In this list:

  • The fe80::6a5b:35ff:feyy:yyyy%en6 address is the IPv6 Link Local Address, derived from the prefix fe80:://64 and an EUI-64 suffix derived from the interface MAC address (as described in RFC2373). It is approximately the first 3 octets of the MAC address, then ff:fe, then the last 3 octets of the MAC address -- but the Universal/Local bit of the MAC address is inverted in IPv6, so as to make ::1, ::2 style hand created addresses end up automatically marked as "local". (While this seems clever, with perfect hindsight it would perhaps have been better if the IEEE MAC address Universal/Local flag was a Local/Universal flag with the bit values inverted, for the same reason... and perhaps better positioned in the bit pattern.) In this case 0x68 in the MAC address becomes 0x6a:

    ewen@ashram:~$ perl -le 'printf("%08b\n", 0x68);'
    01101000
    ewen@ashram:~$ perl -le 'printf("%08b\n", 0x6a);'
    01101010
    ewen@ashram:~$
    

    by setting this additional (7th from the left) bit.

  • The 2407:xxxx:xxxx:4856:6a5b:35ff:feyy:yyyy address is the globally routable IPv6 SLAAC address, derived from the SLACC /64 prefix obtained from the IPv6 Route Advertisement packets and the EUI-64 suffix ss described above (where the SLAAC /64 prefix provided by the Hauwei HG659 itself came from an IPv6 DHCPv6 Prefix Delegation request made by the Huawei HG659). This address is recognisable by the "autoconf" flag indicating SLAAC, and the non-fd prefix.

  • The fd50:1d9:5e3e:8300:6a5b:35ff:feyy:yyyy address is the Unique Local Address (RFC4193), derived from a randomly generated prefix in fd00::/8 and the EUI-64 suffix as described above. This address is recognisable by the "autoconf" flag indicating SLAAC, and the fd prefix. (See also "3 Ways to Ruin Your Future Network with IPv6 Unique Local Addresses" Part 1 and Part 2 -- basically by re-introducing all the pain of NAT to IPv6, as well as all the pain of "everyone uses the same site-local prefixes".)

  • The 2407:xxxx:xxxx:4800::2 address is obtained from the Huawei HG659 DHCPv6 server, and consists of the first /64 in the /56 that the Huawei HG659 DHCPv6 client obtained via Prefix Delegation, and a DHCP assigned suffix, starting with ::2 (where I think the Huawei HG659 itself is ::1, but it does not respond to ICMP with that address). This address is recognisable by the "dynamic" flag indicating DHCPv6.

    Unfortunately as described above the Huawei HG659 DHCPv6 DHCP server is broken (at least in Huawei HG659 firmware version V100R001C206B020), and mistakenly hands out the same DHCP assigned suffix to multiple clients. This means that only the lucky first DHCPv6 client on the network will have a working DHCPv6 address. (It also appears, as described above, that it does not support DHCPv6 Prefix Delegation.)

That explains all but two of the IPv6 addresses listed. The remaining two have the "temporary" flag:

ewen@ashram:~$ ifconfig en6 | egrep "^en|inet6" | egrep "^en6|temporary"
en6: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
         inet6 2407:xxxx:xxxx:4856:b971:8973:3fe3:1a51 prefixlen 64 autoconf temporary
         inet6 fd50:1d9:5e3e:8300:e010:afec:6457:2850 prefixlen 64 autoconf temporary
ewen@ashram:~$

and those are even more special. IPv6 Temporary Addresses are created to reduce the ability to track the same device across multiple locations through the SLAAC EUI-64 suffix -- which being predictably derived from the MAC address will stay the same across multiple SLAAC prefixes. Mac OS X (since OS X 10.7 -- Lion) and Microsoft Windows since (Windows Vista) will generate, and use them, by default.

The relevant RFC is RFC4941 which defines "Privacy Extensions for Stateless Address Autoconfiguration in IPv6". Basically it defines a method to create additional ("temporary") IPv6 addresses, following a method like IPv6 SLAAC, which are not derived from a permanent identifier like the ethernet MAC address -- instead a IPv6 suffix is randomly generated and used instead of the EUI-64 suffix. Amusingly the suggested algorithm appears to be old enough to use the (now widely deprecated) MD5 hash algorithm as part of the derivation steps. (These temporary/"Privacy" addresses are supported on many modern OS.)

These RFC4941 "temporary" addresses normally have a shorter lifetime, which can be seen on OS X with "ifconfig -L":

ewen@ashram:~$ ifconfig -L en6 | egrep "^en6|temporary"
en6: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        inet6 2407:xxxx:xxxx:4856:b971:8973:3fe3:1a51 prefixlen 64 autoconf temporary pltime 3440 vltime 7040
        inet6 fd50:1d9:5e3e:8300:e010:afec:6457:2850 prefixlen 64 autoconf temporary pltime 3440 vltime 7040
ewen@ashram:~$

but on the system I checked both the temporary and the permanent SLAAC addresses had the same pltime/vltime values, which I assume are derived from the SLAAC validity times. The "pltime" is the "preferred" lifetime, and the "vltime" is the "valid" lifetime; I think that after the preferred lifetime an attempt will be made to generate renew the address, and after the valid lifetime the address will be expired (assuming it is not renewed/replaced before then).

It appears that in macOS 10.12 (Sierra) and later, even the non-temporary IPv6 addresses no longer use the EUI-64 approach to derive the address suffix from the MAC address -- which means the "permanent" addresses also changed between 10.11 and 10.12. I do not currently have a macOS 10.12 (Sierra) system to test this on. I found a claim these are RFC 3972 "Cryptographically Generated Addresses", but there does not seem to be much evidence for the exact algorithm used. (There are also suggestions that this is an implementation of RFC7217 "Semantically Opaque Interface Identifiers" which effectively make the IPv6 suffix also depend on the IPv6 prefix. Ie, the resulting address would be stable given the same Prefix, but different for each prefix. See also an IPv6 on OS X Hardening Guide -- from 2015, so probably somewhat out of date now.)

Returning to the problem I started with, configuring a Mikrotik for IPv6, I found that the Mikrotik could have an interfacen address configured with SLAAC, by setting:

/ipv6 settings set accept-router-advertisements=yes

or:

/ipv6 settings set accept-router-advertisements=yes-if-forwarding-disabled forward=no

(see Mikrotik IPv6 Settings), but at least on 6.40.1 this still does not result in an IPv6 SLAAC address being visible anywhere, even after a reboot. (Bouncing the interface, or rebooting -- /system reboot -- is required to initiate SLAAC.)

You can check the address did get assigned properly by pinging it from another SLAAC configured system, with the EUI-64 derived suffix:

ewen@ashram:~$ ping6 -c 2 2407:xxxx:xxxx:4856:d6ca:6dff:feyy:yyyy
PING6(56=40+8+8 bytes) 2407:xxxx:xxxx:4856:6a5b:35ff:fe88:8f6e --> 2407:7000:9b0e:4856:d6ca:6dff:feyy:yyyy
16 bytes from 2407:xxxx:xxxx:4856:d6ca:6dff:feyy:yyyy, icmp_seq=0 hlim=255 time=0.440 ms
16 bytes from 2407:xxxx:xxxx:4856:d6ca:6dff:feyy:yyyy, icmp_seq=1 hlim=255 time=0.504 ms

--- 2407:xxxx:xxxx:4856:d6ca:6dff:feyy:yyyy ping6 statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/std-dev = 0.440/0.472/0.504/0.032 ms
ewen@ashram:~$

In the default IPv6 settings:

[admin@naos-rb951-2n] > /ipv6 settings print
                       forward: yes
              accept-redirects: yes-if-forwarding-disabled
  accept-router-advertisements: yes-if-forwarding-disabled
          max-neighbor-entries: 8192
[admin@naos-rb951-2n] >

then IPv6 SLAAC will not be performed; but with either of the settings above (after a reboot: /system reboot) then SLAAC will be performed.

Other than the UI/display issue, this is consistent with the idea that the WAN interface of a router should be assignable using SLAAC, but not entirely consistent with the documentation which says SLAAC cannot be used on routers. It is just that to be useful routers typically need IP addresses for multiple interfaces, and the only way to meaningfully obtain those is either IPv6 DHCPv6 Prefix Delegation -- or static configuration.

Since the Huawei HG659 appears not to provide a usable IPv6 DHCPv6 server there is no way to get DHCPv6 Prefix Delegation working internally, which means my best option will be to replace Huawei HG659 with something else as the "Home Gateway" connected to the Technicolor TC4400VDF modem. Generally people seem to be using a Mikrotik RB750Gr3 (a "hEX") for which there is a general Mikrotik with Vodafone setup guide available. It is a small 5 * GigE router capable of up to 2 Gbps throughput in ideal conditions (by contrast the old Mikrotik RB951-2n that I had lying around to test with is only 5 * 10/100 interfaces, so slower than my FibreX connection let alone my home networking).

In theory the Mikrotik IPv6 support includes both DHCPv6 Prefix Delegation in the client and server, including on-delegating smaller prefixes. Which should mean that if a Mikrotik RB750Gr3 were directly connected to the Technicolor TC4400VDF cable modem it could handle all my requirements, including creating isolated subnets in IPv4 and IPv6. (The Huawei HG659 supports the typical home "DMZ" device by NAT'ing all IPv4 traffic to a specific internal IP, but it is not very isolated unless you NAT to another firewall like the Mikrotik and then forward from there to the isolated subnet -- and I would really prefer to avoid double NAT. That Huawei HG659 DMZ support also appears to be IPv4 only, and it does not appear to support static IPv4 routes on the LAN interface either -- the static routing functions only allow you to choose a WAN interface.)

Since I seem to have hit up against the limits of the Huawei HG659, my "interim" use of the supplied Huawei HG659 appears to be coming to an end. In the meantime I have turned off the DHCPv6 server on the Huawei HG659 (Home Network -> Lan Interface -> IPv6 DHCP Server -> IPv6 DHCP Server should be unticked).

For the record, the Mikrotik MAC Telnet reimplmentation appears to work quite well on OS X 10.11, providing you already know the MAC address you want to reach (eg, from the sticker on the outside of the Mikrotik). That helps a lot with reconfiguration of the Mikrotik for a new purpose, without relying on a Microsoft Windows system or WINE.

Posted Sun Aug 20 17:42:10 2017 Tags:

KeePassXC (source, wiki) is a password manager forked from KeePassX which is a Linux/Unix port of the Windows KeePass Password Safe. KeePassXC was started because of concern about the relatively slow integration of community code into KeePassX -- ie it is a "Community" fork with more maintainers. KeePassXC seems to have been making regular releases in 2017, with the most recent (KeePassXC 2.2.0) adding Yubikey 2FA support for unlocking databases. KeePassXC also provides builds for Linux, macOS, and Windows, including package builds for several Linux distributions (eg an unofficial Debian/Ubuntu community package build, built from the deb package source with full build instructions).

For macOS / OS X there is a KeePassXC 2.2.0 for macOS binary bundle, and KeePassXC 2.2.0 for macOS sha256 digest. They are GitHub "release" downloads, which are served off Amazon S3. KeePassXC provide instructions on verifying the SHA256 Digest and GPG signature. To verify the SHA256 digest:

  • wget https://github.com/keepassxreboot/keepassxc/releases/download/2.2.0/KeePassXC-2.2.0.dmg

  • wget https://github.com/keepassxreboot/keepassxc/releases/download/2.2.0/KeePassXC-2.2.0.dmg.digest

  • Check the SHA256 digest matches:

    ewen@ashram:~/Desktop$ shasum -a 256 -c KeePassXC-2.2.0.dmg.digest
    KeePassXC-2.2.0.dmg: OK
    ewen@ashram:~/Desktop$
    

To verify the GPG signature of the release:

  • wget https://github.com/keepassxreboot/keepassxc/releases/download/2.2.0/KeePassXC-2.2.0.dmg.sig

  • wget https://keepassxc.org/keepassxc_master_signing_key.asc (which is stored inside the website repository)

  • gpg --import keepassxc_master_signing_key.asc

  • gpg --recv-keys 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2 (alternatively or in addition; in theory it should report it is unchanged)

    ewen@ashram:~/Desktop$ gpg --recv-keys 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2
    gpg: requesting key 6397D0D2 from hkps server hkps.pool.sks-keyservers.net
    gpg: key 6397D0D2: "KeePassXC Release <release@keepassxc.org>" not changed
    gpg: Total number processed: 1
    gpg:              unchanged: 1
    ewen@ashram:~/Desktop$
    
  • Compare the fingerprint on the website with the output of "gpg --fingerprint 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2":

    ewen@ashram:~/Desktop$ gpg --fingerprint 0xBF5A669F2272CF4324C1FDA8CFB4C2166397D0D2
    pub   4096R/6397D0D2 2017-01-03
          Key fingerprint = BF5A 669F 2272 CF43 24C1  FDA8 CFB4 C216 6397 D0D2
    uid                  KeePassXC Release <release@keepassxc.org>
    sub   2048R/A26FD9C4 2017-01-03 [expires: 2019-01-03]
    sub   2048R/FB5A2517 2017-01-03 [expires: 2019-01-03]
    sub   2048R/B59076A8 2017-01-03 [expires: 2019-01-03]
    ewen@ashram:~/Desktop$
    

    to check that the GPG key retrieved is the expected one.

  • Compare the GPG signature of the release:

    ewen@ashram:~/Desktop$ gpg --verify KeePassXC-2.2.0.dmg.sig
    gpg: assuming signed data in `KeePassXC-2.2.0.dmg'
    gpg: Signature made Mon 26 Jun 11:55:34 2017 NZST using RSA key ID B59076A8
    gpg: Good signature from "KeePassXC Release <release@keepassxc.org>"
    gpg: WARNING: This key is not certified with a trusted signature!
    gpg:          There is no indication that the signature belongs to the owner.
    Primary key fingerprint: BF5A 669F 2272 CF43 24C1  FDA8 CFB4 C216 6397 D0D2
         Subkey fingerprint: C1E4 CBA3 AD78 D3AF D894  F9E0 B7A6 6F03 B590 76A8
    ewen@ashram:~/Desktop$
    

    at which point if you trust the key you downloaded is supposed to be signing the code you intend to run, the verification is complete. (There are some signatures on the signing key, but I did not try to track down a GPG signed path from my key to the signing keys, as the fingerprint verification seemed sufficient.)

In addition for Windows and OS X, KeePassXC raised funds for an AuthentiCode code signing certificate earlier this year. When signed, this results in a "known publisher" which avoids the Windows and OS X warnings about running "untrusted" code, and acts as a second verification of the intended code running. It is not clear that the .dmg or KeePassXC.app on OS X is signed at present, as "codesign -dv ..." reports both the .dmg file and the .app as not signed (note that it is possible to use Authenticode Code Signing Certificate with OS X's Signing Tools). My guess is maybe the KeePassXC developers focused on Windows executable signing first (and Apple executables normally need to be signed by a key signed by Apple anyway).

Having verified the downloaded binary package, on OS X it can be installed in the usual manner by mounting the .dmg file, and dragging the .app to somewhere in /Applications. There is a link to /Applications in the .dmg file, but without the clever folder background art that some .dmg files it is less obvious that you are intended to drag it into /Applications to install. (However there is no included installer, so the obvious alternative is "drag'n'drop" to install.)

Once installed, run KeePassXC.app to start. Create a new password database and give it at least a long master password, then save the database (with the updated master password). After the database is created it is possible to re-open KeePassXC.app with the relevant database with the usual:

open PATH/TO/DATABASE.kdbx

thanks to the application association with the .kdbx file extension. This makes it easier to manage multiple databases. When opened in this way the application will propmpt for the master password of the specific database immediately (with the other known databases available as tabs).

KeePassXC YubiKey Support

KeePassXC YubiKey support is via the YubiKey HMAC-SHA1 Challenge-Response authentication, where the YubiKey mixes a shared secret with a challenge token to create a response token. This method was chosen for the KeePassXC YubiKey support because it provides a determinstic response without, eg, needing to reliably track counters or deal with gaps in monotonically increasing values, such as is needed with U2F -- Universal 2nd Factor. This trades a reduction in security (due to just relying on a shared secret) for robustness (eg, not getting permanently locked out of password database due to the YubiKeys counter having moved on to a newer value than the password database), and ease of use (eg, not having to activate the YubiKey at both open and close of a database; the KeePassXC ticket ticket #127 contains some useful discussion of the tradeoffs with authentiction modes needing counters; pwsafe also uses YubiKey Challenge-Response mode, presumably for similar reasons).

The design chosen seems similar to KeeChallenge, a plugin for KeePass2 (source) to support YubiKey authentiction for the Windows KeePass. There is a good setup guide to Securing KeePass with a Second Factor descriting how to set up the YubiKey and KeeChallenge, which seems broadly transferrable to using the similar KeePassXC YubiKey Challenge-Response feature. (A third party YubiKey Handbook contains an example of configuring the Challenge-Response mode from the command line for a slightly different purpose.)

By contrast, the Windows KeePass built in support is OATH-HOTP authentication (see also KeePass and YubiKey), which does not seem to be supported on KeePassXC -- some people also note OTP 2nd Factor provides authentication not encryption which may limit the extra protection in the case of a local database. HOTP also uses a shared key and a counter so suffers from similar shared secret risks as the Challenge Response mechanism, as well as robustness risks in needing to track the counter value -- one guide to the OATH-HOTP mode warns about keeping OTP recovery codes to get back in again after being locked out due to the counter getting out of sync. See also HOTP and TOTP details; HOTP hashes a secret key and a counter, whereas TOTP hashes a secret key and the time, which means it is easier to accidentally get out of sync with HOTP. TOTP seems to be more widely deployed in client-server situations, presumably because it is self-recovering given a reasonably accurate time source.

Configuring a YubiKey to support Challenge-Response HMAC-SHA1

To configure one or more YubiKeys to support Challenge-Response you need to:

  • Install the YubiKey Personalisation Tool from the Apple App Store; it is a zero cost App, but obviously will not be very useful without a YubiKey or two. (The YubiKey Personalisation Tool is also available for other platforms, and in a command line version.)

  • Run the YubiKey Personalization Tool.app

  • Plug in a suitable YubiKey, eg, YubiKey 4; the cheaper YubiKey U2F security key does not have sufficient functionality. (Curiously the first time that I plugged a new YubiKey 4 in, the Keyboard Assistant in OS X 10.11 (El Capitan) wanted to identify it as a keyboard, which seems to be a known problem -- apparently one can just kill the dialog, but I ended up touching the YubiKey, then manually selecting a ANSI keyboard, which also seems to a valid approach. See also the YubiKey User Guide examples for mac OS X.)

  • Having done that, the YubiKey Personalisation Tool should show "YubiKey is inserted", details of the programming status, serial number, and firwmware version, and a list of the features supported.

  • Change to the "Challenge Response" tab, and click on the "HMAC-SHA1" button.

  • Select "Confguration Slot 2" (if you overwrite Configuration Slot 1 then YubiKey Cloud will not work, and that apparently is not recoverable, so using Slot 2 is best unless you are certain you will never need YubiKey Cloud; out of the factory only Configuration Slot 1 is programmed).

  • Assuming you have multiple YubiKeys (and you should, to allow recovery if you lose one or it stops functioning) tick "Program Multiple YubiKeys" at the top, and choose "Same Secret for all Keys" from the dropdown, so that all the keys share the same secret (ie, they are interchangable for this Challenge-Response HMAC-SHA1 mode).

  • You probably want to tick "Require user input (button press)", to make it harder for a remote attacker to activate the Challenge-Response functionality.

  • Select "Fixed 64-byte input" for the HMAC-SHA1 mode (required by KeeChallenge for KeePass; unclear if it is required for KeePassXC but selecting it did work).

  • Click on the "Generate" button to generate a random 20-byte value in hex.

  • Record a copy of the 20-byte value somewhere safe, as it will be needed to program an additional/replacement YubiKey with the same secret later (unlike KeeChallenge it is not needed to set up KeePassXC; instead KeePassXC will simply ask the YubiKey to run through the Challenge-Response algorithm as part of the configuration process, not caring about the secret key used, only caring about getting repeatable results).

    Beware the dialog box seems to be only wide enough to display 19 of the bytes (not 20), and not resizeable, so you have to scroll in the input box to see all the bytes :-( Make sure you get all 20 bytes, or you will be left trying to guess the first or last byte later on. (And make sure you keep the copy of the shared secret secure, as anyone with that shared secret can program a working YubiKey that will be functionally identical to your own. Printing it out and storing it somewhere safe would be better than storing it in plain text on the computers you are using KeePassXC on... and storing it inside KeePassXC creates a catch-22 situation!)

  • Double check your settings, then click on "Write Configuration" to store the secret key out to the attached YubiKey.

  • The YubiKey Personalisation Tool will want to write a "log" file (actually a .csv file), which will also *contain the secret key, so make sure you keep that log safe, or securely delete it.

  • Pull out the first YubiKey, and insert the next one. You should see a "YubiKey is removed" message then a "YubiKey is inserted" message. Click on "Write Configuration" for the next one. Repeat until you have programmed all the YubiKeys you want to be interchangeable for the Challenge-Response HMAC-SHA1 algorithm. (Two, kept separately, seems like the useful minimum, and three may well make sense.)

Configuring a KeePassXC to use password and YubiKey authentication

  • Insert one of the programmed YubiKeys

  • Open KeePassXC on an existing password database (or create a new one), and authenticate to it.

  • Go to Database -> Change Master Key.

  • Enter your Password twice (ie, so that the Password will be set back to the same password)

  • Tick "Challenge Response" as well (so that the Password and "Challenge Respone" are both ticked)

  • An option like "YubiKey (nnnnnnn) Challenge Response - Slot 2 - Press" should appear in the drop down list

  • Click the "OK" button

  • Save the password database

  • When prompted press the button on your YubiKey (which will allow it to use the YubiKey Challenge Response secret to update the database).

Accessing the KeePassXC database with password and YubiKey authentication

To test that this has worked, close KeePassXC (or at least lock the database), then open KeePassXC again. You will get a prompt for access credentials as usual, without any options ticked.

Verify that you can open the database using both the password and the YubiKey Challenge-Response, by typing in the password and ticking "Challenge Response" (after checking it detected the YubiKey) and then clicking on "OK". When prompted, click the button on your YubiKey, and the database should open. (KeePassXC seems to recognise that the Challenge-Response is needed if you have opened the database with the YubiKey and the YubiKey is present; but you will need to remember to also enter the password each time you authenticate. At least it will auto-select the Password as soon as you type one in. The first time around opening a specific database is just one additional box to tick, which is fairly easy to remember particularly if you use the same combination -- password and YubiKey Challenge-Response -- on all your databases.)

You can confirm that both the password and the YubiKey Challenge Response and required, by trying to authenticate just using the Password (enter Password, untick "Challenge Response", press OK), and by trying to authenticate just using the YubiKey (tick "Challenge Response", untick Password, press OK). In both cases it should tell you "Unable to open database" (the "Wrong key or database file is corrupt" really means "insufficient authentication" to recover the database encryption key in this case; they could perhaps more accurately say "could not decrypt master key" here, perhaps with a suggestion to check the authentication details provided).

If you have programmed multiple YubiKeys with the same Challenge-Response shared secret (and hopefully you have programmed at least two), be sure to check opening the database with each YubiKey to verify that they are programmed identically and thus are interchangable for opening the password database. It should open identically with each key (because they all share the same secret when you programmed the keys, and thus the Challenge Response values are identical).

If you have multiple databases that you want to protect with the YubiKey Challenge-Response method, you will need to go through the Database -> Change Master Key steps and verification steps for each one. It probably makes sense to change them all at the same time, to avoid having to try to remember which ones need the YubiKey and which ones do not.

Usability of KeePassXC with Password and YubiKey authentiction

Once you have configured KeePassXC for Password and YubiKey authentication, and opened the database at least once using the YubiKey, the usability is fairly good. Use:

open PATH/TO/DATABASE.kdbx

to open a specific KeePassXC password database directly, and KeePassXC will launch with a window to authenticate to that password database. So long as one of the appropriate YubiKeys is plugged in, after a short delay (less time than it takes to type in your password) the YubiKey will be detected, and Challenge-Response selected. The you just type in your password as usual (which auto selects "Password" as well), hit enter (which auto-OKs the dialog), and touch your YubiKey when prompted.

One side effect of configuring your KeePassXC databases like this is that they are not able to be opened in other KeePass related tools, except maybe the Windows KeePass with the KeeChallenge plugin (which uses a similar method; I have not tested that). For desktop use, KeePassXC should work pretty much everywhere that is likely to be useful (modern Windows, modern macOS / OS X, modern Linux), as should the YubiKey, so desktop portability is fairly good. But, for instance, MiniKeePass (source, on the iOS App Store) will not be able to open the password database. Amongst other reasons, while the "camera connection kit" can be used to link a YubiKey to an iOS device, the YubiKey iOS HowTo points out that U2F, OATH-TOTP and Challenge-Response functionality will not work (and I found suggestions on the Internet this only worked with older iOS versions).

If access from a mobile device is important, then you may want to divide your passwords amongst multiple KeePass databases: a "more secure" one including the YubiKey Challenge-Response and a "less secure" one that only requires a password for compatibility. For instance it might make sense to store "low risk" website passwords in their own database protected only by a relatively short master password, and synchronise that database for use by MiniKeePass (using the DropBox app). But keep higher security/higher risk passwords protected by password and YubiKey Challenge-Response and only accessible from a desktop application (and not synchronised via DropBox to reduce exposure of the database itself).

It also looks like, in the UI, it should be possible to configure KeePassXC to require only the YubiKey Challenge-Response (no password), simply by changing the master key and only specifying YubiKey Challenge-Response. Since the Challenge-Response shared secret is fairly short (20 bytes, so 160 bits), secured only by that shared key, and the algorithm is known, that too would be a relatively low security form of authentication. Possibly again for "low value" passwords like random website logins with no real risk it might offer a more secure way to store per-website random passwords, ratehr than reusing the same password on the each website. But the combination of password and YubiKey Challenge-Response would be preferable for most password databases over the YubiKey Challenge-Response alone, even if the password itself was fairly short (eg under 16 characters).

Posted Sun Jul 23 15:34:28 2017 Tags:

Apple's Time Machine software, included with macOS for about the last 10 years is a service to automatically back up a computer to one or more external drives or machines. Once configured it pretty much looks after itself, usually keeping hourly/weekly/monthly snapshots for sensible periods of time. It can even rotate the snapshots amongst multiple targets to give multiple backups -- although it really wants to see every drive around once a week, otherwise it starts to regularly complain about no backups to a given drive, even when there are several other working backups. (Which makes it a poor choice for offline, offsite, backups which are not brought back onsite again frequently; full disk clones are better for that use case.)

More recent versions of Time Machine include local snapshots, which are copies saved to the internal drive in between Time Machine snapshots to an external target -- for instance when that external target is not available. This is quite useful functionality on, eg, a laptop that is not always on its home network or connected to the external Time Machine drive. These local snapshots do take up some space on the internal drive, but Time Machine will try to ensure there is at least 10% free space on the internal drive and aim for 20% free space (below that Time Machine local snapshots are usually cycled out fairly quickly, particularly if you do something that needs more disk space).

On my older MacBook Pro, the internal SSD (large, but not gigantic, for the time when it was bought, years ago) has been "nearly full" for a long time, so I have been regularly looking for things taking up space that that do not need to be on the internal hard drive. In one of these explorations I found that while Time Machine's main local snapshot directory was tiny:

ewen@ashram:~$ sudo du -sm /.MobileBackups
1       /.MobileBackups
ewen@ashram:~$ 

as expected with an almost full drive causing the snapshots to be expired rapidly, there was another parallel directory which was surprisingly big:

ewen@ashram:~$ sudo du -sm /.MobileBackups.trash/
21448   /.MobileBackups.trash/
ewen@ashram:~$

(21.5GB -- approximately 2-3 times the free space on the drive). When I looked in /.MobileBackups.trash/ I found a bunch of old snapshots from 2014 and 2016, some of which were many gigabytes each:

root@ashram:/.MobileBackups.trash# du -sm *
2468    Computer
412     MobileBackups_2016-10-22-214323
16824   MobileBackups_2016-10-24-163201
1746    MobileBackups_2016-10-26-084240
1       MobileBackups_2016-12-18-144553
1       MobileBackups_2017-02-05-125225
1       MobileBackups_2017-05-18-180448
root@ashram:/.MobileBackups.trash# du -sm Computer/*
1480    Computer/2014-06-08-213847
58      Computer/2014-06-15-122559
156     Computer/2014-06-15-162406
166     Computer/2014-06-29-183344
608     Computer/2014-07-06-151454
3       Computer/2016-10-22-174000
root@ashram:/.MobileBackups.trash# 

Some searching online indicated that this was a fairly common problem (there are many other similar reports). As best I can tell what is supposed to happen is:

  • /.MobileBackups is automatically managed by Time Machine to store local snapshots, and they are automatically expired as needed to try to keep the free disk space at least above 10%.

  • /MobileBackups.trash appears if for some reason Time Machine cannot remove a particular local snapshot or needs to start again (eg a local snapshot was not able to complete); in that case Time Machine will move the snapshot out of the main /.MobileBackups directory into /MobileBackups.trash directory. The idea is that eventually whatever is locking the files in the snapshot to prevent them from being deleted will be cleared, eg, by a reboot, and then /.MobileBackups.trash will get cleaned up. This is part of the reason for reboots being suggested as part of the resolution for Time Machine issues.

However there appears to be some scenarios where it is impossible to remove /.MobileBackups.trash, which just leads to them gradually accumulating over time. Some people report hundreds of gigabytes used there. Because /.MobileBackups.trash is not the main Time Machine Local Snapshots, it shows up as "Other" in the OS X Storage Report -- rather than "Backups". And of course if it cannot be deleted, it will not be automatically removed to make space when you need more space on the drive :-(

Searching for /.MobileBackups.trash in /var/log/system.log turned up the hint that Time Machine was trying to remove the directory, but being rejected:

Jul 18 16:31:36 ashram com.apple.mtmd[852]: Failed to delete
/.MobileBackups.trash, error: Error Domain=NSCocoaErrorDomain
Code=513 "“.MobileBackups.trash” couldn’t be removed because you
don’t have permission to access it."
UserInfo={NSFilePath=/.MobileBackups.trash, NSUserStringVariant=(
    Remove
), NSUnderlyingError=0x7feb82514860 {ErrorDomain=NSPOSIXErrorDomain
Code=1 "Operation not permitted"}}

(plus lots of "audit warning" messages about the drive being nearly full, which was the problem I first started with). There are some other references to that failure on OS X 10.11 (El Capitan), which I am running on the affected machine.

Based on various online hints I tried:

  • Forcing a full Time Machine backup to an external drive, which is supposed to cause it to clean up the drives (it did do a cleanup, but it was not able to remove /.MobileBackups.trash).

  • Disabling the Time Machine local snapshots:

    sudo tmutil disablelocal
    

    which is supposed to remove the /.MobileBackups and /.MobileBackups.trash directories; it did remove /.MobileBackups but could not remove /.MobileBackups.trash.

  • Emptying the Finder Trash (no difference to /.MobileBackups.trash)

  • Wait a while to see if it got automatically removed (nope!)

  • Forcing a full Time Machine backup to an external drive, now that the local Time Machine snapshots are turned off. That took ages to get through the prepare stage (the better part of an hour), suggesting it was rescanning everything... but it did not reduce the space usage in /.MobileBackups.trash in the slightest.

Since I had not affected /.MobileBackups.trash at all, I then did some more research into possible causes for why the directory might not be removable. I found a reference suggesting file flags might be an issue, but searching for the schg and uchg flags did not turn up anything:

sudo find /.MobileBackups.trash/ -flags +schg
sudo find /.MobileBackups.trash/ -flags +uchg

(uchg is the "user immutable" flag; schg is the "system immutable" flag). There are also xattr attributes (which I have used previously to avoid accidental movement of directories in my home directory), which should be visible as "+" (attributes) or "@" (permissions) when doing "ls -l" -- but in some quick hunting around I was not seeing those either (eg sudo ls -leO@ CANDIDATE_DIR).

I did explictly try removing the immutable flags recursively:

sudo chflags -f -R nouchg /.MobileBackups.trash
sudo chflags -f -R noschg /.MobileBackups.trash

but that made no obvious difference.

Next, after finding a helpful guide to reclaiming space from Time Machine Local snapshots I ensured that the Local Snapshots were off, then rebooted the system:

sudo tmutil disablelocal

followed by Apple -> Restart... In theory that is supposed to free up the /.MobileBackups.trash snapshots for deletion, and then delete them. At least when you do another Time Machine backup -- so I forced one of those after the system came back up again. No luck, /.MobileBackups.trash was the same as before.

After seeing reports that /.MobileBackups.trash could be safely removed manually, and having (a) two full recent Time Machine shapshots and (b) having just rebooted with the Time Machine Local Snapshots turned off, I decided it was worth trying to manaully remove /.MobileBackups.trash. I did:

sudo rm -rf "/.MobileBackups.trash"

with the double quotes included to try to reduce the footgun potential of typos (rm -rf / is something you very rarely want to do, especially by accident!).

That was able to remove most of the files, by continuing when it had errors, but still left hundreds of files and directories that it reported being unable to remove:

ewen@ashram:~$ sudo rm -rf "/.MobileBackups.trash"
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/Library/Preferences/SystemConfiguration: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/Library/Preferences: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/Library: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/private/var/db: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/private/var: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume/private: Operation not permitted
rm: /.MobileBackups.trash/MobileBackups_2016-10-22-214323/Computer/2016-10-22-182406/Volume: Directory not empty
[....]

At least most of the disk space was reclaimed, with just 45MB left:

-=- cut here -=-
ewen@ashram:~$ sudo du -sm /.MobileBackups.trash/
45      /.MobileBackups.trash/
ewen@ashram:~$ 
-=- cut here -=-

In order to get back to a useful state I then moved that directory out of the way:

sudo mv /.MobileBackups.trash /var/tmp/mobilebackups-trash-undeleteable-2017-07-18

and rebooted my machine again to ensure everything was in a fresh start state.

When the system came back up again, I tried removing various parts of /var/tmp/mobilebackups-trash-undeleteable-2017-07-18 with no more success. Since the problem had followed the files rather than the location I figured there had to be something about the files which prevented them from being removed. So I did some more research.

The most obvious is the Time Machine Safety Net, which is special protections around the Time Machine snapshots to deal with the fact that they create hard links to directories (to conserve inodes, I assume) which can confuse rm. The recommended approach is to use "tmutil delete", but while it will take a full path doing something like:

tmutil delete /var/tmp/mobilebackups-trash-undeleteable-2017-07-18/MobileBackups_2016-10-22-214323

will just fail with a report that it is an "Invalid deletion target":

ewen@ashram:/var/tmp$ sudo tmutil delete /var/tmp/mobilebackups-trash-undeleteable-2017-07-18/MobileBackups_2016-10-22-214323
/private/var/tmp/mobilebackups-trash-undeleteable-2017-07-18/MobileBackups_2016-10-22-214323: Invalid deletion target (error 22)
Total deleted: 0B
ewen@ashram:/var/tmp$ 

and nothing will be deleted. My guess is that it at least tries to ensure that it is inside a Time Machine backup directory.

Another approach suggested is to use Finder to delete the directory, as that has hooks to the extra cleanup magic required, so I did:

open /var/tmp

and then highlighted mobilebackups-trash-undeleteable-2017-07-18 and tried to do permanently delete it with Alt-Cmd-Delete. After a confirmation prompt, and some file counting, that failed with:

The operation can't be completed because an unexpected error occurred (error code -8072).

deleting nothing. Explicitly changing the problem directories to be owned by me:

sudo chown -R ewen:staff /var/tmp/mobilebackups-trash-undeleteable-2017-07-18

also failed to change anything.

There is an even lower level technique to bypass the Time Machine Safety Net, using a helper bypass tool, which on OS X 10.11 (El Capitan) is in "/System/Library/Extensions/TMSafetyNet.kext/Contents/Helpers/bypass". However running the rm with the bypass tool did not get me any further forward:

cd /var/tmp
sudo /System/Library/Extensions/TMSafetyNet.kext/Contents/Helpers/bypass rm -rf mobilebackups-trash-undeleteable-2017-07-18

failed with the same errors, leaving the whole 45MB still present. (From what I can tell online using the bypass tool is fairly safe if you are removing all the Time Machine snapshots, but can leave very incomplete snapshots if you merely try to remove some snapshots -- due precisely to the directory hard links which is the reason that the Time Machine Safety Net exists in the first place. Proceed with caution if you are not trying to delete everything!)

More hunting for why root could not remove files, turned up the OS X 10.11+ (El Capitan onwards) System Integrity Protection which adds quite a few restrictions to what root can do. In particular the theory was that the file had a restricted flag on it which means that only restricted processes, signed by Apple, would be able to modify them.

That left me with the options of either trying to move the files back somewhere that "tmutil delete" might be willing to deal with, or trying to override System Integrity Protection for long enough to remove the files. Since Time Machine had failed to delete the files, apparently for months or years, I chose to go with the more brute force approach of overriding System Integrity Protection for a while so that I could clean up.

The only way to override System Integrity Protection is to boot into System Recovery mode, and run "csrutil disable", then reboot again to access the drive with Sytsem Integrity Protection disabled. To do this:

  • Apple -> Restart...

  • Hold down Cmd-R when the system chimes for restarting, and/or the Apple Logo appears; you have started a Recovery Boot if the background stays black rather than showing a color backdrop prompting for your password

  • When Recovery mode boots up, use Utilities -> Terminal to start a terminal.

  • In the Terminal window, run:

     csrutil disable
    
  • Reboot the system again from the menus

When the normal boot completes and you log in, you are running without System Integrity Protection enabled -- the foot gun is now on automatic!

Having done that, OS X was happy to let me delete the left over trash:

ewen@ashram:/var/tmp$ sudo du -sm mobilebackups-trash-undeleteable-2017-07-18/
Password:
45      mobilebackups-trash-undeleteable-2017-07-18/
ewen@ashram:/var/tmp$ sudo rm -rf mobilebackups-trash-undeleteable-2017-07-18
ewen@ashram:/var/tmp$ ls mob*
ls: mob*: No such file or directory
ewen@ashram:/var/tmp$ 

so I had finally solved the problem I started with, leaving no "undeleteable" files around for later. My guess is that those snapshots happened to run at a time that captured files with restricted flags on them, which then could not be removed (at least once Time Machine had thrown them out of /.MobileBackups and into /.MobileBackups.trash). But it seems unfortunate that the log messages could not have provided more useful instructions.

All that was left was to put the system back to normal:

  • Boot into recovery mode again (Apple -> Restart...; hold down Cmd-R at the chime/Apple logo)

  • Inside Recovery Mode, re-enable System Integrity Protection, with:

    csrutil enable
    

    inside Utilities -> Termimal.

  • Reboot the system again from the menus.

At this point System Integrity Protection is operating normally, which you can confirm with the "csrutil status" command that you can run at any time:

ewen@ashram:~$ csrutil status
System Integrity Protection status: enabled.
ewen@ashram:~$ 

(changes to the status can be made only in Recovery Mode).

Finally re-enable Time Machine local snaphots because on a mobile device it is a useful feature:

sudo tmutil enablelocal

and then force the first local snapshot to be made now to get the process off to an immediate start:

sudo tmutil snapshot

At which point you should have /.MobileBackups with a snapshot or two inside it:

root@ashram:~# ls -l /.MobileBackups/Computer/
total 8
-rw-r--r--  1 root  wheel  263 18 Jul 17:37 .mtm.private.plist
drwxr-xr-x@ 3 root  wheel  102 18 Jul 17:37 2017-07-18-173719
drwxr-xr-x@ 3 root  wheel  102 18 Jul 17:37 2017-07-18-173758
root@ashram:~# 

and if you look in the Time Machine Preferences Window you should see the line that it will create "Local snapshots as space permits".

Quite the adventure! But my system now has about three times as much free disk space as it did previously, which was definitely worth the effort.

Posted Wed Jul 19 17:55:22 2017 Tags:

After upgrading to the "Windows 10 Creators Update", on my dual booted Dell XPS 9360, I installed the Windows Subsystem for Linux, because my fingers like being able to use Unix/Linux commands :-)

There are a few steps to enabling and installing "Bash on Ubuntu on Windows", on a 64-bit Windows 10 Creators Update install:

  • Turn on the Windows Subsystem for Linux, by starting an Administrator PowerShell (right click on Windows icon at bottom left, choose "Windows PowerShell (Admin)" from the menu), then run:

    Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
    

    It will run for a little while with a text progress message, then ask to reboot and do a small amount of installation before restarting.

  • Turn on Developer Mode to enable installing extra features: In Settings -> Update and Security -> For developers move the radio selection to Developer Mode (default seems to be "Sideload Apps"; settings can be found by left clicking on the Windows icon at the bottom left, then click on the "cog wheel"). It will install a "Developer Mode package" (which I guess includes, eg, additional certificates).

  • Start a cmd prompt (eg, Windows -> Run... cmd), and inside that run "bash" to trigger the install of the Linux environment. (Note that without doing the above two steps this will fail completely, with a "not found" message, so if you get a "not found" message double check you have done the steps above.) You are promoted to accept the terms at https://aka.ms/uowterms, which seems to just be a shortlink to the Ubuntu Licensing Page.

    The text also notes that this is a "beta feature" which is presumably why it is necessary to enable "Devleoper Mode"; the install itself seems to download from the Windows (application) Store. (They also warn against modifying Linux files from Windows applications which illustrates the complexity of making the two subsystems play nicely together. It seems like this behaves a little more like a "Linux container" running under Windows than parallel processes.)

  • It detected the locale needed for New Zealand (en_NZ) and offered to change it from the default (en_US); I said "yes". (Note there was quite a long delay after this answer before the next prompt, enough I wondered if it had read the answer -- give it another minute or two.)

  • Then it prompted for a Unix-style user name (see the WSL guide to Linux User Account and Permissions, at http://aka.ms/wslusers). It also prompts for a Unix-style password, which I assume is mostly used to run sudo.

  • Install the outstanding Ubuntu Linux 16.04 updates (ie, the ones released since the install snapshot was made):

    sudo apt-get update
    sudo apt-get dist-upgrade
    

After that, other than the default prompt/vim colours being terrible for a black background console (default on Windows), the environment works in a pretty similar manner to a native Linux environment.

The "Bash on Ubuntu on Windows" menu option added, which runs bash directly, suffers from the same "black background" readability issues. But fortunately if you go to Properties (click on top left of title bar, choose Properties) you can change the colours -- I simple changed the background to be 192/192/192 (default foreground grey), and the foreground to be 0/0/0 (default background black), and the default prompt/vim etc colours look more like they are intended.

There is some more documentation for Bash on Ubuntu on Windows which cover the whole Ubuntu on Windows feature. Of note, the Windows 10 Creators Update" version of the feature is based on Ubuntu Linux 16.04, which means I now have Ubuntu Linux 16.04 functionality on both sides of the dual boot environment :-) The newer version also seems to have improved Linux/Windows interoperabilty, and 24-bit colour support in the console.

Posted Sun Jul 16 10:36:05 2017 Tags:

Tickets for the Wellington 2017 edition of the New Zealand International Film Festival went on sale this morning at 10:00. As with 2014 and 2015 the online ticketing worked... rather poorly for the first hour or so after tickets went on sale. Leading to various tweets calling for patience -- and an apology from the NZIFF Festival Director on Facebook. I had fairly high hopes at 10:00 this morning, after being told by other Festival regulars that 2016 had been better than previous years -- but they were quickly dashed. After trying for the first half hour and getting no where I gave up until about 11:15, and then eventually managed to buy the tickets I wanted gradually, mostly one at a time, over the next hour.

As I have said previously, ticketing is a hard problem. Given a popular event, and limited (good) seats, there will always be a rush for the (best) seats as soon as the sales start. The demand in the first day will always be hundreds of times higher than the demand two weeks later, and the demand in the first hour will be 75% of the demand in the first day. That is just part of the business, so what you need to do is plan for that to happen.

The way that NZIFF (and/or their providers) have set up their online ticketing appears, even four years in, to not properly plan for efficiently handling a large number of buyers all wanting to buy at once. Some of the obvious problems with their implementation include:

  • putting the tickets for about 500 (five hundred) events on sale at the exact same moment -- so instead of a moderate sized stampede for tickets to one event, there are many stampedes for many events, all competing for the same server/network resources.

  • only collecting information about the types of tickets required at ticket purchase time, rather than collecting it in the "wishlist" in advance.

  • only collecting details of the purchaser at ticket purchase time, rather than collecting them in advance (eg, "create account profile" as part of building the wishlist), requiring another more round trips to the server, and storing more data in the database during the contentious ticket sale period.

  • relying on a "best available seat" algorithm that has no user control, and typically picks at best a mediocre seat, thus forcing many more users through the "choose my own seat" process which requires more intensive server interaction

  • not collecting money in advance (eg, selling "movie bucks" credits), which means that the period where the seat allocations are conditional waiting on payment is extended much longer, which both delays finalising seats free for the next buyer to choose from and requires more writes to the database

Less obviously, it appears as if there are some other technical problems:

  • Not designing to automatically "scale out" rapidly to more servers when the site is busy

  • Not pre-scaling to a large enough size, and pre-warming the servers (so they have everything they need in RAM) before opening up ticket sales

  • Breaking the web pages up into too many small requests and stages, increasing the client/server interaction (and thus both load on the server and points at which the process could go wrong) dramatically

  • Writing too much to disk during the processing

  • Reading too much from disk during the interactions

  • Not offloading enough to third party services (eg, CDNs)

and behind all of these is inadequate load testing to simulate the thundering herd of requests that come as an inevitable part of the ticket sales problem, leading to false confidence that "this year we will be okay", only to have those hopes crushed in the first 10 minutes.

So how do we make this ticket sales problem more manageable:

  • Stagger the event sales -- with 500+ events over 15+ days there is no good reason to put all of the events on sale at exactly the same time. It just makes the ticket sales problem two orders of magnitude worse than a single popular event. So break up the ticket releases into stages -- open up sales for the first few days of the festival at 10:00 on the first day, then open up sales for the next few days of the festival in the afternoon or the next day. Clearly having many many days with new tickets going on sale is impractical, but staggering the ticket sales opening over 2-5 days is fairly easily achieved, and instantly halves (or better) the "sales open" server load.

  • Collect every possible piece of information you can in advance of ticket sales, and write it to the database in advance. This would include all the name and contact details needed to complete the sale, a confirmation the user has read the terms and conditions, and details of how many tickets of which types the user wants. All of this can be part of the account profile and "wishlist". Ideally the only thing left for ticket sales time is seat allocation.

  • Preferably also collect the users preferred seat, or a way to hint to the seat allocation policy where to pick. Many regular movie goers (and almost all of the early sales will be regulars) will know the venues like the back of their hand, and can probably name off the top of their head their favourite seat. Obviously you cannot guarantee the exact seat will still be available when they buy their ticket, but if your seat selection algorithm is choosing "seat nearest to the user desired one" rather than "arbitrary seat the average person might not hate", then there is a good chance the user will not have to interact with the "choose my seat" screen at all. (For about half the films I booked this morning pre-entering my preferred seat would have just worked to give me the perfect seat. But since I had no way to pre-enter it, I had to go through the "I want to choose a better seat than the automatic one" on every single movie session.)

  • Ideally, collect the users money in advance. Many of the most eager purchasers will be literally spending hundreds of dollars, and going to dozens of sessions. Most of them would probably be willing to pre-purchase, say, a block of "10 tickets of type FOO" to be allocated to sessions later, if it sped up their ticket purchasing process. Having the money in advance both saves the users typing in their credit card details over and over again, and also means the server can go directly from "book my session" to "session confirmed" with no waiting -- avoiding writing details to the database at any intermediate step. (This also potentially decreases the festival expenses on credit card transaction fees by an order of magnitude.)

  • Maintain an in-RAM cache of "hot" information, such as the seats which are available/sold/in the process of being sold for each active session. Use memcached or other similar products. Make the initial decisions about which seats to offer the user from those tables, only accessing the database to store a permanent reservation once the seats are found.

  • Done completely you end up with a process that is:

    • User ticks one or more sessions for which they want to finalise their ticket purchase

    • The website returns a page saying "use your credit to buy these seats for these session", pre populated with the nearest seat to the ones they pre-indicated they wanted. It saves a temporary seat reservation to the database with a short timeout, and marks it in the RAM cache as "sale in progress". These writes to the database can be very short (and thus quick) because they are just a 4-tuple of (userid, eventid, seatid, expiry time).

    • User clicks "yes, complete sale", their single interaction if the seat they wanted (or a "close enough" one) is available.

    • The website makes the temporary seat reservations as final (by writing "never" in the expiry time), and writes the new credit balance to the database, and returns a page saying "you have these seats, and this credit left, tickets to follow via email".

    Occasionally a user might need to dive into the seat selection page to try to find a better choice, but for users in that critical first hour there is a pretty good chance that they will get something close to the seat they wanted. And the users will rapidly decide the algorithm is doing as well as is possible when they dive into the seat selection page and find all the ones nearer their preferred seat are gone already.

  • Organise the website so as much as possible is static content -- all images, styling (CSS), descriptions of films, etc, is cache-friendly static content. That both allows the browsers not to even ask for it again, and for any checks for whether it has changed to be met with a very quickly answered "all good, you have the latest version".
    Redirect all that static content to an external CDN to keep it away from the sales transaction process.

  • For data that has to be dynamically loaded (eg, seats available) send it in the most compact form possible, and unpack it on the client. CPU time on the client is effectively free in this process as there are many many client CPUs, and relatively few server resources. Try to offload as much work as possible to the browser CPUs, and make them rely on as little as possible coming from the central server.

  • By getting the sales process down to "are you sure?" / "yes", very few server interactions are required, so users get through the process quicker and go away (reducing load) happy. It also means that there is very little to write to the database, so the database contention is dramatically reduced. Done properly almost nothing has to be read from the database.

  • The quick turn around then makes it possible to do things like, eg, keep a HTTPS connection open from the browser to the load balancer to the back end webserver for the 15-30 seconds it takes to complete the sale, avoiding a bunch of network congestion and setup time. This also dramatically reduces the risk of the sales process failing at step 7 of 10, and the user having to start again (which means the load generated by all previous steps was wasted load on the server and means the user is frustrated). By taking the payment out of line from the seat allocation/finalisation process, the web browser only needs to interact with a single server maximising the chances of keeping the connection "hot", ready for when the user eagerly clicks "yes, perfect, I want those" button. Which completes the transaction as quickly as possible.

  • The quick turn around would also encourage users to purchase multiple sessions at once, rather than resorting to purchasing one ticket at a time just to have a chance of anything working. And users purchasing, eg, 10 sessions at a time will get all the tickets they were after much quicker, then leave the site -- and more server resources available for all other users.

  • Host the server as close as possible to the actual users, so that the web connection set up time is as small as possible, and the data transfers to the user happen as fast as possible. Having connections stall for long periods due to packet loss and long TCP timeouts (scaled to expect long delays due to distance) just ties up server resources and makes the users frustrated.

  • Pre-start lots of additional capacity in advance of the "on sale now" time, and pre-warm it by running a bunch of test transactions through, so the servers are warmed up and ready to go. A day or two later you can manually (or automatically) scale back to a more realistic "sale days 2-20" capacity. With most "cloud" providers you will pay a few hundred dollars extra on the first few hours, or days, in exchange for many happy customers and thus many sales. The extra sales possible as a result may well pay for the extra "first day" hosting costs. And most "cloud" providers will allow you to return that extra capacity on the second day at no extra cost -- so it is a single day cost.

Implementing any of this would help. Implementing all of it would make a dramatic difference to the first day experience of the most enthusiastic customers. I for one would be extremely grateful even just to avoid having to type in my name and contact details several dozen times (between failed and successful one-ticket-at-a-time attempts), or avoiding having to type my credit card details in a couple of dozen times in a rush to try to "complete the sale while the site is still talking to me".

Posted Thu Jul 6 20:52:32 2017 Tags:

git svn provides a way to check out Subversion repositories and interact with them using the git interface. It is useful to avoid the mental leap of working with another revision control system tool occassionally when, eg, dealing with RANCID repositories of switch and router configuration (historically RANCID only supported CVS, and then more recently CVS and Subversion; recent versions do support git directly, but not all my clients are using recent enough versions to have direct git support).

Unfortunately the git svn command interaction is still fairly different from "native git" repository interaction, which causes some confusion. But fortunately with a few kludges you can hide the worst of this from day-to-day interation.

Starting at the beginning, you "clone" the repository with something like:

git svn clone svn+ssh://svn.example.com/var/lib/rancid/SVN/switches

(using a specific path within the Subversion repository as suggested by Janos Gyerik, rather than fighting with the git svn/Subversion branch impedance mismatch).

After that you're supposed to use:

git svn rebase

to update the repository pulling in changes made "upstream"; "git pull" simply will not work.

However my fingers, trained by years of git usage, really want to type "git pull" -- if I have to type something else to update, then I might as well just run svn directly. So I went looking for a solution to make "git pull" work on git svn clone (which I never changed locally).

An obvious answer would be to define a git alias (see Git Aliases), but sadly it is not possible to define a git alias that shadows an internal command, and it appears this is considered a feature. I could call the alias something else, but then I am back at "have to type something different, so I might as well just run svn" :-(

A comment on the same Stack Overflow thread suggests the best answer is to define a bash "shell function" that intercepts calls to git and redirects commands as appropriate. In my case I want "git pull" to run "git svn rebase" if (and only if) I am in a git svn repository. Inspecting those repositories showed that one unique feature they have is that there is a .git/svn directory -- so that gave me a way to tell which repositories to run this command on. Some more searching turned up git rev-parse --show-toplevel as the way to find the .git directory, so my work around could work no matter how deep I am in the git svn repository.

Putting these bits together I came up with this shell function that intercepts "git pull", checks for a git svn repository, and if we are running "git pull" in a git svn repository runs "git svn rebase" instead -- which does a fetch and update, just like "git pull" would do on a native repository:

function git {
    local _GIT

    if test "$1" = "pull"; then
        _GIT=$(command git rev-parse --show-toplevel)
        if test -n "${_GIT}" -a -e "${_GIT}/.git/svn"; then
            command git svn rebase
        else
            command git "$@"
        fi
    else
        command git "$@"
    fi
}

(The "command git" bit forces bash to ignore the alias, and run the git in the PATH instead, preventing infinite recursion -- without having to hard code the path of the git binary.)

Now "git pull" functions like normal in git repositories, and magically does the right thing on git svn repositories; and all other git commands run as normal.

It is definitely a kludge, but avoiding a daily "whoops, that is the repository that is special" confusion is well worth it. (git log and git diff seem to "just work" in git svn repositories -- which are the main two other commands I end up using on RANCID repositories.)

Posted Tue Jul 4 14:52:58 2017 Tags:

Debian 7.0 ("Wheezy") was originally released about four years ago, in May 2013; the last point release (7.11) was released a year ago, in June 2016. While Debian 7.0 ("Wheezy") has benefited from the Debian Long Term Support with a further two years of support -- until 2018-05-31 -- the software in the release is now pretty old, particularly software relating to TLS (Transport Layer Security) where the most recent version supported by Debian Wheezy is now the oldest still reasonably usable on the Internet. (The Long Term Support also covered only a few platforms -- but they were the most commonly used platforms including x86 and amd64.)

More recently Debian released Debian 8.0 ("Jessie"), originally a couple of years ago in May 2015 (with the latest update, Debian 8.8, released last month, in May 2017). Debian are also planning on releasing Debian Stretch (presumably as Debian 9.0) mid June 2017 -- in a couple of weeks. This means that Debian Stretch is still a "testing" distribution, which does not have security support, but all going according to plan later this month (June 2017) it will released and will have testing support after the release -- for several years (between the normal security support, and likely Debian Long Term Support).

Due to a combination of lack of spare time last year, and the Debian LTS providing some additional breathing room to schedule updates, I still have a few "legacy" Debian installations currently running Debian Wheezy (7.11). At this point it does not make much sense to upgrade them to Debian Jessie (itself likely to go into Long Term Support in about a year), so I have decided to upgrade these systems from Debian Wheezy (7.11) through Debian Jessie (8.8) and straight on Debian Stretch (currently "testing', but hopefully soon 9.0). My plan is to start with the systems least reliant on immediate security support -- ie, those that are not exposed to the Internet directly. I have done this before, going from Ubuntu Lucid (10.04) to Ubuntu Trusty (14.04) in two larger steps, both of which were Ubuntu LTS distributions.

Most of these older "still Debian Wheezy" systems were originally much older Debian installs, that have already been incrementally upgraded several times. For the two hosts that I looked at this week, the oldest one was originally installed as Debian Sarge, and the newest one was originally installed as Debian Etch, as far as I can tell -- although both have been re-homed on new hardware since the originally installs. From memory the Debian Sarge install ended up being a Debian Sarge install only due to the way that two older hosts were merged together some years ago -- some parts of that install date back to even older Debian versions, around Debian Slink first released in 1999. So there are 10-15 years of legacy install decisions there, as well as both systems having a number of additional packages installed for specific long-discarded tasks that create additional clutter (such is the disadvantage of the traditional Unix "one big system" approach, versus the modern approach of many VMs or containers). While I do have plans to gradually break the remaining needed services to separate, automatically built, VMs or containers, it is clearly not going to happen overnight :-)

The first step in planning such an update is to look at the release notes:

The upgrade instructions are relatively boilerplate (prepare for an upgrade, check system status, change apt sources, minimal package updates then full package updates) but do contain hints as to possible upgrade problems with specific packages and how to work around them.

The "issues to be aware of" contain a lot of compatibility hints of things which may break as a result of the upgrade. In particular Debian 8 (Jessie) brings:

  • Apache 2.4 which both has significantly different configuration syntax and only includes files ending in .conf (breaking, eg, naming virtual servers after just the domain name); as does the Squid proxy configuration (see Squid 3.2, 3.3, and 3.4release notes, particularly Helper Name Changes).

  • systemd (in the form of systemd-sysv) by default, which potentially breaks local init changes (or custom scripts) and halt no longer powering off by default -- that behaviour apparently being declared "a bug that was never fixed" in the old init scripts, after many many years of it working that way. It got documented, but that is about it. (IMHO the only use of "halt but do not power of is in systems like Juniper JunOS where a key on the console can be used on the halted system to cause it to boot again in the case of accidental halts; it is not clear that actually works with systemd. systemd itself has of course been rather controversial, eventually leading to Devuan Jessie 1.0 which is basically Debian Jessie without systemd. While I am not really a fan of many of systemds technical decisions, the adoption by most of the major Linux distributions makes interaction with it inevitable, so I am not going out of my way to avoid it on these machines.)

  • The "nobody" user (and others) will have their shell changed to /usr/sbin/nologin -- which mostly affects running commands like:

    sudo su -c /path/to/command nobody
    

    Those commands instead need to be run as:

    sudo su -s /bin/bash -c /path/to/command nobody
    

    Alternatively you can choose to decline the change for just the nobody user -- the upgrade tool asks per user change in an interactive upgrade if your debconf question priority is medium or lower. In my case nobody was the last user shell change mentioned.

  • systemd will start, fsck, and mount both / and /usr (if it is a separate device) during the initramfs. In particularly this means that if they are RAID (md) or LVM volumes they need to be started by the time that initramfs runs, or startable by initramfs. There also seem to be some races around this startup, which may mean that not everything starts correctly; at least once I got dumped into the systemd rescue shell, and had to run "vgchange -a y" for systemd, wait for everything to be automatically mounted, and then tell it to continue booting (exit), but one boot it booted correctly by itself so it is defintely a race. (See, eg, Debian bug #758808, Debian bug #774882, and Debian bug #782793. The latter reports a fix in lvm2 2.02.126-3 which is not in Debian Jessie, but is in Debian Stretch, so I did not try too hard to fix this in Debian Jessie before moving on. The main system I experienced this on booted correctly, first time, on Debian Stretch, and continued to reboot automatically, where as on Debian Jessie it needed manual attention pretty much every boot.)

Debian 9 (Stretch) seems to be bringing:

  • Restrictions around separate /usr (it must be mounted by initramfs if it is separate; but the default Debian Stretch initramfs will do this)

  • net-tools (arp, ifconfig, netstat, route, etc) are deprecated (and not installed by default) in favour of using iproute2 (ip ...) commands. Which is a problem for cross-platform finger-macros that have worked for 20-30 years... so I suspect net-tools will be a common optional package for quite a while yet :-)

  • A warning that a Debian 8.8 (Jessie) or Debian 9 (Stretch) kernel is needed for compatibility with the PIE (Position Independent Executable) compile mode for executables in Debian 9 (Stretch), and thus it is extra important to (a) install all Debian 8 (Jessie) updates and reboot before upgrading to Debian 9 (Stretch), and (b) to reboot very soon after upgrading to Debian 9 (Stretch). This also affects, eg, the output of file -- reporting shared object rather than executable (because the executables are now compiled more like shared libraries, for security reasons). (Position independent code (PIC) is also somewhat slower on registered limited machines like 32-bit x86 -- but gcc 5.0+ contains some performance improvements for PIC which apparently help reduce the penalty. This is probably a good argument to prefer amd64 -- 64-bit mode -- for new installs. And even the x86 support is i686 or higher only; Debian Jessie is the last release to support i586 class CPUs.)

  • SSH v1, and older ciphers, are disabled in OpenSSH (although it appears Debian Stretch will have a version where they can still be turned back on; the next OpenSSH release is going to remove SSH v1 support entirely, and it is already removed from the development tree). Also ssh root password login is disabled on upgrade. These ssh changes are particularly an upgrade risk -- one would want to be extra sure of having an out of band console to reach any newly upgraded machines before rebooting them.

  • Changes around apt package pinning calculations (although it would be best to remove all pins and alternative package repositories during the upgrade anyway).

  • The Debian FTP Servers are going away which means that ftp URLs should be changed to http -- the ftp.CC.debian.org names seem likely to remain for the foreseeable future for use with http.

I have listed some notes on issues experienced below, for future reference and will update this list with anything else I find as I upgrade more of the remaining legacy installs over the next few months.

Debian 7 (Wheezy) to Debian 8 (Jessie)

  • webkitgtk (libwebkitgtk-1.0-common) has limited security support. To track down why this is needed:

    apt-cache rdepends libwebkitgtk-1.0-common
    

    which turns up libwebkitgtk-1.0-0, which is used by a bunch of packages. To find the installed packages that need it:

    apt-cache rdepends --installed libwebkitgtk-1.0-0
    

    which gives libproxy0 and libcairo2, and repeating that pattern indicates many things installed depending on libcairo2. Ultimately iceweasel / firefox-esr are one of the key triggering packages (but not the only one). I chose to ignore this at this point until getting to Debian Stretch -- and once on Debian Stretch I will enable backports to keep firefox-esr relatively up to date.

  • console-tools has been removed, due to being unmaintained upstream, which is relatively unimportant for my systems which are mostly VMs (with only serial console) or okay with the default Linux kernel console. (The other packages removed on upgrade appear to just be, eg, old versions of gcc, perl, or other packaged replaced by newer versions with a new name.)

  • /etc/default/snmpd changed, which removes custom options and also disables the mteTrigger and mteTriggerConf features. The main reason for the change seems to be to put the PID file into /run/snmpd.pid instead of /var/run/snmpd.pid. /etc/snmp/snmpd.conf also changes by default, which will probably need to be merged by hand.

    On SNMP restart a bunch of errors appeared:

    Error: Line 278: Parse error in chip name
    Error: Line 283: Label statement before first chip statement
    Error: Line 284: Label statement before first chip statement
    Error: Line 285: Label statement before first chip statement
    Error: Line 286: Label statement before first chip statement
    Error: Line 287: Label statement before first chip statement
    Error: Line 288: Label statement before first chip statement
    Error: Line 289: Label statement before first chip statement
    Error: Line 322: Compute statement before first chip statement
    Error: Line 323: Compute statement before first chip statement
    Error: Line 324: Compute statement before first chip statement
    Error: Line 325: Compute statement before first chip statement
    Error: Line 1073: Parse error in chip name
    Error: Line 1094: Parse error in chip name
    Error: Line 1104: Parse error in chip name
    Error: Line 1114: Parse error in chip name
    Error: Line 1124: Parse error in chip name
    

    but snmpd apparently started again. The line numbers are too high to be /etc/snmp/snmpd.conf, and as bug report #722224 notes, the filename is not mentioned. An upstream mailing list message implies it relates to lm_sensors object, and the same issue happened on upgrade from SLES 11.2 to 11.3. The discussion in the SLES thread pointed at hyphens in chip names in /etc/sensors.conf being the root cause.

    As a first step, I removed libsensors3 which was no longer required:

    apt-get purge libsensors3
    

    That appeared to be sufficient to remove the problematic file, and then:

    service snmpd stop
    service snmpd start
    service snmpd restart
    

    all ran without producing that error. My assumption is that old /etc/sensors.conf was from a much older install, and no longer in the preferred location or format. (For the first upgrade where I encountered it, the machine was now a VM so lm-sensors reading "hardware" sensors was not particularly relevant.)

  • libsnmp15 was removed, but not purged. The only remaining file was /etc/snmp/snmp.conf (note not the daemon configuration, but the client configuration), which contained:

    #
    # As the snmp packages come without MIB files due to license reasons, loading
    # of MIBs is disabled by default. If you added the MIBs you can reenable
    # loading them by commenting out the following line.
    mibs :
    

    on default systems to disable of the SNMP MIBs from being loaded. Typically one would want to enable SNMP MIB usage and thus to get names of things rather than just long numeric OID strings. snmp-mibs-downloader appears to still exist in Debian 8 (Jessie), but it is in non-free.

    The snmp client package did not seem to be installed, so I installed it manually along with snmp-mibs-downloader:

    sudo apt-get install snmp snmp-mibs-downloader
    

    which caused that, rather than libsnmp15 to own the /etc/snmp/snmp.conf configuration file, which makes more sense. After that I could purge both libsnmp15 and console-tools:

    sudo apt-get purge libsnmp15 console-tools
    

    (console-tools was an easy choice to purge as I had not actively used its configuration previously, and thus could be pretty sure that none of it was necessary.)

    To actually use the MIBs one needs to comment out the "mibs :" line in /etc/snmp/snmp.conf manually, as per the instructions in the file.

  • Fortunately it appeared I did not have any locally modified init scripts which needed to be ported. The suggested check is:

    dpkg-query --show -f'${Conffiles}' | sed 's, /,\n/,g' | \
       grep /etc/init.d | awk 'NF,OFS="  " {print $2, $1}' | \
       md5sum --quiet -c
    

    and while the first system I upgraded had one custom written init script it was for an old tool which did not matter any longer, so I just left it to be ignored.

    I did have problems with the rsync daemon, as listed below.

  • Some "dummy" transitional packages were installed, which I removed:

    sudo apt-get purge module-init-tools iproute
    

    (replaced by udev/kmod and iproute2 respectively). The ttf-dejavu packages also showed up as "dummy" transitional packages but owned a lot of files so I left them alone for now.

  • Watching the system console revealed the errors:

    systemd-logind[4235]: Failed to enable subscription: Launch helper exited with unknown return code 1
    systemd-logind[4235]: Failed to fully start up daemon: Input/output error
    

    which some users have reported when being unable to boot their system, although in my case it happened before rebooting so possibly was caused by a mix of systemd and non-systemd things running.

    systemctl --failed reports:

    Failed to get D-Bus connection: Unknown error -1
    

    as in that error report, possibly due to the wrong dbus running; the running dbus in this system is from the Debian 7 (Wheezy) install, and the systemd/dbus interaction changed a lot after that. (For complicated design choice reasons, historically dbus could not be restarted, so changing it requires rebooting.)

    The system did reboot properly (although it appeared to force a check of the root disk), so I assume this was a transitional update issue.

  • There were a quite a few old Debian 7 (Wheezy) libraries, which I found with:

    dpkg -l | grep deb7
    

    that seemed no longer to be required, so I removed them manually. (Technically that only finds packages with security updates within Debian Wheezy, but those seem the most likely to be problematic to leave lying around.)

    At one point after the upgrade apt-get offered a large selection of packages to autoremove, but after some other tidy up and rebooting it no longer showed any packages to autoremove; it is unclear what happened to cause that change in report. I eventually found the list in my scrollback and pasted the contents into /tmp/notrequired, then did:

    for PKG in $(cat /tmp/notrequired); do echo $PKG; done | tee /tmp/notrequired.list
    dpkg -l | grep -f /tmp/notrequired.list
    

    to list the ones that were still installed. Since this included the libwebkitgtk-1.0-common and libwebkitgtk-1.0-0 packages mentioned above, I did:

    sudo apt-get purge libwebkitgtk-1.0-common libwebkitgtk-1.0-0
    

    to remove those. Then I went through the remainder of the list, and removed anything marked "transitional" or otherwise apparently no longer necessary to this machine (eg, where there was a newer version of the same library installed). This was fairly boring rote cleanup, but given my plan to upgrade straight to Debian 9 (Stretch) it seemed worth starting with a system as tidy as possible.

    I left installed the ones that seemed like I might have installed them deliberately (eg, -perl modules) for some non-packaged tool, just to be on the safe side.

  • I found yet more transitional packages to remove with:

    dpkg -l | grep -i transitional
    

    and removed them with:

    sudo apt-get purge iceweasel mailx mktemp netcat sysvinit
    

    after using "dpkg -L PACKAGE" to check that they contained only documentation; sysvinit contained a couple of helper tools (init and telinit) but their functionality has been replaced by separate systemd programs (eg systemctl) so I removed those too.

    Because netcat is useful, I manually installed the dependency it had brought in to ensure that was selected as an installed package:

    sudo apt-get install netcat-traditional
    

    While it appeared that multiarch-support should also be removable as a no-longer required transitional package, since it was listed as transitional and contained only manpages, in practice attempts to remove it resulted in libc6 wanting to be removed too, which would rapidly lead to a broken system. (On my system the first attempt failed on gnuplot, which was individually fixable by installing, eg, gnuplot-nox explicitly and removing the gnuplot meta package, but since removing multiarch-support lead to removing libc6 I did not end up going down that path.)

    For consistency I also needed to run aptitude and interactively tell aptitude about these decisions.

  • After all this tidying up, I found nothing was listening on the rsync port (tcp/873) any longer. Historically I had run the rsync daemon using /etc/init.d/rsync, which still existed, and still belonged to the rsync package.

    sudo service rsync start
    

    did work, to start the rsync daemon, but it did not start at boot. Debian Bug #764616 provided the hint that:

    sudo systemctl enable rsync
    

    was needed to enable it starting at boot. As Tobias Frost noted on Debian Bug #764616 this appears to be a regression from Debian Wheezy. It appears the bug eventually got fixed in rsync package 3.1.2-1, but that did not get backported to Debian Jessie (which has 3.1.1-3) so I guess the regression remains for everyone to trip over :-( If I was not already planning on upgrading to Debian Stretch then I might have raised backporting the fix as a suggestion.

  • inn2 (for UseNet) is no longer supported on 32-bit (x86); only the LFS (Large File Support) package, inn2-lfs is supported, and it has a different on-disk database format (64-bit pointers rather than 32-bit pointers). The upgrade is not automatic (due to the incompatible database format) so you have to touch /etc/news/convert-inn-data and then install inn2-lfs to upgrade:

    You are trying to upgrade inn2 on a 32-bit system where an old inn2 package
    without Large File Support is currently installed.
    
    Since INN 2.5.4, Debian has stopped providing a 32-bit inn2 package and a
    LFS-enabled inn2-lfs package and now only this LFS-enabled inn2 package is
    supported.
    
    This will require rebuilding the history index and the overview database,
    but the postinst script will attempt to do it for you.
    
    [...]
    
    Please create an empty /etc/news/convert-inn-data file and then try again
    upgrading inn2 if you want to proceed.
    

    Because this fails out the package installation it causes apt-get dist-upgrade to fail, which leaves the system in a partially upgraded messy state. For systems with inn2 installed on 32-bit this is probably the biggest upgrade risk.

    To try moving forward:

    sudo touch /etc/news/convert-inn-data
    sudo apt-get -f install
    

    All going well the partly installed packages will be fixed up, then:

    [ ok ] Stopping news server: innd.
    Deleting the old overview database, please wait...
    Rebuilding the overview database, please wait...
    

    will run (which will probably take many minutes on most non-trivial inn2 installs; in my case these are old inn2 installs, which have been hardly used for years, but do have a lot of retained posts, as a historical archive). You can watch the progress of the intermediate files needed for the overview database being built with:

    watch ls -l /var/spool/news/incoming/tmp/
    watch ls -l /var/spool/news/overview/
    

    in other windows, but otherwise there is no real indication of progress or how close you are to completion. The "/usr/lib/news/bin/makehistory -F -O -x" process that is used in rebuilding the overview file is basically IO bound, but also moderately heavy on CPU. (The history file index itself, in /var/lib/news/history.* seems to rebuild fairly quickly; it appears to be the overview files that take a very long time, due to the need to re-read all the articles.)

    It may also help to know where makehistory is up to reading, eg:

    MKHISTPID=$(ps axuwww | awk '$11 ~ /makehistory/ && $12 ~ /-F/ { print $2; }')
    sudo watch ls -l "/proc/${MKHISTPID}/fd"
    

    which will at least give some idea which news articles are being scanned. (As far as I can tell one temporary file is created per UseNet group, which is then merged into the overview history; the merge phase is quick, but the article scan is pretty slow. Beware the articles are apparently scanned in inode order rather than strictly numerical order, which makes it harder to tell group progress -- but at least you can tell which group it is on.)

    In one of my older news servers, with pretty slow disk IO, rebuilding the overview file took a couple of hours of wall clock time. But it is slow even given the disk bandwidth, because it makes many small read transactions. This is for about 9 million articles, mostly in a few groups where a lot of history was retained, including single groups with 250k-350k articles retained -- and thus stored in a single directory by inn2. On ext4 (but probably without directory indexes, due to being created on ext2/ext3).

    Note that all of this delay blocks the rest of the upgrade of the system, due to it being done in the post-install script -- and the updated package will bail out of the install if you do not let it do the update in the post-install script. Given the time required it seems like a less disruptive upgrade approach could have been chosen, particularly given the issue is not mentioned at all as far as I can see in the "Issues to be aware of for Jessie" page. My inclination for the next one would be to hold inn2, and upgrade everything else first, then come back to upgrading inn2 and anything held back because of it.

    Some searching turned up enabling ext4 dir_index handling to speed up access for larger directories:

    sudo service inn2 stop
    sudo umount /dev/r1/news
    sudo tune2fs -O dir_index,uninit_bg /dev/r1/news
    sudo tune2fs -l /dev/r1/news
    sudo e2fsck -fD /dev/r1/news
    sudo mount /dev/r1/news
    sudo service inn2 start
    

    I apparently did not do this on the previous OS upgrade to avoid locking myself out of using earlier OS kernels; but these ext4 features have been supported for many years now.

    In hindisght this turned out to be a bad choice, causing a lot more work. It is unclear if the file system was already broken, or if changing these options and doing partial fscks broke it :-( At minimum I would suggest doing a e2fsck -f /dev/r1/news before changing any options, to at least know whether the file system is good before the options are changed.

    In my case when I first tried this change I also set "-O uninit_bg" since it was mentioned in the online hints, and then after the first e2fsck, tried to do one more "e2fsck -f /dev/r1/news" to be sure the file system was okay before mounting it again. But apparently parts of the file system need to be initialised by a kernel thread when "uninit_bg is set.

    I ended up with a number of reports of like:

    Inode 8650758, i_size is 5254144, should be 6232064.  Fix? yes
    Inode 8650758, i_blocks is 10378, should be 10314.  Fix? yes
    

    followed by a huge number of reports like:

    Pass 2: Checking directory structure
    Directory inode 8650758 has an unallocated block #5098.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5099.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5100.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5101.  Allocate? yes
    

    which were so numerous to allocate by hand (although I tried saying "yes" to a few by hand), and they could not be fixed automatically (eg, not fixable by "sudo e2fsck -pf /dev/r1/news").

    It is unclear if this was caused by "-O uninit_bg", or some earlier issue on the file system (this older hardware has not been entirely stable), or whether there was some need for more background initialisation to happen which I interrupted by mounting the disk, then unmounting it, and then deciding to check it again.

    Since the file system could still be mounted, so I tried making a new partition and using tar to copy everything off it first before trying to repair it. But the tar copy also reported many many kernel messages like:

    Jun 11 19:12:10 HOSTNAME kernel: [24027.265835] EXT4-fs error (device dm-3): __ext4_read_dirblock:874: 
    inode #9570798: block 6216: comm tar: Directory hole found
    

    and in general the copy proceeded extremely slowly (way way below the disk bandwidth). So I gave up on trying to make a tar copy first, as it seemed like it would take all night with no certainty of completing. I assume these holes are the same "unallocated blocks" that fsck complained about.

    Given that the news spool was mostly many year old articles which I also had not looked at in years, instead I used dd to make a bitwise copy of the partition:

    dd if=/dev/r1/news of=/dev/r1/news_backup bs=32768
    

    which ran at something approaching the underlying disk speed, and at least gives me a "broken" copy to try a second repair on if I find a better answer later.

    Running a non-interactive "no change" fsck:

    e2fsck -nf /dev/r1/news
    

    indicated the scope of the problem was pretty huge, with both many unallocated block reports as above, and also many errors like:

    Problem in HTREE directory inode 8650758: block #1060 has invalid depth (2)
    Problem in HTREE directory inode 8650758: block #1060 has bad max hash
    Problem in HTREE directory inode 8650758: block #1060 not referenced
    

    which I assume indicate dir_index directories that did not get properly indexed, as well as a whole bunch of files that would end up in lost+found. So the file system was pretty messed up.

    Figuring backing out might help, I turned dir_index off again:

    tune2fs -O ^dir_index /dev/r1/news
    tune2fs -l /dev/r1/news
    

    There were still a lot of errors when checking with e2fsck -nf /dev/r1/news, but at least some of them were that there were directories with the INDEX_FL flag set on filesystem without htree support, so it seemed like letting fsck fix that would avoid a bunch of the later errors.

    So as a last ditch attempt, no longer really caring about the old UseNet articles (and knowing they are probably on the previous version of this hosts disks anyway), I tried:

     e2fsck -yf /dev/r1/news
    

    and that did at least result in fewer errors/corrections, but it did throw a lot of things in lost+found :-(

    I ran e2fsck -f /dev/r1/news again to see if it had fixed everything there was to fix, and at least it did come up clean this time. On mounting the file system, there were 7000 articles in lost+found, out of several million on the file system. So I suppose it could have been worse. Grepping through them, they appear to have been from four Newsgroups (presumably the four inodes originally reported as having problems), and all are ones I do not really care about any longer. inn2 still started, so I declared success at this point.

    At some point perhaps I should have another go at enabling dir_index, but definitely not during a system upgrade!

  • python2.6 and related packages, and squid (2.x; replaced by squid3) needed to be removed before db5.1-util could be upgraded. They are apparently linked via libdb5.1, which is not provided in Debian Jessie, but is specified as broken by db5.1-util unless it is a newer version than was in Debian Wheezy. In Debian Jessie only the binary tools are provided, and it offers to uninstall them as an unneeded package.

    Also netatalk is in Debian Wheezy and depends on libdb5.1, but is not in Debian Jessie at all. This surprised other people too, and netatalk seems to be back in Debian Stretch. But it is still netatalk 2.x, rather than netatalk 3.x which has been released for years; some has attempted to modify the netatalk package to netatalk 3.1, but that also seems to have been abandoned for the last couple of years. (Because I was upgrading through to Debian Stretch, I chose to leave the Debian Wheezy version of netatalk installed, and libdb5.1 from Debian Wheezy installed until after the upgrade to Debian Stretch.)

Debian 8 (Jessie) to Debian 9 (Stretch)

  • Purged the now removed packages:

    # dpkg -l | awk '/^rc/ { print $2 }'
    fonts-droid
    libcwidget3:i386
    libmagickcore-6.q16-2:i386
    libmagickwand-6.q16-2:i386
    libproxy1:i386
    libsigc++-2.0-0c2a:i386
    libtag1-vanilla:i386
    perl-modules
    #
    

    with:

    sudo apt-get purge $(dpkg -l | awk '/^rc/ { print $2 }')
    

    to clear old the old configuration files.

  • Checked changes in /etc/default/grub:

    diff /etc/default/grub.ucf-dist /etc/default/grub
    

    and updated grub using update-grub.

  • Checked changes in /etc/ssh/sshd_config:

    grep -v "^#" /etc/ssh/sshd_config.ucf-old | grep '[a-z]'
    grep -v "^#" /etc/ssh/sshd_config | grep '[a-z]'
    

    and checked that the now commented out lines are the defaults. Check that sshd stops/starts/restarts with the new configuration:

    sudo service ssh stop
    sudo service ssh start
    sudo service ssh restart
    

    and that ssh logins work after the upgrade.

  • The isc-dhcp-server service failed to start because it wanted to start both IPv4 and IPv6 service, and the previous configuration (and indeed the network) only had IPv4 configuration:

    dhcpd[15518]: No subnet6 declaration for eth0
    

    Looking further back in the log I saw:

    isc-dhcp-server[15473]: Launching both IPv4 and IPv6 servers [...]
    

    with the hint "(please configure INTERFACES in /etc/default/isc-dhcp-server if you only want one or the other)".

    Setting INTERFACES in /etc/default/isc-dhcp-server currently works to avoid starting the IPv6 server, but it results in a warning:

    DHCPv4 interfaces are no longer set by the INTERFACES variable in
    /etc/default/isc-dhcp-server.  Please use INTERFACESv4 instead.
    Migrating automatically for now, but this will go away in the future.
    

    so I edited /etc/default/isc-dhcp-server and changed it to set INTERFACESv4 instead of INTERFACES.

    After that:

    sudo service isc-dhcp-server stop
    sudo service isc-dhcp-server start
    sudo service isc-dhcp-server restart
    

    worked without error, and syslog reported:

    isc-dhcp-server[15710]: Launching IPv4 server only.
    isc-dhcp-server[15710]: Starting ISC DHCPv4 server: dhcpd.
    
  • The /etc/rsyslog.conf has changed somewhat, particularly around the syntax for loading modules. Lines like:

    $ModLoad imuxsock # provides support for local system logging
    

    have changed to:

    module(load="imuxsock") # provides support for local system logging
    

    I used diff /etc/rsyslog.conf /etc/rsyslog.conf.dpkg-dist to find these changes and merged them by hand. I also removed any old commented out sections no longer present in the new file, but kept my own custom changes (for centralised syslog).

    Then tested with:

    sudo service rsyslog stop
    sudo service rsyslog start
    sudo service rsyslog restart
    
  • This time, even after reboot, apt-get reported a whole bunch of unneeded packages, so I ran:

    sudo apt-get --purge autoremove
    

    to clean them up.

  • An aptitude search:

    aptitude search '~i(!~ODebian)'
    

    from the Debian Stretch Release Notes on Checking system status provided a hint on finding packages which used to be provided, but are no longer present in Debian. I went through the list by hand and manually purged anything which was clearly an older package that had been replaced (eg old cpp and gcc packages) or was no longer required. There were a few that I did still need, so I have left those installed -- but it would be better to find a newer Debian packaged replacement to ensure there are updates (eg, vncserver).

  • Removing the Debian 8 (Jessie) kernel:

    sudo apt-get purge linux-image-3.16.0-4-686-pae
    

    gave the information that the libc6-i686 library package was no longer needed, as in Debian 9 (Stretch) it is just a transitional package, so I did:

    sudo apt-get --purge autoremove
    

    to clean that up. (I tried removing the multiarch-support "transitional" package again at this point, but there were still a few packages with unmet dependencies without, including gnuplot, libinput10, libreadline7, etc, so it looks like this "transitional" package is going to be with us for a while yet.)

  • update-initramfs reported a wrong UUID for resuming (presumably due to the swap having been reinitialised at some point):

    update-initramfs: Generating /boot/initrd.img-4.9.0-3-686-pae
    W: initramfs-tools configuration sets RESUME=UUID=22dfb0a9-839a-4ed2-b20b-7cfafaa3713f
    W: but no matching swap device is available.
    I: The initramfs will attempt to resume from /dev/vdb1
    I: (UUID=717eb7a5-b49c-4409-9ad2-eb2383957e77)
    I: Set the RESUME variable to override this.
    

    which I tracked down to config in /etc/initramfs-tools/conf.d/resume, that contains only that one single line.

    To get rid of the warning I updated the UUID in /etc/initramfs-tools/conf.d/resume to match the new auto-detected one, and tested that worked by running:

    sudo update-initramfs -u
    
  • The log was being spammed with:

    console-kit-daemon[775]: missing action
    console-kit-daemon[775]: GLib-CRITICAL: Source ID 6214 was not found when attempting to remove it
    console-kit-daemon[775]: console-kit-daemon[775]: GLib-CRITICAL: Source ID 6214 was not found when attempting to remove it
    

    messages. Based on the hint that consolekit is not necessary since Debian Jessie in the majority of cases, and knowing almost all logins to this server are via ssh, I followed the instructions in that message to remove consolekit:

    sudo apt-get purge consolekit libck-connector0 libpam-ck-connector
    

    to silence those messages. (This may possibly be a Debian 8 (Jessie) related tidy up, but I did not discover it until after upgrading to Debian 9 (Stretch).)

  • A local internal (ancient, Debian Woody vintage) apt repository no longer works:

    W: The repository 'URL' does not have a Release file.
    N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
    N: See apt-secure(8) manpage for repository creation and user configuration details.
    

    since the one needed local package was already installed long ago, I just commented that repository out in /etc/apt/sources.list. The process for building apt repositories has been updated considerably in the last 10-15 years.

  • After upgrading and rebooting, on one old (upgraded many times) system systemd-journald and rsyslogd were running flat out after boot, and lpd was running regularly. Between them they were spamming the /var/log/syslog file with:

    lpd[847]: select: Bad file descriptor
    

    lines, many, many, many times a second. I stopped lpd with:

    sudo service lpd stop
    

    and the system load returned to normal, and the log lines stopped. The lpd in this case was provided by the lpr package:

    ewen@HOST:~$ dpkg -S /usr/sbin/lpd
    lpr: /usr/sbin/lpd
    ewen@HOST:~$
    

    and it did not seem to have changed much since the Debian Jessie lpr package -- Debian Wheezy had 1:2008.05.17+nmu1, Debian Jessie had 1:2008.05.17.1, and Debian Stretch has 1:2008.05.17.2. According to the Debian Changelog the only difference between Debian Jessie and Debian Stretch is that Debian Stretch's version was updated to later Debian packaging standards.

    Searching on the Internet did not turn up anyone else reporting the same issue in lpr.

    Doing:

    sudo service lpd start
    

    again a while after boot did not produce the same symptoms, so for now I have left it running.

    However some investigation in /etc/printcap revealed that this system had not been used for printing for quite some time, as its only printer entries referred to printers that had been taken out of service a couple of years earlier. So if the problem reoccurs I may just remove the lpr package completely.

    ETA, 2017-07-14: This happened again after another (unplanned) reboot (caused by multiple brownouts getting through the inexpensive UPS). Because I did not notice in time, it then filled up / with a 4.5GB /var/log/lpr.log, with endless messages of:

    Jul 14 06:25:25 tv lpd[844]: select: Bad file descriptor
    Jul 14 06:25:25 tv lpd[844]: select: Bad file descriptor
    Jul 14 06:25:25 tv lpd[844]: select: Bad file descriptor
    

    so since I had not used the printing functionality on this machine since I ended up just removing it completely:

    sudo cp /dev/null /var/log/lpr.log
    sudo cp -p /etc/printcap /etc/printcap-old-2017-07-14
    sudo apt-get purge lpr
    sudo logrotate -f /etc/logrotate.d/rsyslog
    sudo logrotate -f /etc/logrotate.d/rsyslog
    

    which seemed more time efficient than trying to debug the problem of which file descriptor it was talking about (my guess is maybe one which systemd closed for lpd, that the previous init system did not close, but I have no detailed investigation of that). I kept a copy of /etc/printcap in case I do want to try to restore the printing functionality (or debug it later), but most likely I would just set up printing from scratch.

    The two (forced) log rotates were to force compression of the other copies of the 4GB of log messages (in /var/log/syslog, which rotates daily by default, and /var/log/messages which rotates weekly by default), having removed /var/log/lpr.log which was another 4.5GB. Unsurprisingly they compress quite well given the logs were spammed with a single message -- but even compressed they are still about 13MB.

After fixing up those upgrade issues the first upgraded system seems to have been running properly on Debian 9 (Stretch) for the last few days, including helping publish this blog post :-)

ETA, 2017-06-11: Updates, particularly around inn2 upgrade issues.

ETA, 2017-06-17: Updates on boot issues in jessie, fixed by stretch.

Posted Wed Jun 7 10:50:46 2017 Tags:

I have Java installed for precisely one reason: to be able to access Dell iDRAC consoles on both my own server and various client servers. Since Java on the web has been a terrible idea for years, and since the Dell iDRAC relies on various binary modules which do not work on Mac OS X, I have restricted this Java install to a single VM on my desktop which I start up when I need to access the iDRAC consoles.

For the last few years, this "iDRAC console" VM has been an Ubuntu 14.04 LTS VM, with OpenJDK 7 installed. It was the latest available at the time I installed it, and since it was working I left it alone. Unfortunately after upgrading some client Dell hosts to the latest iDRAC firmware, as part of a redeployment exercise, those iDRACs stopped working with this Ubuntu 14.04/OpenJDK 7 environment. But I was able to work around that by using a newer Java environment on a client VM.

Today, when I went to use the Java console with my own older Dell server, the iDRAC console no longer started properly, failing with a Java error:

Fatal: Application Error: Cannot grant permissions to unsigned jars.

which was a surprise as it had previously worked as recently as a few weeks ago.

One StackExchange hint suggested this policy could be overridden by running:

/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/itweb-settings

and changing the Policy Settings to allow "Execute unowned code". But in my case that made no difference. I also tried setting the date in the VM back a year, in case maybe the signing certificate had now expired out -- but that too made no difference.

Given the hint that OpenJDK 8 actually worked, and finding some backports of OpenJDK 8 to Ubuntu 14.04 LTS (which was released shortly after OpenJDK 8 came out, so does not contain it), I decided to try installing the OpenJDK 8 versions on Ubuntu 14.04 LTS. Fortunately this did actually work.

To install OpenJDK 8 on Ubuntu 14.04 LTS ("trusty") you need to install from the OpenJDK builds PPA, which is not officially part of Ubuntu but this one is managed by someone linked with Ubuntu, so is a bit more trustworthy than "random software found on the Intenet".

Installation of the OpenJDK 8 JRE:

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk

and it can be made the default by running:

sudo update-alternatives --config java

and choosing the OpenJDK 8 version.

Unfortunately that does not include javaws, which is the JNLP client that actually triggers the iDRAC console startup -- which meant that OpenJDK 7 was still running (and failing) trying to launch the iDRAC console. Some hunting turned up the need to install icedtea-8-plugin from another Ubuntu PPA to get a newer javaws that would work with OpenJDK 8. To install this one:

sudo add-apt-repository ppa:maarten-fonville/ppa
sudo apt-get update
sudo apt-get install icedtea-8-plugin

Amongst other things this updates the icedtea-netx package, which includes javaws, to also include a version for OpenJDK 8. Unfortunately the updated package (a) did not make the new OpenJDK 8 javaws the default, nor did update-alternatives --config javaws offer the OpenJDK 8 javaws as an option. Which meant the old, non-working, OpenJDK 7 version still launched.

To actually use the newer OpenJDK 8 javaws, I had to manually update the /etc/alternatives symlink:

cd /etc/alternatives
sudo rm javaws
sudo ln -s /usr/lib/jvm/java-8-openjdk-i386/jre/bin/javaws .

After which, finally, I could launch the iDRAC console again and carry on with what I originally planned to do. I hope this will have fixed the iDRAC console access on the newer iDRAC firmware on some of my client machines too; but I have not tested that so far.

Posted Mon May 29 11:29:04 2017 Tags: