Fundamental Interconnectedness

This is the occasional blog of Ewen McNeill. It is also available on LiveJournal as ewen_mcneill, and Dreamwidth as ewen_mcneill_feed.

Debian 7.0 ("Wheezy") was originally released about four years ago, in May 2013; the last point release (7.11) was released a year ago, in June 2016. While Debian 7.0 ("Wheezy") has benefited from the Debian Long Term Support with a further two years of support -- until 2018-05-31 -- the software in the release is now pretty old, particularly software relating to TLS (Transport Layer Security) where the most recent version supported by Debian Wheezy is now the oldest still reasonably usable on the Internet. (The Long Term Support also covered only a few platforms -- but they were the most commonly used platforms including x86 and amd64.)

More recently Debian released Debian 8.0 ("Jessie"), originally a couple of years ago in May 2015 (with the latest update, Debian 8.8, released last month, in May 2017). Debian are also planning on releasing Debian Stretch (presumably as Debian 9.0) mid June 2017 -- in a couple of weeks. This means that Debian Stretch is still a "testing" distribution, which does not have security support, but all going according to plan later this month (June 2017) it will released and will have testing support after the release -- for several years (between the normal security support, and likely Debian Long Term Support).

Due to a combination of lack of spare time last year, and the Debian LTS providing some additional breathing room to schedule updates, I still have a few "legacy" Debian installations currently running Debian Wheezy (7.11). At this point it does not make much sense to upgrade them to Debian Jessie (itself likely to go into Long Term Support in about a year), so I have decided to upgrade these systems from Debian Wheezy (7.11) through Debian Jessie (8.8) and straight on Debian Stretch (currently "testing', but hopefully soon 9.0). My plan is to start with the systems least reliant on immediate security support -- ie, those that are not exposed to the Internet directly. I have done this before, going from Ubuntu Lucid (10.04) to Ubuntu Trusty (14.04) in two larger steps, both of which were Ubuntu LTS distributions.

Most of these older "still Debian Wheezy" systems were originally much older Debian installs, that have already been incrementally upgraded several times. For the two hosts that I looked at this week, the oldest one was originally installed as Debian Sarge, and the newest one was originally installed as Debian Etch, as far as I can tell -- although both have been re-homed on new hardware since the originally installs. From memory the Debian Sarge install ended up being a Debian Sarge install only due to the way that two older hosts were merged together some years ago -- some parts of that install date back to even older Debian versions, around Debian Slink first released in 1999. So there are 10-15 years of legacy install decisions there, as well as both systems having a number of additional packages installed for specific long-discarded tasks that create additional clutter (such is the disadvantage of the traditional Unix "one big system" approach, versus the modern approach of many VMs or containers). While I do have plans to gradually break the remaining needed services to separate, automatically built, VMs or containers, it is clearly not going to happen overnight :-)

The first step in planning such an update is to look at the release notes:

The upgrade instructions are relatively boilerplate (prepare for an upgrade, check system status, change apt sources, minimal package updates then full package updates) but do contain hints as to possible upgrade problems with specific packages and how to work around them.

The "issues to be aware of" contain a lot of compatibility hints of things which may break as a result of the upgrade. In particular Debian 8 (Jessie) brings:

  • Apache 2.4 which both has significantly different configuration syntax and only includes files ending in .conf (breaking, eg, naming virtual servers after just the domain name); as does the Squid proxy configuration (see Squid 3.2, 3.3, and 3.4release notes, particularly Helper Name Changes).

  • systemd (in the form of systemd-sysv) by default, which potentially breaks local init changes (or custom scripts) and halt no longer powering off by default -- that behaviour apparently being declared "a bug that was never fixed" in the old init scripts, after many many years of it working that way. It got documented, but that is about it. (IMHO the only use of "halt but do not power of is in systems like Juniper JunOS where a key on the console can be used on the halted system to cause it to boot again in the case of accidental halts; it is not clear that actually works with systemd. systemd itself has of course been rather controversial, eventually leading to Devuan Jessie 1.0 which is basically Debian Jessie without systemd. While I am not really a fan of many of systemds technical decisions, the adoption by most of the major Linux distributions makes interaction with it inevitable, so I am not going out of my way to avoid it on these machines.)

  • The "nobody" user (and others) will have their shell changed to /usr/sbin/nologin -- which mostly affects running commands like:

    sudo su -c /path/to/command nobody
    

    Those commands instead need to be run as:

    sudo su -s /bin/bash -c /path/to/command nobody
    

    Alternatively you can choose to decline the change for just the nobody user -- the upgrade tool asks per user change in an interactive upgrade if your debconf question priority is medium or lower. In my case nobody was the last user shell change mentioned.

  • systemd will start, fsck, and mount both / and /usr (if it is a separate device) during the initramfs. In particularly this means that if they are RAID (md) or LVM volumes they need to be started by the time that initramfs runs, or startable by initramfs. There also seem to be some races around this startup, which may mean that not everything starts correctly; at least once I got dumped into the systemd rescue shell, and had to run "vgchange -a y" for systemd, wait for everything to be automatically mounted, and then tell it to continue booting (exit), but one boot it booted correctly by itself so it is defintely a race. (See, eg, Debian bug #758808, Debian bug #774882, and Debian bug #782793. The latter reports a fix in lvm2 2.02.126-3 which is not in Debian Jessie, but is in Debian Stretch, so I did not try too hard to fix this in Debian Jessie before moving on.)

Debian 9 (Stretch) seems to be bringing:

  • Restrictions around separate /usr (it must be mounted by initramfs if it is separate; but the default Debian Stretch initramfs will do this)

  • net-tools (arp, ifconfig, netstat, route, etc) are deprecated (and not installed by default) in favour of using iproute2 (ip ...) commands. Which is a problem for cross-platform finger-macros that have worked for 20-30 years... so I suspect net-tools will be a common optional package for quite a while yet :-)

  • A warning that a Debian 8.8 (Jessie) or Debian 9 (Stretch) kernel is needed for compatibility with the PIE (Position Independent Executable) compile mode for executables in Debian 9 (Stretch), and thus it is extra important to (a) install all Debian 8 (Jessie) updates and reboot before upgrading to Debian 9 (Stretch), and (b) to reboot very soon after upgrading to Debian 9 (Stretch). This also affects, eg, the output of file -- reporting shared object rather than executable (because the executables are now compiled more like shared libraries, for security reasons). (Position independent code (PIC) is also somewhat slower on registered limited machines like 32-bit x86 -- but gcc 5.0+ contains some performance improvements for PIC which apparently help reduce the penalty. This is probably a good argument to prefer amd64 -- 64-bit mode -- for new installs. And even the x86 support is i686 or higher only; Debian Jessie is the last release to support i586 class CPUs.)

  • SSH v1, and older ciphers, are disabled in OpenSSH (although it appears Debian Stretch will have a version where they can still be turned back on; the next OpenSSH release is going to remove SSH v1 support entirely, and it is already removed from the development tree). Also ssh root password login is disabled on upgrade. These ssh changes are particularly an upgrade risk -- one would want to be extra sure of having an out of band console to reach any newly upgraded machines before rebooting them.

  • Changes around apt package pinning calculations (although it would be best to remove all pins and alternative package repositories during the upgrade anyway).

  • The Debian FTP Servers are going away which means that ftp URLs should be changed to http -- the ftp.CC.debian.org names seem likely to remain for the foreseeable future for use with http.

I have listed some notes on issues experienced below, for future reference and will update this list with anything else I find as I upgrade more of the remaining legacy installs over the next few months.

Debian 7 (Wheezy) to Debian 8 (Jessie)

  • webkitgtk (libwebkitgtk-1.0-common) has limited security support. To track down why this is needed:

    apt-cache rdepends libwebkitgtk-1.0-common
    

    which turns up libwebkitgtk-1.0-0, which is used by a bunch of packages. To find the installed packages that need it:

    apt-cache rdepends --installed libwebkitgtk-1.0-0
    

    which gives libproxy0 and libcairo2, and repeating that pattern indicates many things installed depending on libcairo2. Ultimately iceweasel / firefox-esr are one of the key triggering packages (but not the only one). I chose to ignore this at this point until getting to Debian Stretch -- and once on Debian Stretch I will enable backports to keep firefox-esr relatively up to date.

  • console-tools has been removed, due to being unmaintained upstream, which is relatively unimportant for my systems which are mostly VMs (with only serial console) or okay with the default Linux kernel console. (The other packages removed on upgrade appear to just be, eg, old versions of gcc, perl, or other packaged replaced by newer versions with a new name.)

  • /etc/default/snmpd changed, which removes custom options and also disables the mteTrigger and mteTriggerConf features. The main reason for the change seems to be to put the PID file into /run/snmpd.pid instead of /var/run/snmpd.pid. /etc/snmp/snmpd.conf also changes by default, which will probably need to be merged by hand.

    On SNMP restart a bunch of errors appeared:

    Error: Line 278: Parse error in chip name
    Error: Line 283: Label statement before first chip statement
    Error: Line 284: Label statement before first chip statement
    Error: Line 285: Label statement before first chip statement
    Error: Line 286: Label statement before first chip statement
    Error: Line 287: Label statement before first chip statement
    Error: Line 288: Label statement before first chip statement
    Error: Line 289: Label statement before first chip statement
    Error: Line 322: Compute statement before first chip statement
    Error: Line 323: Compute statement before first chip statement
    Error: Line 324: Compute statement before first chip statement
    Error: Line 325: Compute statement before first chip statement
    Error: Line 1073: Parse error in chip name
    Error: Line 1094: Parse error in chip name
    Error: Line 1104: Parse error in chip name
    Error: Line 1114: Parse error in chip name
    Error: Line 1124: Parse error in chip name
    

    but snmpd apparently started again. The line numbers are too high to be /etc/snmp/snmpd.conf, and as bug report #722224 notes, the filename is not mentioned. An upstream mailing list message implies it relates to lm_sensors object, and the same issue happened on upgrade from SLES 11.2 to 11.3. The discussion in the SLES thread pointed at hyphens in chip names in /etc/sensors.conf being the root cause.

    As a first step, I removed libsensors3 which was no longer required:

    apt-get purge libsensors3
    

    That appeared to be sufficient to remove the problematic file, and then:

    service snmpd stop
    service snmpd start
    service snmpd restart
    

    all ran without producing that error. My assumption is that old /etc/sensors.conf was from a much older install, and no longer in the preferred location or format. (For the first upgrade where I encountered it, the machine was now a VM so lm-sensors reading "hardware" sensors was not particularly relevant.)

  • libsnmp15 was removed, but not purged. The only remaining file was /etc/snmp/snmp.conf (note not the daemon configuration, but the client configuration), which contained:

    #
    # As the snmp packages come without MIB files due to license reasons, loading
    # of MIBs is disabled by default. If you added the MIBs you can reenable
    # loading them by commenting out the following line.
    mibs :
    

    on default systems to disable of the SNMP MIBs from being loaded. Typically one would want to enable SNMP MIB usage and thus to get names of things rather than just long numeric OID strings. snmp-mibs-downloader appears to still exist in Debian 8 (Jessie), but it is in non-free.

    The snmp client package did not seem to be installed, so I installed it manually along with snmp-mibs-downloader:

    sudo apt-get install snmp snmp-mibs-downloader
    

    which caused that, rather than libsnmp15 to own the /etc/snmp/snmp.conf configuration file, which makes more sense. After that I could purge both libsnmp15 and console-tools:

    sudo apt-get purge libsnmp15 console-tools
    

    (console-tools was an easy choice to purge as I had not actively used its configuration previously, and thus could be pretty sure that none of it was necessary.)

    To actually use the MIBs one needs to comment out the "mibs :" line in /etc/snmp/snmp.conf manually, as per the instructions in the file.

  • Fortunately it appeared I did not have any locally modified init scripts which needed to be ported. The suggested check is:

    dpkg-query --show -f'${Conffiles}' | sed 's, /,\n/,g' | \
       grep /etc/init.d | awk 'NF,OFS="  " {print $2, $1}' | \
       md5sum --quiet -c
    

    and while the first system I upgraded had one custom written init script it was for an old tool which did not matter any longer, so I just left it to be ignored.

    I did have problems with the rsync daemon, as listed below.

  • Some "dummy" transitional packages were installed, which I removed:

    sudo apt-get purge module-init-tools iproute
    

    (replaced by udev/kmod and iproute2 respectively). The ttf-dejavu packages also showed up as "dummy" transitional packages but owned a lot of files so I left them alone for now.

  • Watching the system console revealed the errors:

    systemd-logind[4235]: Failed to enable subscription: Launch helper exited with unknown return code 1
    systemd-logind[4235]: Failed to fully start up daemon: Input/output error
    

    which some users have reported when being unable to boot their system, although in my case it happened before rebooting so possibly was caused by a mix of systemd and non-systemd things running.

    systemctl --failed reports:

    Failed to get D-Bus connection: Unknown error -1
    

    as in that error report, possibly due to the wrong dbus running; the running dbus in this system is from the Debian 7 (Wheezy) install, and the systemd/dbus interaction changed a lot after that. (For complicated design choice reasons, historically dbus could not be restarted, so changing it requires rebooting.)

    The system did reboot properly (although it appeared to force a check of the root disk), so I assume this was a transitional update issue.

  • There were a quite a few old Debian 7 (Wheezy) libraries, which I found with:

    dpkg -l | grep deb7
    

    that seemed no longer to be required, so I removed them manually. (Technically that only finds packages with security updates within Debian Wheezy, but those seem the most likely to be problematic to leave lying around.)

    At one point after the upgrade apt-get offered a large selection of packages to autoremove, but after some other tidy up and rebooting it no longer showed any packages to autoremove; it is unclear what happened to cause that change in report. I eventually found the list in my scrollback and pasted the contents into /tmp/notrequired, then did:

    for PKG in $(cat /tmp/notrequired); do echo $PKG; done | tee /tmp/notrequired.list
    dpkg -l | grep -f /tmp/notrequired.list
    

    to list the ones that were still installed. Since this included the libwebkitgtk-1.0-common and libwebkitgtk-1.0-0 packages mentioned above, I did:

    sudo apt-get purge libwebkitgtk-1.0-common libwebkitgtk-1.0-0
    

    to remove those. Then I went through the remainder of the list, and removed anything marked "transitional" or otherwise apparently no longer necessary to this machine (eg, where there was a newer version of the same library installed). This was fairly boring rote cleanup, but given my plan to upgrade straight to Debian 9 (Stretch) it seemed worth starting with a system as tidy as possible.

    I left installed the ones that seemed like I might have installed them deliberately (eg, -perl modules) for some non-packaged tool, just to be on the safe side.

  • I found yet more transitional packages to remove with:

    dpkg -l | grep -i transitional
    

    and removed them with:

    sudo apt-get purge iceweasel mailx mktemp netcat sysvinit
    

    after using "dpkg -L PACKAGE" to check that they contained only documentation; sysvinit contained a couple of helper tools (init and telinit) but their functionality has been replaced by separate systemd programs (eg systemctl) so I removed those too.

    Because netcat is useful, I manually installed the dependency it had brought in to ensure that was selected as an installed package:

    sudo apt-get install netcat-traditional
    

    While it appeared that multiarch-support should also be removable as a no-longer required transitional package, since it was listed as transitional and contained only manpages, in practice attempts to remove it resulted in libc6 wanting to be removed too, which would rapidly lead to a broken system. (On my system the first attempt failed on gnuplot, which was individually fixable by installing, eg, gnuplot-nox explicitly and removing the gnuplot meta package, but since removing multiarch-support lead to removing libc6 I did not end up going down that path.)

    For consistency I also needed to run aptitude and interactively tell aptitude about these decisions.

  • After all this tidying up, I found nothing was listening on the rsync port (tcp/873) any longer. Historically I had run the rsync daemon using /etc/init.d/rsync, which still existed, and still belonged to the rsync package.

    sudo service rsync start
    

    did work, to start the rsync daemon, but it did not start at boot. Debian Bug #764616 provided the hint that:

    sudo systemctl enable rsync
    

    was needed to enable it starting at boot. As Tobias Frost noted on Debian Bug #764616 this appears to be a regression from Debian Wheezy. It appears the bug eventually got fixed in rsync package 3.1.2-1, but that did not get backported to Debian Jessie (which has 3.1.1-3) so I guess the regression remains for everyone to trip over :-( If I was not already planning on upgrading to Debian Stretch then I might have raised backporting the fix as a suggestion.

  • inn2 (for UseNet) is no longer supported on 32-bit (x86); only the LFS (Large File Support) package, inn2-lfs is supported, and it has a different on-disk database format (64-bit pointers rather than 32-bit pointers). The upgrade is not automatic (due to the incompatible database format) so you have to touch /etc/news/convert-inn-data and then install inn2-lfs to upgrade:

    You are trying to upgrade inn2 on a 32-bit system where an old inn2 package
    without Large File Support is currently installed.
    
    Since INN 2.5.4, Debian has stopped providing a 32-bit inn2 package and a
    LFS-enabled inn2-lfs package and now only this LFS-enabled inn2 package is
    supported.
    
    This will require rebuilding the history index and the overview database,
    but the postinst script will attempt to do it for you.
    
    [...]
    
    Please create an empty /etc/news/convert-inn-data file and then try again
    upgrading inn2 if you want to proceed.
    

    Because this fails out the package installation it causes apt-get dist-upgrade to fail, which leaves the system in a partially upgraded messy state. For systems with inn2 installed on 32-bit this is probably the biggest upgrade risk.

    To try moving forward:

    sudo touch /etc/news/convert-inn-data
    sudo apt-get -f install
    

    All going well the partly installed packages will be fixed up, then:

    [ ok ] Stopping news server: innd.
    Deleting the old overview database, please wait...
    Rebuilding the overview database, please wait...
    

    will run (which will probably take many minutes on most non-trivial inn2 installs; in my case these are old inn2 installs, which have been hardly used for years, but do have a lot of retained posts, as a historical archive). You can watch the progress of the intermediate files needed for the overview database being built with:

    watch ls -l /var/spool/news/incoming/tmp/
    watch ls -l /var/spool/news/overview/
    

    in other windows, but otherwise there is no real indication of progress or how close you are to completion. The "/usr/lib/news/bin/makehistory -F -O -x" process that is used in rebuilding the overview file is basically IO bound, but also moderately heavy on CPU. (The history file index itself, in /var/lib/news/history.* seems to rebuild fairly quickly; it appears to be the overview files that take a very long time, due to the need to re-read all the articles.)

    It may also help to know where makehistory is up to reading, eg:

    MKHISTPID=$(ps axuwww | awk '$11 ~ /makehistory/ && $12 ~ /-F/ { print $2; }')
    sudo watch ls -l "/proc/${MKHISTPID}/fd"
    

    which will at least give some idea which news articles are being scanned. (As far as I can tell one temporary file is created per UseNet group, which is then merged into the overview history; the merge phase is quick, but the article scan is pretty slow. Beware the articles are apparently scanned in inode order rather than strictly numerical order, which makes it harder to tell group progress -- but at least you can tell which group it is on.)

    In one of my older news servers, with pretty slow disk IO, rebuilding the overview file took a couple of hours of wall clock time. But it is slow even given the disk bandwidth, because it makes many small read transactions. This is for about 9 million articles, mostly in a few groups where a lot of history was retained, including single groups with 250k-350k articles retained -- and thus stored in a single directory by inn2. On ext4 (but probably without directory indexes, due to being created on ext2/ext3).

    Note that all of this delay blocks the rest of the upgrade of the system, due to it being done in the post-install script -- and the updated package will bail out of the install if you do not let it do the update in the post-install script. Given the time required it seems like a less disruptive upgrade approach could have been chosen, particularly given the issue is not mentioned at all as far as I can see in the "Issues to be aware of for Jessie" page. My inclination for the next one would be to hold inn2, and upgrade everything else first, then come back to upgrading inn2 and anything held back because of it.

    Some searching turned up enabling ext4 dir_index handling to speed up access for larger directories:

    sudo service inn2 stop
    sudo umount /dev/r1/news
    sudo tune2fs -O dir_index,uninit_bg /dev/r1/news
    sudo tune2fs -l /dev/r1/news
    sudo e2fsck -fD /dev/r1/news
    sudo mount /dev/r1/news
    sudo service inn2 start
    

    I apparently did not do this on the previous OS upgrade to avoid locking myself out of using earlier OS kernels; but these ext4 features have been supported for many years now.

    In hindisght this turned out to be a bad choice, causing a lot more work. It is unclear if the file system was already broken, or if changing these options and doing partial fscks broke it :-( At minimum I would suggest doing a e2fsck -f /dev/r1/news before changing any options, to at least know whether the file system is good before the options are changed.

    In my case when I first tried this change I also set "-O uninit_bg" since it was mentioned in the online hints, and then after the first e2fsck, tried to do one more "e2fsck -f /dev/r1/news" to be sure the file system was okay before mounting it again. But apparently parts of the file system need to be initialised by a kernel thread when "uninit_bg is set.

    I ended up with a number of reports of like:

    Inode 8650758, i_size is 5254144, should be 6232064.  Fix? yes
    Inode 8650758, i_blocks is 10378, should be 10314.  Fix? yes
    

    followed by a huge number of reports like:

    Pass 2: Checking directory structure
    Directory inode 8650758 has an unallocated block #5098.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5099.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5100.  Allocate? yes
    Directory inode 8650758 has an unallocated block #5101.  Allocate? yes
    

    which were so numerous to allocate by hand (although I tried saying "yes" to a few by hand), and they could not be fixed automatically (eg, not fixable by "sudo e2fsck -pf /dev/r1/news").

    It is unclear if this was caused by "-O uninit_bg", or some earlier issue on the file system (this older hardware has not been entirely stable), or whether there was some need for more background initialisation to happen which I interrupted by mounting the disk, then unmounting it, and then deciding to check it again.

    Since the file system could still be mounted, so I tried making a new partition and using tar to copy everything off it first before trying to repair it. But the tar copy also reported many many kernel messages like:

    Jun 11 19:12:10 HOSTNAME kernel: [24027.265835] EXT4-fs error (device dm-3): __ext4_read_dirblock:874: 
    inode #9570798: block 6216: comm tar: Directory hole found
    

    and in general the copy proceeded extremely slowly (way way below the disk bandwidth). So I gave up on trying to make a tar copy first, as it seemed like it would take all night with no certainty of completing. I assume these holes are the same "unallocated blocks" that fsck complained about.

    Given that the news spool was mostly many year old articles which I also had not looked at in years, instead I used dd to make a bitwise copy of the partition:

    dd if=/dev/r1/news of=/dev/r1/news_backup bs=32768
    

    which ran at something approaching the underlying disk speed, and at least gives me a "broken" copy to try a second repair on if I find a better answer later.

    Running a non-interactive "no change" fsck:

    e2fsck -nf /dev/r1/news
    

    indicated the scope of the problem was pretty huge, with both many unallocated block reports as above, and also many errors like:

    Problem in HTREE directory inode 8650758: block #1060 has invalid depth (2)
    Problem in HTREE directory inode 8650758: block #1060 has bad max hash
    Problem in HTREE directory inode 8650758: block #1060 not referenced
    

    which I assume indicate dir_index directories that did not get properly indexed, as well as a whole bunch of files that would end up in lost+found. So the file system was pretty messed up.

    Figuring backing out might help, I turned dir_index off again:

    tune2fs -O ^dir_index /dev/r1/news
    tune2fs -l /dev/r1/news
    

    There were still a lot of errors when checking with e2fsck -nf /dev/r1/news, but at least some of them were that there were directories with the INDEX_FL flag set on filesystem without htree support, so it seemed like letting fsck fix that would avoid a bunch of the later errors.

    So as a last ditch attempt, no longer really caring about the old UseNet articles (and knowing they are probably on the previous version of this hosts disks anyway), I tried:

     e2fsck -yf /dev/r1/news
    

    and that did at least result in fewer errors/corrections, but it did throw a lot of things in lost+found :-(

    I ran e2fsck -f /dev/r1/news again to see if it had fixed everything there was to fix, and at least it did come up clean this time. On mounting the file system, there were 7000 articles in lost+found, out of several million on the file system. So I suppose it could have been worse. Grepping through them, they appear to have been from four Newsgroups (presumably the four inodes originally reported as having problems), and all are ones I do not really care about any longer. inn2 still started, so I declared success at this point.

    At some point perhaps I should have another go at enabling dir_index, but definitely not during a system upgrade!

  • python2.6 and related packages, and squid (2.x; replaced by squid3) needed to be removed before db5.1-util could be upgraded. They are apparently linked via libdb5.1, which is not provided in Debian Jessie, but is specified as broken by db5.1-util unless it is a newer version than was in Debian Wheezy. In Debian Jessie only the binary tools are provided, and it offers to uninstall them as an unneeded package.

    Also netatalk is in Debian Wheezy and depends on libdb5.1, but is not in Debian Jessie at all. This surprised other people too, and netatalk seems to be back in Debian Stretch. But it is still netatalk 2.x, rather than netatalk 3.x which has been released for years; some has attempted to modify the netatalk package to netatalk 3.1, but that also seems to have been abandoned for the last couple of years. (Because I was upgrading through to Debian Stretch, I chose to leave the Debian Wheezy version of netatalk installed, and libdb5.1 from Debian Wheezy installed until after the upgrade to Debian Stretch.)

Debian 8 (Jessie) to Debian 9 (Stretch)

  • Purged the now removed packages:

    # dpkg -l | awk '/^rc/ { print $2 }'
    fonts-droid
    libcwidget3:i386
    libmagickcore-6.q16-2:i386
    libmagickwand-6.q16-2:i386
    libproxy1:i386
    libsigc++-2.0-0c2a:i386
    libtag1-vanilla:i386
    perl-modules
    #
    

    with:

    sudo apt-get purge $(dpkg -l | awk '/^rc/ { print $2 }')
    

    to clear old the old configuration files.

  • Checked changes in /etc/default/grub:

    diff /etc/default/grub.ucf-dist /etc/default/grub
    

    and updated grub using update-grub.

  • Checked changes in /etc/ssh/sshd_config:

    grep -v "^#" /etc/ssh/sshd_config.ucf-old | grep '[a-z]'
    grep -v "^#" /etc/ssh/sshd_config | grep '[a-z]'
    

    and checked that the now commented out lines are the defaults. Check that sshd stops/starts/restarts with the new configuration:

    sudo service ssh stop
    sudo service ssh start
    sudo service ssh restart
    

    and that ssh logins work after the upgrade.

  • The isc-dhcp-server service failed to start because it wanted to start both IPv4 and IPv6 service, and the previous configuration (and indeed the network) only had IPv4 configuration:

    dhcpd[15518]: No subnet6 declaration for eth0
    

    Looking further back in the log I saw:

    isc-dhcp-server[15473]: Launching both IPv4 and IPv6 servers [...]
    

    with the hint "(please configure INTERFACES in /etc/default/isc-dhcp-server if you only want one or the other)".

    Setting INTERFACES in /etc/default/isc-dhcp-server currently works to avoid starting the IPv6 server, but it results in a warning:

    DHCPv4 interfaces are no longer set by the INTERFACES variable in
    /etc/default/isc-dhcp-server.  Please use INTERFACESv4 instead.
    Migrating automatically for now, but this will go away in the future.
    

    so I edited /etc/default/isc-dhcp-server and changed it to set INTERFACESv4 instead of INTERFACES.

    After that:

    sudo service isc-dhcp-server stop
    sudo service isc-dhcp-server start
    sudo service isc-dhcp-server restart
    

    worked without error, and syslog reported:

    isc-dhcp-server[15710]: Launching IPv4 server only.
    isc-dhcp-server[15710]: Starting ISC DHCPv4 server: dhcpd.
    
  • The /etc/rsyslog.conf has changed somewhat, particularly around the syntax for loading modules. Lines like:

    $ModLoad imuxsock # provides support for local system logging
    

    have changed to:

    module(load="imuxsock") # provides support for local system logging
    

    I used diff /etc/rsyslog.conf /etc/rsyslog.conf.dpkg-dist to find these changes and merged them by hand. I also removed any old commented out sections no longer present in the new file, but kept my own custom changes (for centralised syslog).

    Then tested with:

    sudo service rsyslog stop
    sudo service rsyslog start
    sudo service rsyslog restart
    
  • This time, even after reboot, apt-get reported a whole bunch of unneeded packages, so I ran:

    sudo apt-get --purge autoremove
    

    to clean them up.

  • An aptitude search:

    aptitude search '~i(!~ODebian)'
    

    from the Debian Stretch Release Notes on Checking system status provided a hint on finding packages which used to be provided, but are no longer present in Debian. I went through the list by hand and manually purged anything which was clearly an older package that had been replaced (eg old cpp and gcc packages) or was no longer required. There were a few that I did still need, so I have left those installed -- but it would be better to find a newer Debian packaged replacement to ensure there are updates (eg, vncserver).

  • Removing the Debian 8 (Jessie) kernel:

    sudo apt-get purge linux-image-3.16.0-4-686-pae
    

    gave the information that the libc6-i686 library package was no longer needed, as in Debian 9 (Stretch) it is just a transitional package, so I did:

    sudo apt-get --purge autoremove
    

    to clean that up. (I tried removing the multiarch-support "transitional" package again at this point, but there were still a few packages with unmet dependencies without, including gnuplot, libinput10, libreadline7, etc, so it looks like this "transitional" package is going to be with us for a while yet.)

  • update-initramfs reported a wrong UUID for resuming (presumably due to the swap having been reinitialised at some point):

    update-initramfs: Generating /boot/initrd.img-4.9.0-3-686-pae
    W: initramfs-tools configuration sets RESUME=UUID=22dfb0a9-839a-4ed2-b20b-7cfafaa3713f
    W: but no matching swap device is available.
    I: The initramfs will attempt to resume from /dev/vdb1
    I: (UUID=717eb7a5-b49c-4409-9ad2-eb2383957e77)
    I: Set the RESUME variable to override this.
    

    which I tracked down to config in /etc/initramfs-tools/conf.d/resume, that contains only that one single line.

    To get rid of the warning I updated the UUID in /etc/initramfs-tools/conf.d/resume to match the new auto-detected one, and tested that worked by running:

    sudo update-initramfs -u
    
  • The log was being spammed with:

    console-kit-daemon[775]: missing action
    console-kit-daemon[775]: GLib-CRITICAL: Source ID 6214 was not found when attempting to remove it
    console-kit-daemon[775]: console-kit-daemon[775]: GLib-CRITICAL: Source ID 6214 was not found when attempting to remove it
    

    messages. Based on the hint that consolekit is not necessary since Debian Jessie in the majority of cases, and knowing almost all logins to this server are via ssh, I followed the instructions in that message to remove consolekit:

    sudo apt-get purge consolekit libck-connector0 libpam-ck-connector
    

    to silence those messages. (This may possibly be a Debian 8 (Jessie) related tidy up, but I did not discover it until after upgrading to Debian 9 (Stretch).)

  • A local internal (ancient, Debian Woody vintage) apt repository no longer works:

    W: The repository 'URL' does not have a Release file.
    N: Data from such a repository can't be authenticated and is therefore potentially dangerous to use.
    N: See apt-secure(8) manpage for repository creation and user configuration details.
    

    since the one needed local package was already installed long ago, I just commented that repository out in /etc/apt/sources.list. The process for building apt repositories has been updated considerably in the last 10-15 years.

After fixing up those upgrade issues the first upgraded system seems to have been running properly on Debian 9 (Stretch) for the last few days, including helping publish this blog post :-)

ETA, 2017-06-11: Updates, particularly around inn2 upgrade issues.

Posted Wed Jun 7 10:50:46 2017 Tags:

I have Java installed for precisely one reason: to be able to access Dell iDRAC consoles on both my own server and various client servers. Since Java on the web has been a terrible idea for years, and since the Dell iDRAC relies on various binary modules which do not work on Mac OS X, I have restricted this Java install to a single VM on my desktop which I start up when I need to access the iDRAC consoles.

For the last few years, this "iDRAC console" VM has been an Ubuntu 14.04 LTS VM, with OpenJDK 7 installed. It was the latest available at the time I installed it, and since it was working I left it alone. Unfortunately after upgrading some client Dell hosts to the latest iDRAC firmware, as part of a redeployment exercise, those iDRACs stopped working with this Ubuntu 14.04/OpenJDK 7 environment. But I was able to work around that by using a newer Java environment on a client VM.

Today, when I went to use the Java console with my own older Dell server, the iDRAC console no longer started properly, failing with a Java error:

Fatal: Application Error: Cannot grant permissions to unsigned jars.

which was a surprise as it had previously worked as recently as a few weeks ago.

One StackExchange hint suggested this policy could be overridden by running:

/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/itweb-settings

and changing the Policy Settings to allow "Execute unowned code". But in my case that made no difference. I also tried setting the date in the VM back a year, in case maybe the signing certificate had now expired out -- but that too made no difference.

Given the hint that OpenJDK 8 actually worked, and finding some backports of OpenJDK 8 to Ubuntu 14.04 LTS (which was released shortly after OpenJDK 8 came out, so does not contain it), I decided to try installing the OpenJDK 8 versions on Ubuntu 14.04 LTS. Fortunately this did actually work.

To install OpenJDK 8 on Ubuntu 14.04 LTS ("trusty") you need to install from the OpenJDK builds PPA, which is not officially part of Ubuntu but this one is managed by someone linked with Ubuntu, so is a bit more trustworthy than "random software found on the Intenet".

Installation of the OpenJDK 8 JRE:

sudo add-apt-repository ppa:openjdk-r/ppa
sudo apt-get update
sudo apt-get install openjdk-8-jdk

and it can be made the default by running:

sudo update-alternatives --config java

and choosing the OpenJDK 8 version.

Unfortunately that does not include javaws, which is the JNLP client that actually triggers the iDRAC console startup -- which meant that OpenJDK 7 was still running (and failing) trying to launch the iDRAC console. Some hunting turned up the need to install icedtea-8-plugin from another Ubuntu PPA to get a newer javaws that would work with OpenJDK 8. To install this one:

sudo add-apt-repository ppa:maarten-fonville/ppa
sudo apt-get update
sudo apt-get install icedtea-8-plugin

Amongst other things this updates the icedtea-netx package, which includes javaws, to also include a version for OpenJDK 8. Unfortunately the updated package (a) did not make the new OpenJDK 8 javaws the default, nor did update-alternatives --config javaws offer the OpenJDK 8 javaws as an option. Which meant the old, non-working, OpenJDK 7 version still launched.

To actually use the newer OpenJDK 8 javaws, I had to manually update the /etc/alternatives symlink:

cd /etc/alternatives
sudo rm javaws
sudo ln -s /usr/lib/jvm/java-8-openjdk-i386/jre/bin/javaws .

After which, finally, I could launch the iDRAC console again and carry on with what I originally planned to do. I hope this will have fixed the iDRAC console access on the newer iDRAC firmware on some of my client machines too; but I have not tested that so far.

Posted Mon May 29 11:29:04 2017 Tags:

After running into problems trying to get git-annex to run on an SMB share on my Synology DS216+, prompted by the git-annex author, and an example with an earlier Synology NAS I decided to install the standalone version of git-annex directly on my Synology DS216+.

My approach was similar to the earlier "Synology NAS and git annex" tip, but the DS216+ uses an x86_64 CPU:

ewen@nas01:/$ grep "model name" /proc/cpuinfo | uniq
model name  : Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz
ewen@nas01:/$

and I chose a slightly different approach to getting everything working, in part based on my experience setting up the standalone git-annex on a Mac OS X server. I am using Synology DSM "DSM 6.1.1-15101 Update 4", which is the latest release as I write (released 2017-05-25).

To install git-annex:

  • In the Synology web interface (DSM) enable the "SSH Service", in Control Panel -> Terminal, by ticking "Enable SSH Service", and verify that you can ssh to your Synology NAS. Only accounts in the administrators group can use the ssh service, so you will need to create an administrator account to use if you do not already have one. (If your Synology NAS is exposed to the Internet directly now would be a very good time to ensure you have a strong password on the account; mine is behind a separate firewall.)

  • In the Synology web interface (DSM) go to the Package Center and search for "Git Server" (git) from Synology and install that package. It should install in a few seconds, and currently appears to install git 2.8.0:

      ewen@nas01:/$ git --version
      git version 2.8.0
      ewen@nas01:/$
    

    which while not current (eg my laptop has git 2.13.0), is only about a year old. It is a symlink (in /usr/bin/git) into the Git package in /var/packages/Git/target/bin/git.

  • Verify that you can now reach the necessary parts of the git package:

    for FILE in git git-shell git-receive-pack git-upload-pack; do
        which "${FILE}"
    done
    

    should produce something like:

    /bin/git
    /bin/git-shell
    /bin/git-receive-pack
    /bin/git-upload-pack
    
  • Download the latest git-annex standalone x86-64 tarball, and gpg signature

  • Verify the git-annex gpg signature (as with previous installs):

    gpg --verify git-annex-standalone-amd64.tar.gz.sig
    

    which should report a "Good signature" from the "git-annex distribution signing key" (DSA key ID 89C809CB, Primary key fingerprint: 4005 5C6A FD2D 526B 2961 E78F 5EE1 DBA7 89C8 09CB).

    If you have not already verified that key is the right signature key it can be verified against, eg, keys in the Debian keyring as Joey Hess is a former Debian Developer.

  • Once you are happy with git-annex tarball you downloaded, copy it onto the NAS somewhere suitable, eg on the NAS:

    sudo mkdir /volume1/thirdparty
    sudo mkdir /volume1/thirdparty/archives
    sudo chown "$(id -nu):$(id -ng)" /volume1/thirdparty/archives
    

    then from wherever you downloaded the git-annex archive:

    scp -p git-annex-standalone-amd64.tar.gz* nas01.em.naos.co.nz:/volume1/thirdparty/archives/
    
  • Extract the archive on the NAS:

    cd /volume1/thidparty
    sudo tar -xzf archives/git-annex-standalone-amd64.tar.gz
    

    The extracted archive is about 160MB, because of bundling all the required tools:

    ewen@nas01:/volume1/thirdparty$ du -sm git-annex.linux/
    161 git-annex.linux/
    ewen@nas01:/volume1/thirdparty$
    

    to make it a stand alone version (as well as static linking everything).

  • Symlink git-annex into /usr/local/bin so we have a common place

    to reference these binaries:

    cd /usr/local/bin
    sudo ln -s /volume1/thirdparty/git-annex.linux/git-annex .
    

    In a normal login shell /usr/local/bin will be on the PATH, and:

    which git-annex
    

    should print:

    /usr/local/bin/git-annex
    

    and you should be able to run git-annex by itself and have it print out the basic help text.

    Unfortunately this does not work for non-interactive shells, because the Synology NAS uses the "/bin/sh" symlink to bash, which means that non-interactive shells do not process ~/.bashrc, and non-interactive shells also do not read /etc/profile (which is where /usr/local/bin/ is added to the PATH). So we have to add some more work arounds, with symlinks into /usr/bin/ later (see below).

    For reference, this is my /etc/passwd entry created by the Synology NAS web interface (DSM):

    ewen@nas01:~$ grep "$(id -un):" /etc/passwd
    ewen:x:1026:100:Ewen McNeill:/var/services/homes/ewen:/bin/sh
    ewen@nas01:~$
    
  • To fix the warning:

    warning: /bin/sh: setlocale: LC_ALL: cannot change locale (en_US.utf8)
    

    we have to pre-create the locales directory that the git-annex runshell script tries to write the locales into, with permissions that a regular user can write into, and then run git-annex once.

    sudo mkdir /volume1/thirdparty/git-annex.linux/locales
    sudo chown "$(id -un):$(id -gn)" /volume1/thirdparty/git-annex.linux/locales
    

    On the Synology NAS, with the default locale:

    ewen@nas01:~$ set | egrep "LANG|LC_ALL"
    LANG=en_US.utf8
    LC_ALL=en_US.utf8
    ewen@nas01:~$
    

    this should create:

    ewen@nas01:~$ ls /volume1/thirdparty/git-annex.linux/locales/en_US.utf8/
    LC_ADDRESS  LC_IDENTIFICATION  LC_MONETARY  LC_PAPER
    LC_COLLATE  LC_MEASUREMENT     LC_NAME      LC_TELEPHONE
    LC_CTYPE    LC_MESSAGES        LC_NUMERIC   LC_TIME
    ewen@nas01:~$
    

    And then we can revert the file permissions to root owned:

    sudo chown -R root:root /volume1/thirdparty/git-annex.linux/locales
    

    Note that it is possible to change the interactive locale by setting LANG and LC_ALL in, eg, ~/.bash_profile (but this will not work for non-interactive shells). git-annex only supports utf8 locales, but that is probably the most useful modern choice anyway. (I chose not to bother as en_US.utf8 is close enough to my usual locale -- en_NZ.utf8 -- that it did not really matter at present; the main difference would be the date format, and I do not expect to use git-annex interactively on the Synology NAS often enough for that to be an issue. I just wanted the warning message gone, as it turns up repeatedly in interactive use.)

  • To usefully use git-annex you probably also want to enable the "User Home" feature, so that the home directory for your user is created and you can store things like ssh keys; this also enables a per-user share via (CIFS, etc). To do this, in the Synology web interface (DSM) go to Control Panel -> User -> User Home and tick "Enable user home service", and hit Apply. This will create a /volume1/homes directory, a directory for each user, and a /var/services/homes symlink pointing at /volume1/homes so that the shell directories are reachable.

    Once that is done, when you ssh into the NAS, the message about your home directory being missing:

    Could not chdir to home directory /var/services/homes/ewen: No such
    

    file or directory

    should be gone, and you should arrive in your home directory at login:

      ewen@nas01:~$ pwd
      /var/services/homes/ewen
      ewen@nas01:~$
    
  • If you do have a home directory, you might also want to do some common git setup:

    git config --global user.email ...    # Insert your email address
    git config --global user.name ...     # Insert your name
    

    which should run without any complaints, creating a ~/.gitconfig file with the values you supply.

  • Assuming you do have a user home directory you can usefully run the next step to have git-annex auto-generate a couple of necessary helper scripts in ${HOME}/.ssh/ -- which cannot be automatically created otherwise (but see the contents below if you want to try to create them by hand).

    To create the helper scripts automatically run:

    /volume1/thirdparty/git-annex.linux/runshell
    

    which will start a new shell, with /volume1/thirdparty/git-annex.linux/bin in the "${PATH}" so you can interactively use the git-annex versions of tools (eg, for testing).

    It also creates the two helper scripts that we need:

    $ ls -l ${HOME}/.ssh
    total 8
    -rwxrwxrwx 1 ewen users 241 May 28 11:20 git-annex-shell
    -rwxrwxrwx 1 ewen users  74 May 28 11:20 git-annex-wrapper
    $
    
  • Since (a) these scripts are not user specific and (b) "${HOME}/.ssh" is not on the PATH by default, it is much more useful to move these scripts into, eg, /usr/local/bin/, so they are in a central location.

    To do this:

    cd /usr/local/bin
    sudo mv "${HOME}/.ssh/git-annex-shell" .
    sudo mv "${HOME}/.ssh/git-annex-wrapper" .
    sudo chown root:root git-annex-shell git-annex-wrapper
    sudo chmod 755 git-annex-shell git-annex-wrapper
    

    This should give you two trivial shell scripts, which hard code the path to where you unpacked git-annex:

    ewen@nas01:/usr/local/bin$ ls -l git-annex-*
    -rwxr-xr-x 1 root root 241 May 28 11:20 git-annex-shell
    -rwxr-xr-x 1 root root  74 May 28 11:20 git-annex-wrapper
    ewen@nas01:/usr/local/bin$ cat git-annex-shell
    #!/bin/sh
    set -e
    if [ "x$SSH_ORIGINAL_COMMAND" != "x" ]; then
    exec '/volume1/thirdparty/git-annex.linux/runshell' git-annex-shell -c
    "$SSH_ORIGINAL_COMMAND"
    else
    exec '/volume1/thirdparty/git-annex.linux/runshell' git-annex-shell -c "$@"
    fi
    ewen@nas01:/usr/local/bin$ cat git-annex-wrapper
    #!/bin/sh
    set -e
    exec '/volume1/thirdparty/git-annex.linux/runshell' "$@"
    ewen@nas01:/usr/local/bin$
    

    (which gives you enough to create them by hand if you need to, substituting the path where you unpacked the git-annex standalone archive for /volume1/thirdparty/).

  • To be able to run these helper scripts, and git-annex itself, from a non-interactive shell -- such as when git-annex itself is trying to run the remote git-annex, we need to ensure that git-annex, git-annex-shell and git-annex-wrapper are reachable via a directory that is in the default PATH. That default PATH is very minimal, containing:

    ewen@ashram:~$ ssh nas01.em.naos.co.nz 'set' | grep PATH
    PATH=/usr/bin:/bin:/usr/sbin:/sbin
    ewen@ashram:~$
    

    Since /bin and /sbin are both symlinks anyway:

    ewen@nas01:~$ ls -l /bin
    lrwxrwxrwx 1 root root 7 May 21 18:57 /bin -> usr/bin
    ewen@nas01:~$ ls -l /sbin
    lrwxrwxrwx 1 root root 8 May 21 18:57 /sbin -> usr/sbin
    ewen@nas01:~$
    

    that gives us only two choices -- /usr/bin and /usr/sbin -- which are on the default PATH. Given that git-annex is not a system administration tool, only /usr/bin makes sense.

    To do symlink them into /usr/bin:

    cd /usr/bin
    sudo ln -s /usr/local/bin/git-annex* .
    

    I am expecting that this step may need to be redone periodically, as various Synology updates update /usr/bin, which is why I have a "master" copy in /usr/local/bin and just symlink it into /usr/bin. For git-annex this is a chain of two symlinks:

    ewen@nas01:~$ ls -l /usr/bin/git-annex
    lrwxrwxrwx 1 root root 24 May 28 11:57 /usr/bin/git-annex -> /usr/local/bin/git-annex
    ewen@nas01:~$ ls -l /usr/local/bin/git-annex
    lrwxrwxrwx 1 root root 45 May 28 11:01 /usr/local/bin/git-annex -> /volume1/thirdparty/git-annex.linux/git-annex
    ewen@nas01:~$
    

    which is slightly inefficient, but still convenient for restoring later.

  • Now is a convenient time to set up ssh key access to the Synology NAS, by creating ${HOME}/.ssh/authorized_keys as usual. Since we do not need a special key to trigger a special hard coded path to git-annex-shell (because it is on the PATH) you can use your regular key if you want rather than a dedicated "git-annex on Synology NAS" key.

    Ensure that the permissions on the ${HOME}/.ssh directory and the authorized_keys file are appropriately locked down so that sshd will trust them, eg:

    cd
    chmod go-w .
    chmod 2700 .ssh
    chmod 400 .ssh/authorized_keys
    

    and then you should be able to ssh to the NAS with key authentication; if it does not work use "ssh -v ..." to figure out the error, which is most likely a permissions problem like:

    debug1: Remote: Ignored authorized keys: bad ownership or modes for directory /volume1/homes/ewen
    

    because the permissions on the default created directories are very permissive (and would allow anyone to create a ssh authorized key entry), so sshd will not trust the files until the permissions are corrected.

  • All going well at this point you should be able to verify that you can reach all the necessary programs from a non-interactive ssh session with something like:

    for FILE in git-annex git-annex-shell git-annex-wrapper git git-shell git-receive-pack git-upload-pack; do
        ssh NAS "which ${FILE}"
    done
    

    and get back answers like:

    /usr/bin/git-annex
    /usr/bin/git-annex-shell
    /usr/bin/git-annex-wrapper
    /usr/bin/git
    /usr/bin/git-shell
    /usr/bin/git-receive-pack
    /usr/bin/git-upload-pack
    

    if one or more of those is missing from the output you will want to figure out why before continuing.

  • To centralise my git-annex storage, I created an "annex" share through the Synology NAS web interface (DSM) in Control Panel -> Shared Folder. This created a /volume1/annex directory.

  • To make that easily accessible, I created a top level symlink to it:

    sudo ln -s /volume1/annex /
    

    giving:

    ewen@nas01:~$ ls -l /annex
    lrwxrwxrwx+ 1 root root 14 May 28 12:10 /annex -> /volume1/annex
    ewen@nas01:~$
    

    This matches the pattern I use on some other machines.

Once all these setup is done, git-annex can be used effectively like any other Linux/Unix machine. For instance you can "push a clone" onto the NAS using "git bundle" and "git clone" from the bundle, and then add that as a "git remote" and use "git annex sync" and "git annex copy ...`" to copy into it.

The "standalone git-annex" will probably need updating periodically (for bug/security fixes, new features, etc), but it should be possible to do that simply by replacing the unpacked tarfile contents as required; everything else points back to that directory. (Possibly the locale generation step might need to be done by hand again.)

Finally for future reference, it is also possible to run a Debian chroot on the Synology NAS, which would open up even more possibilities for using the NAS as a more general purpose machine.

Posted Sun May 28 13:09:13 2017 Tags:

Imagine, not entirely hypothetically, that you have a VMware ESXi 6.0 host that has disconnected from VMware vCenter due to an issue with the management agents on the host, but the virtual machines on the host are still running. Both the affected host and the VMs will show as "disconnected" in vCenter in this case. Attempts to reconnect from the vCenter side fail.

The usual next steps are to check the management network (eg, from the ESXi DCUI) -- it was fine -- and then to try restarting the management agents in ESXi from the DCUI or a ssh shell, which in this not entirely hypothetical case hung for (literally) hours attempting to restart them. When you reach this point the usual advice is that the host has to be rebooted -- which is complicated because it has production VMs on it, and you cannot just vmotion those VMs to somewhere else.... because the connection to vCenter is broken :-(

If you are lucky enough to have:

  • ssh access to the affected ESXi host, so you can easily tell what is running there

  • your VMs hosted on shared storage

  • at least one other working ESXi host with capacity for the affected VMs connected to vCenter and the shared storage

  • ssh access to the working ESXi host

then there may be a relatively non-disruptive way out of this mess where you can cleanly shut down each VM and then start it up again on the working host even when the management agents are not working any more. (In our not entirely hypothetical case we got no response at all to any esxcli or vim-cmd or similar commands, including commands like df -- presumably because they all talk to the local management agents, which were wedged.)

To be able to move the affected VMs with the least downtime like this you need:

  • to know the path on the shared storage to the VM's .vmx file (typically something like /vmfs/volumes/..../VM/VM.vmx)

  • to know the port group on the vDS (distributed switch) for each interface of the VM

  • have a login (or contact that can login) to the VM and shut it down from within the guest OS

Hopefully you can find the first two in your provisioning database (in a larger environment), or someone will remember where the VMs are stored (in a smaller environment), otherwise you will need to find them by manually browsing your storage and vDS in vCenter. Do find out both of these things before shutting down the VM to minimise downtime of the affected VMs.

To move the VM in this manual way the approach is then:

  • Log into vCenter

  • Find a new unused port ID to use for the VM on the new working host that is in the same port group (normally the host/vCenter will do this for you, but because we bypass vCenter to register the VM this does not happen automatically. To do this go to the Networking page in vCenter, and look in the "Ports" tab of the relevant port group for an empty Port ID line, then make a note of that Port ID number. If you have multiple interfaces you will need to do this for each interface of the affected VM. (If you do not do this, the VM will start up with its networking disconnected, and you will get the error Invalid configuration for device '0' or similar, which will lead to unnecessary downtime. If you really cannot figure out the appropriate vDS port groups, you can leave this step until after you have shut down/re-registered the VM, but there will be more downtime.)

  • ssh into the new working host you plan to start the VM, and prepare the command:

    vim-cmd solo/registervm /vmfs/volumes/PATH_TO_VM/VM/VM.vmx

    ready to run as soon as it is time. This will register the VM on the new ESXi host, which will then tell vCenter "hey, I have this VM now" and the VM will no longer show as disconnected in ESXi. (I believe this works because it manually replicates what, eg, VMware HA does.)

  • ssh into the new working host in another window, and run:

    ls /vmfs/volumes/PATH_TO_VM/VM/

    to check for the VM.vmx.lck file indicating the VM is running; it should be present at this point as the VM is still running on the affected host. Be ready to run this command again once the VM is shut down.

  • Now log into the guest OS (via ssh, RDP, etc) and ask the guest OS to shut down (or call your contact and ask them to do that). Monitor the progress shutting down by, eg, pinging the external IP.

  • Once you see ping stop responding, wait a few seconds then re-run your:

    ls /vmfs/volumes/PATH_TO_VM/VM/

    on the new working host. With luck you will see that there is no VM.vmx.lck file left a few seconds after it stops responding to ping, indicating that the shutdown completed successsfully.

  • Once the VM.vmx.lck file is gone, hit enter in the other window where you prepared the:

    vim-cmd solo/registervm /vmfs/volumes/PATH_TO_VM/VM/VM.vmx

    command to register it on the new working host.

  • Then find the VM in vCenter -- it should no longer show as disconnected. Edit its settings, and for each network interface click on the "Advanced Settings" link, and then change the Port ID of the vDS port is connected to from the old one (tied to the broken host) to the free Port ID in the same port group that you found above. Save your changes.

  • Hit the Play button on VM in vCenter. All going well, the VM should start normally, and connect to the network. Wait for the guest OS to boot and then check (or have your contact check) that it is working. (If it does not connect to the network double check the Port ID that you set, and the guest OS -- by this point you should be able to open the VM's console again to look.)

All going well the downtime for each VM is about 30 seconds longer than the time it takes to shut down the guest OS in the VM, and start up the guest OS in the VM again -- so best case 1-2 minutes downtime.

Lather, rinse, and repeat to move the other VMs. I would suggest doing only one at a time to minimise the risk of getting confused about which step you are up to on which VM, and also minimise the downtime for each individual VM due to being distracted by working on a different VM.

If you are very lucky then after a while maybe you will manage to shutdown the VM that caused the host management agents to wedge/not start, and then the host management agents will start and the host will reconnect to vCenter. If so, you can then vMotion the remaining VMs off the affected host as normal. Otherwise keep going with the manual procedure until the host is empty. (You can tell it is empty because you no longer have disconnected VMs in vCenter; also "ps | grep vcpu | cut -f 2 -d : | sort | uniq" makes an acceptable substitute for "esxcli vm process list" -- the latter of which will just hang in this case.)

Once the affected host is empty either reboot it (or power cycle it if you cannot get in to reboot it), or if it did reconnect to vCenter, put it into maintenance mode and then reboot it. That way all the management agents and the vmkernel get a fresh start. If the VMware host logs (eg hostd.log, vmkernel.log, vxpa.log) do not show an obvious hardware cause of the problems -- so that it seems like a bug was triggered instead -- then it is probably safe to put the affected host back into production once it has been rebooted/power cycled.

Thanks to devnull4 and routereflector for the very useful hints to this process (other useful information). In our not entirely hypothetical situation none of the esxcli commands or vim-cmds to run on the non-working host worked -- they all just hung indefinitely -- so we skipped all of those, and just shut down the guests from within the guest OS. (As best we can tell from context it seems like maybe something confused CBT on a specific guest on this host, which caused a pile up of processes waiting on a lock, which caused all the symptoms. Moving a VM that we found out was being backed up via CBT tracking at the time seemed to be the magic step that freed everything else up. The ESXi hosts affected were on the nearly-latest patch level, but we plan to patch them up to date in a maintenance window soon in case the bug we seem to have stumbled across has been fixed.)

The moral of this story is if you find yourself in this situation try to start the SSH shell before trying to resart the Management Agents. It will give you a second way to look at the affected host if the Management Agents do not just restart. In our case it took about 5-10 minutes before the SSH shell started, and during that time the DCUI did not respond to keyboard input. But by contrast restarting the Management Agents through the DCUI took literally 3 hours, during which the DCUI was unusable -- so if we had not started the SSH shell first we would have had no visiblity.

Posted Mon May 15 13:07:30 2017 Tags:

For many years, including on OS X 10.9 (Snow Leopard) and on OS X 10.11 (El Capitan), I have relied on AutoImporter.app to automatically import photos from my iPhone when it is connected to my Mac.

Unfortunately AutoImporter.app stopped working when I upgraded to iOS 10.3.1 -- it launches, but does not import anything, when the phone is connected while logged in (whether or not the phone is locked). It seems like this issue might have started with with iOS 10.2 -- but I went straight from 9.x to 10.3.1, so do not know if 10.2 or 10.3 first introduced the problem.

In my case even logging in with the phone connected and unlocked does not seem to work for me on OS X 10.11 (El Capitan). From reading bug reports (and some hints in a Reddit thread) it appears the issue is that AutoImporter.app is starting correctly, but is unable to get a list of photos to import -- which is consistent with opening, but showing a 100% progress bar immediately then closing.

Manually importing into Photos.app (with Photos.app auto-launched) appears to be Apple's official answer, but I do not want a manual import -- or to use Photos.app. Supposedly the iCloud My Photo Stream feature can get in the way, but I do not have any iCloud features turned on, and other users report it made no differnce for them. So my current pick is that something changed around enumerating the list of photos on the iPhone, and no one bothered to upgrade AutoImporter.app to match :-( Some users report upgrading to recent versions of macOS Sierra (but only recent versions) made it work again -- however I am not about to embark on a major OS upgrade at present, just in the hope of making this one thing work again.

For now I am trying to remember to periodically use "Image Capture.app" to manually import new images off my iPhone as a backup (in addition to the automatic copy in the iTunes backup of my phone). Sadly this is rather less convenient than the automatically running import.

To use "Image Capture.app" to do a manual import:

  • connect the iPhone to the computer

  • open the "Image Capture.app"

  • select the phone from the list of devices, and if necessary enter the passcode on the phone to unlock it and grant access

  • highlight the photos since the last import (:-( ), eg determined by the file names or date

  • make sure the destination setting at the bottom is "Pictures"

  • make sure that the "Make subfolders per camera" setting is ticked (to replicate how I had AutoImporter.app set up)

  • make sure "Delete after import" is not ticked (in the bottom left area)

Then click on "Import" at the bottom right.

It appears that "Image Capture.app" may recognise the photos which are already imported, once everything else is set up (eg, they have a green tick), so the step of selecting the new photos may not be necessary -- or may be able to be guided by those green ticks -- but for now I am still doing that step of determining what to import manually, in order to have more control over what is imported.

Hopefully eventually there is an OS X 10.11 (El Capitan) update that restores this useful automatic import functionality.

Posted Sun May 14 11:22:02 2017 Tags:

I have had "always on" Internet at home, via a cable modem, since August 2000 (literally "always on" -- the modem never gets turned off due to overnight backups, etc). The brand on the service has changed a few times over the years (TelstraSaturn, TelstraClear, and in recent years, Vodafone) but in that time I have had just two cable modems and two IPv4 IP addresses (static IPs -- one associated with each cable modem; it was rather a challenge when the forced cable modem upgrade forced a change of static IP; the first was 203.79.72.36 and the second was 203.167.144.68).

This past week I had my service upgraded to the latest Vodafone cable modem variant -- Vodafone FibreX.

DOCSIS 1.1/pre-DOCSIS

The first iteration of the cable modem network was provided via rented ($17/month!) Com 21 cable modems. The network was built by Saturn; but I did not sign on until after the TelstraSaturn buyout, ordering via Paradise Internet, who were bought by TelstraSaturn earlier in 2000. From the timeline (original deployments around 1998) I think the technology would have either been DOCSIS 1.0/1.1 or possibly a proprietary precursor to DOCSIS 1.1. As usually deployed it was expected that you would have a single computer, or your own NAT gateway (but from memory you could also get a routed subnet over the cable modem network for additional cost on a "business" plan).

I forget exactly what service I originally had (the first invoices just say "Paradise HighSpeed Internet" at $73.00/month plus cable modem rental of $17.00/month), but I think it might have been 256 kbps down, 128 kbps up (see also plans available in 2002) -- and it originally came with a 512MB monthly data quota, increased to 1GB per month by the end of 2000 (with the total price also going down by a few dollars). Additional data usage was charged at $0.20/MB outside New Zealand, and $0.02/MB within New Zealand -- for modern comparison, that is about $200/GB for overseas traffic, and $20/GB for New Zealand traffic. There was obviously both a big incentive to keep traffic within New Zealand, and also to track data usage fairly closely via the Paradise "Member Internet Usage" page.

From memory the final plan on that cable modem technology got to about 2Mbps down and 256kbps up, with around 1GB or so of data transfer included (and excess traffic at around $10 per 512MB, so $20/GB). Since I have worked from home for many years, I do remember wanting to double the upstream bandwidth to 512kbps but finding no such plan available.

DOCSIS 2.0

The second generation of cable modem technology, deployed in a forced rolling upgrade that moved everyone to a different subnet, used Motorola SB5100 SURFboard cable modems (Vodafone SB5100 troubleshooting guide) which is a DOCSIS 1.1/DOCSIS 2.0 cable modem. I believe the TelstraClear Cable Modem network at that point was using DOCSIS 2.0.

From memory my original plan was something like the HighSpeed 10G plan -- 4Mbps down, 2Mbps up, 10GB of data. Over time I switched to something like a LightSpeed 40G plan (which was 10Mbps down, 2Mbps up and 40GB of data) and other larger plans -- by the end my plan was 15Mbps down, 2Mbps up, and 100GB of data. Traffic usage beyond the bandwidth cap was still charged, at a flat rate for all usage with no overseas/New Zealand distinction, of $3/GB or $3/2GB on the higher datacap plans. (A few years back I even upgraded to the next higher datacap plan in order to get into the plan range that was entitled to $3/2GB overage charges, just to reduce the financial impact of excess usage.)

While it appears that the Motorola SB5100 could act as a NAT gateway in some deployments, I always used it in bridge mode with the static IP assigned to my NAT router. I think that was the most common home deployment -- a separate consumer supplied NAT/WiFi router connected to the SB5100 cable modem.

DOCSIS 3.0

Vodafone did roll out a third generation of cable modems a couple of years ago, using the TechniColor 7210D (Vodafone manual), which could operate either in "CM" (Cable Modem) mode or in "RG" (Residental Gateway) mode -- basically an all-in-one consumer NAT/firewall/WiFi router mode). I believe this was used on the 50Mbps down/2 Mbps up, and 100 Mbps down/ 10 Mbps up plans, intended to be used stand alone (ie without a separate NAT router/WiFi router). From the hardware and speeds available I believe this iteration of the network used DOCSIS 3.0.

I did consider upgrading when these plans came out, but originally there was a 2 year term contract, and I was reluctant to sign up to a 2 year term contract for a cable modem plan when the Government-funded UFB (Ultra Fast Broadband) fibre to the home (FTTH) rollout already well under way. As it turns out the UFB build did not make it to my street until late last year, so with hindsight I could have upgraded to this plan and completed the two year term before I had other options available. (I was also reluctant to lose the static IP that I had, because over years of working from home that static IP had ended up hard coded in several client's firewall rules.)

Vodafone FibreX / DOCSIS 3.1

Vodafone FibreX, like the previous TelstraSaturn/TelstraClear/Vodafone cable modem networks, is a Hybrid Fibre-Coaxial (HFC) network -- basically fibre to the node (FTTN). I suspect the original cable modem networks probably trunked further back towards the network core on coaxial cable than the current deployments do, since I imagine the cable length limitations for DOCSIS 3.1 are much shorter than the original DOCSIS 1.1 supported lengths; it appears the modern Vodafone HFC is GPON 2 to the cabinet, and DOCSIS 3.1 from there (this "GPON 2" appears to je 10G-PON, a 10Gbps Passive Optical Networking (PON) technology; there is also a NG-PON 2 technology now, which is a 4 * 10Gbps PON technology).

The FibreX rebranding appears to be trying to reflect this "mixed Fibre and Coax" network, in an age where UFB/Fibre to the Home (FTTH) is what people are talking about; and the upgrade to DOCSIS 3.1 makes it speed competitive with the common FTTH plans. However hiding the cable modem part of it in the marketing makes it less obvious that it is a direct upgrade from the previous cable modem plans rather than, eg, Vodafone's UFB offering over fibre to the home (which is what I originally assumed when I first saw the marketing); others were confused by the FibreX name too. It was not until I started discussing it with friends, and reading, eg, the Vodafone FibreX FAQ in detail that I realised that it a direct upgrade for what I had without any new cabling being required. (By contrast Chorus UFB cabling to the home has quite a bit of "new build" complexity -- but does support overhead install where the existing cabling is already overhead, which eliminates some of the complexity I was concerned about.)

Vodafone FibreX launched in October 2016, but it was only this past month that I finally had enough free time to research upgrading my old cable modem to something else and deal with the impact of changing static IP addresses. The final straw was my old (Motorola SB5100) cable modem taking four hours to reconnect to the cable network after a brief power cycle to plug it back into the UPS (after replacing the battery in the UPS) -- and then finding out that the tech callout would take several days. (Vodafone did load extra data usage onto my mobile, as per their "Always Connected" (FAQ) promise, but 3G data via through a single device, even with WiFi tethering, is no where near as convenient as "whole house" wired access, particularly for work use.)

I switched to the "FibreX 200" plan, with home phone, from my old (Motorola SB5100 based) "LightSpeed 100GB" plan -- and overall the monthly bill should drop by over $40/month, mostly due to the home phone component going from about $35/month to about $10/month. In addition to the cost savings the "FibreX 200" plan has brought:

  • 200Mbps down, 20 Mbps up peak speed (although as usual for "unlimited" consumer-grade plans, that marketing peak speed is available to the nearest speed test site, but not so much in the real world; it seems like people do reach those peak speeds in the middle of the night.)

  • "Unlimited" data transfer "for standard residential use only", so that I do not have to time shift larger data transfers to fit inside fixed monthly quotas (or "sneakernet" particularly large files). (The old Unlimited Broadband Data terms which appear still to be linked to from the FibreX product page but now just redirect to the residential terms, suggested that P2P traffic would be shaped in the face of any congestion and that 22:00-06:00 was the preferred time for such bulk data transfers; presumably that shaping, and probably more, still happens; eg better transfers overnight.)

  • Dynamic IPv4 addresses by default (although you can still request a static IPv4 address, apparently for an extra $5/month on Vodafone FibreX).

  • IPv6 addresses by default (the Vodafone and IPv6 "coming soon" for FibreX appears to have arrived at my location, as it was there as soon as the modem was installed; apparently the rollout is still under way throughout the network). The IPv6 addresses appear to be a dynamic /56 prefix; there is no option for a static address, but in theory the IPv6 prefix received should be fairly stable.

  • Both a cable modem (Vodafone badged TechniColor TC4400 DOCSIS 3.1 modem -- model TC4400VDFV4), and a residential gateway (Vodafone badged Huawei HG659) with gigabit ethernet and 802.11ac WiFi, are supplied as part of the monthly cost. (The Huawei HG659 seems to be supplied as part of several Vodafone Internet services, including DSL and fibre based ones.) The TechniColor TC4400VDF appears to be configured as a straight cable modem bridge, so it is possible to use your own NAT router/gateway if you want -- users on GeekZone report using Ubiquity EdgeRouters or Mikrotik RB750gr3 routers; the main trick is apparently DHCP on VLAN 10 on the WAN interface.

Vodafone FibreX installation

For a house with an existing cable modem install (so no new cabling is required) the installation seems to be typically a 10-15 minute procedure (at least both for me, and for a friend), consisting of a Downer technician:

  • Disconnecting the old cable modem

  • Checking the quality of the cable signal with a diagnostic tool to verify it is still good enough for DOCSIS 3.1 (in my case this particular cable install is 14 years old)

  • Connecting the TechniColor TC4400VDF cable modem up to the cable plant

  • Calling back to Downer/Vodafone with the MAC address of the cable modem to get service switched over to the new modem's MAC address

  • Powering on the TechniColor TC4400VDF and waiting for it to connect to the cable network

  • Connecting the Huawei HG659 to the TechniColor TC4400VDF with a Cat 6 ethernet cable, and powering on the Huawei HG659

  • Testing the Internet service with a Windows-based laptop connected via Cat 6 ethernet cable to the Huawei HG659 by going to the ACS Data Speed test site

Then providing the speed test show the expected speed (in my case 200Mbps down, 20 Mbps up), declaring success, and taking the old cable modem away (after giving me a receipt to prove it had been returned).

My installer was on his third call of the day (mine) by just after 09:30, and heading back out the door literally 15 minutes after arriving -- pretty good for a "morning install" window given as 09:30-12:30 :-)

As installed the Huawei HG659 provides IPv4 DHCP service on the 192.168.1.0/24 range, with the HG659 on 192.168.1.1 as the IPv4 default gateway, and IPv6 Neighbor Discovery (NDP) on one of the /64s from the /56 supplied to the router by DHCPv6 on the WAN interface. It also provides 802.11b/g/n WiFi on 2.4GHz and 802.11a/ac WiFi on 5GHz, with some default SSID and MACs based on the serial number (the installer pointed out the sticker with the default credentials on it).

Huawei HG659 customisation

After testing that the connection was working from a directly connected laptop, I moved on to customising the Huawei HG659 so that it could replace my existing NAT gateway (an old Linksys WRT54GS v1.1, with a custom OpenWRT install) -- since the Linksys WRT54GS v1.1 was never going to be capable of routing at anything like 200Mbps. (The WRT54GS v1.1 had done very well with ethernet routing even up to 15Mbps down, 2Mbps up; I turned off its WiFi function in favour of an Apple Airport with 802.11-pre-N functionality many years ago. It appears the WRT54GS v1.1 can run OpenWRT 15.05, so I may upgrade it to the last supported version at some point and use it for devices that can only do 802.11b/g, such as "IoT" devices, to keep them separated from my main WiFi for both security and performance reasons -- 802.11b in particular takes a lot of radio time for not much bandwidth. See also, more Linksys WRT54GS v1.1 information.)

Fortunately Vodafone document the default admin password for their HG659, and in recent models it is based on the serial number (username: Admin, password: @NNNNNNNN, where the user name is case sensitive and NNNNNNNN is the last 8 digits of the serial number), so it was easy to get into the web management interface by going to http://192.168.1.1 from the laptop directly connected to the HG659.

To work on my network I have changed:

  • the LAN interface address to match the internal /24 I have used on my home network for years (I learnt long, long, ago not to use 192.168.1.0/24 or similar common RFC1918 default ranges; it just causes too many conflicts later when you want to connect things together). The LAN address is changed in Home Network -> LAN Interface -> LAN Interface Settings, and the HG659 will reboot when the change is applied (which will result in a new external IP address too). The HG659 web interface attempts to redirect to the new IP, but in practice it took sufficiently long to reboot and re-DHCP the configuration client that the redirect timed out; going to the new IP after it finished booting worked fine.

  • disabled the IPv4 DHCP server for the LAN, so as to continue using my existing DHCP server (on one of my home Linux systems) which has all the static leases for my existing devices. This is done in Home Network -> LAN Interface -> DHCP Server, by unticking the box which it enables it; disabling it does not require a reboot. (I left the IPv6 router advertisement functionality enabled, as I did not have anything else doing non-link-local IPv6 allocation previously.) The main disadvantage of not using the internals DHCP server is that it then only recognises devices in the web view by their MAC address, rather than also being able to display the name used during DHCP.

  • at this point I could unplug my old gateway (Linksys WRT54GS) and connect the HG659 to the rest of my internal network, and verify that I could reach the Internet again.

  • I turned off the 2.4GHz radio, since for now I plan to continue using the Apple Airport Express 802.11n functionality for 2.4GHz-only devices, as it is better located in my house for good coverage. To turn off the 2.4GHz radio, untick Home Network -> WLAN Settings -> Basic Settings -> Enable WLAN 2.4 GHz.

  • because I wanted to use the 802.11ac 5GHz functionality, to get better WiFi performance on modern devices, I left the 5GHz radio on and instead configured the SSID and WPA2 password -- I set the SSID to the same as my Apple Airport, but with a "-5" suffix so I can recognise it, and the WPA2 password to the same long random password as my Apple Airport -- that way I can move laptops over to the new WiFi AP by just cut'n'pasting the password from the existing setup, instead of trying to type another long random password in by hand. These settings are changed in Home Network -> WLAN Settings -> WLAN Encryption -> 5GHz Frequency Band. (All my current laptops seem to support 802.11ac, which should reduce the need to run ethernet cables to them for larger transfers, as does my iPad Mini 4, but my now relatively old iPhone does not.)

  • I also changed the passwords of the user and "Admin" accounts, and changed the user name of the "user" account (default: vodafone) because the HG659 was prompting to change them on every log in (and because it is generally good security practice). These can be changed in Maintain -> Modify Login Password, by clicking on the "Edit" link next to each item; the username of the user account is editable, but the username of the "Admin" account is not. (I understand that Vodafone can remotely manage their supplied home gateway, but that appears to be via ACS/TR-069, as configured in the Maintain -> Remote Management section; in theory it can be turned off, but I do not know if that would break useful service functionality.)

  • ETA, 2017-05-03: Enable ping from the WAN side (so I can monitor my connection remotely), by adding a new ACL in Internet -> Network Security -> ACL, with the protocol "ICMP" and the source "WAN".

The HG659 appeared to come with the latest firmware (16V100R001C206B020 at present), so that did not need upgrading at this point.

I have switched some of my laptops over to the 802.11ac WiFi and that seems to work; the remainder of devices are either on wired ethernet or the existing Apple Airport Express 802.11b/g/n network, and also seem to work (just with some WiFi induced speed limitations). Fortunately I upgraded the last of my network switches to 1Gbps last year/early this year, so every device with a 1Gbps Ethernet port can be plugged in at 1Gbps.

Several of my devices just transparently started using IPv6 to access various services -- going to, eg, test-ipv6.com from one of my OS X 10.11 laptops showed a perfect score. The availability of IPv6 by default on my home Internet connection gives me some home that the Internet might manage to transition to "more IPv6 than IPv4" sometime in the next 10 years -- perhaps completing the IPv4 to IPv6 in a mere 2-3 decades. (IPv6 development started over 20 years ago, eventually standardising in RFC2460 from nearly 20 years ago; I remember doing IPv6 interop tests in a network user group around 15 years ago...)

The main issue I found is that there are still networks with much worse IPv6 connectivity than IPv4 connectivity, so following the IPv6 path can involve tromboning via another country and returning instead of a cross-town or cross-island connection in the case of IPv4. One of my clients is affected by this -- they are advertising IPv6 addresses for things I need to reach, but they effectively only have international IPv6 connectivity, and no "domestic" IPv6 connectivity outside of Internet Exchanges (and sadly while it seems that some bits of the Vodafone FibreX IP ranges are advertised onto some Internet Exchanges, none of the IPv6 space appears to be). I worked around that by adding config options for ssh (the main case where the extra latency is obvious) to force IPv4:

Host *.example.com
    AddressFamily inet

Some simple tests with IPv4 suggest that I can download from my Wellington based colo server at 100Mbps - 150 Mbps from an Ethernet connected device, at least some of the time, ie 10MB/s - 15 MB/s, which is roughly 10 times as fast as with the previous cable modem. The 802.11ac WiFi path appears to be capable of 50 Mbps - 70 Mbps for the same download, ie 5MB/s - 7MB/s, which is also still an improvement over the previous cable modem (topping out at 15Mbps -- 1.5MB/s). (I think in part that relies on the IPv4 routing going over the same Internet exchange as the ACS Data speed test server, so hopefully that direct routing is maintained as the IPv4 address changes around -- including to the IPv4 static IP pool.)

Future changes

As deployed the Vodafone FibreX service has a dynamic IP -- changing every time the modem is rebooted. Because I want to update the ACLs on client firewalls with my new IP I attempted to request a static IP once the new modem was installed (I also tried asking when I ordered the upgrade but that just caused a lot of confusion, and resulted in three people handling my call instead of one, so by the third person I kept my request simple!). When I called the afternoon of my morning install, I was told that not enough of the FibreX upgrade had processed for them to run the script that enabled a static IP -- apparently it takes up to 24 hours for the billing database to update, and then another 24-48 hours for the static IP allocation to take effect. In theory the "Vodafone Ninja" who took my call was going to follow up once the billing database updated, and process my static IP request then -- but I will probably have to check in a week whether or not it took effect. (So far my IP has not changed since I have not rebooted the HG659 after the initial LAN address change...; ETA, 2017-05-03: I got a call that the static IP had been applied, and I should reboot my modem; it took a couple of power cycle/reboots of the Technicolor TC4400VDF and the Huwaei HG659 before the IP changed from a 27.252.0.0/16 address to a 203.118.0.0/16 address, including one where the HG659 had the old IPv4 IP address but IPv4 Internet acesss no longer worked -- only IPv6 Internet access did. After those first few reboots everything seems to be working and stable on the new IP address even over more reboots/power cycling of everything. FWIW, AFAICT my IP address did not change in the previous week due to never rebooting the HG659, so if you can live with your IP changing occassionally you may not need to pay for a static IP; in my case the IP ends up in too many ACLs to want it changing even every few weeks. It also appears that the IPv6 prefix I've been assigned has been consistent over the past week, including the multiple reboots to get the IPv4 address to change.)

Currently I am undecided whether I will stick to using the supplied HG659 as my Internet gateway and/or WiFi device. I would prefer to be using a device that I could secure and customise completely myself, rather than something at the edge of my network which I cannot completely verify is secure. But as an interim step the supplied HG659 avoided me having to immediately find a suitably fast NAT router, and a 802.11ac WiFi bridge.

I am also undecided whether I will still get UFB installed to my house, so as to have a redundant Internet connection. The UFB-based Internet plans appear to cost roughly the same amount, so in total it would roughly double my Internet access costs (that list appear to be out of date; there are other providers available now, and other plan combinations). Because I work from home a lot of the time I could probably justify the extra monthly cost (having just reduced my cable modem/phone bill substantially), but I am unsure whether it is worth the install costs or install complexity. (UFB residental installs used to be free, and it appears this may be renewed until 2019, so it might make sense to arrange the UFB install sooner rather than later. But at minimum I will need a 1Gbps capable router that can make policy routing decisions.)

Posted Sun Apr 30 16:09:51 2017 Tags:

This past week there has been a lot of hype about CVE-2016-10229 which seems to have been one of those "just a bug" bugs that later turned out to be exploitable. The description:

udp.c in the Linux kernel before 4.5 allows remote attackers to
execute arbitrary code via UDP traffic that triggers an unsafe
second checksum calculation during execution of a recv system call
with the MSG_PEEK flag.

implies that Linux versions before Linux 4.5 are vulnerable, which seems to have led to misleading things like Security Focus listing dozens of Linux versions as vulnerable.

But according to the author of the patch, "Whoever said that linux [before] 4.5 was vulnerable made a mistake", and only kernels which had Linux kernel git commit 89c22d8c3b278212eef6a8cc66b570bc840a6f5a backported need the fix, which is in Linux kernel git commit 197c949e7798fbf28cfadc69d9ca0c2abbf93191. The fix was created in late 2015, and applied to the main Linux git repository in early 2016.

Debian patched CVE-2016-10229 before there was any CVE assigned, as a result of Debian Bug #808293 where UDP in IPv6 did not always work correctly. The fix was released in, eg, Debian Linux kernel 3.2.73-2+deb7u2 (for Debian Squeeze):

ewen@debian-squeeze:~$ zgrep -A 18 3.2.73-2+deb7u2 /usr/share/doc/linux-image-3.2.0-4-686-pae/changelog.Debian.gz | egrep "udp|808293|-- |^ *$"

  * udp: properly support MSG_PEEK with truncated buffers
    (Closes: #808293, regression in 3.2.72)

 -- Ben Hutchings [...email omitted...]  Sat, 02 Jan 2016 03:31:22 +0000
ewen@debian-squeeze:~$ 

in January 2016, which means that Debian Squeeze has not been vulnerable since very early 2016.

Ubuntu patched CVE-2016-10229 before there was any CVE assigned, as a result of Ubuntu Bug #1527902, as a result of different symptoms but referencing the Debian Bug and the net-next patch that got committed above. For Ubuntu 14.04 the patch was released in 3.13.0-79.123; which is so long ago that the installed changlogs do not even include that release in the installed changelog.Debian.gz. The full Linux Trusty kernel changelog does not have a date for 3.13.0.79-123, but it must have been released at least by Monday 2016-02-22 when 3.13.0-80.124 was released (the next release). So Ubuntu has also been fixed since early 2016.

Redhat Linux never included CVE-2016-10229, due to not backporting the vulnerable code, so they have never been vulnerable. And it appears that Debian and Ubuntu were vulnerable for only a few Linux kernel releases before realising they had a regression and fixing them.

At this point it would be difficult to be running a modern server-Linux distribution and not have been not-vulnerable to CVE-2016-10229 for over a year, assuming you ever install patches. Which means no rush-patching is required. (Rather like last month's Microsoft MS17-010 SMB fixes turned out to patch the bugs in the Shadow Brokers Release that were not already patched, and was released weeks before the Shadow Brokers Release. Pro Tip: Stop using SMB1!)

So why the hype now? As best I can tell it is because Android only just patched CVE-2016-10229 this month, and called it out as a security issue whereas no one else had. That plus the imprecise CVE-2016-10229 description "udp.c in the Linux kernel before 4.5 allows remote attackers to execute arbitrary code via UDP" seems to have caused all the noise.

It probably did not help that the Register, Reddit, and Hacker News describe it as patched "earlier this year", or "in Jan/Feb 2017" or "a while ago", without pointing out that it has been patched for around 14-15 months (early 2016, weeks after being introduced) in most non-Android locations. Plus of course the brokenness of the Android security update eco-system (most handsets are patched via a chain of Google, phone manufacturer and/or telco -- and many fixes do not make it through that chain to devices in real world use -- which leads to a lot of non-patchable devices).

Sometimes Linus Torvalds's "So I personally consider security bugs to be just "normal bugs"" does pay off; this bug was mostly fixed as a regression (except by Android who were a year late to the party). But it seems like the lack of CVE identifiers being back-tagged onto older bugs that were fixed, combined with a lack of research by journalists, leads to more hype when the security risks (rather than just regressions) are later realised.

At least CVE-2016-10229 did not have a vanity website.

Posted Mon Apr 17 11:35:25 2017 Tags:

Recently I ordered a Synology DS216+ II Linux based NAS with two 6GB WD60EFRX (WD Red NAS) drives, as an "end of (business) year" special. I had been considering buying a NAS for a while as I have lots of data collected over years from many different computers scattered over lots of drives (including several copies of that data), and having a definitive central copy of that data would make things a lot easier. My other hope is to finally get rid of the attached external drive by my main workstation (which has been full for a while anyway), as that is the loudest thing near my work area (at least when it spins up; and the drive spin up causes annoying disk IO pauses even on things that should in theory just need the internal SSD).

I went with Synology because I have friends who have used them for years, and know that I can get a ssh connection into them to check things. In addition the data recovery options for getting data off the disks elsewhere are pretty good -- it is Linux mdadm and lvm under the hood. The DS216+ II happened to be one on sale, and the bundle turned out to be not that much more expensive (on sale) than buying a DS216j and the drives separately -- so the better RAM and CPU specifications seemed worth the small extra cost, and hot swapable drives is also a useful addition (the DS216j requires opening the case with a screw driver).

The single Gigabit Ethernet of both models was not a major limitation for me, as my use case is basically "single user", and each of the client machines also has only Gigabit Ethernet (or less); it is very rare I'm using more than one of those client machines at a time. (Besides the 100MB/s maximum of a single Gigabit Ethernet is still faster than the USB2 speed of older drive attachments, around 48 MB/s due to 480 Mbps -- and, eg, the external drive on my main desktop is USB2 attached due to that being what is available on the Apple Thunderbolt Cinema Display monitor I have.) The 6GB WD Red NAS drives were basically chosen based on price/capacity being reasonable, and expecting to only use 3-4GB in the immediate future. (Only WD Red NAS drives were available in the bundle, but I would probably have chosen them anyway.)

Because the DS216+ was ordered as a bundle it arrived with the drives pre-installed, and a note attached to check that they were still properly inserted. It also appears to have been delivered with DSM (Disk Station Manger) pre-installed on the drives -- DSM 6.1-15047 to be precise -- which means that I did not have to go through some of the setup steps. But it also meant that it has been preinstalled with some defaults that I did not necessarily want -- so I chose to delete the disk volume and start again (given that they apparently cannot be shrunk, and I do want to leave space for more than one volume at this stage).

Out of the box, the DS216+ found an IP address with DHCP, and then was reachable on http://IP:5000/ and also on http://diskstation.local:5000/ -- the latter being found by Multicast DNS (mDNS)/Bonjour. They default username was admin, and it appears if you do not complete all the setup the default password is no password (ie, enter admin and then just press enter for the password).

My first setup step was to assign a static DHCP lease for the DS216+ MAC address, and a DNS name, so that I could more easily find it (nas01 in my local domain). The only way I could find to persuade the DS216+ to switch over to the new IP address was to force it to restart ("person" icon -> Restart).

Once that was done, it seemed worth updating to the latest DSM, which is currently 6.1-15047-2 that appears to just have some bug fixes for 6.1-15047. To do that in the DSM interface (http://nas01:5000/) go to the Control Panel -> Update & Restore, and it should tell you that a new DSM is available and offer to download it. Clicking on Download will download the software off the Internet, and when that finishes clicking on "Update Now" will install the update. After the "are you sure you want to do this now" prompt, warning you that it will reset the DS216+, the update will start and then the DS216+ will restart. It said it would take up to 10 minutes, but actually took about 2 minutes (presumably at least in part due to being a minor software update).

The other "attention needed" task was an update in the Package Center, which is Synology's "app store". It needed me to agree to the Package Center Terms of Service, and then I could see there was an update to the "File Station" application which I assume is in the default install. I also updated that at this point (by clicking on "Update", which seemed to do everything fairly transparently).

At this point it also seemed useful to create a user for myself, and set the "admin" password to something longer than an empty string. Both are done in the Control Panel -> User area. There are a lot of options in the new user creation (around volume access, and quotas), but I left them all at the default other than putting my user into the administrators group so that it could be used via ssh.

With the user/passwords set up, I could ssh into the DS216+ (since ssh seemed to be on by default):

ssh nas01

and look around at how things were set up out of the box.

The DS216+ has a Linux 3.10 kernel:

ewen@nas01:/$ uname -a
Linux nas01 3.10.102 #15047 SMP Thu Feb 23 02:23:28 CST 2017 x86_64 GNU/Linux synology_braswell_216+II
ewen@nas01:/$

with a dual core Intel N3060 CPU:

ewen@nas01:/$ grep "model name" /proc/cpuinfo
model name  : Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz
model name  : Intel(R) Celeron(R) CPU  N3060  @ 1.60GHz
ewen@nas01:/$

The two physical hard drives appear as SATA ("SCSI") disks, along with what looks like a third internal disk:

ewen@nas01:/$ cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: WDC      Model: WD60EFRX-68L0BN1         Rev: 82.0
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: WDC      Model: WD60EFRX-68L0BN1         Rev: 82.0
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: Synology Model: DiskStation              Rev: PMAP
  Type:   Direct-Access                    ANSI  SCSI revision: 06
ewen@nas01:/$

On the first two disks there are three Linux MD RAID partitions:

ewen@nas01:/$ sudo fdisk -l /dev/sda
Disk /dev/sda: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: FB4736A9-5AAF-4D25-905D-97A8A8035FC2

Device       Start         End     Sectors  Size Type
/dev/sda1     2048     4982527     4980480  2.4G Linux RAID
/dev/sda2  4982528     9176831     4194304    2G Linux RAID
/dev/sda5  9453280 11720838239 11711384960  5.5T Linux RAID
ewen@nas01:/$

ewen@nas01:/$ sudo fdisk -l /dev/sdb
Disk /dev/sdb: 5.5 TiB, 6001175126016 bytes, 11721045168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 1322083B-9F26-47B5-825A-56C09FAB9C39

Device       Start         End     Sectors  Size Type
/dev/sdb1     2048     4982527     4980480  2.4G Linux RAID
/dev/sdb2  4982528     9176831     4194304    2G Linux RAID
/dev/sdb5  9453280 11720838239 11711384960  5.5T Linux RAID
ewen@nas01:/$

which are then joined together into three Linux MD software RAID arrays, using RAID 1 (mirroring):

ewen@nas01:/$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sda5[0] sdb5[1]
      5855691456 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      2097088 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      2490176 blocks [2/2] [UU]

unused devices: <none>
ewen@nas01:/$

The first is used for the root file system:

ewen@nas01:/$ mount | grep md0
/dev/md0 on / type ext4 (rw,relatime,journal_checksum,barrier,data=ordered)
ewen@nas01:/$

The second is used as a swap volume:

ewen@nas01:/$ grep md1 /proc/swaps
/dev/md1                                partition   2097084 0   -1
ewen@nas01:/$

and the third is used for LVM:

ewen@nas01:/$ sudo pvdisplay
  --- Physical volume ---
  PV Name               /dev/md2
  VG Name               vg1000
  PV Size               5.45 TiB / not usable 704.00 KiB
  Allocatable           yes (but full)
  PE Size               4.00 MiB
  Total PE              1429612
  Free PE               0
  Allocated PE          1429612
  PV UUID               mcSYoC-774T-T6Qj-bk1g-juLe-bqfi-cPRBCS

ewen@nas01:/$

By default there is one volume group:

ewen@nas01:/$ sudo vgdisplay
  --- Volume group ---
  VG Name               vg1000
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  2
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               5.45 TiB
  PE Size               4.00 MiB
  Total PE              1429612
  Alloc PE / Size       1429612 / 5.45 TiB
  Free  PE / Size       0 / 0
  VG UUID               Qw9A2i-F3aQ-txow-XUIk-OP6o-pVCf-sIsz1g

ewen@nas01:/$

with a single volume in it:

ewen@nas01:/$ sudo lvdisplay
  --- Logical volume ---
  LV Path                /dev/vg1000/lv
  LV Name                lv
  VG Name                vg1000
  LV UUID                KRcrco-cOGl-gdOt-GVJ7-IWvc-jogO-ZqyA4G
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 1
  LV Size                5.45 TiB
  Current LE             1429612
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           253:0

ewen@nas01:/$

I believe this is the result of going through the default setup process and choosing a "quick" volume -- resulting in a single volume on RAID. This appears to result in the single data RAID 1, with a single LVM volume group and logical volume -- and not be possible to shrink, or turn into a multi-volume setup without adding hard drives, which obviously is not possible in a two drive chassis.

After some reading, my aim is a SHR (Synology Hybrid RAID)/RAID 1 disk group, with about a 3.5TB disk volume for the initial storage, and the rest left for future use (either expanding the existing volume or, eg, presenting as iSCSI LUNs). In the case of a two drive system Synology Hybrid RAID is basically just a way to say "RAID 1", but possibly having it recorded on disk that way would allow transferring the disks to a larger (more drive bays) unit later on.

That 3.5TB layout is chosen knowing that the recommended Time Machine Server setup is to use a share out of a common volume, with a disk quota to limit the maximum disk usage -- rather than a separate volume, which was my original idea. (The DS216+ can also create a file-backed iSCSI LUN, but the performance is probably not as good, so I would rather keep my options open to have more than one volume.)

The DS216+ II (unlike the DS216j) will support btrfs as a local file system (on wikipedia), which is a Linux file system that has been "in development" for about 10 years, designed to compete with the ZFS file system originally developed by Sun Microsystems. Historically btrfs has been fairly untrusted (with multiple people reporting data loss in the early years), but it has been the default file system for SLES 12 since 2014, and it is also now the default file system for the DS216+. Apparently btrfs is also heavily used at Facebook. The stability of btrfs appears to depend on the features you need, with much of the core file system functionality being listed as "OK" in recent kernels -- which is around Linux 4.9 at present, about 4 years newer than the Linux 3.10 kernel, presumably with many patches, running on the DS216+. (Hopefully missing some or all of those 4 years of development does not cause btrfs stability issues...)

Since the btrfs metadata and data checksums seem useful in a large file system, and the snapshot functionality might be useful, I decided to stick with the Synology DS216+ default of btrfs. Hopefully the older Linux kernel (and thus older btrfs code) does not bite me! (The "quotas for shared folders" are also potentially useful, eg, for the Time Machine Server use case.)

Given that there is no way to (a) shrink a volume that I could find, and (b) no way to convert a volume to a disk group (without adding disks, that I cannot do), my next step was to delete the pre-configured, empty, volume so that I could start the disk layout again. To do this go to the main menu -> Storage Manager -> Volume, and choose to remove the volume.

There were two confirmation prompts -- one to remove the volume, and one "are you sure" warning that data will be deleted, and services will restart. Finally it asked for the account password before continuing, which is a useful verification step for such a destructive action (although you do have to remember which user you used to log in, and thus which password applies -- there does not seem to be anything displaying the logged in user).

The removal process is very thorough -- after removal there is no LVM configuration left on the system, and the md2 RAID array is removed as well:

ewen@nas01:/$ sudo lvdisplay
ewen@nas01:/$ sudo vgdisplay
ewen@nas01:/$ sudo pvdisplay
ewen@nas01:/$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md1 : active raid1 sda2[0] sdb2[1]
      2097088 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      2490176 blocks [2/2] [UU]

unused devices: <none>
ewen@nas01:/$

so you are effectively back to a bare system, which amongst other things will mean that the RAID array gets rebuilt from scratch. (I had sort of hoped to avoid that for time reasons -- but at least forcing it to be rebuilt will also force a check of reading/writing the disks, which is a useful step prior to trusting it with "real" data.)

Once you are back to an empty system, it is possible to go back through the volume creation wizard and choose "custom" and "multi-volume", but I chose to explicitly create the Disk Group first, by going to Storage Manager -> Disk Group, and agreeing to use the two disks that it found. There was a warning that all data on the disks would be erased, and then I could choose the desired RAID mode -- I choose Synology Hybrid RAID (SHR) to leave my options open, as discussed above. I also chose to perform the optional disk check given that these are new drives which I have not tested before. Finally it wanted a description for the disk group, which I have called "shr1". (An example with pictures.)

Once that was applied (which took a few seconds as described in the wizard) there was a new md2 raid partition on the disk, which was rebuilding:

ewen@nas01:/$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sdb5[1] sda5[0]
      5855691456 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.0% (2592768/5855691456) finish=790.1min speed=123465K/sec

md1 : active raid1 sda2[0] sdb2[1]
      2097088 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      2490176 blocks [2/2] [UU]

unused devices: <none>
ewen@nas01:/$

as well as new LVM physical volumes and volume groups:

ewen@nas01:/$ sudo pvdisplay
  --- Physical volume ---
  PV Name               /dev/md2
  VG Name               vg1
  PV Size               5.45 TiB / not usable 704.00 KiB
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              1429612
  Free PE               1429609
  Allocated PE          3
  PV UUID               l03e6f-X3Wa-zGsW-a6yo-3NKG-5YI9-5ghHit

ewen@nas01:/$ sudo vgdisplay
  --- Volume group ---
  VG Name               vg1
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  2
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                1
  Open LV               0
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               5.45 TiB
  PE Size               4.00 MiB
  Total PE              1429612
  Alloc PE / Size       3 / 12.00 MiB
  Free  PE / Size       1429609 / 5.45 TiB
  VG UUID               RjMnEQ-IKst-3N2V-3vJb-s8GE-15RO-qQOdOc

ewen@nas01:/$

And to my surprise there was even a small LVM logical volume:

ewen@nas01:/$ sudo lvdisplay
  --- Logical volume ---
  LV Path                /dev/vg1/syno_vg_reserved_area
  LV Name                syno_vg_reserved_area
  VG Name                vg1
  LV UUID                4IdgrT-c5A6-3IOo-6Tq6-3rej-9nL9-i2SQou
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 0
  LV Size                12.00 MiB
  Current LE             3
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     384
  Block device           253:0

ewen@nas01:/$

That syno_vg_reserved_area volume seems to appear in other installs too, but I do not know what it is used for (other than perhaps as a marker that there is a "real" Disk Group and multiple volumes).

Since even once the MD RAID 1 rebuild picked up to full speed:

ewen@nas01:/$ cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sdb5[1] sda5[0]
      5855691456 blocks super 1.2 [2/2] [UU]
      [>....................]  resync =  0.6% (40464000/5855691456) finish=585.6min speed=165497K/sec

md1 : active raid1 sda2[0] sdb2[1]
      2097088 blocks [2/2] [UU]

md0 : active raid1 sda1[0] sdb1[1]
      2490176 blocks [2/2] [UU]

unused devices: <none>
ewen@nas01:/$

it was going to take about 10 hours to finish the rebuild, I left the DS216+ to its own devices overnight before carrying on.

As an aside, since there is an implicit "Disk Group" (RAID set, LVM volume group) even in the "One Volume" case, it is not obvious to me why the Synology DSM chose to also delete the original "Disk Group" (RAID set) when the single volume was deleted -- it could have just dropped the logical volume, and left the RAID alone, saving a lot of disk IO. Possibly the quick setup should more explicitly create a Disk Group, so a more easy transition becomes an obvious option, rather than retaining what appears to be two distinct code paths.

By the next morning the RAID array had rebuilt. I then forced an extended SMART disk check on each disk in turn by going to Storage Manager -> HDD/SSD -&gt, highlighting the disk in question, and clicking on "Health Info", then setting up the test in the "S.M.A.R.T Test" tab. Each Extended Disk Test took about 11 hours, which I left running while doing other things. I did them approximately one at a time, so that the DS216+ RAID array could still be somewhat responsive -- but ended up with a slight overlap as I started the second one just before going to bed, and the first one had not quite finished by then. (It turns out that I got a bonus second extended disk check on the first disk, because there is a Smart Test scheduled to run once a week on all disks starting at 22:00 on Saturday -- and that must have kicked in on the first disk minutes after the one I manually started in the morning finished, but of course by then the manual one on the second disk was already running.)

The results of the S.M.A.R.T tests are visible in the "History" tab of the "Health Info" page for each drive (in Storage Manager -> HDD/SSD), and I also checked them via the ssh connection:

ewen@nas01:/$ sudo smartctl -d ata -l selftest /dev/sda
smartctl 6.5 (build date Feb 14 2017) [x86_64-linux-3.10.102] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       105         -
# 2  Extended offline    Completed without error       00%        92         -
# 3  Short offline       Completed without error       00%        63         -
# 4  Extended offline    Completed without error       00%        42         -

ewen@nas01:/$ sudo smartctl -d ata -l selftest /dev/sdb
smartctl 6.5 (build date Feb 14 2017) [x86_64-linux-3.10.102] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       105         -
# 2  Short offline       Completed without error       00%        63         -
# 3  Extended offline    Completed without error       00%        42         -

ewen@nas01:/$

just to be sure I knew where to find them later. (That also reveals that there was an Extended test and a short test done before the drives were shipped to me; presumably by the distributor of the "DS216+ and drives" bundle.)

Once that was done, I created a new Volume to hold the 3.5TB of data that I had in mind originally, leaving the remaining space for future expansion. Since there was already a manually created Disk Group, the Storage Manager -> Volume -> Create process automatically selected a Custom setup (and Quick was greyed out). It also automatically selected Multiple Volumes on RIAD (and Single Volume on RAID was greyed out), and "Choose an existing Disk Group" (with "Create a new Disk Group" being greyed out) since there are only two disks in the DS216+ both used in the Disk Group created above.

It told me there was 5.45TB available, which is about right for "6" TB drives less some overhead for the DSM software install (about 4.5GB AFAICT -- 2.4GB for root on md0 and 2GB for swap on md1). As described above I chose btrfs for the disk volume, and then 3584 GB (3.5 * 1024) for the size (out of a maximum 5585 GB available, so leaving roughly 2TB free for later use). For the description I used "Shared data on SHR1" (it appears to be used only within the web interface and editable later). After applying the changes there was roughly 3.36 TiB available in the volume (with 58.7MB used by the system -- I assume file system structure) -- and a /dev/vg1/volume_1 volume created in the LVM:

ewen@nas01:/$ sudo lvdisplay
  --- Logical volume ---
  LV Path                /dev/vg1/syno_vg_reserved_area
  LV Name                syno_vg_reserved_area
  VG Name                vg1
  LV UUID                4IdgrT-c5A6-3IOo-6Tq6-3rej-9nL9-i2SQou
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 0
  LV Size                12.00 MiB
  Current LE             3
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     384
  Block device           253:0

  --- Logical volume ---
  LV Path                /dev/vg1/volume_1
  LV Name                volume_1
  VG Name                vg1
  LV UUID                J9FKic-QYdA-mTCK-W01z-dO7V-GDk6-JD41mC
  LV Write Access        read/write
  LV Creation host, time ,
  LV Status              available
  # open                 1
  LV Size                3.50 TiB
  Current LE             917504
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     4096
  Block device           253:1

ewen@nas01:/$

which shows a 3.5TiB volume. There is 1.95TiB left:

ewen@nas01:/$ sudo vgdisplay
  --- Volume group ---
  VG Name               vg1
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  3
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               1
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               5.45 TiB
  PE Size               4.00 MiB
  Total PE              1429612
  Alloc PE / Size       917507 / 3.50 TiB
  Free  PE / Size       512105 / 1.95 TiB
  VG UUID               RjMnEQ-IKst-3N2V-3vJb-s8GE-15RO-qQOdOc

ewen@nas01:/$

for future expansion (either of that volume, or creating new volumes).

The new volume was automatically mounted on volume1:

ewen@nas01:/$ mount | grep vg1-volume_1
/dev/mapper/vg1-volume_1 on /volume1 type btrfs (rw,relatime,synoacl,nospace_cache,flushoncommit_threshold=1000,metadata_ratio=50)
ewen@nas01:/$

ready to be used (eg, by creating shares).

From here I was ready to create shares for the various data that I wanted to store, which I will do over time. It appears that thanks to choosing btrfs I can have quotas on the shares as well as the users, which may be useful for things like Time Machine backups.

ETA, 2017-04-23: Some additional file sharing setup:

  • In Control Panel -> File Services -> SMB/AFP/NFS -> SMB -> Advanced Settings, change the Maximum SMB protocol to "SMB3" and the Minimum SMB Protcocol to "SMB2" (Pro Tip: Stop using SMB1!)

  • Also in Control Panel -> File Services -> SMB/AFP/NFS -> SMB -> Advanced Settings, tick "Allow symbolic links within shared folders"

  • In Control Panel -> File Services -> SMB/AFP/NFS -> NFS, tick "Enable NFS" to simplify automatic mounting from Linux systems without passwords. Also tick "Enable NFSv4 support" to allow NFSv4 mounting, which allows more flexiblity around authentication and UID/GID mapping than earlier NFS versions (earlier NFS versions basically assumed you had a way to enforce the same UID/GID enterprise wide, via NIS, LDAP or similar).

Once that is done, new file shares can be created in Control Panel -> Shared Folder -> Create. With btrfs you also get an Advanced -> "Enable advanced data integrity protection" which seems to be on by default, and is useful to have enabled. If you do not want a #recycle directory in your share it is best to untick the "Enable Recycle Bin" option on the first page (that seems most useful on shares intended for Microsoft Windows Systems, and an annoying top level directory anywhere else).

Once the shared folder is created you can grant access to users/groups, and if NFS is turned on you can also grant access to machines (since NFS clients are authenticated by IP) in the "NFS Permissions" tab. Obviously you then have all the usual unix UID/GID issues after that if you are using NFS v3 or NFSv4 without ID mapping, and do not have synchronised UID/GID values across your whole network (which I do not, not least of which is because the Synology DS216+ makes up its own local uid values).

I had hoped to get NFS v4 ID mapping working, by setting the "NFSv4 domain" to the same string on the Synology DS216+ and the clients (on the Synology it appears to default to an empty string; on Linux clients it effectively defaults to the DNS domain name). But even setting both of those (in /etc/idmapd.conf on Linux) did not result in idmapping happening :-( As best I can tell this is because Linux defaults to sec=sys for NFSv4 mounts, the Synology DS216+ default to AUTH_SYS (which turns into sec=sys) for NFS shares, and UID mapping does not happen with sec=sys, because what is passed over the wire is still NFS v3 style UID/GID. (See confirmation from Linux NFS Maintainer that this is intended by modern NFS; the same confirmation can be found in RFC 7530.) Also of note, in sec=sys (AUTH_SYS) NFS UID/GID values are used for authentication, even if file system UID/GID mapping is happening for what is displayed, which causes confusion. (From my tests no keys appear in /proc/keys, indicating no ID mappings are being created.)

There is no UID/GID mapping because of /sys/module/nfs/parameters/nfs4_disable_idmapping being set to "Y" by default on the (Linux) client, and /sys/module/nfsd/parameters/nfs4_disable_idmapping being set to "Y" by default on the Synology DS216+. Which is change from 2012 to the client, and another change from 2012 for the server, apparently for backwards compatiblity with NFS v3. These changes appear to have landed in Linux 3.4; and both my Linux client and the Synology have Linux kernels greater than 3.4.

The idea seems to be that if the unix UID/GID (ie, AUTH_SYS) are used for authentication then they should also be used in the file system, as happened in NFS v3 (to avoid files being owned by nobody:nogroup due to mapping failing). The default is thus to disable the id mapping at both ends in the sec=sys / AUTH_SYS case. It is possible to change the default on the Linux client (eg, echo "N" into /sys/module/nfs/parameters/nfs4_disable_idmapping), but I cannot find a way to persistently change it on the Synology DS216+. Which means that NFS v4 id mapping can really only be used with Kerberos-based authentication :-( (In sec=sys mode, you can see the UID/GID going over the wire, so idmap does not work. This is mostly a NFS, and NFS v4 in particular, issue rather than a Synology NAS issue as such.)

Anyway effectively this means that in order to use the UID/GID mapping in NFS v4, you need to set up kerberos authentication, and then presumably add those Kerberos keys into the Synology DS216+ in Control Panel -> File Services -> SMB/AFP/NFS -> NFS -> Advanced Settings -> Kerberos Settings, and set up the ID mapping. All of which feels like too much work for now. (It seems other Synology users wish UID/GID mapping worked without Kerberos too; it is unfortunate there is no UID:UID mapping option available as a NFS translation layer, but that is not the approach taken by NFS v4. The only references I find to a NFS server with UID:UID mapping was the old Linux user-mode NFS server with map_static, which is no longer used, and thus not available on a Synology NAS.)

It is possible to set NFS "Squash: Map all users to admin" to create effectively a single UID file share, which is sufficient for some of my simple shares (eg, music), so that is what I have done for now. (See a simple example with screenshots and another example with screenshots; see also Synology notes on NFS Security Flavours.)

Setting "Squash: Map all users to admin" in the UI, turns into all_squash,anonuid=1024,anongid=100 in /etc/exports:

ewen@nas01:/$ sudo cat /etc/exports; echo

/volume1/music  172.21.1.0/24(rw,async,no_wdelay,all_squash,insecure_locks,sec=sys,anonuid=1024,anongid=100)
ewen@nas01:/$

and results in files that are owned by uid 1024, and gid 100 no matter which user created them. Which I could then mount on my Linux client with:

ewen@client:~$ sudo mkdir /nas01
ewen@client:~$ sudo mkdir /nas01/music
ewen@client:~$ sudo mount -t nfs -o hard,bg,intr,rsize=65536,wsize=65536  nas01:/volume1/music /nas01/music/

and then look at with:

ewen@client:~$ ls -l /nas01/music/
total 0
drwxrwxrwx 1 1024 users 1142 Sep 10  2016 flac
ewen@client:~$

For my network that is mostly acceptable for basic ("equal access for all") file shares, as gid 100 is "users" on my Linux machines, and thus most machines have my user in that group. (Unfortuantely there is no way in the UI to specify that all access should be squashed to a specific user-specified uid, or I would squash them to my own user in these simple cases. There is also no apparent way to assign uids to the Synology DS216+ users when they are created, so pesumably the only way to set the UIDs of users is by having them supplied by a directory server like LDAP.)

The main issue I notice (eg, with rsync) is that attempts to chown files as root or chgrp files as root fail with "Invalid argument" so this will not work for anything requiring "root" ownership. (I found this while rsyncing music onto the share, but all the instances of music files owned by root are mistakes, so I fixed them at the source and re-ran rsync.)

For more complicated shares I probably either need to use SMB mounts, with appropriate username/password authentication to get access to the share as that user (which also effectively results in single-user access to the share, but will properly map the user values for the user I am accessing as). Or to dedicate the NFS share to a single machine, in which case it can function without ID mapping, as the file IDs will be used only by that machine.

Note that on OS X cifs:// forces SMBv1 over TCP/445, and we turned SMBv1 off above -- so use smb:// to connect to the NAS from OS X Finder (Go -> Connect to Server... (Apple-K)), which will use SMB 2.0 since OS X 10.9 (Mavericks). (CIFS is rarely used these days; instead SMB2 and SMB3 are used, which also work over TCP/445; TCP/445 was one of the original dinstinguishing things of the original Microsoft CIFS implemenation. By contrast the Linux kernel "CIFS" client supports SMB 2.0 since Linux 3.7, so Linux has hung onto the CIFS name longer than other systems; it now supports CIFS, SMB2, SMB2.1 and SMB3, which was implemented by the Samba team.)

On a related note, while testing git-annex on a SMB mount I encountered a timeout, so I ended up installing a later version of git-annex. That allowed git annex init to complete, but transferring files around still failed with locking issues. (Possibly the ssh shell, and git server application for the Synology NAS provides another path to getting git annex working? See example of using git server application. Or using that plus a stand alone build of git-annex on the Synology NAS. Another option is the git annex rsync special remote, but that is content only and I think might only have the internal (SHA hash) filenames.)

ETA, 2017-05-26: While trying to patch the Synology for the Samba bug (fixed in 6.1.1-15101-4) I ran into the "Cannot connect to the Internet" issue, in the update screen. Despite having working IPv4 and IPv6 connectivity (as tested from a ssh sesison). On a hunch I checked the IPv6 settings, and found that the IPv6 DNS server there was pointed at my ISP supplied home gateway rather than my internal DHCP server (used for IPv4) -- which resulted in the Synology trying to use both. So I tried disabling IPv6, but that did not seem sufficient (it is not clear if it ever tried reconnecting); but a reboot with IPv6 did seem to be sufficient. Since I am not actively using IPv6 internally at present, for now I am going to leave IPv6 turned off on the Synology to see if that makes any difference. (My desktops have not had any issues with IPv6 being enabled on my home gateway, but they appear to only be using the internal DNS server AFAICT -- so maybe the issue is the DNS server on the home gateway not responding? In which case perhaps a static IPv6 configuration would fix the issue.)

Unfortunately it does not seem to be well documented precisely what the web interface tries to connect to, and when, to find out if there are updates -- which makes debugging the exact root cause more difficult. However there are forum posts on how to do the upgrade from the ssh shell using the synoupgrade tool, which may help if the problem returns later.

Posted Sun Apr 16 11:48:21 2017 Tags:

Introduction

A couple of months ago I bought a Numato Mimas v2 with the intention of running MicroPython on it.

Today, with a bit of guess work, a lot of CPU time, and some assistance from the #upy-fpga channel on FreeNode I managed to get it going. Below are my notes on how to get MicroPython on FPGAs running on my Numato Mimas v2. This project is very much a work in progress (I am told multiple people were working on it this weekend), so if you are following this guide later I would definitely suggest seeking out updated instructions.

ETA, 2017-05-07: Indeed new getting started instructions were posted a few weeks later; see the update at the end of this post for more details on the later install approach.

Prerequisites

  • Ubuntu 16.04 LTS x86_64 system

  • Numato Mimas V2 Spartan6 FPGA board, with MimasV2Config.py set up to be able to upload "gateware" to the FPGA board (there is also a copy of MimasV2Config.py installed as part of this envvironment setup below which is used for flashing the MicroPython FPGA gateware).

  • USB A to USB Mini B cable, to connect Numato Mimas V2 to the Ubuntu 16.04 LTS system.

  • Xilinx ISE WebPACK installed, reachable from /opt/Xilinx (or optionally installed within the Xilinx directory inside your build directory).

Before you begin it would be a very good idea to check that the Numato Mimas v2 sample.bin example will run on your Mimas v2, and that you can successful replace it with a program of your own (eg, the Numato tutorial synthesis example).

Building the gateware

The "gateware" consists of the compiled FPGA definitions of the soft CPU (lm32) and peripheral devices that you need. For MicroPython a relatively small "base" set is sufficient.

Setup

Clone the upy-fpga-litex-gateware repository, which originated with the HDMI2USB project (hence the dependencies listed):

git clone https://github.com/upy-fpga/upy-fpga-litex-gateware

Install the relevant bits of the environment in two parts, firstly as root:

cd upy-fpga-litex-gateware
sudo bash -x scripts/download-env-root.sh

which will install dozens of packages as direct or indirect dependencies, including some from Tim Ansell's Ubuntu PPA.

And then as a non-root user (eg, your own user) for the remaining parts:

cd upy-fpga-litex-gateware
bash -x scripts/download-env.sh

which will download a bunch more packages, and execute/install them. Among other things it installs lm32 cross build tools (binutils, gcc, etc), as pre-built binary tools. The install process is managed with conda, a Python environment management tool. (It currently actually installs a Python 3.6 environment, and then downgrades it to Python 3.5.1 for compatability, as well as a lot of other old tools.)

It also automatically git clones the relevant Enjoy Digital litex git modules, as listed in the README.

The environment install process will take several minutes, mostly depending on the download speed.

Build

From a terminal which has not entered the Xilinx ISE WebPACK environment, set the desired PLATFORM and TARGET to select what will be built, then enter the upy-fpga environment:

cd upy-fpga-litex-gateware
PLATFORM=mimasv2
TARGET=base
export PLATFORM TARGET
source scripts/enter-env.sh

All going well, it should do some checking, report the directories being used, and then change the prompt to include the PLATFORM and TARGET values. Eg,

(H2U P=mimasv2 T=base R=nextgen)

make help will show you the valid PLATFORM and TARGET values, but cannot be run until after scripts/enter-env.sh has been done; in theory you can change PLATFORM and TARGET after entering the environment, but it might be safest to start with a fresh terminal. (README.targets has some information on the possible TARGET values.)

From there, you can build the "gateware" for your selected PLATFORM/TARGET combination with:

make gateware

which will result in a lot of output, most of it from the Xilinx ISE WebPACK tools. This step will also take a few minutes, and will keep your CPU pretty busy. All going well you should end up with a build/mimasv2_base_lm32/gateware/top.bin file which is the system on a chip to be loaded onto the Mimas V2.

Next you can build the "firmware" to run on the softcore CPU to provide MicroPython on the FPGA. You can build this for your selected PLATFORM/TARGET combination with:

make firmware

This step appears to use a pre-compiled firmware file, and builds quite quickly. It should result in a build/mimasv2_base_lm32/software/firmware/firmware.bin file.

Gateware and Firmware install

Ensure that the Numato Mimas v2 "operation mode" switch (SW7) is set to program mode -- the side nearest the USB connector is program mode (see the Numato Mimas V2 documentation).

Bundle up the gateware, BIOS, and firmware togther with:

make image

(which runs ./mkimage.py), to create build/mimasv2_base_lm32/flash.bin.

Then install the gateware, BIOS and firmware bundle with:

make image-flash

(which effectively runs make image-flash-mimasv2 due to the PLATFORM setting).

Because the upload happens at 19200 bps, this will take a couple of minutes to complete -- it does an erase cycle, a write cycle, and a read-back verification cycle.

The upload process looks something like:

(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$ make gateware-flash-mimasv2
python $(which MimasV2Config.py) /dev/ttyACM0 build/mimasv2_base_lm32//flash.bin
****************************************
* Numato Lab Mimas V2 Configuration Tool *
****************************************
Micron M25P16 SPI Flash detected
Loading file build/mimasv2_base_lm32//flash.bin...
Erasing flash sectors...
Writing to flash 100% complete...
Verifying flash contents...
Flash verification successful...
Booting FPGA...
Done.
(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$

The Mimas v2 will reboot into the default firmware which is not MicroPython (see later in the document on getting MicroPython running).

Modern lm32 build environment for MicroPython

Building MicroPython needs a fairly recent crosscompiler build environment, newer than the one used by the MicroPython gateware.

This step is best done in a terminal which does not have the gateware configuration (above) in it, so start a fresh terminal.

To build this newer crosscompiler environment clone the lm32-build-scripts repository:

git clone https://github.com/shenki/lm32-build-scripts.git

Then build the cross compilers on your system with:

cd lm32-build-scripts
./build-lm32-toolchain.sh

It will download several key Linux build tools (gcc, gmp, mpfr, mpc, binutils, gdb, etc), and then use them to build a crosscompiler for lm32, designed to be installed in /opt/lm32.

This step will take several minutes, partcularly to download the required source to build. Expect your CPU to be very busy for a while as it does a make -j32 when building everything; this also makes the console output fairly tricky to follow, and it fairly difficult to tell how far through the build process it has reached. (Currently there does not seem to be any check that the downloads are complete, or as intended -- nor any stop and continue steps in the build process -- so it is a bit "hope this works". There is partial support for building a Docker environment with the cross-compilers, but it appears to do a one-shot build and then remove them; presumably it is there only for testing the build scripts work.)

Assuming that it finishes without obvious error, and the return code is 0:

echo $?

then we can assume that it worked. The build directory should include a lot of built code (approximately 2GB).

The built code can then be installed somewhere central with:

sudo mkdir /opt/lm32
sudo chown $USER:$USER /opt/lm32
(cd build && make install)

which will also generate a lot of output, but run much quicker.

After this /opt/lm32/bin should contain a bunch of useful cross-compile tools, eg:

ewen@parthenon:/opt/lm32/bin$ ls
lm32-elf-addr2line  lm32-elf-gcc-6.2.0   lm32-elf-gprof    lm32-elf-readelf
lm32-elf-ar         lm32-elf-gcc-ar      lm32-elf-ld       lm32-elf-run
lm32-elf-as         lm32-elf-gcc-nm      lm32-elf-ld.bfd   lm32-elf-size
lm32-elf-c++filt    lm32-elf-gcc-ranlib  lm32-elf-nm       lm32-elf-strings
lm32-elf-cpp        lm32-elf-gcov        lm32-elf-objcopy  lm32-elf-strip
lm32-elf-elfedit    lm32-elf-gcov-tool   lm32-elf-objdump
lm32-elf-gcc        lm32-elf-gdb         lm32-elf-ranlib
ewen@parthenon:/opt/lm32/bin$

MicroPython

MicroPython is also best built in a new terminal, without the gateware build environment variables. It needs to be built from an in-development repository with changes for MicroPython on FPGA and the Mimas v2.

In a fresh terminal, clone the forked MicroPython repository, with Mimas V2 support in it:

git clone https://github.com/upy-fpga/micropython.git

(There are other repositories too; I chose this one to try first as it had been reported as working on the Mimas v2. Apparently the lm32-v2 branch is the main one being worked on at present.)

Enter the repository, and checkout the lm32-mimas2 branch:

cd micropython
git checkout lm32-mimas2

ETA, 2017-03-16: The upy-fpga/micropython has been rebased onto the upstream micropython/micropython, with the lm32 patches merged onto the master branch; it is now best just to use the master branch (and there is no lm32-mimas2 branch any longer).

Change into the lm32 directory, and build with a cross compiler:

cd lm32
PATH="${PATH}:/opt/lm32/bin" make CROSS=1

That should build fairly quickly, and result in a build/firmware.elf file. Convert that into a firmware.bin file that can be uploaded to the Mimas v2 with:

PATH="${PATH}:/opt/lm32/bin" make build/firmware.bin CROSS=1

ETA, 2017-03-13: Apparently one should copy the contents of the build/mimasv2_base_lm32/software/include/generated from the gateware build environment (above) into the micropython/lm32/generated directory before building, to keep them in sync. I did not do this, and presumably it worked due to having an old "compatible enough" version checked in.

Installing MicroPython on the Numato Mimas v2

To actually install MicroPython, we have a few options. Firstly we can build a complete flash image including MicroPython instead of the default firmware. Secondly we can reset the soft CPU running in the default firmware and trigger an upload of MicroPython to run that boot instead of the default firmware. Thirdly we can upload a flash image without any default firmware, and rely on always uploading the application we want to run.

Flash image including MicroPython

Return to the gateware build top directory, with the environment set up, ie as before (possibly you still have a suitable terminal open):

cd upy-fpga-litex-gateware
PLATFORM=mimasv2
TARGET=base
export PLATFORM TARGET
source scripts/enter-env.sh

Make a directory to build up the MicroPython flash image:

mkdir micropython
cd micropython

And then copy over the MicroPython firmware.elf and firmware.bin file:

cp -p .../micropython/lm32/build/firmware.bin .

Run:

python -m litex.soc.tools.mkmscimg -f firmware.bin -o firmware.fbi

to build a firmware.fbi file.

Change back up to the top directory, and then use mkimage to build a complete flash image including MicroPython:

cd ..
rm build/mimasv2_base_lm32/flash.bin
./mkimage.py --override-firmware micropython/firmware.fbi

This should build a new flash image in build/mimasv2_base_lm32/flash.bin. with output something like:

(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$ ./mkimage.py --override-firmware micropython/firmware.fbi

Gateware @ 0x00000000 (    341436 bytes) build/mimasv2_base_lm32/gateware/top.bin                     - Xilinx FPGA Bitstream
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff aa 99 55 66 30 a1 00 07 20 00 31 a1 03 80 31 41 3d 00 31 61 09 ee 31 c2 04 00 10 93 30 e1 00 cf 30 c1 00 81 20 00 20 00 20 00 20 00 20 00 20 00
    BIOS @ 0x00080000 (     19356 bytes) build/mimasv2_base_lm32/software/bios/bios.bin               - LiteX BIOS with CRC
98 00 00 00 d0 00 00 00 78 01 00 08 38 21 00 00 d0 e1 00 00 e0 00 00 3b 34 00 00 00 34 00 00 00 e0 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00
Firmware @ 0x00088000 (    153736 bytes) micropython/firmware.fbi                                     - HDMI2USB Firmware in FBI format (loaded into DRAM)
00 02 58 80 36 67 08 1a 98 00 00 00 d0 00 00 00 78 01 40 00 38 21 00 00 d0 e1 00 00 e0 00 00 3b 34 00 00 00 34 00 00 00 e0 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00
----------------------------------------
       Remaining space    1386360 bytes (10 Megabits, 1.32 Megabytes)
           Total space    2097152 bytes (16 Megabits, 2.00 Megabytes)

Flash image: build/mimasv2_base_lm32/flash.bin
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff aa 99 55 66 30 a1 00 07 20 00 31 a1 03 80 31 41 3d 00 31 61 09 ee 31 c2 04 00 10 93 30 e1 00 cf 30 c1 00 81 20 00 20 00 20 00 20 00 20 00 20 00
(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$

Then this custom flash image can be loaded onto the Numato Mimas v2, by ensuring that the "operation mode" switch (SW7) is in "program mode" (nearest to the USB connector), then running:

MimasV2Config.py /dev/ttyACM0 build/mimasv2_base_lm32/flash.bin

to program MicroPython onto the Mimas v2. This will take a few minutes to write, as it is uploading at 19200 bps.

The result should look something like:

(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$ MimasV2Config.py /dev/ttyACM0 build/mimasv2_base_lm32/flash.bin
****************************************
* Numato Lab Mimas V2 Configuration Tool *
****************************************
Micron M25P16 SPI Flash detected
Loading file build/mimasv2_base_lm32/flash.bin...
Erasing flash sectors...
Writing to flash 100% complete...
Verifying flash contents...
Flash verification successful...
Booting FPGA...
Done.
(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$

Using the REPL of MicroPython on the FPGA

Unplug the Numato Mimas v2, to power it down.

Move the Numato Mimas V2 "operation mode" switch (SW7) to serial port mode (furthest away from the USB port), so that a terminal program on the computer can communicate with the softcore in the FPGA.

Then run:

screen /dev/ttyACM0 19200

to connect to the MicroPython REPL.

(For some reason MicroPython does not work with flterm as the serial console program, hence "make firmware-connect-mimasv2" which uses flterm does not work; thus the use of screen as a simple terminal emulator. ETA, 2017-05-15: Recent builds of flterm fix this issue so it is not necessary to start screen to interact with MicroPython.)

All going well, if you hit enter a couple of times, you should get a prompt, and then be at the Python REPL:

>>>
>>>
>>>
>>> print("Hello World!")
Hello World!
>>>

To get out of the screen session, use ctrl-a \ (backslash) to quit screen.

Uploading MicroPython at boot

The disadvantage of including MicroPython in the flash image is that the whole system needs to be reflashed for every change to MicroPython. As an alternative it is possible to program the Mimas v2 flash with the default application, and then upload the MicroPython firmware application over the serial link, through the BIOS boot loader.

To do this, build the default firmware image as above, and upload that:

make image
make image-flash

then change the operation mode (SW7) to "serial port" (away from the USB connector), and start flterm to upload the MicroPython firmware.bin into RAM on the Mimas v2:

flterm --port=/dev/ttyACM0 --kernel=micropython/firmware.bin --speed=19200

Once flterm is running, press SW6 (button 6, at top right), to send a reset to the soft CPU. (In theory one should be able to type reboot at the H2U> application prompt, but at present on the Mimas v2 that jumps to the wrong address and just hangs.)

You should see the BIOS/boot loader messages appear, and then it should prompt flterm to send the kernel image to run. The upload should start automatically and look something like:

LiteX SoC BIOS (lm32)
(c) Copyright 2012-2017 Enjoy-Digital
(c) Copyright 2007-2017 M-Labs Limited
Built Mar 12 2017 15:39:00

BIOS CRC passed (cdfe4dda)
Initializing SDRAM...
Memtest OK
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
[FLTERM] Received firmware download request from the device.
[FLTERM] Uploading kernel (153728 bytes)...
[FLTERM] Upload complete (1.6KB/s).
[FLTERM] Booting the device.
[FLTERM] Done.
Executing booted program.
MicroPython v1.8.7-38-gafd8920 on 2017-03-12; litex with lm32
Type "help()" for more information.
>>>

Unfortunately the problem of MicroPython and flterm disagreeing about something still exists, so once you reach this point, you need to disconnect flterm (ctrl-c) and reconnect with screen at this point to use the REPL (ETA, 2017-03-15: unless you have a recent build of flterm):

screen /dev/ttyACM0 19200

and then the Python REPL should work:

>>>
>>>
>>>
>>> print("hello world!")
hello world!
>>>

To get out of the screen session, use ctrl-a \ (backslash) to quit screen.

A third option: no default application

It is also possible to build the flash image without a default application, and then simply rely on resetting the Mimas v2 and flterm uploading the application to run.

To do this:

rm build/mimasv2_base_lm32/flash.bin
./mkimage.py --override-firmware none

which should result in something like:

(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$ ./mkimage.py --override-firmware none

Gateware @ 0x00000000 (    341436 bytes) build/mimasv2_base_lm32/gateware/top.bin                     - Xilinx FPGA Bitstream
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff aa 99 55 66 30 a1 00 07 20 00 31 a1 03 80 31 41 3d 00 31 61 09 ee 31 c2 04 00 10 93 30 e1 00 cf 30 c1 00 81 20 00 20 00 20 00 20 00 20 00 20 00
    BIOS @ 0x00080000 (     19356 bytes) build/mimasv2_base_lm32/software/bios/bios.bin               - LiteX BIOS with CRC
98 00 00 00 d0 00 00 00 78 01 00 08 38 21 00 00 d0 e1 00 00 e0 00 00 3b 34 00 00 00 34 00 00 00 e0 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00 34 00 00 00
Firmware @ 0x00088000 (         0 bytes) Skipped                                                      - HDMI2USB Firmware in FBI format (loaded into DRAM)

----------------------------------------
       Remaining space    1540096 bytes (11 Megabits, 1.47 Megabytes)
           Total space    2097152 bytes (16 Megabits, 2.00 Megabytes)

Flash image: build/mimasv2_base_lm32/flash.bin
ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff aa 99 55 66 30 a1 00 07 20 00 31 a1 03 80 31 41 3d 00 31 61 09 ee 31 c2 04 00 10 93 30 e1 00 cf 30 c1 00 81 20 00 20 00 20 00 20 00 20 00 20 00
(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$

That can be programmed onto the Mimas v2, by putting the Mimas v2 into programming mode (SW7 to the side nearest the USB connector), then running:

MimasV2Config.py /dev/ttyACM0 build/mimasv2_base_lm32/flash.bin

Once that completes put the "operation mode" switch (SW7) back to the "serial console" mode (furthest from the USB connector), then run flterm as above:

flterm --port=/dev/ttyACM0 --kernel=micropython/firmware.bin --speed=19200

and hit enter a couple of times. You should get a BIOS> prompt. At that prompt you can type serialboot to get it kick off the application upload:

(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$ flterm --port=/dev/ttyACM0 --kernel=micropython/firmware.bin --speed=19200
[FLTERM] Starting...

BIOS>
BIOS>
BIOS>
BIOS> serialboot
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
[FLTERM] Received firmware download request from the device.
[FLTERM] Uploading kernel (153728 bytes)...
[FLTERM] Upload complete (1.6KB/s).
[FLTERM] Booting the device.
[FLTERM] Done.
Executing booted program.
MicroPython v1.8.7-38-gafd8920 on 2017-03-12; litex with lm32
Type "help()" for more information.
>>>
(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$

(or just press SW6 to reset the soft CPU back into the start of the boot loader, once you have flterm running, as with the second option above).

As above, you will need to disconnect flterm (ctrl-c) once the MicroPython banner appears, and start screen to interact with the MicroPython REPL (ETA, 2017-03-15: unless you have a recent build of flterm):

screen /dev/ttyACM0 19200

To get out of the screen session, use ctrl-a \ (backslash) to quit screen.

Other references

ETA, 2017-03-13: Lots of proof reading edits, and tweaks based on advice from Tim Ansell.

ETA, 2017-03-15: A newer version of flterm is now available, which does work with MicroPython, so it is now possible to do both the MicroPython firmware upload and interact with MicroPython from one program (ie, no need to exit out to screen).

To update, after building everything, do conda install flterm, which should install flterm 2.4_15_gd17828f-0 timvideos:

(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$ conda install flterm
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /home/ewen/work/naos/src/upy-fpga/upy-fpga-litex-gateware/build/conda:

The following packages will be UPDATED:

    flterm: 0+git20160123_1-0 timvideos --> 2.4_15_gd17828f-0 timvideos

flterm-2.4_15_ 100% |################################| Time: 0:00:01   6.17 kB/s
(H2U P=mimasv2 T=base R=nextgen) ewen@parthenon:~/work/naos/src/upy-fpga/upy-fpga-litex-gateware$

Run flterm as usual (eg, as explained above). You can tell that it is working properly if hitting enter at the MicroPython REPL gives you another prompt, and you are able to enter Python programs.

ETA, 2017-03-15: Katie Bell pointed out that there is a not-merged testing branch which enables controlling the LEDs on the Numato Mimas v2 board. It is in the lm32-leds branch of shenki's MicroPython repository on GitHub, with an example given in the lm32: Add leds module commit comment.

I was able to build it:

git clone https://github.com/shenki/micropython micropython-shenki
cd micropython-shenki
git checkout lm32-leds
cd lm32
PATH="${PATH}:/opt/lm32/bin" make CROSS=1
PATH="${PATH}:/opt/lm32/bin" make build/firmware.bin CROSS=1

and then get it working on my Numato Mimas v2 board with:

cd upy-fpga-litex-gateware
PLATFORM=mimasv2
TARGET=base
export PLATFORM TARGET
source scripts/enter-env.sh
cd micropython
mkdir leds
cd leds
cp -p ..../micropython-shenki/lm32/build/firmware.bin .
python -m litex.soc.tools.mkmscimg -f firmware.bin -o firmware.fbi
cd ../..
flterm --port=/dev/ttyACM0 --kernel=micropython/leds/firmware.bin --speed=19200

and then press SW6 (top right) to reset the soft CPU into the boot loader, and load that newer build of MicroPython.

>>>
LiteX SoC BIOS (lm32)
(c) Copyright 2012-2017 Enjoy-Digital
(c) Copyright 2007-2017 M-Labs Limited
Built Mar 12 2017 15:39:00

BIOS CRC passed (cdfe4dda)
Initializing SDRAM...
Memtest OK
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
[FLTERM] Received firmware download request from the device.
[FLTERM] Uploading kernel (180724 bytes)...
[FLTERM] Upload complete (1.6KB/s).
[FLTERM] Booting the device.
[FLTERM] Done.
Executing booted program.
MicroPython v1.8.7-182-gd86a88c on 2017-03-15; litex with lm32
>>>
>>> import litex
>>> leds = [ litex.LED(n) for n in range(1,9) ]
>>> print(leds)
[LED(1), LED(2), LED(3), LED(4), LED(5), LED(6), LED(7), LED(8)]
>>> for led in leds:
...     led.on()
...
>>> for led in leds:
...     led.off()
...
GC: total: 1984, used: 1408, free: 576
 No. of 1-blocks: 7, 2-blocks: 7, max blk sz: 32, max free sz: 8
>>>

During the led.on() loop all the LEDs should turn on; during the led.off() loop, all the LEDs should turn off. But note the GC line indicating that about 75% of the memory resources are in use, just holding those 8 LED objects open -- so this may not be the most memory efficient approach.

(Unfortunately this particular MicroPython build does not have as many features turned on as the MicroPython for the ESP8266, so things like time.sleep() do not seem to be available.)

ETA, 2017-03-16: The upy-fpga/micropython has been rebased onto the upstream micropython/micropython, with the lm32 patches merged onto the master branch; it is now best just to use the master branch (and there is no lm32-mimas2 branch any longer).

ETA, 2017-05-07:

Updated upy-fpga bootstrap instructions

Tim Ansell posted new upy-fpga bootstrap instructions, which simplified the steps to get a working upy-fpga system on the Mimas V2 board.

The instructions point at the upy-fpga-litext-gateware "getting started" document, which uses a curl | sudo bash bootstrap method.

Since I am not that fond of curl | sudo bash, I chose a slightly different way to get started. Firstly I downloaded the bootstrap script to review:

curl -fsS -o hdmi2usb-litex-firmware-bootstrap.sh https://raw.githubusercontent.com/timvideos/HDMI2USB-litex-firmware/master/scripts/bootstrap.sh

Then I looked through te script to check that everything seemed sensible. Similar to the appoach that I followed originally (described above), it does a recursive clone of the git repository:

https://github.com/timvideos/HDMI2USB-litex-firmware.git

which is effectively the upstream of which upy-fpga-litex-firmwware was forked. This results in building from the timvideos/HDMI2USB-litex-firmware GitHub repostory instead of the upy-fpga/upy-fpga-litex-gateware repository. (It is possible to point it at the HDMI2USB-litex-firmware.git repository in another GitHub account by setting GITHUB_USER first, but not obviously at another repository name. GITHUB_USER defaults to timvideos, note the singular "tim", as "timvideos" is something else.)

The recursive clone will pick up a number of third party git repositories as well, including litex and several lite.... component ones. Once the git clone finishes, it will automatically run the scripts/download-env-root.sh script as root, which uses apt-get to install a bunch of additional tools, and then also automatically run the scripts/download-env.sh as your own user, which will use conda to install a bunch more things. The things installed are similar to what I described in my previous blog post, but these scripts are run from the timvideos/HDMI2USB-litex-firmware repository.

Once you are happy with what it is going to do, run the bootstrap.sh script (as a regular non-root user) with something like:

bash -x hdmi2usb-litex-firmware-bootstrap.sh

(You might want to run it inside a script session to get a record of what it is doing.)

When it gets to the point of needing root credentials (to run scripts/ download-env-root.sh), sudo will probably prompt for your password, which gives you a chance to review that what it downloaded in the git clone, is the same as what you expected from the git repository. You should also review the downloaded scripts/download-env.sh script at this point too, to make sure you are happy with that, as it will run immediately afterwards.

If you have already run the download-env-root.sh script from a previous install, it will probably not install anything new at this point. However the download-env.sh script will use conda to install a large number of things, including downloading various lm32 tools very slowly (20-60 kB/s in my case, so well under 1Mbps). The conda installed things appear to end up in build/conda/bin and friends, so as the downloads progress you will see more lm32-... tools in build/conda/bin. Altogether after these steps the build directory contains about 500MB (0.5GB) of programs.

Once you reach the line "Bootstrap: Set up complete, you're good to go!" the gateway build environment bootstrap phase is complete. You might want to make a backup of the build directory at this point to save time if you want to start again later:

cd HDMI2USB-litex-firmware
mkdir -p ../bootstrap
tar -cpzf ../bootstrap/hdmi2usb-litex-firmware-build-after-bootstrap.tar.gz build

After that, you can move onto building MicroPython.

To build MicroPython for the Mimas V2, enter the gateware build environment for the Mimas V2 with:

cd HDMI2USB-litex-firmware
export PLATFORM=mimasv2
export TARGET=base
source scripts/enter-env.sh

and you should get your prompt changed to include "(H2U P=mimasv2 T=base)" at the start. (It appears that scripts/settings.sh, is defaulting the CPU to lm32, otherwise you would also need export CPU=lm32.)

Next, build the gateware (ie, the FPGA logic for the lm32 soft CPU and peripherals):

make gateware

All going well, that should end with "Bitstream generation is complete."

Then you can run the scripts/build-micropython.sh script, which will clone the upy-fpga repository (https://github.com/upy-fpga/micropython.git), and then build a MicroPython binary image for your specified platform. Eg,

bash -x scripts/build-micropython.sh

However, one of the first things that will do is use conda to install another lm32 gcc package -- an elf-newlib variation -- if you do not already have it installed. So be prepared for another bootstrap phase (with another slow download) the first time you run it.

After two long, slow, failed, download attempts failed, I tried grabbing the URL that conda reported unable to download on another system and copying that into build/conda/pkgs trying to avoid conda trying to download it from the network. Unfortunately conda was determined to download the file itself, and just moved my pre-downloaded file out of the way. I eventually managed to make it install, by setting http_proxy and https_proxy to point at a web proxy, at another site, which seemed to be able to keep streaming data fast enough that conda did not time out. (conda seems to have rather short timeouts, and its only recovery mechanism appears to be to delete the download attempt and start again from scratch -- which is an incredibly user and bandwidth hostile measure. At minimum it should be able to resume downloads, or use a download tool, like wget, which will automatically attempt to resume downloads.)

Assuming that (eventually!) works, you should end up with build/mimasv2_base_lm32/software/micropython/firmware.fbi (and .../micropython/firmware.bin) with the MicroPython build, and build/mimasv2_base_lm32/micropython.bin with a ready-to-flash image containing the gateway and firmware.

If the board already has earlier gateware loaded (eg from following the earlier instructions above), you can upload the new MicroPython build temporarily by setting SW7 to the serial interaction mode (right hand side), and then running:

flterm --port=/dev/ttyACM0 --kernel="build/${PLATFORM}_${TARGET}_${CPU}/software/micropython/firmware.bin" --speed=19200

and then press SW6 (top right of the Mimas V2) to cause the Mimas V2 to restart, which should start the upload of the MicroPython software. This should look something like:

(H2U P=mimasv2 T=base) ewen@parthenon:/src/upy-fpga/HDMI2USB-litex-firmware$ flterm --port=/dev/ttyACM0 --kernel=build/${PLATFORM}_${TARGET}_${CPU}/software/micropython/firmware.bin --speed=19200
[FLTERM] Starting...

LiteX SoC BIOS (lm32)
(c) Copyright 2012-2017 Enjoy-Digital
(c) Copyright 2007-2017 M-Labs Limited
Built Mar 12 2017 15:39:00

BIOS CRC passed (cdfe4dda)
Initializing SDRAM...
Memtest OK
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
[FLTERM] Received firmware download request from the device.
[FLTERM] Uploading kernel (167952 bytes)...
[FLTERM] Upload complete (1.6KB/s).
[FLTERM] Booting the device.
[FLTERM] Done.
Executing booted program.
MicroPython v1.8.7-458-gb63852f on 2017-05-07; litex with lm32
>>>
>>>
>>> print("Hello World!")
Hello World!
>>>
(H2U P=mimasv2 T=base) ewen@parthenon:/src/upy-fpga/HDMI2USB-litex-firmware$

It is also possible to flash the Mimas V2 with the new gateware and firmware combination, in build/mimasv2_base_lm32/micropython.bin, by setting SW7 to be in the gateware upload mode (to the left hand side) and then running:

MimasV2Config.py /dev/ttyACM0 "build/${PLATFORM}_${TARGET}_${CPU}/micropython.bin"

which should look something like:

(H2U P=mimasv2 T=base) ewen@parthenon:/src/upy-fpga/HDMI2USB-litex-firmware$ MimasV2Config.py /dev/ttyACM0 "build/${PLATFORM}_${TARGET}_${CPU}/micropython.bin"
****************************************
* Numato Lab Mimas V2 Configuration Tool *
****************************************
Micron M25P16 SPI Flash detected
Loading file build/mimasv2_base_lm32/micropython.bin...
Erasing flash sectors...
Writing to flash 100% complete...
Verifying flash contents...
Flash verification successful...
Booting FPGA...
Done.
(H2U P=mimasv2 T=base) ewen@parthenon:/src/upy-fpga/HDMI2USB-litex-firmware$

Once that you finishes, move SW7 to the serial interaction setting (to the right), and use flterm to connect to the Python REPL:

flterm --port=/dev/ttyACM0 --speed=19200

Press enter a few times until you get a Python REPL prompt:

(H2U P=mimasv2 T=base) ewen@parthenon:/src/upy-fpga/HDMI2USB-litex-firmware$ flterm --port=/dev/ttyACM0 --speed=19200
[FLTERM] Starting...

>>>
>>>
>>>
>>> print("Hello World!")
Hello World!
>>>
>>>
(H2U P=mimasv2 T=base) ewen@parthenon:/src/upy-fpga/HDMI2USB-litex-firmware$

and ctrl-C when you are done.

The now-merged upy-fpga/micropython.git repository includes the Mimas V2 LED support, so the LED example above should work too.

Posted Sun Mar 12 22:12:05 2017 Tags:

Imagine, not entirely hypothetically, that you have a client that needs you to work on multiple systems only accessible via https (where the domain name needs to match), all located behind the client's firewall. Further suppose that the only access they can provide to their network is ssh to a bastion host -- even while located in their physical office, only "guest" network access to the Internet is available. Assume, also not entirely hypothetically, that they have no VPN server. Finally assume, again not entirely hypothetically, that no software can be installed on the bastion host, and that it runs Ubuntu Linux 12.04 LTS (hey, there is at least a month of maintenance support left for that version...).

In this situation there are a few reasonable approaches that preserve the browser's view of the domain name:

  • an outgoing (forward) web proxy, supporting CONNECT

  • transparent redirection of outgoing TCP connections to a proxy

  • tricks with DNS resolution (eg /etc/hosts), possibly combined with one of the above.

I did briefly experiment with transparent redirection of the outgoing TCP connections (which works well on Linux: iptables -t nat -A OUTPUT ...), but since I was working from a Mac OS X desktop system it was more complicated (Mac OS X uses pf, and pf.conf can include rdr statements to redirect packets, but intercepting locally originated traffic involves multiple steps and seems somewhat fragile and was not working reliably for me).

Instead I went looking for a way to implement a web forward proxy, on the bastion host. Since I could not install software on the bastion host, I needed to find something already installed which could be repurposed to be a forward proxy. Fortunately it turned out that the bastion host had been installed with apache2 (2.2.2), to support another role of the host (remote access to monitoring output). I then needed a configuration that could use apache2 in a forward proxy mode.

Apache 2.2 provides a forward proxy feature through mod_proxy, but it is definitely something you want to secure carefully as the documentation repeatedly warns. In addition the bastion host naturally was firewalled from the Internet to allow only certain ports to be reached directly, including ssh, so simply running a web proxy on some port on an Internet reachable IP was never an option.

To solve both of these problems I created a configuration to run another instance of Apache 2.2, with mod_proxy enabled in forward proxy mode, listening on localhost, that could be reached only via ssh port forward (based on examples from the Internet).

This involved creating a custom configuration to run Apache 2.2 with:

Listen 127.0.0.1:3128

# Access control functionality
LoadModule authz_host_module /usr/lib/apache2/modules/mod_authz_host.so

# Proxy functionality
LoadModule proxy_module         /usr/lib/apache2/modules/mod_proxy.so
LoadModule proxy_http_module    /usr/lib/apache2/modules/mod_proxy_http.so
LoadModule proxy_connect_module /usr/lib/apache2/modules/mod_proxy_connect.so

# Logging
LogFormat "%h %l %u %t \"%r\" %s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined
LogLevel warn
ErrorLog  logs/error.log
CustomLog logs/access.log combined

# PID file
PidFile   logs/apache2.pid

# Forward Proxy
ProxyRequests On

# Allow CONNECT (and thus HTTPS) to additional ports
AllowCONNECT 443 563 8080

as well as a section (in angle brackets) to match "Proxy *", which limited access to the proxy to localhost:

Order Deny,Allow
Deny from all
Allow from 127.0.0.1

(Full Apache 2.2 example config.)

Note that the above configuration will work only with Apache 2.2 (or earlier); the configuration for the Apache 2.4 mod_proxy, and other Apache features, changed significantly, particularly around authorization. (ETA, 2017-04-03: See end of post for update with Apache 2.4 configuration.)

The key features in the above configuration is that it loads the modules needed for HTTP/HTTPS proxying, and IP authentication, and then listens on 127.0.0.1:3128 for connections and treats those connections as forward proxy connections -- thanks to ProxyRequests On and the "Proxy *" section. The "Proxy *" section is intended to allow access only from localhost itself. It may be useful to also add user/password authentication to the proxy use.

Put the config in a directory, and then make a "logs" directory:

cd ....
mkdir logs

And create a simple wrapper script to start up the proxy on the bastion host (eg, called "go"):

#! /bin/sh
# Run Apache in forward proxy mode to be reached via ssh tunnel
#
exec apache2 -d ${PWD} -f ${PWD}/apache-2.2-forward-proxy.conf -E /dev/stderr

Then the proxy can be started up when needed with:

cd ....
./go

and will run in the background. When run, you should see something listening on TCP/3128 on localhost on the bastion host, eg with netstat -na | grep 3128:

tcp      0       0 127.0.0.1:3128        0.0.0.0:*              LISTEN

From there, the next step is to ssh into the bastion host with a port forward that allows you to reach 127.0.0.1:3128:

ssh -L 3128:127.0.0.1:3128 -o ExitOnForwardFailure=yes -4 -N -f HOST

which should cause ssh to listen on our local (desktop) system, here also on port 3128, so netstat -na | grep 3128 on the local system should also show:

tcp      0       0 127.0.0.1:3128        0.0.0.0:*              LISTEN

The final step is to set the proxy configuration of your web browser to use a web proxy at 127.0.0.1 on port 3128, so that your web browser will use the proxy for HTTP/HTTPS connections. For Safari this can be done with Safari -> Preferences... -> Advanced -> Proxies: Change Settings..., which will open the system wide proxy settings. You need to change both "Web Proxy (HTTP)" and "Secure web proxy (HTTPS)" for this to work in most cases, ticking them and then setting the "Web Proxy Server" to "127.0.0.1" and the port to "3128".

After that your web browsing should automatically go through the 127.0.0.1:3128 connection on your desktop system, via the ssh port forward to Apache 2.2/mod_proxy on the bastion host, and then CONNECT out to the desired system, giving a transparent HTTPS connection so that TLS certificate validation will just work.

To access from a command line client it is typically useful to set:

http_proxy=https://127.0.0.1:3128/
https_proxy=https://127.0.0.1:3128/
export http_proxy https_proxy

because most client libraries will look for those (lowercase) environment variables when deciding how to make the connection. This enables using, eg, web REST/JSON APIs from Python, with the proxy.

If the bastion host is restarted then the proxy will have to be manually restarted (as above); but if it is running on a legacy Linux install chances are that the client will be unwilling to reboot the host regularly due to being uncertain if it will boot up cleanly with everything needed running. I got through the entire project without having to restart the proxy.

The main catch with this configuration is that because the Safari proxy setting are system wide, they affect all traffic including things like iTunes. I worked around that by using another browser which had its own network connection settings (Firefox) for regular web browsing, and turning the proxy settings on and off as I needed them to work on that project. If this were to be a semi-permanent solution it might either be better to use a browser with its own proxy settings (eg, Firefox; or one in a VM) dedicated to the project, and leave the system wide settings alone. Or perhaps to create a Proxy Auto-Config (.pac) file which redirected only certain URLs to the proxy -- it is possible to arrange to load those from a local file instead of a web URL.

ETA, 2017-04-03: Inevitably, given that the support for Ubuntu 12.04 LTS runs out this month, the client upgraded to the bastion host Ubuntu 14.04 LTS (but not Ubuntu 16.04 LTS; presumably due to being a smaller jump). This brings in Apache 2.4 instead of Apache 2.2, which brings in a non-trivial changes in configuration syntax.

The incremental differences are relatively small though, for a basic config. You need to load a couple of additional modules:

# Apache 2.4 worker
LoadModule mpm_worker_module /usr/lib/apache2/modules/mod_mpm_worker.so

LoadModule authz_core_module /usr/lib/apache2/modules/mod_authz_core.so

and change the Proxy authorization section to be:

Require ip 127.0.0.1

rather than "Order Deny, Allow", "Deny from all", "Allow from 127.0.0.1".

It is now also possible to use additional proxy directives including:

ProxyAddHeaders On
ProxySourceAddress A.B.C.D

but these are optional. With those changes the same minimal configuration approach should work with Apache 2.4.

(Full Apache 2.4 example config.)

Posted Tue Mar 7 09:15:35 2017 Tags: