git-annex to manage media backups

Introduction

Over the last few years I have accumulated quite a lot of purchased, downloadable-only, content -- 200+ (DRM free) eBooks, quite a few software programs (mainly photography related), and an increasing amount of audio and video content (mostly photography training).

Some of it is only available for download for a short period of time (the 5daydeal specialises in that -- content is available for download only for about two weeks), so I really do not want to lose my copies of it. But I also do not want to lose my copies of other downloaded items, especially video, because even if I can still download them again, in New Zealand the bandwidth involved in downloading it again costs Real Money (tm). (On my current plan, excess data is $3/2GB -- which is cheap compared with what it used to be, but definitely not free when you're talking about many 10s of GBs.)

When the content was relatively small, my main approach was to keep it on my laptop, which has a good set of hourly (Time Machine), and weekly (SuperDuper!) backups (including an off-site rotation). Plus maybe another copy or two by hand onto a couple of Linux servers in different locations. But as the content grows -- both in disk space consumed, and quantity of files -- that becomes increasingly difficult to manage.

git-annex is an extension to the git version control system, which allows it to track "large" files -- larger than you would want to keep stored in every copy of the git repository -- in a more sensible fashion. Put very simply git-annex uses git to keep track of the metadata of the files, including which git-annex repositories have actual copies of the file data -- and then provides commands for copying the data around on request. Amongst other useful things it can ensure that there are always N copies of the file data, and -- through the robust file checksums it keeps -- ensure that it knows when it has a complete, unchanged, copy of the file data.

git-annex is written by Joey Hess, who amongst other things was a Debian Linux Developer until very recently, and wrote the blog software used for my blog. git-annex is both a command line tool, and also comes with an assistant, which provides an easier to use "synchronised folder" mechanism -- a bit like the dropbox, etc, workflow. (The assistant was funded by a Kickstarter, and Joey has continued working on git-annex since then, picking up funding periodically since then.) Since I am very familiar with the command line, and wanted to use both local and remote systems, I went with the basic command line git-annex.

My aim was to take:

Files on my laptop internal drive (running OS X 10.9)
Files on external drives attached to my laptop (some "most of the time" -- eg an external data drive; and some just periodically such as various "backup" drives).
Files on my home Debian Linux server
Files on my colocated Debian Linux server

and bring them together into a coherent git-annex managed set of files, without making additional copies of the files -- either on the local drives (due to not having a massive abundance of disk space) or especially over the network unless the machine/location had no copies of the files (due to limited, expensive, bandwidth).

`git-annex` installation

git-annex is written in Haskell which is... not the most mainstream language (but from following the git-annex devblog does seem to have made development of some features easier). This means it is not in MacPorts, and building it from source requires a lot of dependencies -- and MacPorts haskell-platform is well out of date (but it does look promising that it might get updated to a reasonably current version soon). So after an aborted earlier attempt to get enough Haskell things installed to build from source -- via MacPorts, etc -- this time I just installed from the binary packages.

OS X 10.9 (Mavericks) install

The git-annex OS X autobuilder was recently switched to OS X 10.10, which means that the only recent builds are for OS X 10.10 (and I'd guess these will be the only ones available in future). However it appears that for at least the git-annex command line tool, it does also work on OS X 10.9.

So I installed as follows:

Downloaded git-annex.dmg from the git-annex OS X install page.
Installed the GPG key to verify the git-annex download

Verified the downloaded dmg using the signature file next to it in the same directory with:

gpg --import gpg-pubkey.asc
gpg --verify git-annex.dmg.sig

and verified that key by checking it against the Debian keyring on a Debian Linux system:

sudo apt-get install debian-keyring
gpg --import gpg-pubkey.asc
gpg --keyring /usr/share/keyrings/debian-keyring.gpg --check-sigs 89C809CB

Opened the git-annex.dmg file, and dragged the git-annex.app into /Applications/OpenSource/
Added a symlink to the git-annex command line to /usr/local/bin so I could run git annex interactively, with:
```
cd /usr/local/bin/
sudo ln -s /Applications/OpenSource/git-annex.app/Contents/MacOS/git-annex .
```
Added a symlink to the git-annex-shell command line heler into a directory that my ~/.bashrc/ adds to my $PATH, in this case ~/.bin so that git-annex from another system can find itself with:
```
cd ~/.bin
ln -s /Applications/OpenSource/git-annex.app/Contents/MacOS/git-annex-shell .
```
(Note that /usr/local/bin is not on the OS X ssh path by default -- nor is ~/.bin -- so some other action is required to ensure that a git-annex that ssh's into the OS X system can find its helper; adding the Contents/MacOS directory to your $PATH in ~/.bashrc might also work, but my $PATH is long enough already!)

Note: I already had git and several other dependencies installed from MacPorts, so I did not need to add those bundled commands into the $PATH as well -- I already had usable versions on my $PATH.

Debian 7.0 (Wheezy) install

There is a git-annex package in Debian 7.0 (Wheezy) -- Joey Hess was a Debian Developer after all! -- but it is pretty out of date, because the next release of Debian Stable (Jessie) is due out fairly soon.

The suggested approach is to enable the Debian Wheezy Backports with:

deb http://http.debian.net/debian wheezy-backports main

(eg create /etc/apt/sources.list.d/wheezy-backports.list with that contents, or put it in /etc/apt/sources.list), and then install the version in backports with:

sudo apt-get update
sudo apt-get -t wheezy-backports install git-annex

which is what I did. Because it's pre-packaged, and comes in via a checksum-verified installation method (Debian apt), that is all that is required to install; it will pull in a newer version of git as well as git-annex and some other dependencies from the Debian Wheezy archive. (In theory updates to the Backports will be installed automatically as required, but new packages will only be installed from Backports if you specifically request them, like the above.)

Adding media into `git-annex` without copying it again

I started out following the git-annex walkthrough, and then planning to adapt it to avoid copying the data over the network. I did run into some problems because git-annex wants to be able ssh from the client system to other remote systems -- especially to git clone the archive, but found a work around to get across the firewall in the middle -- so I will describe that approach below for reference. However next time I do this I will probably start from a system that all the others can easily ssh to, even though it is a remote system -- now that I know how to migrate two separate copies of the same data into git-annex.

Local copy

My first step was to create a git-annex archive on my laptop from the local repository of one set of video I had there (in a separate directory for testing purposes):

mkdir /bkup/purchased/craft-and-vision/the-created-image
cd /bkup/purchased/craft-and-vision/the-created-image
ln /bkup/video/the-created-image/*.mov .   # Hard links, no extra disk usage

Then the git-annex repository can be set up:

git init
git annex init "ashram (laptop) internal drive"

and the files added to the repository:

git annex add .
git commit -a -m 'The Created Image, Volume 1'

(the git annex add . command will take a little while, as it is checksumming the content, and then moving it deep into the git archive and replacing it with a symlink -- indirect mode).

Throughout that process the extra disk space used was quite small, and the files remained hard linked to the original ones -- thus saving space. (When I am happy with this workflow I will remove the original copy, just leaving the git annex.)

External drive

My second step was to create a second copy on my "usually attached" external drive -- this time making a copy because the drive had sufficient free space to create a new copy, and I wanted to test that process:

mkdir /data/purchased/craft-and-vision/
cd /data/purchased/craft-and-vision/
git clone /bkup/purchased/craft-and-vision/the-created-image
cd the-created-image
git annex init "ashram external data drive"

At this point the git annex archive was initialised, and it knew how to find the content -- but it did not have a copy of it (which I could check by seeing the symlinks did not point to files that existed, and by the disk space consumed).

To prepare it to get the content easily we want to do:

 git remote add ashram /bkup/purchased/craft-and-vision/the-created-image
 git annex sync ashram

And then to actually get (copy over) all the content:

git annex get .

(the . expands to "everything in this directory"; you can also specify individual files).

That data copy took a while (copying onto a relatively slow external USB drive -- a USB3 drive connected to a USB2 hub in the Apple Thunderbolt Display; apparently Apple chose not to put USB3 into the Thunderbolt Display :-( ), but did have good progress updates.

Remote fileserver

My third step was to integrate the copy on a remote (colocated) fileserver -- it already had the content, but it was not in git-annex so I needed to tell git-annex about it, and make sure that the other copies knew about it. Without copying it over the (slow, bandwidth counted) network link in between. (It was at this point that hindsight shows that I would have been better starting with that remote copy, as all my systems can easily ssh into that remote fileserver, via assorted VPN connections -- but ssh into my OS X laptop is seldom used.)

The first challenge was getting to a point where, starting from the remote system I could do, eg, a git clone from my laptop. There were a couple of challenges:

A firewall in between my laptop and the remote file server, allowed ssh from my laptop to the remote file server, but not back again.
It took a while to be able to run git-annex-shell on OS X, via that ssh connection -- for which the eventual solution was to symlink git-annex-shell into a directory that my ~/.bashrc added into my non-interactive shell $PATH (as mentioned above).

To solve the firewall issue, I made use of ssh reverse port forwarding, and .ssh/config to effectively alias my laptop name to the forwarded port:

From my laptop, forward remote port 2222 back to the ssh daemon on my laptop:
```
ssh -R 2222:localhost:22 172.20.2.3
```

And on that remote server, add an entry to ~/.ssh/config, which maps my laptop name to that port:

# Work around to reach back to laptop via ssh tunnels
#
Host ashram
    Hostname localhost
    port     2222

Which can then be tested with ssh ashram, and should result in a shell -- this also primes ssh's view of the host fingerprint, which is a useful step.

Having done that, if I did not already have a copy of the data, then I could have carried on much like the external drive case with:

mkdir /data/purchased/craft-and-vision
cd /data/purchased/craft-and-vision
git clone ssh://ashram/bkup/purchased/craft-and-vision/the-created-image
cd the-created-image
git annex init "Colo fileserver"
git remote add ashram ssh://ashram/bkup/purchased/craft-and-vision/the-created-image
...

and then to actually copy the data across the network (eg, if I was in a well connected location and did not care about bandwidth used), to do git annex get ., as in the external drive case above.

But because I already had copied the data to that remote fileserver I did not want to copy it again. The solution is to set up the remote repository slightly differently, so that you can add the files you have locally before you tell git-annex to sync with the remote repository:

mkdir /data/purchased/craft-and-vision
mkdir /data/purchased/craft-and-vision/the-created-image
cd /data/purchased/craft-and-vision/the-created-image
ln ~/the-created-image/* .         # Hard link in existing files

Then set up the basic git-annex repository around that data:

git init
git remote add ashram ssh://ashram/bkup/purchased/craft-and-vision/the-created-image
git fetch ashram
git annex init "Colo fileserver"

and add the files you have locally:

git annex add .

before you tell git-annex to sync with the remote one, so that it learns it already has the files. At this point the repository is in a state where it is ready for you to commit them into git -- but you should avoid doing that, because you want git annex to realise the files are already committed and tell git everything is fine, rather than creating a second conflicting commit.

You can use:

git annex whereis

to check that your new repository recognises that each file is in two locations (in this case locally and "ashram"), and then you get it to tell git locally the files are already known, and tell the other git-annex repository what it has:

git annex sync

After doing that, running:

git annex whereis

on either the remote file server repository, or the local copy on my laptop, showed that both recognised there were two copies.

The final step to make this useful, is to go back to the laptop copy and give it a way to reach out to the remote file server:

git remote add colo-fileserver ssh://172.20.2.3/data/purchased/craft-and-vision/the-created-image
git annex sync

(No special ssh tricks required in that direction, since that is the direction that the firewall allows ssh.)

(FTR, someone else used a shell readlink hack to "fill in" the missing files, plus a git annex fsck to cause git-annex to catch up with what had happened behind its back. See also Joey's advice on recoverying from lost files. It seems worth being aware of that method, but the method listed above seems cleaner if you are starting from scratch. This guide to using git-annex for backups also has some useful hints on "not completely meshed" git-annex usage.)

Hindsight

With the benefit of hindsight, I would have started on the remote fileserver first, since everything can ssh into that. So the plan would be:

Set up the copy of the remote fileserver, stand alone.
Set up the copy on my laptop using the:
- git init
- git remote add NAME LOCATION
- git fetch NAME
- git annex init 'DESCRIPTION'
- git annex add .
- git annex sync
order.
Then set up the copies on the external drives, from the copy on the laptop drive.

That would have avoided some of the work arounds to be able to ssh back into my laptop, and also avoided the git-annex-shell $PATH issues, since the Debian Linux git-annex package just works in that respect.

(Another option is to use a centralised bare git repository to be the glue in the middle -- without git-annex on that central repository. It is just there to be a transport point.)

ETA, 2014-12-22: Another very useful command if you already have a local copy of the file is:

git annex reinject ${PATH_TO_FILE} ${FILE_IN_ANNEX}

which can be done after a git clone, to take the local file and put it into the git annex (removing it from its original location).

git annex will checksum the file to make sure it is the same before moving it into the annex. If you want to keep the file in its original location as well, without duplicating the file on disk, the easiest answer appears to be hard link it to a new name and then reinject the hard link. Eg:

mkdir ../import
ln ${PATH_TO_FILE} ../import
git annex reinject ../import/${FILE_IN_ANNEX} ${FILE_IN_ANNEX}

Because the reinjected file is moved into the git annex, the hard link will stay intact, so you can keep the same content in another directory without having to duplicate the large file.

(Based on hints on stackexchange and tips on recovering data from lost and found)

Also useful is:

git annex addurl --file ${FILE_IN_ANNEX} ${URL}

which will associate a web URL with an existing file (see git annex addurl for more details). Unless told to relax, it will check the URL can be fetched and will return a file of the same size.