Introduction
Over the last few years I have accumulated quite a lot of purchased, downloadable-only, content -- 200+ (DRM free) eBooks, quite a few software programs (mainly photography related), and an increasing amount of audio and video content (mostly photography training).
Some of it is only available for download for a short period of time (the 5daydeal specialises in that -- content is available for download only for about two weeks), so I really do not want to lose my copies of it. But I also do not want to lose my copies of other downloaded items, especially video, because even if I can still download them again, in New Zealand the bandwidth involved in downloading it again costs Real Money (tm). (On my current plan, excess data is $3/2GB -- which is cheap compared with what it used to be, but definitely not free when you're talking about many 10s of GBs.)
When the content was relatively small, my main approach was to keep it on my laptop, which has a good set of hourly (Time Machine), and weekly (SuperDuper!) backups (including an off-site rotation). Plus maybe another copy or two by hand onto a couple of Linux servers in different locations. But as the content grows -- both in disk space consumed, and quantity of files -- that becomes increasingly difficult to manage.
git-annex is an extension to
the git version control system, which allows
it to track "large" files -- larger than you would want to keep
stored in every copy of the git repository -- in a more sensible
fashion. Put very simply git-annex uses git to keep track of
the metadata of the files, including which git-annex repositories
have actual copies of the file data -- and then provides commands for
copying the data around on request. Amongst other useful things it
can ensure that there are always N copies of the file data, and --
through the robust file checksums it keeps -- ensure that it knows when
it has a complete, unchanged, copy of the file data.
git-annex is written by Joey Hess, who
amongst other things was a Debian Linux Developer until very
recently, and wrote the
blog software used for my blog. git-annex
is both a command line tool, and also comes with an
assistant, which
provides an easier to use "synchronised folder" mechanism -- a bit
like the dropbox, etc, workflow. (The assistant was funded by a
Kickstarter,
and Joey has continued working on git-annex since then, picking
up funding periodically since
then.)
Since I am very familiar with the command line, and wanted to use both
local and remote systems, I went with the basic command line
git-annex.
My aim was to take:
Files on my laptop internal drive (running OS X 10.9)
Files on external drives attached to my laptop (some "most of the time" -- eg an external
datadrive; and some just periodically such as various "backup" drives).Files on my home Debian Linux server
Files on my colocated Debian Linux server
and bring them together into a coherent git-annex managed set of
files, without making additional copies of the files -- either on
the local drives (due to not having a massive abundance of disk space)
or especially over the network unless the machine/location had no
copies of the files (due to limited, expensive, bandwidth).
git-annex installation
git-annex is written in
Haskell which is... not
the most mainstream language (but from following the git-annex
devblog does seem to
have made development of some features easier). This means it is
not in MacPorts, and
building it from
source requires
a lot of dependencies -- and MacPorts haskell-platform is well out
of date (but it does look
promising that it might get updated to a reasonably current version
soon). So after an aborted earlier attempt to get enough Haskell
things installed to build from source -- via MacPorts, etc -- this
time I just installed from the binary packages.
OS X 10.9 (Mavericks) install
The git-annex OS X autobuilder was recently switched to OS X
10.10,
which means that the only recent builds are for OS X 10.10 (and
I'd guess these will be the only ones available in future). However
it appears that for at least the git-annex command line tool, it does
also work on OS X 10.9.
So I installed as follows:
Downloaded
git-annex.dmgfrom the git-annex OS X install page.Installed the GPG key to verify the git-annex download
Verified the downloaded
dmgusing the signature file next to it in the same directory with:gpg --import gpg-pubkey.asc gpg --verify git-annex.dmg.sigand verified that key by checking it against the Debian keyring on a Debian Linux system:
sudo apt-get install debian-keyring gpg --import gpg-pubkey.asc gpg --keyring /usr/share/keyrings/debian-keyring.gpg --check-sigs 89C809CBOpened the
git-annex.dmgfile, and dragged thegit-annex.appinto/Applications/OpenSource/Added a symlink to the
git-annexcommand line to/usr/local/binso I could rungit annexinteractively, with:cd /usr/local/bin/ sudo ln -s /Applications/OpenSource/git-annex.app/Contents/MacOS/git-annex .Added a symlink to the
git-annex-shellcommand line heler into a directory that my~/.bashrc/adds to my$PATH, in this case~/.binso thatgit-annexfrom another system can find itself with:cd ~/.bin ln -s /Applications/OpenSource/git-annex.app/Contents/MacOS/git-annex-shell .(Note that
/usr/local/binis not on the OS X ssh path by default -- nor is~/.bin-- so some other action is required to ensure that agit-annexthat ssh's into the OS X system can find its helper; adding theContents/MacOSdirectory to your$PATHin~/.bashrcmight also work, but my$PATHis long enough already!)
Note: I already had git and several other dependencies installed from
MacPorts, so I did not need to add those bundled commands into the
$PATH as well -- I already had usable versions on my $PATH.
Debian 7.0 (Wheezy) install
There is a git-annex package in Debian 7.0 (Wheezy) -- Joey Hess was a
Debian Developer after all! -- but it is pretty out of
date, because the
next release of Debian Stable (Jessie) is due out fairly soon.
The suggested approach is to enable the Debian Wheezy Backports with:
deb http://http.debian.net/debian wheezy-backports main
(eg create /etc/apt/sources.list.d/wheezy-backports.list with that
contents, or put it in /etc/apt/sources.list), and then install
the version in backports with:
sudo apt-get update
sudo apt-get -t wheezy-backports install git-annex
which is what I did. Because it's pre-packaged, and comes in via a
checksum-verified installation method (Debian apt), that is all
that is required to install; it will pull in a newer version of git
as well as git-annex and some other dependencies from the Debian
Wheezy archive. (In theory updates to the
Backports will be
installed automatically as required, but new packages will only be
installed from Backports if you specifically request them, like the
above.)
Adding media into git-annex without copying it again
I started out following the git-annex
walkthrough, and
then planning to adapt it to avoid copying the data over the network.
I did run into some problems because git-annex wants to be able
ssh from the client system to other remote systems -- especially
to git clone the archive, but found a work
around
to get across the firewall in the middle -- so I will describe that
approach below for reference. However next time I do this I will
probably start from a system that all the others can easily ssh
to, even though it is a remote system -- now that I know how to
migrate two separate copies of the same data into
git-annex.
Local copy
My first step was to create a git-annex archive on my laptop from
the local repository of one set of video I had there (in a separate
directory for testing purposes):
mkdir /bkup/purchased/craft-and-vision/the-created-image
cd /bkup/purchased/craft-and-vision/the-created-image
ln /bkup/video/the-created-image/*.mov . # Hard links, no extra disk usage
Then the git-annex repository can be set up:
git init
git annex init "ashram (laptop) internal drive"
and the files added to the repository:
git annex add .
git commit -a -m 'The Created Image, Volume 1'
(the git annex add . command will take a little while, as it is
checksumming the content, and then moving it deep into the git archive
and replacing it with a symlink -- indirect mode).
Throughout that process the extra disk space used was quite small, and
the files remained hard linked to the original ones -- thus saving
space. (When I am happy with this workflow I will remove the original
copy, just leaving the git annex.)
External drive
My second step was to create a second copy on my "usually attached" external drive -- this time making a copy because the drive had sufficient free space to create a new copy, and I wanted to test that process:
mkdir /data/purchased/craft-and-vision/
cd /data/purchased/craft-and-vision/
git clone /bkup/purchased/craft-and-vision/the-created-image
cd the-created-image
git annex init "ashram external data drive"
At this point the git annex archive was initialised, and it knew how
to find the content -- but it did not have a copy of it (which I could
check by seeing the symlinks did not point to files that existed, and
by the disk space consumed).
To prepare it to get the content easily we want to do:
git remote add ashram /bkup/purchased/craft-and-vision/the-created-image
git annex sync ashram
And then to actually get (copy over) all the content:
git annex get .
(the . expands to "everything in this directory"; you can also
specify individual files).
That data copy took a while (copying onto a relatively slow external USB drive -- a USB3 drive connected to a USB2 hub in the Apple Thunderbolt Display; apparently Apple chose not to put USB3 into the Thunderbolt Display :-( ), but did have good progress updates.
Remote fileserver
My third step was to integrate the copy on a remote (colocated)
fileserver -- it already had the content, but it was not in git-annex
so I needed to tell git-annex about it, and make sure that the other
copies knew about it. Without copying it over the (slow, bandwidth
counted) network link in between. (It was at this point that hindsight
shows that I would have been better starting with that remote copy, as
all my systems can easily ssh into that remote fileserver, via
assorted VPN connections -- but ssh into my OS X laptop is seldom used.)
The first challenge was getting to a point where, starting from the
remote system I could do, eg, a git clone from my laptop. There were
a couple of challenges:
A firewall in between my laptop and the remote file server, allowed
sshfrom my laptop to the remote file server, but not back again.It took a while to be able to run
git-annex-shellon OS X, via that ssh connection -- for which the eventual solution was to symlinkgit-annex-shellinto a directory that my~/.bashrcadded into my non-interactive shell$PATH(as mentioned above).
To solve the firewall issue, I made use of ssh reverse port forwarding,
and .ssh/config to effectively alias my laptop name to the
forwarded
port:
From my laptop, forward remote port 2222 back to the ssh daemon on my laptop:
ssh -R 2222:localhost:22 172.20.2.3And on that remote server, add an entry to
~/.ssh/config, which maps my laptop name to that port:# Work around to reach back to laptop via ssh tunnels # Host ashram Hostname localhost port 2222
Which can then be tested with ssh ashram, and should result in a
shell -- this also primes ssh's view of the host fingerprint, which
is a useful step.
Having done that, if I did not already have a copy of the data, then I could have carried on much like the external drive case with:
mkdir /data/purchased/craft-and-vision
cd /data/purchased/craft-and-vision
git clone ssh://ashram/bkup/purchased/craft-and-vision/the-created-image
cd the-created-image
git annex init "Colo fileserver"
git remote add ashram ssh://ashram/bkup/purchased/craft-and-vision/the-created-image
...
and then to actually copy the data across the network (eg, if I was in
a well connected location and did not care about bandwidth used), to do
git annex get ., as in the external drive case above.
But because I already had copied the data to that remote fileserver
I did not want to copy it again. The solution is to set up the
remote repository slightly
differently,
so that you can add the files you have locally before you tell
git-annex to sync with the remote repository:
mkdir /data/purchased/craft-and-vision
mkdir /data/purchased/craft-and-vision/the-created-image
cd /data/purchased/craft-and-vision/the-created-image
ln ~/the-created-image/* . # Hard link in existing files
Then set up the basic git-annex repository around that data:
git init
git remote add ashram ssh://ashram/bkup/purchased/craft-and-vision/the-created-image
git fetch ashram
git annex init "Colo fileserver"
and add the files you have locally:
git annex add .
before you tell git-annex to sync with the remote one, so that it
learns it already has the files. At this point the repository is
in a state where it is ready for you to commit them into git -- but
you should avoid doing that, because you want git annex to realise
the files are already committed and tell git everything is fine, rather
than creating a second conflicting commit.
You can use:
git annex whereis
to check that your new repository recognises that each file is in
two locations (in this case locally and "ashram"), and then you get
it to tell git locally the files are already known, and tell the
other git-annex repository what it has:
git annex sync
After doing that, running:
git annex whereis
on either the remote file server repository, or the local copy on my laptop, showed that both recognised there were two copies.
The final step to make this useful, is to go back to the laptop copy and give it a way to reach out to the remote file server:
git remote add colo-fileserver ssh://172.20.2.3/data/purchased/craft-and-vision/the-created-image
git annex sync
(No special ssh tricks required in that direction, since that is the direction that the firewall allows ssh.)
(FTR, someone else used a shell readlink
hack
to "fill in" the missing files, plus a git annex fsck to cause
git-annex to catch up with what had happened behind its back.
See also Joey's advice on recoverying from lost
files.
It seems worth being aware of that method, but the method listed
above seems cleaner if you are starting from scratch. This guide
to using git-annex for
backups also
has some useful hints on "not completely meshed" git-annex usage.)
Hindsight
With the benefit of hindsight, I would have started on the remote fileserver first, since everything can ssh into that. So the plan would be:
Set up the copy of the remote fileserver, stand alone.
Set up the copy on my laptop using the:
git initgit remote add NAME LOCATIONgit fetch NAMEgit annex init 'DESCRIPTION'git annex add .git annex sync
order.
Then set up the copies on the external drives, from the copy on the laptop drive.
That would have avoided some of the work arounds to be able to ssh
back into my laptop, and also avoided the git-annex-shell $PATH
issues, since the Debian Linux git-annex package just works in that
respect.
(Another option is to use a centralised bare git repository to be the glue in the middle -- without git-annex on that central repository. It is just there to be a transport point.)
ETA, 2014-12-22: Another very useful command if you already have a local copy of the file is:
git annex reinject ${PATH_TO_FILE} ${FILE_IN_ANNEX}
which can be done after a git clone, to take the local file and put
it into the git annex (removing it from its original location).
git annex will checksum the file to make sure it is the same
before moving it into the annex. If you want to keep the file in
its original location as well, without duplicating the file on disk,
the easiest answer appears to be hard link it to a new name and
then reinject the hard link. Eg:
mkdir ../import
ln ${PATH_TO_FILE} ../import
git annex reinject ../import/${FILE_IN_ANNEX} ${FILE_IN_ANNEX}
Because the reinjected file is moved into the git annex, the hard link will stay intact, so you can keep the same content in another directory without having to duplicate the large file.
(Based on hints on stackexchange and tips on recovering data from lost and found)
Also useful is:
git annex addurl --file ${FILE_IN_ANNEX} ${URL}
which will associate a web URL with an existing file (see git
annex
addurl
for more details). Unless told to relax, it will check the URL can
be fetched and will return a file of the same size.