Introduction
Over the last few years I have accumulated quite a lot of purchased, downloadable-only, content -- 200+ (DRM free) eBooks, quite a few software programs (mainly photography related), and an increasing amount of audio and video content (mostly photography training).
Some of it is only available for download for a short period of time (the 5daydeal specialises in that -- content is available for download only for about two weeks), so I really do not want to lose my copies of it. But I also do not want to lose my copies of other downloaded items, especially video, because even if I can still download them again, in New Zealand the bandwidth involved in downloading it again costs Real Money (tm). (On my current plan, excess data is $3/2GB -- which is cheap compared with what it used to be, but definitely not free when you're talking about many 10s of GBs.)
When the content was relatively small, my main approach was to keep it on my laptop, which has a good set of hourly (Time Machine), and weekly (SuperDuper!) backups (including an off-site rotation). Plus maybe another copy or two by hand onto a couple of Linux servers in different locations. But as the content grows -- both in disk space consumed, and quantity of files -- that becomes increasingly difficult to manage.
git-annex
is an extension to
the git
version control system, which allows
it to track "large" files -- larger than you would want to keep
stored in every copy of the git repository -- in a more sensible
fashion. Put very simply git-annex
uses git
to keep track of
the metadata of the files, including which git-annex
repositories
have actual copies of the file data -- and then provides commands for
copying the data around on request. Amongst other useful things it
can ensure that there are always N copies of the file data, and --
through the robust file checksums it keeps -- ensure that it knows when
it has a complete, unchanged, copy of the file data.
git-annex
is written by Joey Hess, who
amongst other things was a Debian Linux Developer until very
recently, and wrote the
blog software used for my blog. git-annex
is both a command line tool, and also comes with an
assistant, which
provides an easier to use "synchronised folder" mechanism -- a bit
like the dropbox, etc, workflow. (The assistant was funded by a
Kickstarter,
and Joey has continued working on git-annex since then, picking
up funding periodically since
then.)
Since I am very familiar with the command line, and wanted to use both
local and remote systems, I went with the basic command line
git-annex
.
My aim was to take:
Files on my laptop internal drive (running OS X 10.9)
Files on external drives attached to my laptop (some "most of the time" -- eg an external
data
drive; and some just periodically such as various "backup" drives).Files on my home Debian Linux server
Files on my colocated Debian Linux server
and bring them together into a coherent git-annex
managed set of
files, without making additional copies of the files -- either on
the local drives (due to not having a massive abundance of disk space)
or especially over the network unless the machine/location had no
copies of the files (due to limited, expensive, bandwidth).
git-annex
installation
git-annex
is written in
Haskell which is... not
the most mainstream language (but from following the git-annex
devblog does seem to
have made development of some features easier). This means it is
not in MacPorts, and
building it from
source requires
a lot of dependencies -- and MacPorts haskell-platform is well out
of date (but it does look
promising that it might get updated to a reasonably current version
soon). So after an aborted earlier attempt to get enough Haskell
things installed to build from source -- via MacPorts, etc -- this
time I just installed from the binary packages.
OS X 10.9 (Mavericks) install
The git-annex OS X autobuilder was recently switched to OS X
10.10,
which means that the only recent builds are for OS X 10.10 (and
I'd guess these will be the only ones available in future). However
it appears that for at least the git-annex
command line tool, it does
also work on OS X 10.9.
So I installed as follows:
Downloaded
git-annex.dmg
from the git-annex OS X install page.Installed the GPG key to verify the git-annex download
Verified the downloaded
dmg
using the signature file next to it in the same directory with:gpg --import gpg-pubkey.asc gpg --verify git-annex.dmg.sig
and verified that key by checking it against the Debian keyring on a Debian Linux system:
sudo apt-get install debian-keyring gpg --import gpg-pubkey.asc gpg --keyring /usr/share/keyrings/debian-keyring.gpg --check-sigs 89C809CB
Opened the
git-annex.dmg
file, and dragged thegit-annex.app
into/Applications/OpenSource/
Added a symlink to the
git-annex
command line to/usr/local/bin
so I could rungit annex
interactively, with:cd /usr/local/bin/ sudo ln -s /Applications/OpenSource/git-annex.app/Contents/MacOS/git-annex .
Added a symlink to the
git-annex-shell
command line heler into a directory that my~/.bashrc/
adds to my$PATH
, in this case~/.bin
so thatgit-annex
from another system can find itself with:cd ~/.bin ln -s /Applications/OpenSource/git-annex.app/Contents/MacOS/git-annex-shell .
(Note that
/usr/local/bin
is not on the OS X ssh path by default -- nor is~/.bin
-- so some other action is required to ensure that agit-annex
that ssh's into the OS X system can find its helper; adding theContents/MacOS
directory to your$PATH
in~/.bashrc
might also work, but my$PATH
is long enough already!)
Note: I already had git
and several other dependencies installed from
MacPorts, so I did not need to add those bundled commands into the
$PATH
as well -- I already had usable versions on my $PATH
.
Debian 7.0 (Wheezy) install
There is a git-annex
package in Debian 7.0 (Wheezy) -- Joey Hess was a
Debian Developer after all! -- but it is pretty out of
date, because the
next release of Debian Stable (Jessie) is due out fairly soon.
The suggested approach is to enable the Debian Wheezy Backports with:
deb http://http.debian.net/debian wheezy-backports main
(eg create /etc/apt/sources.list.d/wheezy-backports.list
with that
contents, or put it in /etc/apt/sources.list
), and then install
the version in backports with:
sudo apt-get update
sudo apt-get -t wheezy-backports install git-annex
which is what I did. Because it's pre-packaged, and comes in via a
checksum-verified installation method (Debian apt
), that is all
that is required to install; it will pull in a newer version of git
as well as git-annex
and some other dependencies from the Debian
Wheezy archive. (In theory updates to the
Backports will be
installed automatically as required, but new packages will only be
installed from Backports if you specifically request them, like the
above.)
Adding media into git-annex
without copying it again
I started out following the git-annex
walkthrough, and
then planning to adapt it to avoid copying the data over the network.
I did run into some problems because git-annex
wants to be able
ssh
from the client system to other remote systems -- especially
to git clone
the archive, but found a work
around
to get across the firewall in the middle -- so I will describe that
approach below for reference. However next time I do this I will
probably start from a system that all the others can easily ssh
to, even though it is a remote system -- now that I know how to
migrate two separate copies of the same data into
git-annex
.
Local copy
My first step was to create a git-annex
archive on my laptop from
the local repository of one set of video I had there (in a separate
directory for testing purposes):
mkdir /bkup/purchased/craft-and-vision/the-created-image
cd /bkup/purchased/craft-and-vision/the-created-image
ln /bkup/video/the-created-image/*.mov . # Hard links, no extra disk usage
Then the git-annex
repository can be set up:
git init
git annex init "ashram (laptop) internal drive"
and the files added to the repository:
git annex add .
git commit -a -m 'The Created Image, Volume 1'
(the git annex add .
command will take a little while, as it is
checksumming the content, and then moving it deep into the git
archive
and replacing it with a symlink -- indirect mode).
Throughout that process the extra disk space used was quite small, and
the files remained hard linked to the original ones -- thus saving
space. (When I am happy with this workflow I will remove the original
copy, just leaving the git annex
.)
External drive
My second step was to create a second copy on my "usually attached" external drive -- this time making a copy because the drive had sufficient free space to create a new copy, and I wanted to test that process:
mkdir /data/purchased/craft-and-vision/
cd /data/purchased/craft-and-vision/
git clone /bkup/purchased/craft-and-vision/the-created-image
cd the-created-image
git annex init "ashram external data drive"
At this point the git annex
archive was initialised, and it knew how
to find the content -- but it did not have a copy of it (which I could
check by seeing the symlinks did not point to files that existed, and
by the disk space consumed).
To prepare it to get the content easily we want to do:
git remote add ashram /bkup/purchased/craft-and-vision/the-created-image
git annex sync ashram
And then to actually get (copy over) all the content:
git annex get .
(the .
expands to "everything in this directory"; you can also
specify individual files).
That data copy took a while (copying onto a relatively slow external USB drive -- a USB3 drive connected to a USB2 hub in the Apple Thunderbolt Display; apparently Apple chose not to put USB3 into the Thunderbolt Display :-( ), but did have good progress updates.
Remote fileserver
My third step was to integrate the copy on a remote (colocated)
fileserver -- it already had the content, but it was not in git-annex
so I needed to tell git-annex
about it, and make sure that the other
copies knew about it. Without copying it over the (slow, bandwidth
counted) network link in between. (It was at this point that hindsight
shows that I would have been better starting with that remote copy, as
all my systems can easily ssh
into that remote fileserver, via
assorted VPN connections -- but ssh into my OS X laptop is seldom used.)
The first challenge was getting to a point where, starting from the
remote system I could do, eg, a git clone
from my laptop. There were
a couple of challenges:
A firewall in between my laptop and the remote file server, allowed
ssh
from my laptop to the remote file server, but not back again.It took a while to be able to run
git-annex-shell
on OS X, via that ssh connection -- for which the eventual solution was to symlinkgit-annex-shell
into a directory that my~/.bashrc
added into my non-interactive shell$PATH
(as mentioned above).
To solve the firewall issue, I made use of ssh reverse port forwarding,
and .ssh/config
to effectively alias my laptop name to the
forwarded
port:
From my laptop, forward remote port 2222 back to the ssh daemon on my laptop:
ssh -R 2222:localhost:22 172.20.2.3
And on that remote server, add an entry to
~/.ssh/config
, which maps my laptop name to that port:# Work around to reach back to laptop via ssh tunnels # Host ashram Hostname localhost port 2222
Which can then be tested with ssh ashram
, and should result in a
shell -- this also primes ssh
's view of the host fingerprint, which
is a useful step.
Having done that, if I did not already have a copy of the data, then I could have carried on much like the external drive case with:
mkdir /data/purchased/craft-and-vision
cd /data/purchased/craft-and-vision
git clone ssh://ashram/bkup/purchased/craft-and-vision/the-created-image
cd the-created-image
git annex init "Colo fileserver"
git remote add ashram ssh://ashram/bkup/purchased/craft-and-vision/the-created-image
...
and then to actually copy the data across the network (eg, if I was in
a well connected location and did not care about bandwidth used), to do
git annex get .
, as in the external drive case above.
But because I already had copied the data to that remote fileserver
I did not want to copy it again. The solution is to set up the
remote repository slightly
differently,
so that you can add the files you have locally before you tell
git-annex
to sync with the remote repository:
mkdir /data/purchased/craft-and-vision
mkdir /data/purchased/craft-and-vision/the-created-image
cd /data/purchased/craft-and-vision/the-created-image
ln ~/the-created-image/* . # Hard link in existing files
Then set up the basic git-annex
repository around that data:
git init
git remote add ashram ssh://ashram/bkup/purchased/craft-and-vision/the-created-image
git fetch ashram
git annex init "Colo fileserver"
and add the files you have locally:
git annex add .
before you tell git-annex
to sync with the remote one, so that it
learns it already has the files. At this point the repository is
in a state where it is ready for you to commit them into git
-- but
you should avoid doing that, because you want git annex
to realise
the files are already committed and tell git
everything is fine, rather
than creating a second conflicting commit.
You can use:
git annex whereis
to check that your new repository recognises that each file is in
two locations (in this case locally and "ashram"), and then you get
it to tell git
locally the files are already known, and tell the
other git-annex
repository what it has:
git annex sync
After doing that, running:
git annex whereis
on either the remote file server repository, or the local copy on my laptop, showed that both recognised there were two copies.
The final step to make this useful, is to go back to the laptop copy and give it a way to reach out to the remote file server:
git remote add colo-fileserver ssh://172.20.2.3/data/purchased/craft-and-vision/the-created-image
git annex sync
(No special ssh tricks required in that direction, since that is the direction that the firewall allows ssh.)
(FTR, someone else used a shell readlink
hack
to "fill in" the missing files, plus a git annex fsck
to cause
git-annex
to catch up with what had happened behind its back.
See also Joey's advice on recoverying from lost
files.
It seems worth being aware of that method, but the method listed
above seems cleaner if you are starting from scratch. This guide
to using git-annex for
backups also
has some useful hints on "not completely meshed" git-annex usage.)
Hindsight
With the benefit of hindsight, I would have started on the remote fileserver first, since everything can ssh into that. So the plan would be:
Set up the copy of the remote fileserver, stand alone.
Set up the copy on my laptop using the:
git init
git remote add NAME LOCATION
git fetch NAME
git annex init 'DESCRIPTION'
git annex add .
git annex sync
order.
Then set up the copies on the external drives, from the copy on the laptop drive.
That would have avoided some of the work arounds to be able to ssh
back into my laptop, and also avoided the git-annex-shell
$PATH
issues, since the Debian Linux git-annex
package just works in that
respect.
(Another option is to use a centralised bare git repository to be the glue in the middle -- without git-annex on that central repository. It is just there to be a transport point.)
ETA, 2014-12-22: Another very useful command if you already have a local copy of the file is:
git annex reinject ${PATH_TO_FILE} ${FILE_IN_ANNEX}
which can be done after a git clone
, to take the local file and put
it into the git annex (removing it from its original location).
git annex
will checksum the file to make sure it is the same
before moving it into the annex. If you want to keep the file in
its original location as well, without duplicating the file on disk,
the easiest answer appears to be hard link it to a new name and
then reinject the hard link. Eg:
mkdir ../import
ln ${PATH_TO_FILE} ../import
git annex reinject ../import/${FILE_IN_ANNEX} ${FILE_IN_ANNEX}
Because the reinjected file is moved into the git annex, the hard link will stay intact, so you can keep the same content in another directory without having to duplicate the large file.
(Based on hints on stackexchange and tips on recovering data from lost and found)
Also useful is:
git annex addurl --file ${FILE_IN_ANNEX} ${URL}
which will associate a web URL with an existing file (see git
annex
addurl
for more details). Unless told to relax, it will check the URL can
be fetched and will return a file of the same size.