After having spent the last five years feeling guilty, I now, finally, have my laptop backing up the data I care about to another machine on my network. Here’s how I did it. This is a relatively long and complicated process, but it means that it all happens automatically and by magic, and I don’t ever have to interact with it, which is what I want. The first component I needed was some backup space: a machine on my network that I could send the backups to. I did look at online backup space (Amazon’s S3 and similar) like all the cool kids, but I just can’t get on with it, and I resent paying because I’m a cheapskate. So, it was to be a box on my network. Now, there are useful NAS machines around, which just get plugged in and automatically export their disc space (normally as a Windows share, with Samba), and I looked at those too (there’s the Terastation, etc, etc). However, I needed an always-on server for another purpose anyway, so I decided to go with a real machine. A machine cobbled together out of the Big Box Of Machine Bits, of course.
Setting up the server
It’s got two disc drives in it, and I divided the first disc into two partitions, one with 1GB and the other with all the rest. Install Ubuntu Linux 6.10 Edgy, server edition, on the 1GB partition. (I actually installed dapper and then upgraded it to edgy, for that bleeding edge greatness; at this writing, edgy is only at RC stage.) After that, we want to take all the remaining space on the machine (one big partition on disc 1, and all of disc 2) and make them one big block of disc space; this is what LVM, the Linux Volume Manager, is for. Note that all this stuff can be done with proper GUI tools, but I don’t have a GUI on this machine because it’s a server and I’m trying to converse disc space. This bit’s also from memory, so be very careful and don’t just slavishly follow it.
# First, make the partition available to LVM, by
# making it a "physical volume". This is LVM-speak for
# "a bit of a disk that I can use"
pvcreate /dev/hda3 # the big partition on the first disc
pvcreate /dev/hdb # and all of the second disc
# Now, create a "volume group". This is LVM-speak for
# "a big block of disc space all managed together"
vgcreate volumegroup /dev/hda3 /dev/hdb
# Next, create a "logical volume". LVM-speak: "something
# that looks like a disc drive, so you can mount it"
# First, find out how big it can be
vgdisplay | grep "Total PE"
Total PE 11833
# now create the logical volume at that size
lvcreate -l 11833 volumegroup -n logical1
# You now have a device /dev/volumegroup/logical1
# which you can treat as if it were a disc
# Create a dir to put it in
mkdir /space
# and add it to /etc/fstab so it gets mounted. Add the line:
/dev/volumegroup/logical1 /space auto defaults 0 0
After that complex little bit (again, if you aren’t tight like me, do it with the GUI, it’s easier), you will have a directory /space on the machine with loads of space in it. Install openssh-server and rsync, because we’ll need them later.
Rotating backups
The way I want my backups to work is as follows. Every night, each machine on my network should connect, and send everything that’s changed since yesterday. When I look on the backup server, there should be a folder for each machine, and there should be in there a folder per day. Each folder should look like a complete backup, but if a file hasn’t changed since yesterday it shouldn’t take up any more disc space. So, the folder structure should look, say, like this:
/space
/stuart
/2006-10-24
/folder1
/file1
/file2
/newfile1
/folder2
/file3
/2006-10-23
/folder1
/file1
/file2
/folder2
/file3
and the 2006-10-24 folder should have all the files in it but only take up as much space as newfile1. Complicated, but part of the reason I specified this is because I know it’s possible. (The main reason, of course, is that I’m tight and want to save disc space.) Making this happens involves two stages: making a hardlink tree, and using rsync.
The hardlink tree
If you can get over how much this sounds like something out of an Enid Blyton book, it’s a cool technique. I’m not going to explain hardlinks and inodes and things like that here, because there are many other descriptions elsewhere. Suffice to say that, if you have a folder, you can make a duplicate of that folder with cp -al folder newfolder, and that duplicate will look the same and be full of real files but not take up any disc space. My nightly backup therefore needs to do the following:
- Copy last night’s backup to a new folder, named for the current date
- Change the data in the new folder to look like my laptop, so it’s got all yesterday’s data but with any changes I’ve made today
The issue here is: how do you know what last night’s backup is called? I’ve solved this by making sure there’s a symbolic link called current which always points to the most recent backup. So, the above process actually becomes:
- Copy the
currentfolder to a new folder, named for the current date - Change the data in the new folder to look like my laptop, so it’s got all yesterday’s data but with any changes I’ve made today
- Change the
currentlink so it points to the newly created most recent backup
The script that does this is stored in /space/begin-backup, made executable with chmod +x /space/begin-backup, and looks like this:
#!/bin/bash
PERSON=$1
BROOT=/space
if [ -z "$PERSON" ]; then
echo You must pass the name of a backup dir
exit 1
fi
PDIR=$BROOT/$PERSON/
# If person dir doesn't exist, create it
if [ ! -d $PDIR ]; then mkdir $PDIR; fi
# If there's no current dir, create an empty one and link it
if [ ! -d $PDIR/current ]; then
mkdir $PDIR/first
ln -s first $PDIR/current
fi
DT=$(date -Iseconds)
# Hardlink-tree the existing recent dir
cp -al "$(readlink -f $PDIR/current)" $PDIR/$DT
# and link current to the new hardlink tree
rm $PDIR/current
ln -s $DT $PDIR/current
We’ll come back to how you run this in a minute.
Rsync
The change the data in the new folder to look like my laptop bit is done with rsync, which is complex but brilliantly clever. In essence, rsync is like copy (or cp), except that it compares the source and the destination and only sends the changes over. On my laptop, I can do
rsync -avz --delete -e ssh
/some/folder/to/back/up
myserver:/space/stuart/current/
and that will copy /some/folder/to/back/up over to the server. Importantly, if that folder is already in the backup space, in the current folder (because we backed it up yesterday) then it’ll only copy the changes over. This is why we make sure that there’s a folder called current with the contents of last night’s backup! Exactly how we run this rsync command we’ll come on to in a minute. Patience, Iago.
Choosing what gets backed up
I don’t want to back up everything. I don’t have the space, and to be frank I have a lot of crap lying around on my machine. So I need a very easy way of tagging something for backups. This is a perfect use of emblems; I can “tag” a file or a folder in the file manager with a special “backup” emblem, and that should indicate to my backup process that that file or folder wants to be included in the backup. Ubuntu doesn’t have a backup emblem included by default, but adding one is easy, and explained in the docs. Pick yourself an image (I use this little tape) and add it as an emblem, and then go through your machine and add it to every file or folder that needs backing up. (This will, if applied to a folder, back up everything inside it. If you need it to back up only some of the stuff inside it then you’ll have to not apply it to the folder. Yes, this is awkward, but I don’t need to do that.) Applying emblems is also in the documentation; a quick way if you’re doing this a lot is to pop up the Edit > Backgrounds and Emblems window and just repeatedly drag your new backup emblem to everything.
SSH with no password
One final preparation step: in order that the backup can run without me being around, I need to be able to make an ssh connection from my laptop to the server without entering a password. I’m not going to describe how to do this because there are plenty of guides out there on the web.
Make it so
Now, finally, after lots of setup, it’s time to actually make it all happen. To summarise, then, to do a backup, we need to:
- Run, on the server, the copy-last-night’s-backup script
- Get the list of all the files with the backup emblem
- Use rsync to copy all those files into the new backup folder on the server
To get the list, we can use my findemblem.py script (and you thought I just wrote it for fun!). The final script, dobackup.sh, which actually does the work, just does the above steps, and looks like this:
#!/bin/bash
# Do backups to the rsync server
# You must have already set up a passphraseless ssh key to the ssh server
# so that "ssh servername" just logs you in.
BK=$(dirname $0)
BKNAME=stuart
# First, tell it to clock over the backup
ssh servername /space/begin-backup $BKNAME
# Now, do the backup
python $BK/findemblem.py backup | while read fn; do
rsync -avzq --delete -e ssh "$fn"
servername:/space/$BKNAME/current
done
All that remains now is to schedule this script to run every night, by editing your tasklist with crontab -e and adding the line
40 4 * * * /full/path/to/dobackup.sh
And, lo and behold, you have overnight backups. All done and dusted. Phew.
Did you consider LVM snapshots? (http://tldp.org/HOWTO/LVM-HOWTO/snapshotintro.html) If so, why did you go with massive trees of hardlinks instead?
I did briefly consider them, but, well, I don’t understand them. For example: does the snapshot space get deducted from the space in your LVM, or do you need space *outside* the LVM to put them in? The documentation seems to pretty much assume that you already know about snapshots, and I don’t…
[...] aquarius, did you know about rsnapshot before doing this? It seems to do pretty much everything you want. I’ve been using it for several years, and currently back up two local and twelve remote machines ever four hours. I’m very happy with it. [...]
Hi,
Very cool! I’ve been doing basicly the same thing for years – but i must say that the use emblems is just awesome!
I did some modification to your dobackup.sh to handle remote and local backups (or both).
#!/bin/bash
# Do backups to local media or/and an rsync server
# You must have already set up a passphraseless ssh key to the ssh server
# so that “ssh servername” just logs you in.
BK=$(dirname $0)
BKNAME=$USER
#BKROTATE=begin-backup
BKROTATE=”rotate-backup”
BKREMOTEHOST=
BKREMOTEDIR=
BKLOCALDIR=/media/LACIE/backup
BKEMBLEMNAME=”Backup”
RSYNCFLAGS=”-avzq -delete”
#Rotate the backup
## Try to rotate remote backup location
if [ -n "$BKREMOTEHOST" ]; then
ssh $BKREMOTEHOST $BKREMOTEDIR/$BKROTATE $BKNAME
if [ $? != 0 ]; then
echo “Failed to connect to backup host (${BKREMOTEHOST}).”
else
### Now, do the backup
$BK/findemblem.py $BKEMBLEMNAME | while read fn; do
rsync $RSYNCFLAGS -e ssh “$fn” $BKREMOTEHOST:$BKREMOTEDIR/$BKNAME/current
done
fi
fi
## Try to rotate local backup location
if [ -n "$BKLOCALDIR" -a -d "$BKLOCALDIR" ]; then
$BKLOCALDIR/$BKROTATE $BKNAME
### Now, do the backup
$BK/findemblem.py $BKEMBLEMNAME | while read fn; do
rsync $RSYNCFLAGS “$fn” $BKLOCALDIR/$BKNAME/current
done
fi
## If no backup destination is specified – abort
if [ -z "$BKREMOTEHOST" -a -z "$BKREMOTEDIR" -a -z "$BKLOCALDIR" ]; then
echo “No backup destination specified.”
exit 1
fi
Ok, that came out completely garbled.
I put it here http://www.update.uu.se/~peterl/tmp/dobackup.sh.txt
if someone wants it.
Oh, and a final note – you probably have some good reason for not using -R (use relative pathnames), but it really is a good idea and makes it much easier to restore from backup
[...] Idag såg jag en rolig idé som gick ut på att använda nautilus emblem för att markera vad det ska tas backup på – så nu har jag fina backupband här och där i min filhanterare Jag har filat lite på ditt och datt – man kanske skulle fila det till något mer generellt och paketerbart? [...]
Unless I missed something when I read this, this backup method won’t create proper incremental backups. The reason being that each new backup contains hard links to all the files in the previous one. That’s fine when you create/delete a file, but when you edit an existing file rsync will update that file with the new content. Since every copy of that file in each day’s backup is a hard link to the same file, they’ll all be updated.
astopy: you have missed something, and it’s this: when rsync needs to change a file, it unlinks it first.
Ah, thanks. I had no idea
[...] Anyway, this means that something I’ve been thinking of doing for a while leaps up my priority list. One of the 6 computers is a server: an old tower case machine with 90GB of storage in it made up of scrounged hard drives I had lying around. I use it as the server for my home backup system, and it does a good job. However, since I can now play media across the network (exciting! it is!), I could use it as the storage place for all our films and TV and music and games and whatnot. However, it’s old and a bit noisy, so it’d be nice to get rid of it. I still need whatever I replace it with to be an actual server, though, not just a NAS-style stack o’ discs, so I don’t want a Buffalo Terastation or similar — it not only needs to run the rsync server but I also plan on having it be used for BitTorrenting stuff I’d like to watch and to be always on for downloading things. [...]