I use Mac OS X’s Time Machine to back up my MacBook Pro, and it has saved my skin more than once — including one time when I lost the entire disk and had to rebuild a new one from scratch. A while ago I wrote a Ruby script to implement a less ambitious but similar back-up strategy for my FreeBSD servers, allowing me to go back in time for lost email or blog posts. Here is how I did it.
In principle, I could have solved my backup needs by simply setting up a cron job that took a snapshot copy of, say, the email server’s repository once every hour, keeping the last, say, 24 of these. Then another one that took a snapshot of the oldest of the hourly snapshots once a day and kept those for a few months. It would work — I would have access to old data — but it would be horribly inefficient in terms of storage.
A different attempt would be to find just the files that had changed and only copy those to a safe location. That too would work, but at the expense of making it difficult to recreate the email repository just as it looked at any one particular snapshot moment. It could be done, of course, but it would be a hassle and certainly not as simple as browsing the corresponding snapshot directory.
Fortunately, we can combine these two solutions and get the best of both worlds; if we populate each new snapshot with copies of the files that have changed since the previous snapshot and hard links to the corresponding entries in the previous snapshot for everything else.
In Unix file systems, hard links are really just references to the underlying file system level entries (usually referred to as i-nodes). As long as one more hard link to an i-node exists, it remains in the file system. When the final hard link disappears, so does the i-node.
Calculating which files to copy and which to hard link to can be somewhat involved, but luckily the brilliant rsync utility can do this for us. If we supply the –link-dest option to rsync it will only copy the source files to the target directory if they don’t already exist in the link-dest directory. Otherwise it will create a hard link to the link-dest version instead.
This means we can implement a Time Machine-like backup procedure in three easy steps:
- Rotate (rename) existing snapshots so that snapshot i becomes snapshot i+1. This makes room for a new “most recent snapshot” 0. Delete the oldest snapshot if it falls outside the number of snapshots you have decided to keep.
- Use rsync to copy files from the source (the backup root) to snapshot 0, pointing –link-dest to snapshot 1
- Profit
Here’s a Ruby script that follows this recipe outline — it adds a few extra bells and whistles but more on that after the code:
#!/usr/bin/env ruby require 'yaml' require 'tempfile' require 'fileutils' require 'pathname' class Cycle attr_reader :length attr_reader :root attr_reader :name attr_reader :exclusions def initialize(spec) @exclusions = [] @length = 0 @root = nil @name = File.basename(spec, '.*') eval(IO.read(spec)) end def cycle(root, length, &block) @root = root @length = length yield(self) if block_given? die "Cycle length must be at least 1 (#{@length} found in #{spec})" unless @length > 0 die "Root #{@root} does not exist" unless File.exists?(@root) end def exclude(*pattern) @exclusions << pattern end end class Snapshot include FileUtils def rotate_existing_snapshots(cycle, snapshot_dir) # if the oldest expected snapshot exists, move it to the fixed spillover # directory (from which it may be picked up by another snapshot # with a longer cycle) if File.exists?("#{snapshot_dir}.#{cycle.length-1}") rm_rf "#{snapshot_dir}.spillover" mv "#{snapshot_dir}.#{cycle.length-1}", "#{snapshot_dir}.spillover" end # shift existing snapshot(s) by one, if they exist (cycle.length - 1).downto(0) do |i| if File.exists?("#{snapshot_dir}.#{i}") mv "#{snapshot_dir}.#{i}", "#{snapshot_dir}.#{i+1}" end end end def take(cycle, target_dir) snapshot_dir = (Pathname.new(target_dir) + Pathname.new(cycle.name)).to_s rotate_existing_snapshots(cycle, snapshot_dir) exclude = '' Tempfile.open("exclude_list") do |f| f << cycle.exclusions.join("\n") exclude = "--delete-excluded --exclude-from=#{f.path}" end system "rsync -vP --archive --delete #{exclude} --link-dest=../#{cycle.name}.1 #{cycle.root}/ #{snapshot_dir}.0" touch "#{snapshot_dir}.0" # reflect snapshot end-time end end def die(message) puts message exit 0 end die "Wrong number of arguments. Expected config-file and target-dir" unless ARGV.length == 2 config_file = ARGV[0] target_dir = ARGV[1] die "Configuration file #{config_file} does not exist" unless File.exists?(config_file) die "Configuration file #{config_file} is not a file" unless File.file?(config_file) die "Configuration file #{config_file} is not readable" unless File.readable?(config_file) cycle = Cycle.new(config_file) die "Target directory #{target_dir} does not exist" unless File.exists?(target_dir) die "Target directory #{target_dir} is not a dircetory" unless File.directory?(target_dir) die "Target dircetory #{target_dir} is not writable" unless File.writable?(target_dir) Snapshot.new.take(cycle, target_dir)
The script should be invoked as follows:
ruby snapshot.rb cycle-file target-directory
where
- cycle-file
- Is the name of a small Ruby file specifying what to backup and how many snapshots to keep
- target-directory
- Is the name of an existing directory in which the snapshots should be kept
Here’s an example of a cycle-file
cycle "/Users/jacob/Downloads", 3 do exclude "*.dmg", "*.pdf" end
It defines a snapshot cycle (a series of snapshots) that backs up the directory tree rooted in /Users/jacob/Downloads except *.dmg and *.pdf files. The cycle is defined to be three steps long meaning that the script will keep the most recent three snapshots. We say that the cycle length is 3.
If you look closely at the script you’ll discover that it doesn’t delete snapshots that fall outside the cycle length straight away, but instead renames them to #{snapshot_dir}.spillover. This makes it easy to “chain” snapshot scripts and have a less frequently run version (e.g. the “daily backup”) use the spillover directory as its backup root without having to know about the cycle length used by the more frequently run instance (e.g. the “hourly backup”).
When the script runs it places each snapshot inside the target directory. The names of the snapshots are derived from the base name of the cycle definition file and the snapshot number. So, if the cycle file is called ‘daily.cycle’ and the target directory is ‘snapshots’, then we’ll end up with
snapshots | +-- daily.0 +-- daily.1 +-- daily.2
and so on.
The beauty of this approach is that if I need to find a certain file in the backup root, then I simply cd into the snapshot taken closest to the point-in-time I’m interested in and search for it there. I won’t have to deal with incremental restores or the like — each snapshot directory will appear to contain the entire file tree rooted in the backup root as it appeared at the moment the snapshot was generated.
Next stop: moving to ZFS and it’s built-in snapshot facilities.