A user at $WORK was running a series of jobs on the cluster -- dozens
at any moment. Other users have their quota set to 60 GB, but this
user was not (long story). His home directory is at 400GB, but it was
closer to a terabyte not so long ago....right when we had a hard drive
and a tape drive fail at the same time on our backup server.
We do backups every night to tape using Bacula. Most backups are
incremental (whatever changed since the last backup, usually the day
before) and are small...maybe tens of GB per day. But backups for
this user, because of the proliferation of logs from his jobs, were
closer to the size of his home directory every day -- simply because
all these log files were being updated as each job progressed.
Ordinarily this wouldn't be a problem, but the cluster of hardware
failures have really fucked things up; they're better now, but I'm
very slowly playing catchup backups. Eating a tape or more every day
is not in my budget right this moment.
I asked him if any of the log files could be excluded from backups
without any great loss. After talking it over with him, we came to
this agreement:
- His home directory would be backed up (obvs)
- but within "projects/output", only files that contained "rep0"
somewhere in the filename would be backed up.
This would exclude lots of other files like "1rep2.foo", "8rep9.log",
etc, and would cut out about 200 GB of useless churn every day.
Bacula has the ability to do this sort of thing...but I found its
methods somewhat counterintuitive, so I want to set down what I did
and how I tested it.
First off, the original, let's-include-everything FileSet looked like
this:
FileSet {
Name = "example"
Include {
File = /home/example
Options {
signature = SHA1
}
}
Exclude {
File = /proc
File = /tmp
File = /.journal
File = /.fsck
File = /.zfs
}
}
We back up everything under /home/example, we keep SHA1 signatures,
and we exclude a handful of directories (most of which are
boilerplate, applied to every FileSet by default).
In order to get Bacula to change the FileSet definition, you have to
get the director to reload its configuration file. But some errors
-- not all -- cause a running bacula-dir process to die. So before I
started fiddling around, I added a Makefile to the /opt/bacula/etc
directory that looked like this:
test:
@/opt/bacula/sbin/bacula-dir -t && echo "bacula-dir.conf looks good" || echo "problem with bacula-dir.conf"
reload: test
echo "reload" | /opt/bacula/sbin/bconsole
Whenever I made a change, I'd run "make reload", which would test the
configuration first; if it failed, bacula would not be reloaded.
(The "@" symbol, in a Makefile, discards standard output.)
Next, I needed a listing of what we were backing up now, before I
started fiddling with things:
echo "estimate job=fileserver-example listing" | bconsole > /tmp/listing-before
The "estimate" command gets Bacula to estimate how big the job is; the
"listing" argument tells it to list the files it'd back up. By
default it gives you the info for a full backup. (You can also append
a joblevel, so you can see how big a Differential or Incremental; I
didn't need that here, but it's worth remembering for next time.)
After that, I made another Makefile that looked like this:
test: estimate shouldwork shouldfail
estimate:
@echo "estimate job=fileserver-example listing" | bconsole > /tmp/listing-after ; wc -l /tmp/listing*
shouldwork: estimate
grep rep0 /tmp/listing-before | grep projects/output | while read i ; do grep -q $$i /tmp/listing-after || exit 1 ; done
shouldfail:
grep rep2 /tmp/listing-before |grep projects/output | while read i ; do grep -q $$i /tmp/listing-after && exit 1 ; done ; true
This is a little hackish, so in detail:
The estimate target gets an updated listing of what Bacula will
back up; the line count lets me eyeball how it compares to the old,
all-inclusive listing.
The shoudwork target gives me a quick way to make sure that all
the files with "rep0" in the name and "projects/output" in the path
are still in that updated listing. We grep for these files in the
new listing; it either works or exits with error code 1, which make
will catch and declare an error.
The shouldfail target is similar, except I'm making sure that
files with "rep2" in the name are excluded from the new listing
and we're short-circuiting the loop if any line is found. The
"true" at the end is there to give make a final success; we only
make it to that command if the entire loop has not found anything,
which is what we want. It's there to make this test a "MUST NOT".
(That's probably not explained very well.)
Anyhow: after each change, I'd run "make reload" as root to make sure
that the syntax worked. After that, I'd run "make test" as an
ordinary user (no need for root privileges) to make sure that I was on
the right track. After a while, I got this:
FileSet {
Name = "example"
Include {
File = /home/example
Include {
Options {
signature = SHA1
Wilddir = /home/example/projects/output
Exclude = yes
}
}
}
Include {
File = /home/example/projects/output
Options {
WildFile = "*rep0*"
Signature = SHA1
}
Options {
Exclude = yes
RegexFile = ".*"
}
}
Exclude {
File = /proc
File = /tmp
File = /.journal
File = /.fsck
File = /.zfs
}
}
Again, this is a little counterintuitive to me, so here's how it works out.
The first "Include" stanza is the same, except that in the "Options"
section we're excluding "/home/example/projects/output". That's what
the "Wilddir" and "Exclude = yes" directives are for.
The second "Include" stanza puts the "/home/example/projects/output"
back in, but modified with two "Options" sections: the first to
include "rep0" (a simple fileglob) and the second to exclude
everything. What ends up being included by this stanza is the union
of those two options: only files named "rep0" in the directory
"/home/example/projects/output".
Last, the third stanza is our standard "Exclude" boilerplate.
After I was confident that I had the right set of files excluded, I
sent the user a list of files to confirm that all was well:
cat /tmp/listing_before | while read i ; do grep -q $i /tmp/listing_after || echo $i ; done > /tmp/excluded
Now, I'm the first to admit that that is ugly. Diff, useless use of
cat...lots of objections to raise. But it's been a long day and I got
what I wanted. I pointed the user at it, made sure it was okay, and
committed the changes.
All in all, this gave me a good loop for testing: it caught fatal
errors before they happened, it let me be sure I was excluding the
right things, and I was able to work in a stepwise fashion to get
where I wanted.