the life of a sysadmin.
Carousel is a lie!

Linky:
[FSF Associate Member] LOPSA

Email: aardvark at saintaardvarkthecarpeted dot com

GPT and MBR

Fri Jul 3 12:17:25 PDT 2009

I've run into an interesting problem with the new backup machine.

It's a Sun X4240 with 10 x 15k disks in it: 2 x 73GB (mirrored for the OS) and 8 x, um, a bunch (250GB?), RAID0 for Bacula spooling. (I want fast disk access, so RAID0 it is.) RAID is taken care of by an onboard RAID card, so these look like regular disks to Linux.

Now the spool disk works out to about 2.2TB or so — which is big enough to make baby fdisk cry:

WARNING: The size of this disk is 2.4 TB (2391994793984 bytes).
DOS partition table format can not be used on drives for volumes
larger than 2.2 TB (2199023255040 bytes). Use parted(1) and GUID
partition table format (GPT).

Well, okay, haven't used parted before but that's no reason to hold back. I follow directions and eventually figure out that mkpart gpt ext3 0 2392G will do what I want. GPT? Piece of cake! And then I rebooted, and I couldn't boot up again. Blank screen after the POST. Crap!

The first time this happened, the reboot also coincided with some additional problems during the POST where too many cards were trying to shove their ROM into the BIOS memory (or some such); I thought the two were connected. But then I did it again today, and I finally started digging.

The problem is that parted overwrites the MBR when setting up a GPT disklabel. This has been noted and argued over. My understanding of the two sides of the debate is:

  • the MBR is not part of the EFI standard, so it's entirely rational that it should be erased;

  • but very few x86 machines are EFI-only;

  • and traditional disklabels don't support partitions over 2TB, so what's a brother gonna do?;

  • and an MBR-GPT hybrid seems a nice way out of this.

Meanwhile, the parted camp has a number of bugs dealing with this very issue, two opened a year ago, and none have any response in them.

This enterprising soul submitted a patch back in December 2008, which appears to have fallen to the floor.

As for me, I was able to convince the BIOS to boot from the smaller disk, and then get a rescue CentOS image going via PXE booting, and then reinstall grub on the smaller disk. Sorted. All I had to do was change root (hd1,0) to `root (hd0,0) in grub.conf.

A touch anti-climactic after all that, perhaps. But it was interesting a) to learn about all this (I hadn't really thought about successors to the DOS partition format before), and b) to see what a slender thread we (okay, I) hang our hopes on sometimes. It's a necessary, sobering thing to realize how much of what I use, depend on, believe in is created by volunteers who are smart, hard-working people — they argue and and focus and forget just like real people, not inhabitants of some shining city on a hill I sometimes take them for ("Next beer in Jerusalem!").

(permalink) (comments)

Bacula, gossip, advice

Thu Jul 2 16:31:35 PDT 2009

  • Bacula config coming along; figured out today that /dev/nst0 corresponds to what mtx sees as Data Transfer Element 1 (as opposed to DTE 0), which explains why previous attempts to run label barcode just failed miserably. (Neat command that.) And I had thought that DTE meant the arm, but no: upon reflection, it's a subtle/obtuse (not the right word, but oh well) way of referring to the tape drive itself.

  • Rather interesting comment, if you like that sort of thing, from Mark Burgess (originator of Cfengine on Puppet and Luke Kanies. I know, I should remain above, but it is weirdly fascinating.

  • And to go out on a high note, some excellent advice from Tom Limoncelli on setting priorities as a sysadmin:

This sounds like when I was at my previous employer and they asked if
I could develop a web-based system to take surveys.  I nearly said,
"yes" because, well, I know perl, I know CGI, and I could do it.
However, I was smart enough to say "no, but surveymonkey.com will do
it for cheap."  Best of all it was self-service and the HR person was
able to do it entirely without me.  If I had said I could write such a
program, it would have been days of back-and-forth changes which would
have driven me crazy.  Instead, she was happy to be empowered to do it
herself.  In fact, doing it herself without any help became a feather
in her cap.

The lesson I learned is that "can I do it?" includes "do I want to do
it?".  If I can do something but don't want to, the answer is, "No, I
don't know how" not "I know how but don't want to".  The first makes
you look like you know your limits.  The latter sounds like you are
just being difficult.

(permalink) (comments)

1246317421 seconds since the epoch…

Mon Jun 29 16:17:01 PDT 2009

I'm back at work after a week off. The UPS control panel continues to work (!), but there is no word back from the manufacturer (says the contractor who installed the thing and filed the ticket). I find this troubling; either the manufacturer really hasn't got back to us yet (bad), or I should have insisted on being a contact for the ticket. I'll have tos ort this out tomorrow.

Spent much of my day tearing my hair out over mod_proxy_html. Turns out that, by default, it strips the DTD from the HTML it proxies; this is a problem for one app that we're proxying. Not only that, the DTDs it does support are HTML, XHTML, and either with a "Transitional"/Legacy flag — but no URI to a DTD, like the one pointing to the Loose DTD that our app uses and the damned thing threw to the floor. (Sorry, brain cells on strike today and my ability to write clearly is going downhill.)

You can specify your own DTD, including a URI (undocumented feature, whee!), and thus put back in the original — but it doesn't append a newline, there's no way to append a newline that I could figure out, and so it mushes the DTD together with the first html opening tag and makes baby Firefox cry and render the page badly.

My rule of thumb for a long time was that if I start lppooking at source code, I'm in over my head. I'm starting to think that may not be entirely true anymore, that I've advanced to the point where I can read C (say) and generally understand what's going on. But when I start looking for API documentation for Apache 2.2 (surprisingly hard to find) to find out if, say, ap_fputs or apr_pstrdup chomp newlines or something (near as I can tell, they don't), or just what AP_INIT_TAKE12 takes as arguments…well, then I am in over my head. If nothing else, I don't want to make some silly error because I don't know what the hell I'm doing. (That's not a slam against the Debian folks; I just mean that I felt shivers when I read about that, because I dread making the same sort of highly-visible, catastrophic error) (unlike the rest of the planet, you understand).

(permalink) (comments)

Busyness

Thu Jun 18 16:12:32 PDT 2009

Full day:

  • Prepare new network map

  • Take stand-in techie around server room and explain new network setup

  • Check UPS; still not crashed

  • New Sun 4240 server unable to get past POST after hooking up fibre cable yesterday to SL-500 library. Try various things, no luck. Fortunately installers coming back next week to finish the job.

  • Over to server room w/boss to take pictures for website

  • Get programmer familiar w/the server she'll be using, how to set up services, etc. Arguably my job, but a) she'll want to learn and b) I'm off on vacation next week.

  • Unless of course the UPS folks need to schedule downtime to make it work. But then I'll just use it as an excuse to show my dad and kids around the server room.

  • Gotta pick out an IPA recipe to brew with my dad. Leaning toward the Cream IPA from Radical Brewing. May need to get a cooler to use as a lauter tun, since I think it's around 13 pounds of grain — more than I can comfortably do in my paint bag strainer setup.

  • Still got out to walk around at lunch time, which was nice; I have a bad habit of skipping that.

(permalink) (comments)

Now that's irritating…

Tue Jun 16 10:48:54 PDT 2009

Just discovered, while trying to test the mail server at $WORK, that my ISP filters outgoing port 25. I'd give them a call but I can't dig up my account info at the moment.

(permalink) (comments)

Once more, with feeling:

Mon Jun 15 12:16:46 PDT 2009

Dress rehearsal includes checking to see if you can, in fact, unrack something. I was uanble to move a switch this morning because it was stuck behind a PDU. Arghh.

The saga of our crashing UPS continues. The techs came out to visit this morning, which meant I needed to schedule downtime so they could bypass the UPS manually. They were unable to find any smoking gun (or capacitors), and need to confer with HQ again. Best case: the UPS control panel continues to work, and they can do the next round of work w/o a manual bypass. Worst case: the control panel crashes again, and we schedule another round of downtime.

(permalink) (comments)

Rack tip #54, or Murphy's Law of Rack PDUs

Fri Jun 12 12:12:08 PDT 2009

If you have space for two PDUs and you put one on each side of the rack, you will have no separate space for network cables and you'll get interference. If you put those two PDUs on one side of the rack, you'll put it on the wrong side and your power cords will interfere with your network cables. If you put those two PDUs on the correct side of the rack, you'll find that racking new items is a pain because the cords block the post holes on that side.

(permalink) (comments)

Tour, FC

Thu Jun 11 20:42:19 PDT 2009

Gave a tour of the new server room today to about 30-odd people in the department. Ended on a bit of a low note ("…and that's the end! Any questions?") but other than that it went well. Even got an ounce of champagne at the end of it.

Oh, and yesterday I found out that our SL-500 has three fibre channel interfaces, compared to the one interface in the server we bought. I think the sales folks assumed we had a fibre switch, and I didn't realize it all (data + control) wouldn't go over one cable. Arghh.

Just saw a character named Terence on "Entourage" who was not Terrance Stamp. Now I want to see "Bowfinger" and "The Limey", in that order.

(permalink) (comments)

New server room ours at last

Wed Jun 10 21:07:30 PDT 2009

Given the recent hoo-ha about abandoned blogs, and my own tendency to lose interest in writing about something the longer I put it off (I haven't graphed it, but I suspect it's a nice exponential decay), I figured I should finally write up what I've been doing the last week: the move at $WORK to our new server room.

So: construction finally got finished on our new server room. Our UPS was installed, our racks set up, and the keys handed over (though they were to be changed again twice). Our new netblock was assigned, the Internet access at the new location was in place, and movers were booked.

Things I did in advance which helped immensely:

  • Checklist in Org mode, plus printed copies; the ability to constantly edit a nice todo list, complete with checkboxes and statistics, was wonderful.

  • Printed copies of the spreadsheet showing rack assignment, cabling requirements, VLAN changes, etc

  • Tested new firewall with VMs (thus pointing out that "antispoof quick" is not a good thing to do with a bridging OpenBSD firewall)

  • Cardboard for the floor of the new server room to lay the servers on (since we weren't going to be able to rack the machines as quickly as they came from the movers)

Last Thursday morning, it all started. I got the machines shut down (thank you, SSH and ubiquitous wireless access at UBC) before the two volunteers who were helping me showed up. We started getting machines unracked; since it was only about 20 machines, I figured it wouldn't take too long. While that was true, I had not counted on the rat's nest of power cables (our power requirements were such that we had to connect machines to PDUs in adjacent racks), or the fact that we wouldn't be able to disassemble that 'til we'd got the machines out.

There was one heartstopping moment: a 1U server, while extended on its rails, came off one of the rails while no one was supporting it. Amazingly the other rail held on while it rotated quickly through 90 degrees to bang loudly against the rack. "You swear quickly," the movers remarked. (Doubly amazingly, the machine seems to be fine, though the rails for the thing are shot.)

The movers were big and burly, which was wonderful when it came to moving the Thumper. I weigh more than it does, but not by much, and I'd had the bad fortune to screw up my back a week before the move. It was tricky trying to figure out how to remove it from the rails, but the movers' trick of supporting it with a couple of big blankets, while fully extended from the rack, made such considerations less urgent. Eventually we got it figured out. I don't know how that could have gone smoother, since we'd got Sun to rack the thing and, frankly, it's not like you spend a lot of time un- and re-racking something like that. Anyhow, a minor point.

The new location was right around the corner, which was handy. The movers had put the servers in these big laundry-like carts on wheels; in the end, we only had four of em. We got the machines unloaded, racked the Thumper with the movers help, signed the paper, then went off for lunch where we picked up two more volunteers.

After that, we started racking servers. Having only one sysadmin around (me) proved to be a bottleneck; the volunteers had not worked with rackmounted machines before, and I kept having to stop what I was doing to explain something to them. It would have been a great help to have another admin around; in fact, I think this is the biggest move I'd want to make without some other admin around.

Problems we ran into:

  • Cage nut pullers are small and get lost easily. (Moral: designate one place for tools, just like it sez here)

  • Mounting brackets didn't work. One of 'em, I just figured out today, we had in backwards. The other wasn't threaded for the bolts from APC, and I had only the right bolts — no cage nuts to fit. (Moral: photograph the racks for anything non-standard; if you have to ask, it's non-standard)

  • One of the things we couldn't mount was a Very Important Disk Array. Fortunately it held a database which had been mirrored on another Very Important Disk Array, which also couldn't be mounted in its brackets. Instead, we used a rack shelf I happened to have around, and that worked well….but its advertised capacity wasn't enough to hold all four trays (2 trays per array), so we made do with one. (Moral: have a spare rack shelf or two on hand)

  • The bolts from APC had these enormous heads, which would end up impinging on the rack unit above/below. This got to be a pain. Only today did I discover that there were plenty of bolts and cage nuts provided by the contractor who installed the racks. (Moral: dress rehearsal includes putting cage nuts and bots in adjacent holes to see how they fit)

  • We had to re-hang the PDUs so they'd reach the power supplies. There were two in each rack, and both were on the right; the power supplies were all on the left, and I'd bought a bunch of 2' power cords to help with cable management. (Moral: Think about cable management for power, not just network)

  • Another thing about the PDUs: The outlets don't stretch throughout the length of the bar, but instead are clustered such that there's a dead space at the bottom/top 8" or so. The power cables had to be chained together sometimes to reach the extremes. (Moral: dress rehearsal includes plugging things in)

  • My plan to mount the switch in the middle of the rack with all the equipment has the advantages of shorter network cables (no running back to front, and no running top to bottom). But I should have noticed the middle empty spot in the PDUs adn mounted it there; as it is, there's a block of outlets in the PDUs I can't use because the power cables will get too close to the network cables. (Moral: think about cable management for network, not just power)

  • Underestimated the amount of time it'd take to get things racked. I suppose this can only be bettered with experience.

  • Underestimated the amount of time it'd take to get cables dressed; did not realize how important this was for working with things.

  • Did not bring warm shirt for when the cooling was turned on. Mistake!

  • Did not have lots of water on hand; did not figure out in advance where bathroom was (important in a building where you only have access to one room)

  • Really could have used a phone in advance in the room; cel coverage was spotty

  • Ratchet set very handy when tightening screws in awkward places (ie, behind power bar); last resort: hold bit in jaws of pliars/Leatherman. (Moral: dress rehearsal includes looking for tight corners and figuring out how you're going to work in them)

  • Preserve all bits and label them; carry masking tape/removeable labels and sharpies; label anything and everything you haven't already; use ziplock bags for stuff and tape them to the machines they're associated with

  • Firewall not modified to allow LDAPS to LDAP server from new netblock

  • Monitoring machine came up with no ethernet interfaces; modprobe tg3 gave "probe of 0000:04:04.0 failed with error -22". (Moral: figure out how you're going to get information off a machine with no network)

  • Anyone else notice that C13-C14 power cords are just plain wobbly in the PDU sockets? I had more than one pop out on me while moving cords around. (Moral: Andy Rooney lives!)

  • Coulda used more printouts of the rack assignments.

  • One cable was flaky: it worked for a while, then didn't. This was the cable that connected our firewall to the ILOMs for the servers, which meant I was unable to work from home on getting them up and running. This was probably for the best; I sorely underestimated just how wired I was when I went home. (Moral: you're more tired than you think)

  • One of the racks was designated as the networking rack; however, since we didn't have that many switches to mount, I figured I'd use it for other stuff too. This turned out not to work: the distance between the front and back rails had been shortened to make room for network cables, and that meant the rack rails for the equipment I wanted to mount didn't fit.

Things that went well:

  • Ripwrap is awesome. So are cordless drills that come with two batteries.

  • The rack rails from Sun that just clip in are also awesome. Man, that makes things fast.

  • There was good beer in the fridge when I got home. Thanks, Pre.

  • Frankly, all the prep meant that things went pretty well overall. This was good.

I'm going to post this now because if I don't, it'll never get done. I may come back and revise it later, but better this than nothing at all.

(permalink) (comments)

Squint

Tue Apr 28 16:34:11 PDT 2009

This has been one of those days where all I've done is stare at monitors too closely.

I know, I'm a sysadmin, what do I expect? But some days I get up, move around; I'm sedentary (and introverted) by nature but I try to talk to people, stare off into the distance, get away from my desk. Going to the server room is always a good break.

Not today, though. My carefully-chosen ATI video card (the Radeon 4550) is giving me headaches, metaphorical and real:

  • the proprietary fglrx drivers work if you want a cloned display, but enabling Xinerama makes X segfault

  • or, interestingly, the fglrx driver will show the desktop on one monitor, and an "uninitialized" (X checker pattern, chunky X cursor) screen on the other

  • the radeonhd drivers work perfectly for VGA out, but the DVI out is flickery and "noisy"

Dual monitors is important. My own damn fault for not getting something old enough…

(permalink) (comments)