Outage Details

I don’t know if anyone noticed, but my server (and this site and a few others) were down from sometime on Saturday until Sunday around Noon.

I had multiple hardware failures and had to do some scrambling to get back on the air.

On Friday when I got home from work I noticed that the server was making a horrible noise. I thought it was just a fan and since I didn’t have time to fix it then I ignored it. I sent an email to the various mailing lists that I host stating that I would be taking the server down Sunday to fix the bad fan.

On Saturday afternoon I had some time available, so I decided to do the work then. I tried to log on to the server to send an email and discovered that it was really slow and loaded up. And when I tried to run any command I got an IO error.

That can’t be good.

So I got on the console only to discover that it was scrolling with hard drive IO errors and I couldn’t do anything. Crap. Guess the hard drive is bad.

I powered it down and headed to Best Buy to buy a new hard drive. Gah.

Then I had to figure out how to get the data from the old drive to the new one. I ended up downloading a new copy of Knoppix (I must have lent my CD to someone) which took about an hour and a quarter. Then I tore apart my desktop PC, plugged in the two hard drives and booted up Knoppix. (I love Knoppix!)

I fired up a dd command and copied the entire old drive over to the new one. 250GB of it. It was doing about 9MB/s so I let it run over night.

This morning I got up and the copy had finished so I stuck the new hard drive into the server, buttoned it up, put it in the rack and turned it on.

WTF? It’s still making that damned noise! Crap. I guess there is a bad fan in addition to the hard drive so I take it all apart again. This is a nice 1U server case and there are four fans across the back. I’m only using two of them since it keeps it plenty cool and all four running are too damned noisy. So I unplug the pair that were running and plug in the other pair. This time I plug it in to test it before I put it all back together. (See? I can learn.) All is well so I rack it up again.

I tell it to FSCK the partitions during boot and all appears to be fine now. MySQL didn’t start at boot and SqlGrey failed because of that, but I started them afterwards and we appear to be back in business.

Except

While I was working on the server I noticed that the firewall was also complaining about hard drive errors. It was still passing traffic (good thing, otherwise I’d have had some trouble getting a copy of Knoppix) so I figured I’d just give it a reboot to clean it up.

Oops

Now you have to understand something about this firewall. It was running RedHat 6.4 :-O on a really, really old desktop Pentium 133 with 64MB of RAM. It’s old. It’s been running for a long time. It’s tired.

Apparently it decided that it was done and refused to boot.

I guess when it rains, it pours.

I had previously attempted to build an IPCop firewall to replace the old one but I had run into some problems and couldn’t make it go so I just stopped. But now I had to do something.

I grabbed the hard drive from the previous attempt and put it into another 1U server that I had available. I tried to boot off the install CD for IPCop but I couldn’t make it go (bad CD Rom drive? Who knows?), so I just booted the version that was on the drive.

After a bit of fumbling around to figure out how to reconfigure it I got it sorted out and all configured.

I ended up installing the SNATGui addon for IPCop so that I could SNAT my server to a different IP than the firewall – something that’s not natively supported by IPCop – but I had to mess around to make it go. When you turn on the SNATGui module, it breaks the SNAT configuration that IPCop sets up (by default it NATs everything to the external IP address of the firewall) so you have to manually define ALL the SNATs that you want.

I have a /28 netblock so I can NAT my servers and desktops to different IPs.

I have a DHCP range of 128-191 setup for the desktops and such (that’s a big range for a few machines, but sometimes I have guests…) and entering the IPs one at a time (you can’t do a range) through a web interface was not cutting it.

So I wrote a quick shell script to crank out the lines to put into the /var/ipcop/snat/snatsettings file (it’s nothing fancy, but it works):

#!/bin/sh

I=128

while [ $I -lt 192 ];
do
echo 192.168.100.$I,10.10.10.10,on,DHCP-$I >> /tmp/out
I=`expr $I + 1`
done

Despite this limitation, IPCop appears to be a pretty decent firewall package. It includes Snort, Squid, DDNS and some other useful things. It also graphs the traffic and has better logging. I need to see if you can set alerts for any of this, but overall it’s a nice setup.

There also appears to be lots of support and plenty of people using it, a Google search turns up lots of hits.

We’ll see how it runs.

Oh and the new firewall hardware is a 2.4GHz Xeon with a gig of RAM. Overkill for a firewall, but it’s what I had…

I guess I should really come up with a backup strategy too… Hrm.