I like jam!
Server woes

Server woes

Firstly, dear readers, you'll probably have noticed there hasn't been an update for a couple of weeks now. That's because I've been having some serious issues with the server that hosts ILikeJam (as well as several other sites, all my emails, etc etc).

The problems came to my attention about a month ago. The server had just stopped responding. I requested a power cycle. Done... no joy. Tried again. Still no joy. A KVM session was showing the machine hanging on startup. No amount of rebooting would fix it. So I had to go flying down to London one Sunday morning to sort it out.

Once in the hosting provider's build room, it all mysteriously started working again. So the server was racked back up and everything seemed fine.

The thought did cross my mind that there was an unidentified problem with the Easywebz hosting platform which runs this site and several others. Fortunately this turned out not to be the case, and it was a sheer coincidence that these problems had started just as I'd moved all but 2 of my clients' web sites across to the new platform.

Anyway, the server ran fine for a couple of weeks, and then stopped responding again. I got a KVM session set up and watched carefully as the startup messages flashed past on a reboot, paying particular attention to the disks.

The server is fitted with a RAID5 array consisting of three 3TB SATA disks (wd0, wd1 and wd2), and a 64MB CF card which contains the bootloader and kernel. It's done this way as although NetBSD will autoconfigure a RAID subsystem and make it the root partition, this doesn't work where the disk (or logical disk in the case of RAID) is GPT rather than MBR. Because the filesystem is bigger than 2.2TB, it has to be GPT. So the way I'd solved it was to have a kernel with an embedded ramdisk with an /etc/rc script which configured the RAID, and subsequently autoconfigured the GPT partition on RAID as dk3 (the RAID components configure as dk0-3 on wd0-3 respectively as the components are also bigger than 2.2TB). It then mounts the RAID on /raid and sets init.root to /raid before exiting. NetBSD will then chroot the entire filesystem to /raid and run the standard /etc/rc script in the chrooted filesystem. It's a complicated hack but it's necessary.

Anyway, as the disks were probed and displayed I saw wd0, wd1, wd2... wait, where's wd3?

Looked again. wd0 and wd1 were showing up as 3TB SATA. wd2 showed as 64MB Flash device. Wait - that should be wd3. Then the light bulb went on in my head.

Clearly there was a loose connection on one of the RAID components. When the server had been moved from the datacentre into the build room, the movement had disturbed the connector and re-made the connection. It had remained ok for a couple of weeks and then gone again.

So I asked the technicians very nicely if they could open up the server and re-seat all the SATA and power cables between the hard drive and motherboard. This was duly done and the machine re-started. It booted up fine.

On investigation I discovered that disk unit 1 (the middle drive) on the RAID was marked as "failed". I ran the command to reconstruct. No worries, I thought. WRONG!

The rebuild got to 39% and failed.

At this point syslog was showing disk read errors on unit 0. So, as you do, I tried repeating the command. Same again. Failed at 39%.

By now I had a potentially serious problem as another disk failure could lose a few terabytes of data. So I did the only sensible thing. I ordered parts and built a new server. The new motherboard didn't have an ATA port, only SATA, so I couldn't boot from a CF card. This isn't a problem, I fitted a USB DOM (Disk-On-Module) and configured it as the boot device. Configured the RAID, installed all required software, then synced a few hundred GB of data from the old server to the new.

Then came the day (yesterday) to change over the servers. All appeared to go smoothly, except the new server wasn't responding. I was working in London all day, so a few frantic phone calls between myself and the hosting provider resulted in me calling in on the way back.

Essentially the problem was that when I set the server up in the office here I'd used DHCP to auto-configure the network. I'd set everything manually, but for some reason the changes hadn't taken. NetBSD is supposed to read the ifconfig parameters from (in the case of re0) /etc/ifconfig.re0, but it wasn't doing so. It also wasn't reading the default route from /etc/mygate. It took a few moments to figure out that this wasn't happening. I still haven't figured out why, but I've solved the problem by sticking "ifconfig re0" etc and "route add default" etc at the very top of /etc/rc so they get run as soon as the system enters multiuser mode.

Within a few minutes of sorting this out, emails started to come in. However, there was another snag with web sites. It appears that mod_php is broken on Apache 2.4, and one has to use php_fpm instead. Which requires the FastCGI module, and some awkward configuration. Attempting to install FastCGI fails, as the version of this in pkgsrc is broken on Apache 2.4, although I did sort of get it working. But then I discovered that FastCGI and mod_rewrite don't play nicely together.

The short version? I uninstalled Apache 2.4 and installed 2.2, with mod_php. And everything (almost) worked.

All the sites hosted on the Easywebz platform are now back up. There are three legacy sites which aren't yet running due to database issues. They are all sites that will be moved to Easywebz in due course, but at present one of the modules required for two of the sites is not yet completed. So that's the weekend's work, get the three remaining sites back up and running.

But now I'm off to the pub.

Comments

No comments to show.

Sign in to post comments