GrabPERF – Database Failure

The GrabPERF database server failed sometime early this morning. The hosting facility is working to install a new machine, and then will begin the long process of restoring from backups and memory.
Updates will be posted here.

UPDATE – Sep 4 2009 22:00 GMT: The database listener is up and data is flowing into the database and can be viewed in the GrabPERF interface. However, I have lost all of the management scripts that aggregate and drop data. These will be critical as the new database server has a substantially smaller drive. There is a larger attached drive, and I will try and mount the data there.

It will likely take more time than I have at the moment to maintain and restore GrabPERF to its pre-existing state. You can expect serious outages and changes to the system in the next few weeks.

[Whining removed. Self-inflicted injuries are always the hardest to bear.]

UPDATE – Sep 5 2009 03:30 GMT: The Database is back up, and absorbing data. Attempts to move it to the larger drive on the system failed, so the entire database is running on an 11GB partition. <GULP>.

The two most vital maintenance scripts are also running the way they should be. I had to rewrite those from very old archives.

Status: Good, but not where I would like it. I will work with Technorati to see if there is something that I’m missing in trying to use the larger partition. Likely it comes down to my own lame-o linux admin skillz.

I want to thank the ops team from Technorati for spending time on this today. They did an amazing job of finding a machine for this database to live on in record time.
I have also learned the hard lesson of backups. May I not have to learn it again.

UPDATE – Sep 5 2009 04:00 GMT: Thanks again to Jerry Huff at Technorati. He pointed out that if I use a symbolic link, I can move the db files over to the large partition with no problem. Storage is no longer an issue.

[And, why you ask, is Tara Hunt (@missrogue) on this post. Hey, when I asked Tagaroo for Technorati images, this is what it gave me. It was a bit of a shock after 8 hours of mind-stretching recovery work, but hey, ask and ye shall receive.]

UPDATE – Sep 7 2009 01:00 GMT: Seems that I got myself into trouble by using the default MySQL configuration that came with the CentOS distro. As a result, I ran out of database connections! Something that I have chided others for, I did myself.
The symptom appeared when I reactivated my logging database, which runs against the same MySQL installation, just in a separate database. It started to use up the default pool of connections (100) and the agents couldn’t report in.

This has been resolved and everything is back to normal.

Leave a Reply Cancel reply