Yesterday at work I was working on a project to migrate an elderly physical SQL Server (v. 2000) to a new VMWare virtual offering better performance and availability.

I installed the new server instance, and tasked my offshore colleagues in India (who are absolutely excellent) with the actual migration, as they can do it at 3am our time during their regular working day.

This is a simple server running a database which supports a Communities-based forum used by the company I work for in supporting their customers. However, it has previously hosted a range of other databases, supporting various website shops used by subsidiaries of my company. Several weeks ago the databases that serve these commercial sites had been moved onto a SQL Server 2005 clustered instance.

At the time of that migration, I noticed that the database used for Session State on these sites was still using the old server in error, and pointed this out. Unfortunately this was at a time of a big fire fight over related web server issues on the same project, and I was told this was a low priority, since the sites were still working (they didn’t mind where they were storing their session state!) This issue was never addressed, and I certainly forgot about it.

Then during this migration, a series of errors occurred in sequence, much in the same way that several unusual events usually chain together to cause plane crashes.

First, application logins could not connect to the new server. After lots of troubleshooting, we realised I had installed the wrong collation! What a terrible, basic, embarrassing mistake for a DBA! So – we needed to invoke roll back, simply switch the old server back in, and reinstall SQL Server on the new virtual server, and redo the migration at a later date, yeah?

Problem. One of my colleagues in India, while renaming the old server and changing its IP address, had clicked ‘disable’ on its network card, instead of ‘properties’. Disconnected. No RDP. No ‘lights out’ card. No one on site in the server room , it was still 7am. Alarm raised.

Also, it now became apparent that all the websites mentioned above were no longer working, because they were still incorrectly using the old server for their Session State database: no one had got round to changing the relevant IIS connection configs. Big problem. commercial sales sites unavailable, rollback position lost. Red faces all round.

By 9.09am UK time we had recovered the situation, someone had arrived and was able to re-enable the network card at the console in the server room. IP and Netbios name changes complete, the old server was back online, websites up once more.

Lessons learned – get your collation right. SQL_1xCompat_CP850_CI_AS really does look very much like SQL_1xCompat_CP850_CS_AS when you give it a cursory glance instead of a proper check. Get your setups thoroughly peer reviewed. Never jeopardise your fallback position – ensure you have Lights Out connectivity if possible. And make sure you know how to use it if you do!

The inevitable inquest, mud-flinging, finger-pointing, blame-apportioning, I-told-you-so accusations have not been at all pleasant, and have made this into a really bad week for me which is barely half way through.


One thought on “The DBA equivalent of a plane crash

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s