Add a new Windows 2003 node to a cluster and Install a new SQL Server 2005 instance on it: issues!

Recently I was tasked with adding a new node to an existing 5-node Windows 2003/SQL Server 2005 cluster. This was much more troublesome than you’d expect!

The first issue I ran into was during the very first step, adding the new node to the cluster (the node had already been built with OS, networking, etc). The node add failed with the following error:-

CLUS01: Could not verify that node "CLUS01N6.domain.internal.net" can host the quorum resource. (hr=0x800713de, {A976DC09-0108-410A-AF57-68C05F9A42F7}, {4518DA37-669B-4C46-9A2B-E903835A8E6B}, 1, 1, 1), Ensure that your hardware is properly configured and that all nodes have access to a quorum-capable resource.

This is worked around as detailed in this MSKB article. This runs the Cluster Add Node step with minimal as opposed to detailed checks, because it doesn’t understand complex SAN pathing.

That fixed the first issue. On to the next! On trying to add the node again, I repeatedly got the error

There is not enough server storage to process this command.

This error also could be generated by attempting to run cluster administrator on the new node. You used to get this error on Windows 2000 if your registry wasn’t sized adequately. Fortunately one of my colleagues tracked down the problem, which was incorrect settings of the following two Registry Keys:-

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\MSV1_0\NtlmMinClientSec
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\MSV1_0\NtlmMinServerSec

Both were set to 0, but needed to be set to 537395248 (Hex 20080030). Hmmm.

Finally, I had my new node added and could format & mount my new LUNs and add them as cluster resources. Windows 2003 has a a little ‘glitch’ which stung me here, as it would do on this blighted job, where it seems to forget the drive letter you have assigned the new drive. You have to delete the partitions in disk administrator and start again.

So finally I could proceed to my SQL Server 2005 instance install. For this (I learned the hard way), you must do an unattended install, because using an interactive install there is no way to specify the Windows AD account for the SQL Browser service on the new node. Setup fails with a totally unhelpful error if you do not:-

Could not find the file specified.

So finally I got somewhere with my unattended install. End of woes? No way. The setup did not graciously complete, even thought to all intents and purposes it had finished (I could see all SQL Server cluster resources, and these were working fine, with a clean error log). I had to end task on all the setup.exe processes on the cluster nodes. But after that it was all sweetness and light. A true test of perseverance and tenacity!

Advertisements

The DBA equivalent of a plane crash

Yesterday at work I was working on a project to migrate an elderly physical SQL Server (v. 2000) to a new VMWare virtual offering better performance and availability.

I installed the new server instance, and tasked my offshore colleagues in India (who are absolutely excellent) with the actual migration, as they can do it at 3am our time during their regular working day.

This is a simple server running a database which supports a Communities-based forum used by the company I work for in supporting their customers. However, it has previously hosted a range of other databases, supporting various website shops used by subsidiaries of my company. Several weeks ago the databases that serve these commercial sites had been moved onto a SQL Server 2005 clustered instance.

At the time of that migration, I noticed that the database used for Session State on these sites was still using the old server in error, and pointed this out. Unfortunately this was at a time of a big fire fight over related web server issues on the same project, and I was told this was a low priority, since the sites were still working (they didn’t mind where they were storing their session state!) This issue was never addressed, and I certainly forgot about it.

Then during this migration, a series of errors occurred in sequence, much in the same way that several unusual events usually chain together to cause plane crashes.

First, application logins could not connect to the new server. After lots of troubleshooting, we realised I had installed the wrong collation! What a terrible, basic, embarrassing mistake for a DBA! So – we needed to invoke roll back, simply switch the old server back in, and reinstall SQL Server on the new virtual server, and redo the migration at a later date, yeah?

Problem. One of my colleagues in India, while renaming the old server and changing its IP address, had clicked ‘disable’ on its network card, instead of ‘properties’. Disconnected. No RDP. No ‘lights out’ card. No one on site in the server room , it was still 7am. Alarm raised.

Also, it now became apparent that all the websites mentioned above were no longer working, because they were still incorrectly using the old server for their Session State database: no one had got round to changing the relevant IIS connection configs. Big problem. commercial sales sites unavailable, rollback position lost. Red faces all round.

By 9.09am UK time we had recovered the situation, someone had arrived and was able to re-enable the network card at the console in the server room. IP and Netbios name changes complete, the old server was back online, websites up once more.

Lessons learned – get your collation right. SQL_1xCompat_CP850_CI_AS really does look very much like SQL_1xCompat_CP850_CS_AS when you give it a cursory glance instead of a proper check. Get your setups thoroughly peer reviewed. Never jeopardise your fallback position – ensure you have Lights Out connectivity if possible. And make sure you know how to use it if you do!

The inevitable inquest, mud-flinging, finger-pointing, blame-apportioning, I-told-you-so accusations have not been at all pleasant, and have made this into a really bad week for me which is barely half way through.