Update on SQL Server 2008 Clustered Instance connection timeout issues

I am still struggling on with this issue. Despite several weeks of mine and PSS’s efforts, we are nowhere nearer resolution.

I took advantage of New Year downtime in the factory at work to completely remove then reinstall the instance. Sledgehammer to crack a nut? Well..

A lot of effort, many hours planning and preparation, and all to no avail – both the timeouts issue and the (probably unrelated) Access Violation on a certain update issue remain.

I thought I’d share a quick overview of how I went about the reinstall, if any of you ever find yourself in a similar boat. the outline plan went like this:-

1. Put all cluster resources in maintenance mode in monitoring tool. Full backup to disk of all DBs.
2. Disable all relevant jobs, interfaces, etc, keeping a list for re-enabling (or script as appropriate)
3. Unsubscribe publications to and from the instance.
4. Detach user dbs and leave data and log files in tact in cluster partitions.
5. Remove instance in turn from each cluster node.
6. Reinstall instance on each node in turn. (used vanilla RTM install this time, to try to eliminate slipstreaming SP1 as a possible cause of issues)
7. Applied SP1 and CU5 to the instance on each node.
8. Reconfigure (trace flags, memory, static port assignment, tempdb layout etc)
9. Restore msdb (with SQL Agent cluster resource offline)
10. Stop SQL Server cluster resource.
11. Start SQL Server locally via command line in single user mode (outside of clustering).
12. Restore master using SQLCMD.
13. Restart all cluster resources and check. Check all logins and jobs are back.
14. Remove cluster resources from maintenance mode.
15. Recreate publications as necessary, re-subscribe and re-sync.

Restoring the master database on a cluster is interesting (first time I’d ever done it).

Navigate to the location of the SQL Server executables from the command prompt (in my case, something like..
C:\Program Files\Microsoft SQL Server\MSSQL10.INSTANCENAME\MSSQL\Binn

and type

sqlservr –c –m –s INSTANCENAME

where INSTANCENAME is simply the named instance name, not including the SQL Server virtual name (that is not clear in BOL).

Then open an SQLCMD connection, such as like this, and restore the master db:-

>restore database master from disk = ‘\\path\backup.bak’ with replace

This restores master and then SQL Server is stopped and you are unceremoniously disconnected.

Of course, this was vaguely interesting, but I certainly had better things to be doing at 6am on the 2nd of January! And it didn’t fix my problems. But it was worth a try, I guess.

Under Microsoft Premier Support Services’ direction, we are testing installing a brand new 2008 instance into the cluster to see if we can isolate the issue to this one problematic instance.