I am still struggling on with this issue. Despite several weeks of mine and PSS’s efforts, we are nowhere nearer resolution.

I took advantage of New Year downtime in the factory at work to completely remove then reinstall the instance. Sledgehammer to crack a nut? Well..

sledgehammersmall 
A lot of effort, many hours planning and preparation, and all to no avail – both the timeouts issue and the (probably unrelated) Access Violation on a certain update issue remain.

I thought I’d share a quick overview of how I went about the reinstall, if any of you ever find yourself in a similar boat. the outline plan went like this:-

1. Put all cluster resources in maintenance mode in monitoring tool. Full backup to disk of all DBs.
2. Disable all relevant jobs, interfaces, etc, keeping a list for re-enabling (or script as appropriate)
3. Unsubscribe publications to and from the instance.
4. Detach user dbs and leave data and log files in tact in cluster partitions.
5. Remove instance in turn from each cluster node.
6. Reinstall instance on each node in turn. (used vanilla RTM install this time, to try to eliminate slipstreaming SP1 as a possible cause of issues)
7. Applied SP1 and CU5 to the instance on each node.
8. Reconfigure (trace flags, memory, static port assignment, tempdb layout etc)
9. Restore msdb (with SQL Agent cluster resource offline)
10. Stop SQL Server cluster resource.
11. Start SQL Server locally via command line in single user mode (outside of clustering).
12. Restore master using SQLCMD.
13. Restart all cluster resources and check. Check all logins and jobs are back.
14. Remove cluster resources from maintenance mode.
15. Recreate publications as necessary, re-subscribe and re-sync.

Restoring the master database on a cluster is interesting (first time I’d ever done it).

Navigate to the location of the SQL Server executables from the command prompt (in my case, something like..
C:\Program Files\Microsoft SQL Server\MSSQL10.INSTANCENAME\MSSQL\Binn

and type

sqlservr –c –m –s INSTANCENAME

where INSTANCENAME is simply the named instance name, not including the SQL Server virtual name (that is not clear in BOL).

Then open an SQLCMD connection, such as like this, and restore the master db:-

sqlcmd –SSERVER\INSTANCENAME –E
>restore database master from disk = ‘\\path\backup.bak’ with replace
>go
 

This restores master and then SQL Server is stopped and you are unceremoniously disconnected.

Of course, this was vaguely interesting, but I certainly had better things to be doing at 6am on the 2nd of January! And it didn’t fix my problems. But it was worth a try, I guess.

Under Microsoft Premier Support Services’ direction, we are testing installing a brand new 2008 instance into the cluster to see if we can isolate the issue to this one problematic instance.

5 thoughts on “Update on SQL Server 2008 Clustered Instance connection timeout issues

  1. Please don\’t put too much effort into fixing this. I will miss the "SQL Server Job System: \’Upload_INT_Manufacturing\’ completed on \\\\ARGUS" failure mails, I\’m getting quite used to deleting them now.

  2. Hey did you manage to pinpoint/fix this issue? I think we are getting a similar issue as we are getting about 10,000 connnection exceptions from one of our clustered servers each day. The issues "A network-related or instance-specific error occurred while establishing a connection to SQL Server" but also a lot of "Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. "This may have occurred because all pooled connections were in use and max pool size was reached". These errors are all from our ASP.NET website and are intermittent. Admittedly we don\’t seem to have any problems (not that I\’ve noticed anyway) connecting via SSMS but maybe that is because we are connecting at just the right time. Also we are still running on SP1 but looking through the CU list nothing matches our problem and by the sounds of it will not make any difference anyway.We have tried all the same things as you regarding WireShark etc (except for the re-install) and nothing is coming up. Also the server just seems to be taking everything in it\’s stride in the activity monitor and not even registering these connecton issues.I am really starting to tear my hair out with this as I\’m having to work with ComputaCenter who are hosting the cluster and having multiple conf calls a day, where all we are doing is grasping at straws because no one has any idea at all, is incredibly soul destroying!!! :(Please help before I explode and do something really bad!! 🙂

  3. Richard, thanks for the comment, apologies for the delay in replying, couldn\’t see the wood for the spam. No I have STILL not fixed my issue, please email me if you\’d like to discuss this as soon as poss. thomaspullen at homtail dot co dot uk .

  4. Hi Thomas

    Not sure if you found a fix for the timeout issue you were experiencing but we had a similar setup and issue.

    We have an SQL Server 2008 (Standard Edition) failover cluster (running on Windows 2008 Server) and were experiencing a lot of connection timeouts for months. After trying many different things we eventually fixed the issue by applying a firmware upgrade to our DL360 servers. This appears to have sorted it and we haven’t had any timeouts since.

    Hope this helps.

    Regards

    Matt

    1. Matt

      thanks for that. I no longer work at the place where this was an issue, I never sorted it while I was there. They upgraded to Windows 2008 and I think that helped a bit but did not fix the problem completely.

Leave a reply to Lindsay Cancel reply