More on SQL Server 2008 Connection Timeouts

A few more details on my current woes with SQL Server 2008. This is 64-bit Enterprise Edition, SP1 + CU5. We are getting severe, intermittent, but easily reproducible connection timeouts. This is when you attempt to connect to the instance, but the client eventually gets fed up of waiting and gives up, like this in Management Studio:-

 ssms_connection_timeout

This timeout is set by default in SSMS to be 15 seconds (plenty long enough to establish a connection to an instance on your LAN, Gigabit connected, fast server)…

connection_timeout_default_ssms

This error is slightly different when you generate it using SQLCMD:-

 connection_timeout

I am getting other symptoms of the same underlying issue, e.g. occasionally when trying to read the error log in a query window using xp_readerrorlog:-

 
xp_readerrorlog_error2

I have no resolution for this issue yet; will update if/when I do. Troubleshooting with Microsoft Premier Support has involved registry changes for things like Max Ports, chimneying off, SynAttackProtect (no benefit). We have also done network tracing using WireShark. I have done some testing with a custom Microsoft-supplied app, and also my own testing using SQLCMD. And before you asked, yes I have tried rebooting!! That’s not an easy thing to arrange when it’s a live 4-node cluster supporting 24X7 websites and applications… one of the (many) compelling reasons to have a passive node in your cluster, which you can use precisely for this kind of thing.

Wish me luck.. even the Director’s started asking for daily updates. Urgh!

SQL Server 2008 stress

10 days ago I upgraded several databases from an old, standalone, 32-bit Windows 2003/SQL Server 2000 server. I moved the databases onto a new SQL Server 2008 x64 SP1 instance, one of 6 running on our core 4-node cluster (4 x 2005 instances and 2 x 2008).

Shiny, spangly, highly available, modern, up-to-date, 64-bit, loads more memory and processing power… nothing could go wrong, right?

First problem was exposed by transactional data replication, into a subscribed table with triggers on it, and this was causing Access Violations and Stack Dumps, so frequently that the small(ish) system partition was soon filled with mini-dumps. We worked around this by changing the trigger (but the underlying issue remains unsolved).

But a more problematic issue is frequent, intermittent, connection problems and timeouts (in SSMS, the default timeout period is a mere 15 seconds). This server is used in my company’s factory, in assembling PCs. The timeouts have been causing so many problems to PCs being build and burn-in tested on the production line that I am getting serious heat.

So .. installed CU5 for SP1 at the weekend.. the timeouts went away for a tantalising 48 hours and I truly believed the issue was fixed, only for it to return with a vengeance this morning. Cue seriously unhappy users and senior managers. Severity A case logged with Microsoft Premier Support. No resolution yet … but a very tired DBA here after a pretty stressful 10 days at work. Christmas can’t come too soon.

MPS are really good and the service can’t be faulted, but I’d rather not have to talk to them at all (with the greatest of respect!)