In one of the systems we support, high availability is provided through a combination of clustering and log shipping. We have a primary data centre with two clusters and a few standalone servers. One of these is a 5-node, 11-instance cluster which log ships to a 3-node partner in a separate data centre several miles away (a pretty nifty setup!).
One of the instances which is shipping logs is a large SharePoint install, with over 60 content databases, ranging in size currently from 2 to 10 GB each. This is SQL Server 2005 Enterprise Edition (64-bit). We started to encounter issues with the log shipping, copy and restore jobs would hang on the standby server, and I was having to resort to sledgehammer/nut cracking methods for resolution, like restarting the SQL Server agent on the standby. Not understanding what was going on was beginning to annoy me, when I just happened to notice some weird messages on this screen, from Job Activity Monitor:
This picture is a healthy snapshot, but in the one I looked at, I could see in the Status column, ‘Waiting For Worker Threads’. this was at a time when my copy and restore jobs were hanging. Now you can imagine if you are log shipping 60 databases, it’s almost impossible to spread all the copy and restore jobs out enough to stop them overlapping completely. Essentially, SQL Server Agent was getting starved of threads. There was nothing in the error logs about this.
So I googled this setting. Normally you don’t need to fiddle with the MAX WORKER THREADS sp_configure setting, but in instances like this, you may have to. I followed the recommendations in this handy MSDN article, in my case it was a 8 processor 64-bit server, so I set max worker threads to 576. I did this on the Primary Server too, as well as the standby server. Gratifyingly, this fixed my issue immediately.
Resynching all 63 databases is something I’ve had quite enough of for one lifetime …..