“Tom, there was a production problem yesterday at about 2.30pm for 20 minutes. Users complained about timeouts. What was going on in the database?”
This is a classic “umm … (oh damn)” moment for a DBA. You weren’t running Profiler. You had no Perfmon running.Your boss won’t countenance spending several thousand big ones on 3rd party performance apps. You can’t go back in time. But you want to! What was running? What plans? What were they waiting on? Was there blocking?
Help is at hand.
It is quite easy to implement a session capture system. This will query the DMVs to capture current running process information, every (specified) X seconds, indefinitely. It will capture a lot of data but will auto-prune itself to prevent excessive data bloat. And having it will mean you can precisely answer that formerly embarrassing question: “well, yesterday at 2.30pm, the following was happening … “. You will look good, prepared, clever, proactive, and in charge. Your users, your boss, your developers – your business – will love you.
This is how you do it.
1. Create a utility DBA database on your Prod server. It will need 10 – 15 GB disk space. Put it in the SIMPLE recovery mode. Ensure it is being backed up regularly.
2. Create the following table:-
CREATE TABLE [dbo].[session_capture](
[date_captured] [datetime] NULL,
[session_id] [smallint] NULL,
[database_id] [smallint] NULL,
[host_name] [nvarchar](256) NULL,
[command] [nvarchar](32) NULL,
[sql_text] [xml] NULL,
[wait_type] [nvarchar](120) NULL,
[wait_time] [int] NULL,
[wait_resource] [nvarchar](512) NULL,
[last_wait_type] [nvarchar](120) NULL,
[total_elapsed_time] [int] NULL,
[blocking_session_id] [int] NULL,
[blocking_text] [xml] NULL,
[start_time] [datetime] NULL,
[login_name] [nvarchar](256) NULL,
[program_name] [nvarchar](128) NULL,
[login_time] [datetime] NULL,
[last_request_start_time] [datetime] NULL,
[last_request_end_time] [datetime] NULL,
[transaction_id] [bigint] NULL,
[transaction_isolation_level] [smallint] NULL,
[open_transaction_count] [int] NULL,
[totalReads] [bigint] NULL,
[totalWrites] [bigint] NULL,
[totalCPU] [int] NULL,
[writes_in_tempdb] [bigint] NULL,
[sql_plan] [xml] NULL
) ON [PRIMARY]
CREATE CLUSTERED INDEX [IDX1] ON [dbo].[session_capture]
(
[date_captured] ASC,
[session_id] ASC
)
CREATE NONCLUSTERED INDEX [IDX2] ON [dbo].[session_capture]
(
[date_captured] ASC,
[wait_time] ASC,
[blocking_session_id] ASC
)
CREATE NONCLUSTERED INDEX [IDX3] ON [dbo].[session_capture]
(
[database_id] ASC,
[wait_time] ASC,
[wait_type] ASC,
[host_name] ASC
)
3. Create the following 2 Stored Procedures in your utility DB:-
CREATE PROC [dbo].[SessionCapture]
@TargetSizeMB bigint = 4000, — Purge oldest 20% of data if SessionCapture table is larger than this size (MB)
@Threshold int = 15000, — Alert on blocking that has wait-time greater than threshold (millisecs)
@Recipients varchar(500) = NULL, — Who will receive blocking alert emails. Separate email addresses with ; Set to NULL to disable alerting
@MailInterval int = 60 — Minimum interval between alert emails (secs)
AS
SET NOCOUNT ON
DECLARE @mailsubject varchar(50),
@tableHTML nvarchar(max),
@LastEmail datetime
SET @LastEmail = GETDATE()
— Loop indefinitely every 10 seonds
WHILE 1=1
BEGIN
BEGIN TRY
INSERT DBAdata.dbo.session_capture
SELECT
GETDATE(),
x.session_id,
x.database_id,
x.host_name,
x.command,
(
SELECT
text AS [text()]
FROM sys.dm_exec_sql_text(x.sql_handle)
FOR XML PATH(”), TYPE
) AS sql_text,
x.wait_type,
x.wait_time,
x.wait_resource,
x.last_wait_type,
x.total_elapsed_time,
COALESCE(x.blocking_session_id, 0) AS blocking_session_id,
(
SELECT
p.text
FROM
(
SELECT
MIN(sql_handle) AS sql_handle
FROM sys.dm_exec_requests r2
WHERE
r2.session_id = x.blocking_session_id
) AS r_blocking
CROSS APPLY
(
SELECT
text AS [text()]
FROM sys.dm_exec_sql_text(r_blocking.sql_handle)
FOR XML PATH(”), TYPE
) p (text)
) AS blocking_text,
x.start_time,
x.login_name,
x.program_name,
x.login_time,
x.last_request_start_time,
x.last_request_end_time,
x.transaction_id,
x.transaction_isolation_level,
x.open_transaction_count,
x.totalReads,
x.totalWrites,
x.totalCPU,
x.writes_in_tempdb
,(select query_plan from sys.dm_exec_query_plan(x.plan_handle)) as sql_plan
FROM
(
SELECT
r.session_id,
r.database_id,
s.host_name,
s.login_name,
s.program_name,
s.login_time,
r.start_time,
r.sql_handle,
r.plan_handle,
r.blocking_session_id,
r.wait_type,
r.wait_resource,
r.wait_time,
r.last_wait_type,
r.total_elapsed_time,
r.transaction_id,
r.transaction_isolation_level,
r.open_transaction_count,
r.command,
s.last_request_start_time,
s.last_request_end_time,
SUM(r.reads) AS totalReads,
SUM(r.writes) AS totalWrites,
SUM(r.cpu_time) AS totalCPU,
SUM(tsu.user_objects_alloc_page_count + tsu.internal_objects_alloc_page_count) AS writes_in_tempdb
FROM sys.dm_exec_requests r
JOIN sys.dm_exec_sessions s ON s.session_id = r.session_id
JOIN sys.dm_db_task_space_usage tsu ON s.session_id = tsu.session_id and r.request_id = tsu.request_id
WHERE r.status IN (‘running’, ‘runnable’, ‘suspended’)
AND r.session_id <> @@spid
AND r.session_id > 50
GROUP BY
r.session_id,
r.database_id,
s.host_name,
s.login_name,
s.login_time,
s.program_name,
r.start_time,
r.sql_handle,
r.plan_handle,
r.blocking_session_id,
r.wait_type,
r.wait_resource,
r.wait_time,
r.last_wait_type,
r.total_elapsed_time,
r.transaction_id,
r.transaction_isolation_level,
r.open_transaction_count,
r.command,
s.last_request_start_time,
s.last_request_end_time
) x
— If email recipients set and blocking above threshold then email blocking alert
— But don’t send another email within @MailInterval
IF @Recipients IS NOT NULL AND DATEDIFF (ss, @LastEmail, GETDATE()) > @MailInterval
BEGIN
IF EXISTS (
SELECT 1
FROM session_capture sc
WHERE sc.date_captured = (SELECT MAX(Date_Captured) FROM Session_Capture)
AND wait_time > @Threshold
AND blocking_session_id <> 0
)
BEGIN
SELECT @MailSubject = ‘Prod Blocking Alert’
SET @tableHTML =
N'<H5 style=”font-family:verdana”> Blocking and blocked sessions</H5>’ +
N'<table width = 1300 border=”1″ style=”font-family:verdana;font-size:60%”>’ +
N'<tr><th width = 100>Session_ID</th><th width = 100>Blocking_Session</th><th width = 100>Wait_Time</th><th width = 100>Hostname</th>’ +
N'<th width = 100>NT_Username</th><th width = 100>DB_Name</th><th width = 500>Text</th>’ +
N'<th width = 100>Current_Command</th><th width = 100>Sequence</th></tr>’
+
— Query blocked and lead blocker session data
CAST ( ( SELECT DISTINCT
td = sp.spid,’ ‘,
td = sp.blocked,’ ‘,
td = sp.waittime,’ ‘,
td = sp.hostname,’ ‘,
td = sp.nt_username + ‘ ‘,’ ‘,
td = sd.name,’ ‘,
td = cast(st.text as nvarchar(500)),’ ‘,
td = case when sp.stmt_start > 0 then
case when sp.stmt_end > sp.stmt_start then
substring(st2.text, 1 + sp.stmt_start/2, 1 + (sp.stmt_end/2)-(sp.stmt_start/2))
else
substring(st2.text, 1 + sp.stmt_start/2, len(st2.text))
end
else ”
end, ‘ ‘,
td = case sp.waittime
when 0 then 999999
else sp.waittime
end, ‘ ‘
from sys.sysprocesses (nolock) sp
left outer join sys.sysprocesses (nolock) s2
on sp.spid = s2.blocked
inner join sys.sysdatabases (nolock) sd
on sp.dbid = sd.dbid
left outer join sys.dm_exec_requests er
on sp.spid = er.session_id
cross apply
( SELECT text AS [text()]
FROM sys.dm_exec_sql_text(sp.sql_handle)
) st (text)
cross apply
sys.dm_exec_sql_text(sp.sql_handle) st2
WHERE ((s2.spid IS NOT NULL AND sp.blocked = 0) OR sp.blocked <> 0)
AND sp.SPID > 50
and (s2.ecid is null or s2.ecid = 0)
order by case sp.waittime
when 0 then 999999
else sp.waittime
end desc
FOR XML PATH(‘tr’), TYPE
) AS NVARCHAR(MAX) ) +
N'</table>’
EXEC msdb.dbo.sp_send_dbmail
@recipients = @Recipients,
@subject = @MailSubject,
@body = @TableHTML,
@body_format = ‘HTML’
SET @LastEmail = GETDATE()
END
END
— Check table size, if > @TargetSizeMB then purge old data
IF (SELECT SUM(reserved_page_count) FROM sys.dm_db_partition_stats where object_id = object_id(‘session_capture’)) > @TargetSizeMB*1024*1024/8192
BEGIN
— shrink table size to 80% of @TargetSizeMB
DECLARE @totalrows bigint, @totalpages bigint
SELECT @totalpages = SUM(reserved_page_count),
@totalrows = SUM(CASE WHEN (index_id < 2) THEN row_count ELSE 0 END)
FROM sys.dm_db_partition_stats ps
WHERE object_id = object_id(‘session_capture’)
— Calculate how many rows to delete to be left with 80% of TargetSize
SELECT @totalrows = @totalrows – (@totalrows * @TargetSizeMB*1024*1024/8192*0.8 / @totalpages)
DELETE sc
FROM (SELECT TOP (@totalrows) *
FROM SESSION_CAPTURE
ORDER BY date_captured) sc
END
WAITFOR DELAY ’00:00:10′
END TRY
BEGIN CATCH
WAITFOR DELAY ’00:00:10′
END CATCH
END
GO
CREATE PROC [dbo].[SessionCapturePurge]
@RetentionDays int = 14
AS
DELETE
FROM session_capture
WHERE date_captured < DATEADD(dd, – @RetentionDays, GETDATE())
GO
And create this one in master:-
USE [master]
GO
/****** Object: StoredProcedure [dbo].[SessionCapture] Script Date: 04/05/2011 10:21:03 ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE procedure [dbo].[SessionCapture]
as
exec DBAdata.dbo.SessionCapture
@TargetSizeMB = 4000,
@Threshold = 20000,
@Recipients = ‘yourname@youremaildomain.com’,
@MailInterval = 900
GO
4. Mark the master.dbo.SessionCapture as a startup proc, by executing the following 2 SPs:-
exec sp_configure ‘scan for startup procs’, 1
RECONFIGURE WITH OVERRIDE
exec sp_procoption ‘SessionCapture’, ‘startup’, 1
5. Create an SQL Server Agent job to daily prune the old data.
Schedule the SessionCapturePurge SP every day, to prune (at least) the last 7 days’ capture.
This data & its collection can be very helpful Firstly it can alert you (by email/database mail) to blocking. It can be very useful in analyzing long-running SQL batches, and blocking, historically. Also, it can be very helpful for analyzing wait stats in hindsight.