Friday, September 6, 2019

Lightweight SQL Server Health Monitoring & Visualization


Even with the introduction of Query Store in SQL Server 2016 and with the introduction of automatic tuning in Azure SQL database, there is benefit to capture and retain diagnostic performance metrics. This post explains a lightweight way to use well-known DMV queries to capture key SQL Server performance data for alerting and long-term reporting. It involves creating a database on the server(s) to be monitored. The script for the database is here.

This system captures data for (1) weight statistics, (2) SQL I/O and (3) stored procedure performance. Below are a few reports to illustrate what sort of visualizations are possible with this approach. After that, I provide a brief explanation of how it works including everything you need to try it out yourself.

(1)  Wait Statistics showing by date - wait task/sec (gray bars), avg resource wait time (blue) and avg signal wait time (green). For more information on wait statistics, visit this article by Paul Randal.




(2)  SQL I/O showing by date - read/write counts (bars), avg write wait time (blue) and avg read wait time (orange).






(3)  Stored Procedure Performance – 3 samples below
(a)  Graph showing sproc execution rate (gray bars) and tuning benefit reducing CPU and Wall Clock from 20ms to 5ms




(b)  Two graphs (CPU/Wall Clock & Logical Reads) showing sproc degradation (resulting from uplift of compatibility to SQL 2014) and benefit after resolution. Weekly subscription of these graphs provided quick alerts and timely resolution, also with the visibility to quantify/confirm the benefit.








(c)  Sample illustrating performance degradation and long-term benefit of resolution.




These three metrics (wait stats, i/o, sproc performance) have corresponding tables in the database each joining to common dimension tables. Below is a diagram of the sproc performance tables.



SQL Server dynamic management views (DMVs) which are the source of this data are cumulative and so capturing data for a slice of time involves taking a data sample, waiting some time, taking a 2nd data sample and then calculating deltas. In the database there is a sproc named uspCaptureMetrics which does all of this this. For wait stats and i/o, it captures the two data samples with a 10 min interval in between. Due to complexities of sproc (re)compilation and procedure cache, sproc performance works by taking 120 samples waiting 5 sec in between each sample. Then calculates the performance deltas by sproc for the greatest delta duration. 

uspCaptureMetrics takes 10 min to run with almost all of the time being WIATFOR in between execution of DMV queries. It is very lightweight. Recommendation is to create an agent job to execute uspCaptureMetrics every 15 minutes resulting in the ability to report down to the quarter-hour, making detailed visualizations like the one below possible – showing a CPU/exec reduction specifically between 5:00pm and 5:15pm on 8/1/2018. Likewise, this level of detail for a performance degradation greatly helps identify root cause.




To retain data long-term, you can implement a nightly ETL to a warehouse. This enables identifying longer term cyclical patterns and also makes possible long-term visualizations emphasizing for management the on-going benefits of specific tuning actions.

Running this system in development and test environments provides for a shift-left to identify and resolve performance issues before going to production.

The visualizations are reports in SSRS. That is beyond the scope of this post. Once you have the data, writing queries and reports is a necessary but not difficult task for someone with that skill set. 

Query Store and Automatic Tuning have their place but as a complement to these tools, capturing and retaining system performance metrics provides invaluable insights.

Hope this is of some use.