Tracking CPU Usage and Memory Utilization
After you bring up the Dashboard, what can you do with it?
For starters, we recommend you start by looking at CPU and memory metrics. Looking at the Linux Run Queue, System: % CPU and CPU Utilization metrics can help you identify runaway processes and unanticipated system activity. If your CPUs are highly saturated, it’s time to start looking at the processes on the system using command-line tools like pcp atop
to identify processes hogging CPU or memory on the system.
From a SQL Server perspective, the SQL wait statistics are pretty important to help you track down CPU and memory utilization issues. A high number of SOS_SCHEDULAR_YIELD or THREADPOOL might indicate that your workload needs additional processing resources.
Similarly, SQL Server: Memory Manager metrics can help you identify how much memory SQL Server is working with overall, as well as how much memory is being “stolen” for purposes other than the SQL page cache. An usually high number for stolen server memory indicates memory pressure in the database system, and fixing it might require a change either to how SQL Server is utilizing its memory or increasing the amount of memory provided by the underlying OS platform.
Tracking Disk Utilization
Memory problems can trigger disk problems, so we need to investigate this area next. Without enough memory, disk activity can rapidly increase because we’re paging memory out to disk (swapping).
You’ll have already determined whether you’re swapping if you’ve gone to use the pcp atop
command we described above.
Perhaps the most interesting thing you can look at with Disk Utilization is whether you are running out of throughput or IOPS.
You can do some basic testing to confirm the maximum capability of your devices. For this, fio
is a great tool for generating synthetic workloads. If you know the max throughput and max input/output operations per second (IOPS) you can then use that knowledge to determine when to add more storage via RAID, move to faster storage, add caching to fast storage tiers such as those using Intel’s Optane DC Persistent Memory, or when to split up your data stores.
SQL statistics can also help you understand storage contention issues. For example, you can look at the SQL OS Wait Stats and examine the PAGE_I/O_LATCH metric. Some number of waits for SQL Server is typically acceptable but an order of magnitude difference in this statistic versus other wait types might indicate a problem that needs fixing by moving to faster storage.
That said, there is also the possibility that you have a database tuning problem such as a poorly designed index that is spiking up disk I/O.
We have now covered some of the key metrics. In this table we include several more useful metrics listed in the order of their appearance in the dashboard panels:
Linux: Run Queue
|
This panel displays the number of processes that are marked runnable by the kernel at any given time.
|
System: % CPU
|
Shows the percentage of CPU being used by sys (kernel) and user processes.
|
Per CPU Busy (User)
|
Shows the percentage of each CPU busy running processes in user space.
|
Per CPU Busy (Sys)
|
Shows the percentage of each CPU busy running processes in sys (kernel) space.
|
Disk Utilization
|
Represents the overall utilization of each disk device on the system.
|
Disk Throughput
|
Represents read and write throughput.
|
Disk IOPS
|
Represents read and write IOs per second.
|
SQL Server: General Statistics – User Connections
|
The number of connected users, Running PCP will require a user connection. A second user is typically consumed for health monitoring if running an Availability Group or Failover Cluster using the Red Hat High Availability add-on.
|
SQL Server: Memory Manager
|
Total, stolen and reserved server memory as seen by SQL Server.
|
SQL Server Statistics: Batch Requests
|
Batch requests, compilations and recompilations per second.
|
SQL Server: Network IO Waits
|
Average network wait time in milliseconds, number of waits in progress, waits started per second and cumulative wait time in milliseconds per second. All statistics relevant to wait on network I/O.
|
SQL Server: Access Methods Page Splits
|
Number of page splits per second that occur as the result of overflowing index pages.
|
SQL Server: Statistics Compilations
|
Number of compilations and recompilations.
|
SQL Server: Plan Cache (Cache Hit Ratio)
|
Cache hit ratio for temporary tables and their variables, extended stored procedures, bound trees, SQL plans and Object plans along with the total cache hit ratio.
|
SQL OS Waiting Tasks
|
Number of waits for each wait type. This counter is incremented at the start of each wait.
|
SQL OS Maximum Wait Stats
|
Maximum wait time for each wait type.
|
SQL OS Wait Times
|
Total wait time for this wait type in milliseconds.
|
SQL Server Latch Waits
|
Internal SQL Server resource locks known as latches can help you to determine user activity and resource usage and identify performance bottlenecks. This metric tracks the number of latch requests that cannot be granted immediately.
|
SQL Server Latch Wait Times
|
This metric tracks the total time (in milliseconds) spent waiting for latch requests in the last second.
|
The dashboard tries to present SQL Server stats that are interesting at a glance, but that doesn’t mean that you don’t need to baseline your environment for those statistics to understand a definition of “normal."
Data annotations and sharing
Annotations in Grafana provide a way to mark points on the graph, linking data metrics with stories that you want to tell with your data. You can also easily export and share your PCP metrics using Grafana with other administrators so that they can use the same dashboard to view and troubleshoot performance problems.