Visualizing performance with RHEL 8.4 Grafana Dashboard for Microsoft SQL Server

7 giugno 2021Don Pinto, Louis Imershein6 minuti (tempo di lettura)

If you’re managing databases, you need to be able to identify performance bottlenecks and figure out how to address them. This often involves pinpointing the exact component in the complex web of interconnected dependencies that is causing the issue.

If you’re a Microsoft Windows administrator, the tool to start with is obvious, it’s called perfmon. However, if you’re one of the nearly 1/3rd of respondents to a recent survey who are running SQL Server on Linux, you probably want to consider using Performance Co-Pilot (PCP) and Grafana on Red Hat Enterprise Linux (RHEL), which brings together SQL Server and RHEL metrics in a single easy-to-use dashboard.

When it comes to measuring performance on RHEL, PCP is Red Hat’s preferred choice for collecting metrics. With Grafana becoming a popular open source analytics and monitoring interface for many use-cases, you can use it in RHEL to visualize metrics collected by PCP. Grafana can be configured to show metrics over a specific time range, and the metrics can be exported if you wish to your favorite spreadsheet.

Grafana also comes ready to handle several data sources, and the RHEL PCP-Grafana integration provides vector, redis, and bpftrace as data sources. These integrations also bring several pre-configured dashboards or allow you to customize your own dashboard based on your particular use case.
Visualizing with grafana dashboard Fig 1

Since RHEL 8.2, PCP has had support for SQL Server metrics. You can learn more about how to configure it here. RHEL 8.4 adds several changes:

The Python dependency pyodbc is now available as an RPM, which is automatically brought in as a dependency when installing the pcp-pmda-mssql package. This means no more need to use pip to install pyodbc.
There is a new PCP vector based Microsoft SQL Server dashboard. To enable this, go to Data Sources->PCP Vector->Dashboards as shown in the screenshot below, and click import. This imported dashboard can then be accessed from the dashboards section in Grafana found at http://<server-name>:3000/dashboards
For more information on how to set up PCP with Grafana and PCP Vector, please see: Visualizing system performance with RHEL 8 using Performance Co-Pilot (PCP) and Grafana (Part 1).

Using this dashboard, RHEL administrators and SQL Server DBAs can spot key metrics that might be slowing down the performance of their SQL Server workloads on RHEL.
If you want to automate the installation of PCP for SQL Server, you can also use the Performance Metrics Role for configuring PCP for SQL Server. More information about it can be found here.

Exploring the new Microsoft SQL Server Dashboard

Figuring out where the performance bottlenecks are within the operating system is a challenging problem by itself, and running a database with layers of abstraction on top of it, adds additional complexities.

Operating system and database performance metrics visualizations are available through tools like the new PCP Vector-based Microsoft SQL Server Dashboard in RHEL 8.4. This provides easier performance data correlation and helps to get to the root cause of performance issues more quickly than before.

The dashboard brings together information from the Linux kernel, SQLOS, and the SQL Server database layer and then displays it in Grafana panels using PCP vector graphs.

Visualizing with grafana dashboard Fig 3

Tracking CPU Usage and Memory Utilization

After you bring up the Dashboard, what can you do with it?

For starters, we recommend you start by looking at CPU and memory metrics. Looking at the Linux Run Queue, System: % CPU and CPU Utilization metrics can help you identify runaway processes and unanticipated system activity. If your CPUs are highly saturated, it’s time to start looking at the processes on the system using command-line tools like pcp atop to identify processes hogging CPU or memory on the system.

From a SQL Server perspective, the SQL wait statistics are pretty important to help you track down CPU and memory utilization issues. A high number of SOS_SCHEDULAR_YIELD or THREADPOOL might indicate that your workload needs additional processing resources.

Similarly, SQL Server: Memory Manager metrics can help you identify how much memory SQL Server is working with overall, as well as how much memory is being “stolen” for purposes other than the SQL page cache. An usually high number for stolen server memory indicates memory pressure in the database system, and fixing it might require a change either to how SQL Server is utilizing its memory or increasing the amount of memory provided by the underlying OS platform.

Tracking Disk Utilization

Memory problems can trigger disk problems, so we need to investigate this area next. Without enough memory, disk activity can rapidly increase because we’re paging memory out to disk (swapping).

You’ll have already determined whether you’re swapping if you’ve gone to use the pcp atop command we described above.

Perhaps the most interesting thing you can look at with Disk Utilization is whether you are running out of throughput or IOPS.

You can do some basic testing to confirm the maximum capability of your devices. For this, fio is a great tool for generating synthetic workloads. If you know the max throughput and max input/output operations per second (IOPS) you can then use that knowledge to determine when to add more storage via RAID, move to faster storage, add caching to fast storage tiers such as those using Intel’s Optane DC Persistent Memory, or when to split up your data stores.

SQL statistics can also help you understand storage contention issues. For example, you can look at the SQL OS Wait Stats and examine the PAGE_I/O_LATCH metric. Some number of waits for SQL Server is typically acceptable but an order of magnitude difference in this statistic versus other wait types might indicate a problem that needs fixing by moving to faster storage.

That said, there is also the possibility that you have a database tuning problem such as a poorly designed index that is spiking up disk I/O.

We have now covered some of the key metrics. In this table we include several more useful metrics listed in the order of their appearance in the dashboard panels:

Linux: Run Queue	This panel displays the number of processes that are marked runnable by the kernel at any given time.
System: % CPU	Shows the percentage of CPU being used by sys (kernel) and user processes.
Per CPU Busy (User)	Shows the percentage of each CPU busy running processes in user space.
Per CPU Busy (Sys)	Shows the percentage of each CPU busy running processes in sys (kernel) space.
Disk Utilization	Represents the overall utilization of each disk device on the system.
Disk Throughput	Represents read and write throughput.
Disk IOPS	Represents read and write IOs per second.
SQL Server: General Statistics – User Connections	The number of connected users, Running PCP will require a user connection. A second user is typically consumed for health monitoring if running an Availability Group or Failover Cluster using the Red Hat High Availability add-on.
SQL Server: Memory Manager	Total, stolen and reserved server memory as seen by SQL Server.
SQL Server Statistics: Batch Requests	Batch requests, compilations and recompilations per second.
SQL Server: Network IO Waits	Average network wait time in milliseconds, number of waits in progress, waits started per second and cumulative wait time in milliseconds per second. All statistics relevant to wait on network I/O.
SQL Server: Access Methods Page Splits	Number of page splits per second that occur as the result of overflowing index pages.
SQL Server: Statistics Compilations	Number of compilations and recompilations.
SQL Server: Plan Cache (Cache Hit Ratio)	Cache hit ratio for temporary tables and their variables, extended stored procedures, bound trees, SQL plans and Object plans along with the total cache hit ratio.
SQL OS Waiting Tasks	Number of waits for each wait type. This counter is incremented at the start of each wait.
SQL OS Maximum Wait Stats	Maximum wait time for each wait type.
SQL OS Wait Times	Total wait time for this wait type in milliseconds.
SQL Server Latch Waits	Internal SQL Server resource locks known as latches can help you to determine user activity and resource usage and identify performance bottlenecks. This metric tracks the number of latch requests that cannot be granted immediately.
SQL Server Latch Wait Times	This metric tracks the total time (in milliseconds) spent waiting for latch requests in the last second.

The dashboard tries to present SQL Server stats that are interesting at a glance, but that doesn’t mean that you don’t need to baseline your environment for those statistics to understand a definition of “normal."

Data annotations and sharing

Annotations in Grafana provide a way to mark points on the graph, linking data metrics with stories that you want to tell with your data. You can also easily export and share your PCP metrics using Grafana with other administrators so that they can use the same dashboard to view and troubleshoot performance problems.

Conclusion

Now that you have another performance troubleshooting tool for SQL Server on RHEL in your toolbox, you should start using it. A good place to start in understanding the Red Hat side of the equation is the latest RHEL documentation on Managing System Status and Performance.

You can also learn more about SQL Server performance metrics through Microsoft’s SQL Server Performance Monitoring and Tuning Tools documentation. The new Grafana SQL Server dashboard in RHEL 8.4 is off to a great start, bringing together useful statistics from RHEL and SQL Server for monitoring.

To learn more about SQL Server on Red Hat Enterprise Linux visit: Running Microsoft SQL Server on Red Hat Enterprise Linux.

Sugli autori

Don Pinto

Technical Product Marketing Manager

Don Pinto is a Technical Product Marketing Manager at Red Hat focused on helping customers understand why Red Hat Enterprise Linux is an ideal operating system platform for modern application workloads. Pinto is passionate about data management and operating systems, having authored several technical blogs and white papers on various tech topics. Pinto holds a Masters degree in Computer Science and a Bachelor's degree in Computer Engineering from the University of Toronto, Canada.

Read full bio

Louis Imershein

Product Manager

Louis Imershein is a Product Manager at Red Hat where his focus is on improving the Red Hat Enterprise Linux software developer experience. He has an extensive background in software development, architecture, quality assurance, technical support, and product management on a wide range of OS, management, security, and storage software projects. Louis joined Red Hat as part of the acquisition of Permabit Technology Corporation, where he was VP of Product.

Read full bio