Table of contents

tldr;

ncu is often considered to have too high an overhead to use for profiling within larger code bases. This isn’t necessarily true.

In this three part series, we show how to use NVTX scoping, kernel filtering, replay modes, and very selective metric collection to profile individual kernels—even in workloads that launch thousands of kernels—without prohibitive overhead.

I’ll also show why relying on ncu alone can be misleading, and how combining nsys / torch.profiler with ncu leads to more accurate performance conclusions.

If you don’t want to read the blog, in essence all you need is the following command along with NVTX markers in your codebase -

ncu
	--nvtx 
	--nvtx-include <regex to match nvtx region used to constrain profiling region>
	--metrics <list of very specific metrics to actually query to reduce replays>
	--set <alternative way of constraining the metrics using pre-created sets of metrics>
	--replay-mode <depending on how many kernels are profiled and the metrics being profiled, both `kernel` and `application` can be good candidates>
	--kernel-id ::<kernel_name>:<n-th instance of launch>

This blog post is split up into 3 parts.

Part 1 - Goes over basics of ncu and how to reason about the various metrics that you can collect from it.
Part 2 - Goes over how to reduce the overheads of profiling with ncu so that you can use it in large code bases. (Coming Soon)
Part 3 - Goes over how accurate profiling with ncu alone is and how we think the analysis can be improved (Coming Soon)

Part 1 - Nsight Compute (`ncu`) Metrics

Quick intro on `ncu`

Nsight Compute is one of NVIDIA’s profiling tools built to extract low level performance data from kernel execution on the GPU. At a high level, the profiler works by intercepting calls between your GPU kernel and drivers to track every interaction the kernel has with the GPU and reads specific registers in the GPU to extract data on op-counts, bytes transferred, cache usage etc.

The rest of this section focuses on building a correct mental model for ncu metrics. We don’t show how to collect metrics with low overhead yet (Part 2 handles that). The goal here is to be able to better understand the metrics that you collect from ncu. This is important as one of the methods of reducing profiling overhead is by being very selective of the metrics you collect. **This section will help you make decisions on what exactly to collect so you can reduce the profiling overhead.

Metrics you can collect

Nsight Compute exposes 3601 hardware and derived metrics. The docs provide a partial list and if you would like to see them all, you can enumerate the full set by running:

tldr;

Part 1 - Nsight Compute (ncu) Metrics

Quick intro on ncu

Metrics you can collect

Part 1 - Nsight Compute (`ncu`) Metrics

Quick intro on `ncu`