NVIDIA Nsight Graphics for UE
Nsight Graphics is what you reach for when RenderDoc tells you which draw is slow but not why. SM occupancy, warp stall reasons, Shader Execution Reordering analysis, ray-tracing live-state inspection — all NVIDIA-specific telemetry you can't get anywhere else. This tutorial covers the launch flow for UE5, Frame Capture vs GPU Trace, the Ray Tracing Live State workflow that produced the 24% gain in Indiana Jones path tracing, and Aftermath crash dumps for shipping titles.
When Nsight beats RenderDoc / PIX
Three categories where Nsight is the right answer:
- SM-level performance. SOL%, warp stall reasons (L1 Long Scoreboard, lg_throttle, L2 Limited), L1/L2 hit rates — not exposed by RenderDoc or PIX.
- Ray tracing on RTX. TLAS/BLAS inspection, AABB overlap, RT live state size, Shader Execution Reordering analysis.
- Crash post-mortem. Nsight Aftermath ingests
.nv-gpudmpdumps and shows page-fault address → last completed event marker mapping. Required for any shipped title.
Latest version: Nsight Graphics 2026.1.0 (build 26067 Windows x64, 26083 ARM64 Linux). RenderDoc still wins on cross-API and pixel history; PIX wins on Windows D3D12 multi-frame timing. Nsight wins on NVIDIA hardware specifics.
Install & launch flow
Install Nsight Graphics 2026.1 from developer.nvidia.com. The senior pattern: launch UnrealEditor.exe from Nsight, not attach post-hoc. Hooks fire correctly only at process start.
- Nsight Graphics → New Project → "Application Executable" → point at
UnrealEditor.exewith your project as cmdline. - Activity: Frame Profiler for individual frame analysis; GPU Trace Profiler for SM/warp metrics; Generate C++ Capture for offline replay.
- Click "Launch Frame Profiler" (or equivalent for the activity).
- Editor boots; play in PIE; capture from the Nsight overlay (F11 default).
Frame Capture (Graphics Capture)
Nsight 2025.4+ promoted Graphics Capture from beta — captures persist directly to disk for offline analysis. Workflow:
- Pre-warm shaders in PIE (UE's PSO compile spikes corrupt timing).
- Click Capture in the Nsight overlay; capture writes to disk.
- Open the
.nsight-gfxfile later for offline analysis.
This is the recommended workflow for CI: a small script can drive a packaged build to a known scenario, capture, and ship the file as an artifact for engineers to inspect.
Frame Profiler & ProfileGPU bridge
Use UE's built-in ProfileGPU first to localize the regression to a pass, then Nsight Frame Profiler → "Profile Shaders" on the offending pass for SM-level detail. This two-step is the senior pattern; Nsight first is a waste of capture time when you don't yet know which pass to focus on.
Set r.RDG.Debug 1 + r.RDG.Events 1 for richer event markers so Nsight's nesting matches UE's RDG passes.
GPU Trace & Advanced Mode (SM occupancy)
GPU Trace Profiler is the deep-telemetry mode. Advanced Mode metrics expose:
- SOL% (Speed of Light). Percentage of theoretical max throughput hit. SM SOL < 50% on a Lumen pass = under-launched dispatch.
- SM Throughput For Active Cycles. Compute density when warps are running.
- L1/L2 hit rates. Memory pressure indicator.
- Warp stall reasons. Why warps aren't issuing — L1 Long Scoreboard (memory wait), lg_throttle (LD/ST throttle), L2 Limited.
The Range Profiler was deprecated; older UE blog posts referencing it are out of date. In 2026.1 the replacement is GPU Trace Profiler (per NVIDIA's migration guide).
Ray Tracing Inspector for Lumen HWRT
For DXR/Lumen HWRT investigations:
- Inspect TLAS/BLAS — instance counts, BLAS overlap, build flags.
- UE5 emits these via RDG; require
r.RayTracing=1and Lumen HWRT before launching Nsight. - RGP-equivalent BVH analysis is in the Ray Tracing Inspector tab.
NVIDIA's Indiana Jones path-tracing case study reported an order-of-magnitude traversal-cost improvement after splitting misaligned BLAS — the exact pattern Nsight surfaces (per NVIDIA's case study).
Shader Execution Reordering (SER)
SER is RTX 40-series's hardware-accelerated reordering of divergent rays into coherent groups. Open the Ray Tracing Live State tab in GPU Trace; identify GLSL/HLSL declarations spilled across traceRay/callable boundaries.
From the Indiana Jones case study: SER alone 4.08 ms → 3.63 ms (~11% gain); with RT live state reduced from 222 to 84 bytes, SER yielded 4.07 ms → 3.08 ms (~24% gain); active threads/warp went 38% to 68–70%.
UE 5.6 enables SER by default for Lumen HWRT (r.Lumen.HardwareRayTracing.ShaderExecutionReordering=1) on supported hardware. The NVIDIA-published number for Lumen reflections in CitySample is 20–30% faster with SER on RTX 40 (per NVIDIA's SER announcement).
Nsight Aftermath for GPU crashes
For shipping titles, Aftermath is non-negotiable. Crash dumps from production let you find the offending warp without a repro:
- Build with the Aftermath SDK integrated (UE 5.3+ docs cover this; SDK 2023.2+ recommended).
- Run with
-gpucrashdebuggingcommand-line. (Do not combine with-d3ddebug— UE docs explicitly warn this combination produces unusable output.) - On crash,
.nv-gpudmpfile written toSaved/Crashes/. - Ingest in Nsight Graphics 2023.3+: page-fault address → last-completed event marker; SM register dump → offending warp.
Configure r.D3D12.GPUTimeout to be more permissive for long Aftermath captures — default Windows TDR cuts the dump short.
Decision matrix: symptom → tool
| Symptom | Tool |
|---|---|
| Low SM SOL on a dispatch | GPU Trace Profiler |
| High RT live state, divergent rays | Ray Tracing Live State tab + SER inspection |
| TDR / GPU device removal | Aftermath crash dump |
| "Why is this draw slow on RTX?" | Frame Profiler → Profile Shaders |
| BVH traversal cost | Ray Tracing Inspector |
| "PSO precaching coverage" | Built-in r.PSOPrecache.Validation 2 |
| CUDA-backed UE plugin (NIM/Ace, custom denoiser) | Nsight Compute |
For non-NVIDIA hardware (AMD, Intel), reach for RGP / Radeon tools or RenderDoc. Nsight Graphics is NVIDIA-only.
PerfGuard regression reports flag pass-level GPU timing deltas; the natural drill-down for an NVIDIA-specific RT regression is opening that scenario in Nsight's Ray Tracing Live State tab. PerfGuard CI can drop .nv-gpudmp and capture artifacts next to its baseline JSON so the engineer arrives with the data already loaded.
- RenderDoc for Unreal — cross-API alternative.
- PIX on Windows — D3D12-specific timing capture.
- AMD Vendor Tools — the AMD analog (RGP / RRA / RMV).
- Lumen Performance — the system most often investigated in Nsight.