AMD Radeon GPU Profiler & Vendor Tools
RGP shows what UE's GPU Visualizer and Insights cannot — wave occupancy, latency hiding, barrier stalls, and ISA-level cost on RDNA hardware. AMD's UE perf guide explicitly recommends RGP "in lieu of GPU Visualizer as your profiling workhorse for RDNA GPUs." Bonus: RGP patterns translate directly to PS5 silicon (RDNA 2), making it the public scaffold for understanding console performance even when Razor for PS5 is NDA-locked.
Why a vendor profiler matters for RDNA
UE Insights and ProfileGPU answer "what took time"; RGP answers "why." The patterns RGP surfaces:
- Wave occupancy per dispatch — how many wavefronts the SIMD is hiding latency with.
- Stall reasons — LDS, VGPR pressure, memory wait, instruction throughput.
- Async overlap — whether your async compute jobs actually overlap with graphics work, vs serializing on barriers.
- ISA cost — instruction-level breakdown of what each shader is doing.
For UE5 specifically, RGP is the right tool to confirm whether r.TSR.AsyncCompute=2 actually overlaps TSR with the rest of the frame, whether Lumen reflections are stalling on a barrier, and whether Nanite cluster shading is occupancy-bound.
The Radeon Developer Tool Suite
- RGP (Radeon GPU Profiler) — timing/ISA/occupancy.
- RMV (Radeon Memory Visualizer) — VRAM allocation timeline.
- RRA (Radeon Raytracing Analyzer) — BVH structure, ray traversal cost.
- RGD (Radeon GPU Detective) — post-mortem GPU crash analysis.
- RGA (Radeon GPU Analyzer) — offline shader ISA analysis.
All five drive from RDP (Radeon Developer Panel) — the connection point to a target GPU/process. GPU PerfStudio is archived/legacy; don't reach for it on modern RDNA.
Preparing UE 5.5/5.6 for capture
RGP needs SCOPED_DRAW_EVENT scopes emitted as RGP user markers, plus deterministic build flags:
D3D12.EmitRgpFrameMarkers=1
r.GPUStatsEnabled=1
r.ShowMaterialDrawEvents=0 ; eliminates per-material marker overhead
r.RHICmdBypass=0
r.RHIThread.Enable=1
YourGame.exe -d3d12 -deterministic -fps=60 -benchmarkseconds=<n>
ALLOW_CHEAT_CVARS_IN_TEST or move [ConsoleVariables] values into DefaultEngine.ini. RGP markers silently disappear otherwise.
Driving RDP and capturing a profile
- Launch RDP. The system tab shows attached GPUs.
- Workflow tab → Profile workflow. RDP intercepts the next D3D12/Vulkan app launched.
- Launch your packaged UE5 build with the flags above. RDP confirms attach.
- In game, press the configured capture hotkey (default Ctrl+Shift+C from RDP).
- RDP "Recently collected profiles" tab → double-click to open in RGP.
Reading the RGP frame
RGP's main views:
- Frame Summary — bird's-eye time breakdown.
- Event Timing — the timeline; events here map to UE's
SCOPED_DRAW_EVENTmacros (Lumen, Nanite, etc.). - Pipeline State — PSO + ISA per dispatch; VGPR/SGPR/LDS counts.
- Wavefront Occupancy — the headline view. Graphics queue and Compute queue rows; bars = wavefronts in flight.
Wave occupancy on RDNA in UE5
Per AMD's "Occupancy explained":
- RDNA 2/3: 16 wavefronts per SIMD max. RDNA 1: 20.
- RX 7900 XTX: 1536 VGPRs/SIMD. A 120-VGPR shader caps at 12 wavefronts (wave32). High-VGPR shaders reduce occupancy.
- wave32 vs wave64 — UE compute shaders may compile wave64; wave64 runs each wavefront over 2 cycles with higher resource pressure. Reading occupancy as if wave32 misreads the budget.
In a typical UE5 capture, look for: Nanite cluster shading at <6/16 wavefronts (likely VGPR-limited), Lumen ScreenProbeGather with low occupancy on RDNA 1 (where wave64 is forced), and TSR running on the compute queue with full occupancy.
Async compute analysis for Lumen/TSR
Open Wavefront Occupancy → look at the Graphics row vs the Compute row. Dispatches on the Compute row that overlap (run simultaneously with) Graphics work are the win. Dispatches that serialize on a barrier are the bug.
UE5 routes Lumen reflections, Lumen GI, TSR, and Nanite streaming onto async compute by default on supported platforms. If you see no Compute row activity, async compute is disabled or barriers are serializing it. Verify r.TSR.AsyncCompute=2, r.LumenScene.Lighting.AsyncCompute=1, etc., and look at barrier annotations in RGP.
From AMD's UE Performance Guide (RX 7900 XTX, 4K): TSR History.ScreenPercentage 100 vs default 200 saves up to 1.2 ms; Lumen StochasticInterpolation 1 = ~30% screen probe gather speedup; Lumen ExecuteIndirect optimization = ~0.3 ms (~2% of 60 Hz budget).
RMV for memory, RRA for ray tracing
RMV (Radeon Memory Visualizer) — capture under the "Memory Trace" workflow in RDP. Use Timeline + Heap Overview to spot:
- Resource thrash (textures going in/out of VRAM repeatedly).
- Virtual texture pool growth.
- Render-target heap fragmentation.
Compare snapshots across level loads to identify leaks. RMV traces from older GPUs require RMV 1.8 or older; current RMV is RDNA-only.
RRA (Radeon Raytracing Analyzer) — for HWRT (Lumen HWRT, MegaLights). BVH traversal counter, SAH coloring, instance overlap, axis-aligned BLAS splits. Per AMD's RRA case study: splitting misaligned BLAS reduced traversal cost "by a whole order of magnitude"; iterative ray loop ran in 69.6% of the time of the recursive approach on RX 6700 XT at 1080p.
Apply RRA when r.Lumen.HardwareRayTracing 1 or MegaLights is enabled in 5.5/5.6.
PS5 Razor caveat (NDA)
PS5 RDNA 2 silicon is profiled with Sony's first-party Razor for PS5 toolset (timing/ISA/memory analogs). Bulk of Razor specifics are NDA. Publicly: it's not RGP, but the patterns translate — RDNA 2 occupancy semantics, async overlap, BVH structure all behave the same way.
Practical implication for non-NDA developers: use RGP on a desktop RX 6000-series RDNA 2 GPU as a public proxy for understanding PS5 silicon behavior. The numbers won't be identical (PS5 has different memory bandwidth and frequency targets) but the bottleneck patterns are isomorphic.
Never publish Razor screenshots, marker dumps, or specific counter names. Tutorials and public docs must keep PS5 guidance at architecture-level (RDNA 2 patterns) only.
PerfGuard can surface, when a Lumen/Nanite/MegaLights pass regresses in CI, an "AMD evidence pack" with the precise marker name to search for in RGP's Event Timing and a link to the matching playbook section (occupancy / BVH / memory) — closing the gap between Insights "this pass got slower" and RGP "wave occupancy dropped from 12/16 to 6/16."
- RenderDoc — cross-API single-frame inspection.
- PIX on Windows — D3D12-on-Windows + Xbox sibling.
- NVIDIA Nsight — the NVIDIA analog.
- Console Performance — how RGP work translates to PS5/XSX.