Advanced ~22 min read

Nanite Performance Deep Dive

Nanite is a different rasterizer, not just a different LOD system. This tutorial walks the cluster-culling pipeline end to end, separates the cost of the software and hardware raster paths, and zeroes in on the two content patterns — masked materials and World Position Offset — that move opaque geometry into the much more expensive programmable raster bin.

Why Nanite, and the 5.5/5.6 baseline

Nanite buys you three things: LOD popping disappears, geometric density stops being a draw-call problem, and your authored mesh is what the player sees, instead of a four-step LOD chain you spent a sprint building. The cost is that everything Nanite renders flows through a custom culling pipeline and rasterizer that has its own performance shape — one that doesn't always behave the way fixed-function raster instinct suggests.

The 5.5 and 5.6 releases meaningfully improved Nanite's runtime cost. The most relevant changes:

5.5 — Raster sort (r.Nanite.RasterSort) raised early-Z rejection rates for masked and PDO materials; experimental DX12 work-graph compute materials (r.Nanite.AllowWorkGraphMaterials); reserved-resource streaming pool sizing (r.Nanite.Streaming.ReservedResources=1) to avoid VRAM spikes on resize (per Tom Looman's 5.5 highlights).
5.6 — Static-instance culling (r.Nanite.StaticGeometryInstanceCull); chunk-based hierarchical instance culling on chunks of 64 instances saving "~100 µs in CitySample"; new Ray Tracing Proxy in the static mesh editor as an alternative to r.RayTracing.Nanite.Mode 1 (per Tom Looman's 5.6 highlights).

If you're on 5.4 or older, upgrading is the single largest Nanite optimization you have available.

The cluster-culling pipeline, end to end

Nanite organizes geometry into clusters of exactly 128 triangles / 384 vertices (sized to fill a GPU wavefront), then groups clusters into a hierarchical DAG (per Lopez's "A Macro View of Nanite", building on Karis et al., "Nanite: A Deep Dive", SIGGRAPH 2021). Each frame, a persist-thread compute shader walks the DAG and selects clusters at the right level of detail for the current camera, then runs them through a two-pass HZB occlusion culler.

The named passes you'll see in stat GPU:

NaniteVisBuffer — cluster culling, raster setup, and writing the visibility buffer.
NaniteRaster — the actual rasterization step. This is where the SW vs HW split lives.
NaniteBasePass — G-buffer fill from the visibility buffer.
NaniteShading — running materials over the visibility buffer (raster bins).

The visibility buffer encodes ClusterID in 25 bits and TriangleID in 7 bits, supporting an upper bound of ~4 billion visible triangles (per Lopez). That ceiling is irrelevant for game frames, but it's why Nanite's per-pixel work scales with screen pixels rather than triangles.

📝

Read the SIGGRAPH talks If this section feels too dense, the canonical references are Karis et al. 2021 and the follow-up Karis HPG 2022 keynote. They're slide-form, fully public, and the best explanation of the persist-thread DAG walk you'll find.

Hardware vs software rasterizer

Nanite has two rasterizers and dispatches each cluster to whichever is faster for that cluster's screen size. The crossover is roughly the 32-pixel screen edge (configurable via r.Nanite.MinPixelsPerEdgeHW):

Software (compute) raster — clusters whose triangles project to less than ~32 pixels of screen edge run through a custom compute shader. Tight-loop atomics into a R32_UINT visibility buffer. Outperforms HW raster for sub-pixel triangles.
Hardware raster — clusters with larger on-screen triangles use mesh shaders (r.Nanite.MeshShaderRasterization=1) or the primitive-shader path (r.Nanite.PrimShaderRasterization=1) to feed fixed-function raster.

In a typical scene, over 90% of triangles are routed to the compute rasterizer (per Lopez's measurements). That's the headline reason Nanite is fast on dense geometry.

The two CVars worth knowing:

DefaultEngine.ini

[/Script/Engine.RendererSettings]
; Higher = more triangles routed to HW raster (cheaper for large triangles)
; Lower  = more triangles to compute raster (cheaper for tiny triangles)
; Default is platform-dependent. AMD reports 32 as a sweet spot on RDNA 3.
r.Nanite.MinPixelsPerEdgeHW=32

; Pixel error per cluster edge. Higher = fewer clusters drawn (cheaper).
; Default 1.0. Try 2 for low-tier; the visual cost is faceting on silhouettes.
r.Nanite.MaxPixelsPerEdge=1.0

The rasterizer pick is automatic per-cluster — you don't normally tune individual clusters. But on hardware where one path is much weaker (some integrated GPUs have weak compute throughput), shifting the crossover via MinPixelsPerEdgeHW can recover frame time.

Programmable raster: the cost of leaving fixed-function

Most Nanite content goes through the standard raster path described above. Some content cannot — specifically, anything that needs to evaluate material at raster time. That kicks the cluster into the programmable raster, a separate pipeline with its own bins, classify/reserve/scatter sort steps, and a meaningfully higher per-cluster cost.

What forces a cluster into programmable raster:

Masked materials — the alpha test has to run before depth-write decisions.
World Position Offset (WPO) — vertices move at raster time, so cluster bounds aren't predictable.
Pixel Depth Offset (PDO) — depth is computed at shading time.
Two-sided materials — backfaces matter; you can't cull them at the cluster level.
Custom UV / vertex animation — same reasoning as WPO.

Wihlidal's "Nanite GPU Driven Materials" GDC 2024 talk is the deep reference here. The headline: in a measured CitySample frame, 81% of shading bins were empty (3,075 of 3,779), and the new bin compaction in 5.4+ saved roughly 1 ms per frame (per Scthe's notes on the talk). The takeaway for you: every additional masked or WPO material adds an active bin, and bins have a fixed cost even when the cluster count in them is small.

Visualizing the bins:

In-PIE

NaniteVisualize Overview
NaniteVisualize MaterialID
NaniteVisualize MaterialComplexity
NaniteVisualize RasterMode    // SW (red) vs HW (green) per cluster
NaniteVisualize RasterBins    // Each programmable raster bin gets a colour

Masked materials and WPO — the two killers

Masked materials and WPO are the two most common reasons a Nanite project profiles worse than expected. Both are visually justifiable in many cases; both should be deliberate decisions, not accidents.

Masked materials. Epic's GDC 2024 talk confirms masked clusters land in their own raster bin and cannot benefit from early-Z. Foliage cards, fence railings, anything with a mask — all programmable raster. The 5.5 raster sort improves rejection rates but doesn't change the fundamental cost shape. Convert masked to fully-opaque geometry where you can. A leaf authored as a tessellated mesh costs less than the same leaf with a texture-mask cutout, on Nanite at least.

WPO. Wind-driven foliage, animated banners, ribbons, anything that moves a vertex — all kicked to programmable raster. Worse, WPO invalidates Virtual Shadow Map cache pages every frame the geometry touches them, which can be catastrophic on dense forest scenes. We cover that interaction in detail in the VSM tutorial.

The mitigations:

Set WPO Disable Distance on every WPO-bearing static mesh. Past this distance, the engine treats the mesh as if WPO were zero, restoring the cluster to the standard raster path.
Use r.Shadow.Virtual.Cache.MaxMaterialPositionInvalidationRange — default -1 (unlimited). Set to a finite range and WPO stops invalidating VSM cache pages outside that radius.
Per-component ShadowCacheInvalidationBehavior=Static — tell the engine that this component's WPO doesn't actually change shadow casting, even though it changes vertex positions. Useful for subtle wind on background trees. Documented in the Fortnite Battle Royale Chapter 4 VSM tech blog.

⚠

5.6 Nanite + foliage WPO regression There is an active issue in 5.6 where Nanite trees painted with the Foliage Tool stop casting shadows when WPO is enabled. The community-reported workaround is to disable WPO or Nanite on the affected proxies until Epic ships a fix — tracked in this Epic forums thread.

Nanite + Virtual Shadow Maps

VSM only really works because Nanite gives it cluster-level culling for shadow passes. The two systems share an HZB and a culling pipeline; the cost picture of one is partly a function of the other.

The practical implications:

Convert background props to Nanite first. Non-Nanite meshes use a different VSM path that's much more expensive on dense scenes. The Fortnite VSM tech blog calls this out explicitly: VSM only scales well when the scene is mostly Nanite.
HZB occlusion culling for VSM — r.Shadow.Virtual.UseHZB=2 enables the two-pass HZB culler for the Nanite shadow path; r.Shadow.Virtual.NonNanite.UseHZB for non-Nanite. Both are on by default; verify in case you've inherited an older project.
VSM resolution bias for Nanite — r.Shadow.NaniteLODBias biases LOD selection in VSM passes. Positive values produce cheaper, blurrier shadows; in 5.5+ this is exposed as a scalability variable.
Page diagnostics — r.Shadow.Virtual.NonNanite.NumPageAreaDiagSlots=-1 prints the worst non-Nanite VSM page hogs to the screen. Useful for identifying which not-yet-Nanite props are blowing the page budget.

The full VSM tuning workflow lives in our VSM Performance & Tuning tutorial. The Nanite-side message: more Nanite content = faster shadows, almost universally.

Fallback meshes for non-Nanite paths

Nanite is great everywhere it runs. It does not run everywhere:

Hardware ray tracing, including Lumen HWRT — uses a separate proxy.
Mobile platforms below the Nanite hardware floor.
Collision queries — collision uses a simple proxy regardless of rendering path.

For each Nanite mesh, you author a fallback mesh that the engine uses on those paths. By default the engine generates one automatically; the result is usually acceptable for collision and far ray-tracing but often visibly inferior for close ray-traced reflections.

UE 5.6 added a new Ray Tracing Proxy directly in the static mesh editor, with explicit Relative Error tuning — allowing you to set quality per asset rather than relying on the auto-generated fallback. This is the recommended approach in 5.6+; it also avoids the VRAM regression that r.RayTracing.Nanite.Mode 1 introduces (see callout below).

⚠

The r.RayTracing.Nanite.Mode 1 VRAM regression Setting r.RayTracing.Nanite.Mode=1 ray-traces against the full Nanite geometry, which sounds great until you measure VRAM. There's a community-reproduced regression in 5.6.1 showing growing VRAM after upgrade from 5.4.4. Use the new Ray Tracing Proxy at Relative Error 0 in the static mesh editor instead.

Tessellation and programmable displacement (5.3+)

Nanite gained programmable displacement via tessellation in 5.3. It is gated behind two CVars and is, frankly, expensive:

DefaultEngine.ini

r.Nanite.AllowTessellation=1   ; Project-side compile gate (off by default)
r.Nanite.Tessellation=1        ; Runtime enable (off by default)

Community measurements show that enabling Nanite tessellation on a 6-layer landscape can drop a 60 fps target to 30 fps (per this forums thread). This isn't necessarily a bug — tessellation is doing real geometric work and competing with VSM cache for the same pass-budget. But it's a feature you measure carefully before turning on, not one you flip globally and forget about.

Use cases where it earns its keep: hero terrain regions, close-up rocky surfaces with displacement maps, prominent architecture with normal-map detail that would benefit from real silhouette change. For everything else, baked normals are still the right answer.

Streaming and memory

Nanite streams cluster data into a GPU pool. Tuning this pool well is mostly about avoiding two failure modes: the pool is too small (visible LOD pop, async streaming hitches) or it's too large (you're paying VRAM that other systems need).

Key settings:

r.Nanite.Streaming.StreamingPoolSize — runtime adjustable in 5.5. Historical default was 512 MB; set lower on memory-constrained platforms, higher on PC.
r.Nanite.Streaming.ReservedResources=1 — uses reserved-resource heaps to avoid the large VRAM spike that pool resize used to cause. Default in 5.5+.

For ongoing diagnosis, NaniteStats in PIE dumps the cluster, instance, and raster-bin counters for the current frame. NaniteStats list shows every available stat group.

🔎

Annotate captures with r.Nanite.ShowMeshDrawEvents=1 When a RenderDoc or PIX capture has a Nanite pass you can't decompose, this CVar annotates the capture with per-material draw events. Combined with NaniteVisualize MaterialID, you can match captures back to specific assets fast.

Debugging cookbook

The Nanite-specific debug visualizers are cheaply enabled and overwhelmingly useful. Memorize this set:

NaniteVisualize Overview — first port of call. Shows triangle, cluster, and instance counts visually overlaid.
NaniteVisualize Triangles / Clusters / Instances — debug each level of the hierarchy individually.
NaniteVisualize MaterialID — per-pixel material ID visualization. Identifies how many materials a frame is shading.
NaniteVisualize MaterialComplexity — like Shader Complexity but Nanite-aware.
NaniteVisualize RasterMode — SW vs HW raster split. Most clusters should be SW; if you see large HW regions on small geometry, MinPixelsPerEdgeHW is misconfigured.
NaniteVisualize Overdraw — per-pixel quad cost. High on poorly-LOD'd hero meshes or aggressive masked content.
NaniteVisualize RasterBins — each programmable raster bin gets its own color. Lots of distinct colors = lots of programmable bins.
NaniteVisualize MassCulling — visualize which clusters are being culled at HZB time.
vis Nanite.VisBuffer — visualizes the visibility buffer GPU resource directly.

Pair these with the existing Common UE Performance Gotchas tutorial — many of the gotchas (collision complexity, draw call sprawl, missing LODs) interact with Nanite in non-obvious ways.

✓

Locking the win in CI

Nanite regressions usually come from one of three sources: (1) a material author flips a previously-opaque material to masked, kicking it into programmable raster; (2) a designer adds WPO to a foliage type, demolishing VSM caching; (3) a content team grows a Lumen-heavy scene past the streaming pool budget. None of those are visible to the author at submit time.

PerfGuard can baseline the Nanite-related GPU passes (NaniteVisBuffer, NaniteRaster, NaniteShading) plus the VSM Shadow.Virtual.Render pass, and gate CI on per-pass regressions. The signal is clean: when a PR pushes opaque clusters into the masked or programmable bin, the per-pass timing regresses by a measurable amount. Catch it on PR review, not in QA.

Lumen Performance Deep Dive — Nanite's neighbor; the surface cache leans on cluster-level visibility from Nanite.
Virtual Shadow Maps — the system Nanite is most directly entwined with.
Common UE Performance Gotchas — especially #6 (foliage WPO + VSM) and #8 (LODs).