Statistical Rigor in Benchmarking
Mytkowicz et al. demonstrated at ASPLOS 2009 that linking-order alone shifted gcc -O2 vs -O3 conclusions by ±15 percentage points. Environment-variable size produced 33–300% slowdowns. The lesson generalizes: you can produce wrong data without doing anything obviously wrong. This tutorial covers the four statistical concepts every UE5 perf engineer should know, the operational discipline that produces trustworthy benchmarks, and the percentile reporting that catches stutter the mean hides.
Why methodology matters more than tooling
Per Mytkowicz et al. ASPLOS '09: changing only the link order of object files shifted gcc compiler-flag benchmark conclusions by ±15 percentage points. Changing the size of an unused environment variable produced 33–300% slowdowns. The takeaway: tiny irrelevant details affect the results more than the thing you're measuring.
UE5 has its own version: PSO compilation hits the first 60–300 frames; if you don't discard them, the "regression" you see is just the cold cache. Texture streaming pool fills over the first 5–30 seconds; if you measure during fill, you measure pool fill, not steady-state.
Four statistical concepts
Coefficient of Variation (CoV). Standard deviation divided by mean, as a percentage. For UE5 frame-time captures, CoV under ~3% means your noise floor is tight enough to detect 5% regressions. CoV above ~10% means your noise is masking real signal — fix the bench before fixing the code.
p-value. Probability that the observed difference between baseline and candidate would arise by chance if both were drawn from the same distribution. Conventional threshold: p < 0.05 = "statistically significant." With 10,000 frames per capture, a 0.2 ms difference can be highly significant statistically yet meaningless to the player. Statistical significance ≠ practical significance.
Confidence Interval (CI). A range that contains the true mean with specified probability (typically 95%). CI width shrinks with √N — quadrupling captures halves the CI width. Reporting "GPUTime mean 12.4 ms, 95% CI [12.1, 12.7]" tells reviewers the precision; reporting just "12.4 ms" hides it.
Welch's t-test. Two-sample location test comparing means when sample sizes and variances differ. Preferred over Student's t-test for game perf because baseline and candidate runs frequently have unequal variance. Frame times have a heavy right tail (rare hitches), so for stutter-sensitive analysis pair Welch's (mean) with Mann-Whitney U or Kolmogorov-Smirnov (distribution shape).
Hardware isolation & clock locking
- Dedicated bare-metal box. No VM. No other workloads. Locked to the test config.
- NVIDIA:
nvidia-smi --lock-gpu-clocks=<base>,<base>and--lock-memory-clocks. NVIDIA's docs note this is approximate — thermal limits still preempt. - AMD: Radeon Developer Panel Device Clocks tab.
- CPU C-state pinning — disable Intel SpeedStep / AMD Cool'n'Quiet via BIOS or
powercfg(Windows). Use a "High Performance" power plan.
NVIDIA Nsight docs explicitly recommend GPU Trace's lock-clocks option over SetStablePowerState — the latter does not lock memory clocks, and benchmarks regress invisibly.
Thermal management
Modern desktop GPUs reach thermal equilibrium in ~10 minutes; laptops 20+. Throttling typically begins around 80–95°C and can cost 10–30% sustained performance.
There are two legitimate capture-length strategies and they answer different questions:
- Short captures (30–90 s) — goal is steady-state measurement of code change impact without thermal noise. Pre-warm to avoid cold-cache outliers, then capture. Pair with cool-down between runs so each capture starts at a similar thermal baseline.
- Sustained captures (10–30 min) — goal is exposing thermal throttle behavior that short bursts hide. Volume Shader BM's thermal throttling guide specifically recommends long-duration runs for this reason: a 30 s benchmark on a desktop GPU rarely sees the same clocks as a 30 min one.
Pick by what you're measuring. For PR-time regression detection (this CVar made the same scene 0.3 ms slower), short. For "does this build hold its frame rate over an hour of play" (the cert / ship question), long.
Either way:
- Cool down between captures so they start at comparable temperatures.
- Track GPU temperature alongside frame time. A run that ended at 92°C is statistically compromised regardless of length.
- Custom fan curves on benchmarking machines — aggressive cooling reduces the variance you have to explain away.
Environmental controls
The Windows audit checklist:
- VBS / Memory Integrity / HVCI off — costs measurable single-digit-percent FPS in many titles (per Tom's Hardware, MakeUseOf benchmarks).
- Game Bar / Game Mode off.
- Defender real-time scanning paused during run window.
- Steam / Discord overlays off.
- Background apps killed — Slack, browsers, dev tools.
- Scheduled tasks paused — Windows Update, Defender scans.
Per AMD GPUOpen UE perf guide, the cluster of UE-specific switches needed for reproducible RGP captures: -deterministic, -novsync, -benchmark, -fps=<n>, bSmoothFrameRate=false, r.ShowMaterialDrawEvents 0.
Warmup strategy
Discard the first 60–300 frames to skip:
- PSO compilation.
- Texture streaming pool fill.
- Nanite DAG cluster culling state.
- Lumen surface cache fill.
- Driver-side shader cache (cold first run).
Validate empirically: run the same scenario twice without changing anything. If run 1 frame time ≠ run 2 frame time, increase warmup until they converge. JMH's default (5 warmup + 5 measurement iterations, warmup discarded entirely from stats) is a reasonable starting point.
Sample size and confidence
Per Pizzolla / MeasuringU: 20–30 samples gives the best precision-per-effort tradeoff. Mytkowicz et al. used thousands. Minimum: 5–10 captures per condition. Below that, you cannot meaningfully compute a t-statistic.
The √N rule: quadrupling captures halves the CI width. If your CI is ±1.0 ms and you want ±0.5 ms, run 4× more captures.
For frame-time arrays specifically, "captures" means independent runs — each from a fresh process, not one long-running session. Mozilla's Talos / Bugzilla 1973820 documents that running each test in a fresh process is required for repeatability. Same lesson applies to UE5 perf testing — restart the engine per scenario.
Percentile-based reporting
P50 (median) hides stutter; P99 catches it. Report:
- Mean — for budget calculations.
- P50 / P95 / P99 — for tail behavior.
- P99.9 — for "is this going to drop a frame?" detection. In VR, a single P99.9 frame > refresh period is comfort failure.
- 1% lows / 0.1% lows — common in DF-style reviews. Per CapFrameX, 0.1% low is statistically unreliable on captures under ~30 seconds (only 2–3 frames). Use percentile-FPS over distribution-FPS for short captures.
NVIDIA FrameView reports P90/P95/P99 percentile FPS — the modern preferred metric over 1% lows on shorter captures.
Welch's t-test in practice
Frame-time arrays in, p-value out. Pseudocode:
import numpy as np from scipy import stats baseline = np.array([16.45, 16.51, 16.38, ...]) # post-warmup candidate = np.array([16.78, 16.89, 16.65, ...]) # Welch's t-test (equal_var=False) t_stat, p_value = stats.ttest_ind(baseline, candidate, equal_var=False) # For distribution shape (catches stutter not in the mean) ks_stat, ks_p = stats.kstest(baseline, candidate) # Reportable summary delta = candidate.mean() - baseline.mean() ci_95 = stats.t.interval(0.95, len(candidate)-1, loc=candidate.mean(), scale=stats.sem(candidate))
Decision: p < 0.05 + delta > practical-significance threshold = real regression. p < 0.05 + delta < threshold = statistically real but practically irrelevant. p > 0.05 = noise.
Multiple-comparison correction
If you run 50 stats × 10 scenarios = 500 comparisons per PR, a 5% false-positive rate gives ~25 false alarms per PR. Two corrections:
- Bonferroni — conservative; multiply p-values by N. Easy but over-corrects.
- Benjamini-Hochberg (FDR) — controls false discovery rate; less conservative; preferred for many comparisons.
Chromium's Speed team uses configurable improve/regress thresholds with directionality (higher-is-better vs lower-is-better) per metric — no global threshold. Same pattern for UE5: a frame-time regression and a memory regression have different significance bars.
Baseline locking ethics
Lock baselines under version control. Re-baseline only on:
- Intentional optimization — budget reduced.
- Engine version upgrade — characteristics changed.
- Hardware change — new CI machine.
Never re-baseline to silence a regression. Document every re-baseline reason. Without this, the baseline drifts away from "what we promised" toward "what we currently ship," and the regression-detection signal degrades to noise.
PerfGuard's auto-threshold feature operationalizes CoV: it measures noise per stat over historical runs and recommends thresholds at ~2–3σ (high statistical power for the >3σ events that actually matter). The multi-run workflow is Welch's t-test in practice. The thermal-throttle diagnostic flag means "your data is statistically compromised." Connecting these to the underlying statistics in the PerfGuard report is the discipline that prevents teams from treating any red number as a regression.
- Performance Testing Best Practices — the prescriptive companion to this engineering tutorial.
- Advanced: Threshold Tuning & Multi-Run Analysis — the PerfGuard-specific workflow.
- UE5 Postmortems Compendium — how shipped titles measured what they shipped.