← Back to Home
Advanced ~20 min read

Advanced: Threshold Tuning & Multi-Run Analysis

Reduce false positives and catch real regressions. Learn statistical techniques for reliable performance testing in noisy environments.

1

Understanding Noise vs Real Regressions

The fundamental challenge of automated performance testing is distinguishing signal from noise. Even on dedicated hardware, frame times vary between runs due to OS scheduling, thermal state, driver behavior, and background services.

A naive comparison (single run A vs single run B) has high false positive rates. PerfGuard uses several techniques to improve signal quality:

  • Warmup frame trimming — Removes initial spikes from shader compilation
  • IQR outlier removal — Removes extreme values that skew aggregates
  • Multi-run aggregation — Averages across multiple captures for statistical confidence
  • CoV flagging — Warns when a stat is too volatile to compare reliably
💡
Tip The single most effective way to reduce noise is to run on dedicated hardware with no other processes competing for resources. Statistical techniques help, but they can't fix a fundamentally noisy test environment.
2

IQR Outlier Trimming Explained

PerfGuard uses the Interquartile Range (IQR) method to remove outlier frames before computing aggregates. This is a robust statistical technique that handles skewed distributions well.

How it works:

  1. Compute Q1 (25th percentile) and Q3 (75th percentile) of the frame data
  2. Calculate IQR = Q3 - Q1
  3. Define bounds: lower = Q1 - 1.5 * IQR, upper = Q3 + 1.5 * IQR
  4. Discard any frames outside [lower, upper]
  5. Compute aggregates on the remaining "clean" data
Example
# Raw frame times (ms):
[12.1, 13.4, 13.8, 14.0, 14.2, 14.5, 14.8, 15.1, 15.3, 45.2, 67.8]

# Q1 = 13.8, Q3 = 15.1, IQR = 1.3
# Lower bound = 13.8 - 1.95 = 11.85
# Upper bound = 15.1 + 1.95 = 17.05

# After trimming (45.2 and 67.8 removed):
[12.1, 13.4, 13.8, 14.0, 14.2, 14.5, 14.8, 15.1, 15.3]

# Mean: 14.13ms (vs 19.84ms with outliers)

This prevents shader compilation hitches or GC stalls from inflating your baseline and masking real regressions.

3

Multi-Run Statistical Analysis

For the highest confidence, run the same scenario multiple times and let PerfGuard compute confidence intervals. This tells you not just the mean, but how confident you can be that the mean is accurate.

Terminal (Bash)
# Run 5 captures of the same scenario
for i in 1 2 3 4 5; do
    UnrealEditor Project.uproject -game \
        -gauntlet=PerfGuardGauntletController \
        -scenario=MyScenario -csvprofile \
        -RenderOffScreen -unattended -log
done

# Analyze all 5 runs together
python3 perfguard_cli.py analyze \
    run1.csv run2.csv run3.csv run4.csv run5.csv \
    --threshold-percent 5.0
PowerShell
# Run 5 captures of the same scenario
for ($i=1; $i -le 5; $i++) {
    & "UnrealEditor-Cmd.exe" Project.uproject -game `
        -gauntlet=PerfGuardGauntletController `
        -scenario=MyScenario -csvprofile `
        -unattended -log
}

# Analyze all 5 runs together
python perfguard_cli.py analyze `
    run1.csv run2.csv run3.csv run4.csv run5.csv `
    --threshold-percent 5.0

With 5 runs, PerfGuard computes 95% confidence intervals. A regression is only flagged if the lower bound of the confidence interval exceeds the threshold — meaning you can be statistically confident the regression is real, not noise.

📊
Multi-run analysis output showing mean FrameTime: 14.8ms with 95% CI [14.3, 15.2], baseline: 14.2ms, verdict: within noise
Screenshot: Multi-run confidence interval output
💡
Tip 3 runs gives reasonable confidence. 5 runs is ideal. More than 5 has diminishing returns relative to the CI time cost. For nightly builds, use 5 runs. For PR checks where speed matters, use 1–3.
4

Coefficient of Variation (CoV) Flagging

The Coefficient of Variation (CoV) is the standard deviation divided by the mean, expressed as a percentage. It measures how "noisy" a stat is relative to its magnitude.

PerfGuard flags stats with CoV above a configurable threshold (default: 10%). A high CoV means the stat varies too much between frames to produce reliable comparisons.

Interpretation
# Low CoV (good, reliable comparison):
FrameTime: mean=14.2ms, stddev=0.8ms, CoV=5.6%

# High CoV (noisy, comparison may be unreliable):
DrawCalls: mean=2400, stddev=480, CoV=20.0%  ⚠ Flagged

When a stat is flagged for high CoV, the report shows a warning. You can still use the comparison, but treat it with caution — the variance is high enough that the delta might be noise.

Warning A stat with 20% CoV and a 5% threshold will produce frequent false positives. Either increase the threshold for that specific stat, use multi-run averaging, or investigate why the stat is so volatile.
5

Thermal Throttle Detection

GPU and CPU thermal throttling causes frame times to gradually increase over the course of a capture. PerfGuard detects this by analyzing the trend slope of frame times over time.

If frame times consistently increase across the capture duration (statistically significant positive slope), the diagnostics card flags potential thermal throttling.

🌡
Frame time scatter plot showing gradual upward drift from ~14ms at the start to ~17ms at the end of a 30-second capture, with a trend line overlaid
Screenshot: Thermal throttle drift visualization

Mitigation strategies:

  • Add a cooling period between consecutive scenario runs
  • Use shorter captures (15–20 seconds) to avoid the throttle window
  • Improve case airflow or use a cooling pad on laptops
  • Run CI captures during cooler ambient conditions (nightly, not midday)
6

Auto-Threshold Recommendations

PerfGuard can analyze your historical run data and recommend thresholds based on the natural variance of each stat. This takes the guesswork out of threshold tuning.

The recommendation algorithm:

  1. Collects the last N runs for each stat (from history)
  2. Computes the standard deviation of each stat across runs
  3. Recommends a threshold at 2–3 standard deviations above zero delta
  4. Ensures the threshold is high enough to avoid false positives from natural variance, but low enough to catch real regressions
💡
Tip Run the auto-threshold analysis after accumulating at least 10 runs of historical data. Fewer runs produce unreliable variance estimates.
7

Tuning Per-Stat Thresholds

A single global threshold rarely works well for all stats. Frame time and GPU time are stable; draw call counts fluctuate more. PerfGuard supports per-stat threshold overrides in Project Settings.

Configure thresholds in DefaultPerfGuard.ini:

DefaultPerfGuard.ini
[/Script/PerfGuardRuntime.PerfGuardSettings]
; Global default
DefaultThresholdPercent=5.0

; Per-stat overrides
+StatThresholds=(StatName="FrameTime", ThresholdPercent=3.0)
+StatThresholds=(StatName="GPUTime", ThresholdPercent=3.0)
+StatThresholds=(StatName="DrawCalls", ThresholdPercent=15.0)
+StatThresholds=(StatName="TrianglesDrawn", ThresholdPercent=10.0)

Tight thresholds (3%) on timing stats catch meaningful regressions. Loose thresholds (10–15%) on count-based stats avoid noise from draw call batching and LOD differences.

8

Hitch Detection Configuration

Hitch detection thresholds determine what counts as a minor, major, or severe hitch. The defaults are based on the frame budget:

Default Hitch Thresholds (60fps budget)
# Minor:  1.0x - 2.0x budget  (16.67ms - 33.33ms)
# Major:  2.0x - 4.0x budget  (33.33ms - 66.67ms)
# Severe: 4.0x+ budget        (66.67ms+)

For VR projects where any hitch is unacceptable, tighten these multipliers. For less latency-sensitive games, you might relax the minor threshold to reduce noise.

💡
Tip Focus on major and severe hitches for CI gating. Minor hitches are useful for analysis but often too noisy for automated pass/fail. Set your CI to only fail on major+ hitches while still reporting minors.
9

Custom CSV Stat Tracking

UE's CSV profiler can output hundreds of columns. PerfGuard tracks whichever columns you specify in the scenario's Tracked Stats array. You can track any column that appears in the CSV header.

Common custom stats beyond the defaults:

Custom Stats
# Rendering
"SceneRendering"       # Total scene render cost
"ShadowDepths"        # Shadow map rendering
"Translucency"        # Translucent object cost

# Memory
"Physical Memory Used" # Process physical memory
"Texture Memory"      # GPU texture allocations

# Gameplay
"Physics"             # Physics simulation time
"AI"                  # AI tick cost
"Navigation"          # Navmesh queries
Warning Stat names must match the CSV column headers exactly (case-sensitive). Run a capture first and inspect the CSV to find the exact column names. UE sometimes adds spaces or changes casing between versions.
10

Integrating with External Dashboards

For teams that use Grafana, Datadog, or custom dashboards, PerfGuard can output comparison results as machine-readable JSON for ingestion by external systems.

Terminal
python3 perfguard_cli.py baseline compare MyScenario \
    --csv capture.csv \
    --json-output results/comparison.json
Command Prompt
python perfguard_cli.py baseline compare MyScenario ^
    --csv capture.csv ^
    --json-output results\comparison.json

The JSON output contains all stat values, deltas, thresholds, and pass/fail results in a structured format. Feed this into your dashboard pipeline to build long-term performance tracking across hundreds of builds.

comparison.json (excerpt)
{
  "scenario": "MainMenu_Flythrough",
  "passed": false,
  "results": [
    {
      "stat": "FrameTime",
      "baseline": 14.23,
      "current": 15.89,
      "delta_ms": 1.66,
      "delta_pct": 11.67,
      "threshold_pct": 5.0,
      "passed": false
    }
  ]
}
💡
Tip Use the JSON output together with the perfguard history show command to build trend visualizations in your team's existing monitoring stack. The JSON format is stable and designed for programmatic consumption.
💡
You've completed all tutorials! You now have a comprehensive understanding of PerfGuard's capabilities. Return to the home page to explore other resources, or revisit any tutorial using the navigation below.