Skip to content

Advantage variance spike

Watches: the standard deviation of GAE advantages (or equivalent). Failure mode: the value function estimates have become unstable. This often shows up just before training starts producing garbage gradients — the value head is no longer giving a useful baseline, so policy gradient updates have wildly inconsistent signs.

How it fires

Same z-score model as KL explosion. Maintains a rolling window over advantage_std. Fires a warning when the current value is more than std_multiplier times the rolling mean, and a critical when it's more than 2 × std_multiplier times the mean.

Configuration

advantage_variance:
  enabled: true
  std_multiplier: 3.0
  rolling_window: 100
  warmup_steps: 20
  baseline_mode: rolling   # or "frozen"

Rolling vs frozen baseline

Same tradeoff as KL explosion. Default rolling catches sharp spikes; frozen catches slow drift. Default stays rolling for backwards compatibility.

When to tune it

  • Higher std_multiplier for noisier value functions (small batch sizes amplify advantage variance).
  • baseline_mode: frozen if you suspect slow drift in advantage statistics.
  • Larger rolling_window for long runs where the default 100-step window is too short.

Value function estimates are unstable. Reduce learning rate, increase batch size, or add advantage normalization.

In order of preference: (1) enable advantage normalization in your framework if it's not already on; (2) increase the batch size to smooth out per-step variance; (3) reduce LR. If none of those help, the value function may need more capacity or a longer warmup before being trusted as a baseline.