Advantage variance spike¶
Watches: the standard deviation of GAE advantages (or equivalent). Failure mode: the value function estimates have become unstable. This often shows up just before training starts producing garbage gradients — the value head is no longer giving a useful baseline, so policy gradient updates have wildly inconsistent signs.
How it fires¶
Same z-score model as KL explosion. Maintains a rolling window over advantage_std. Fires a warning when the current value is more than std_multiplier times the rolling mean, and a critical when it's more than 2 × std_multiplier times the mean.
Configuration¶
advantage_variance:
enabled: true
std_multiplier: 3.0
rolling_window: 100
warmup_steps: 20
baseline_mode: rolling # or "frozen"
Rolling vs frozen baseline¶
Same tradeoff as KL explosion. Default rolling catches sharp spikes; frozen catches slow drift. Default stays rolling for backwards compatibility.
When to tune it¶
- Higher
std_multiplierfor noisier value functions (small batch sizes amplify advantage variance). baseline_mode: frozenif you suspect slow drift in advantage statistics.- Larger
rolling_windowfor long runs where the default 100-step window is too short.
Recommended action when it fires¶
Value function estimates are unstable. Reduce learning rate, increase batch size, or add advantage normalization.
In order of preference: (1) enable advantage normalization in your framework if it's not already on; (2) increase the batch size to smooth out per-step variance; (3) reduce LR. If none of those help, the value function may need more capacity or a longer warmup before being trusted as a baseline.