Detectors¶
rlwatch ships with six detectors. Each one watches a specific failure mode that GRPO/PPO runs hit in practice. Every detector follows the same contract: configurable warmup, configurable severity thresholds, fires once per episode (not per step), and every alert ships with a recommended action.
The lineup¶
| Detector | In plain English | Default trip-wire |
|---|---|---|
| Entropy collapse | The model stopped exploring — it's now just repeating itself. | Entropy < 1.0 for 50 steps in a row |
| KL divergence explosion | The policy is running away from the reference model (usually the prelude to reward hacking). | KL > 3σ above the rolling mean |
| Reward hacking proxy | Rewards suddenly got weird — either way more variance than before, or split into two clusters. | Variance > 3× baseline, or Hartigan dip test p < 0.05 |
| Advantage variance spike | The value function estimates just became unstable. | Advantage std > 3× rolling baseline |
| Loss NaN / Inf | The optimizer has blown up; any further updates corrupt the policy. | Loss is non-finite (one step is enough) |
| Gradient norm spike | Gradients exploded — usually the precursor to a loss NaN. | Grad norm > 3σ above frozen baseline |
| Reward mean drift | Reward mean is drifting monotonically — suspicious for slow reward hacking. | Monotone for 50 steps with magnitude > 0.1 |
Every detector has two severity levels (warning and critical), a configurable warmup period so it doesn't fire at step 3, and a cooldown so you don't get spammed.
Common contract¶
Every detector exposes the same shape of config (see configuration):
enabled: bool— turn the detector off without removing it from your YAML.warmup_steps: int— number of steps to ignore at the start of training. Use a longer warmup for noisy initial steps.- One or more numeric thresholds (the trip-wires).
And every detector emits an Alert with the same fields:
detector— the detector's identifier (e.g.,"entropy_collapse")severity—"warning"or"critical"step— the training step at which the alert firedmessage— a one-sentence human-readable explanationmetric_values— a dict of the relevant numeric valuesrecommendation— what to actually do about it
Severity tiers¶
Most detectors have two tiers:
- Warning — "this might be becoming a problem; check it out next time you look at the dashboard."
- Critical — "this is a problem right now; you should probably stop the run."
Critical alerts can preempt warnings within the cooldown window — if you see a warning and then a critical from the same detector five steps later, the critical is allowed through. The reverse is not true (a warning will not preempt an earlier critical).
Cooldown and rate limiting¶
The AlertManager tracks (detector, severity) pairs against a cooldown window. By default, the same detector at the same severity won't re-fire within 100 steps, and a single run is capped at 50 total alerts. Both are configurable in alerts.cooldown_steps and alerts.max_alerts_per_run.
Frozen vs rolling baselines¶
Two detectors (KL explosion, advantage variance spike) use a z-score model against a rolling baseline by default — this catches sharp spikes reliably but silently follows slow drift. The baseline_mode: "rolling" | "frozen" option freezes the baseline once the rolling window first fills, mirroring how RewardHackingDetector already works.
The new gradient norm spike detector defaults to "frozen" because grad norms drift slowly on healthy runs. KL and advantage default to "rolling" for backwards compatibility — if you want the frozen behavior, opt in explicitly:
See BUILD_DECISIONS.md on GitHub for the full rationale.