Introducing rlwatch — catch broken RL training runs before they waste your GPU budget¶

v0.3.0 — first PyPI release

If you train language models with GRPO or PPO, you already know the pain. You kick off a run on 8 H100s, go to sleep, and wake up to find the policy collapsed into repeating the same token 12 hours ago. Nobody saw it. Nothing paged. The run just quietly rotted.

rlwatch is a tiny Python library that watches your training metrics in real time and pings you on Slack, Discord, email, or any HTTP endpoint the moment things start going wrong — before the run is ruined.

pip install rlwatch

import rlwatch
rlwatch.attach()

That's the whole user-facing API for the common case. Two lines.

What it catches¶

The four most painful failure modes that GRPO/PPO post-training runs hit in practice:

Entropy collapse — the model stopped exploring and is repeating itself. Usually the prelude to a flat reward signal you don't notice for hours.
KL divergence explosion — the policy is running away from the reference model. Often the prelude to reward hacking.
Reward hacking — the model discovered an adversarial way to game the reward function. rlwatch catches this via reward variance explosion or bimodal reward distribution.
Loss NaN / gradient explosion — the optimizer has blown up. By the time you see this, every further update corrupts the policy. rlwatch alerts on the gradient norm spike before the loss goes NaN, so you have a chance to clip and recover.

(Plus: advantage variance spikes for value function instability. Six detectors total.)

Every alert ships with a recommended action. Not just "KL exploded" — "KL exploded; immediately reduce learning rate or revert to a previous checkpoint." Because what you actually want at 3am is the fix, not the diagnosis.

What's in v0.3.0¶

This is the first PyPI release. It includes everything from stage one (production-grade test harness, six detectors, schema-versioned SQLite store, AlertManager with cooldown preemption) plus the adoption-focus work:

Discord webhook alert channel — because lots of OSS ML teams live there
Generic HTTP webhook — universal escape hatch for PagerDuty, Mattermost, internal incident trackers, anything else
[dashboard] extra — the Streamlit dashboard moved out of core, shrinking pip install rlwatch from ~150MB to ~20MB
End-to-end TRL tutorial — a real GPT-2 + TRL GRPO run with a deliberately misconfigured LR that catches a real entropy collapse, on CPU, in ~5 minutes. No GPU, no API keys, deterministic.
mkdocs-material docs site — every detector explained in depth, configuration reference, alerts setup, FAQ
OIDC PyPI publishing — zero long-lived tokens, manual approval gate, full test suite + 90% coverage gate before any publish

Why it exists¶

Tensorboard and W&B log metrics. Grafana works if you set it up. None of them page you when something is wrong. The dominant failure mode of post-training RL is not "the loss diverged" — it's "the run quietly rotted overnight while you slept and you found out 12 hours later."

The thesis: a two-line install bar is the load-bearing UX claim. If installing rlwatch costs more than 30 seconds, researchers won't bother, and the value (saving hours of GPU time) never materializes. Everything else in the project is in service of that claim.

How to try it¶

The fastest path to "is this thing real" is the simulation:

pip install rlwatch
git clone https://github.com/varun1724/rlwatch
cd rlwatch
python examples/simulate_grpo_run.py

That runs a synthetic GRPO trace through the full pipeline in ~5 seconds. You'll see rlwatch's console panel fire when entropy collapses around step 320. Then rlwatch diagnose gives you the retrospective report.

For a real model, try the end-to-end TRL tutorial — GPT-2 + TRL GRPO + deliberately misconfigured LR + real entropy_collapse alert, all in ~5 minutes on a laptop CPU.

What's next¶

The library is local-first and will stay that way. The default install makes zero network calls — no telemetry, no opt-out signup, no analytics dashboard. If you want alerts in Slack/Discord/email/webhook, you wire up the channel yourself. CI enforces this with a forbidden-pattern grep.

Stage three (v0.4) is the depth chunk: deep integrations for veRL and OpenRLHF, real diptest integration for the bimodal reward detection, dashboard polish (run comparison view, alert timeline). Stage four (v1.0) is the beginning of the hosted service — for teams who want a multi-run dashboard and shared alert routing without standing up their own SQLite + Streamlit instance. The local-first OSS library will stay free regardless.

Links¶

PyPI: pypi.org/project/rlwatch
GitHub: github.com/varun1724/rlwatch
Docs: varun1724.github.io/rlwatch
Tutorial: TRL + GRPO end-to-end
CHANGELOG: v0.3.0

If you try it and it breaks, please open an issue. If it works and saves you a GPU budget, I'd love to hear that too.