Back to Blog

How I Wired Claude into Cron to Triage My Own Outages

I run RydeRush.com solo, which means I'm also the on-call engineer at 2 a.m. So I gave the job to an AI agent: a sandboxed Linux user, a hard turn cap, and the Claude Code CLI doing the actual reasoning. Here is exactly how it works.

ai-monitor

I run a small theme-park content site, RydeRush.com. It's a one-person operation (technically 2 people, but the other half is the business side), which means I am - in order - the writer, the editor, the SEO person, and, at 2 a.m. on a Tuesday, the on-call engineer. That last hat fits worst. When a Celery worker hangs or gunicorn forgets how to talk to the database, I usually find out hours later from a sad Search Console graph.

So a few weekends ago I decided to give the job to someone more reliable than half-asleep me: an AI agent that wakes up only when something is actually broken, looks around with a flashlight, writes a clear report, and goes back to bed. Here's how it actually works - verified by SSH-ing into the box while writing this post - what's stopping it from going sideways, and what I learned in the process.

The problem with 'alert me when it's down'

The classic move is to wire your services into a monitoring service like Healthchecks.io and have it ping you when a check goes red. That's great - until you get the ping. Then you still have to:

  1. SSH into the box.
  2. Run systemctl status on the suspect service.
  3. Tail journalctl and a few log files.
  4. Form a hypothesis.
  5. Restart, or escalate to yourself.

For a hobby-scale operation, steps 1-4 are 90% of the work and 100% of the reason I procrastinate fixing things. The fix itself is usually trivial; the diagnosis is the part that drains a Sunday afternoon. I wanted to compress steps 1-4 into a single Telegram message that tells me what's broken and why - generated by something that actually read the logs.

The setup at a glance

The whole system is four small pieces glued together:

  1. Healthchecks.io monitors my server-side services. Every service pings its check URL on a schedule; if a ping is missed, the check goes red.
  2. A cron job runs every 5 minutes on the server and polls the Healthchecks.io API.
  3. An incident-responder script notices a newly-red check and runs the Claude Code CLI with a tight tool allowlist and a hard turn cap.
  4. A dedicated, locked-down Linux user - claude-responder - is the body that Claude lives in. It can read logs and restart a tight allowlist of services. It cannot do anything else.

That's it. No Kubernetes, no message bus, no fancy orchestrator. The whole thing fits in about 300 lines of Bash and Python on a single VPS.

Step 1: cron does the boring part

Cron is the unglamorous hero here. The claude-responder user owns its own crontab:

claude-responder crontab
PATH=/home/claude-responder/.local/bin:/usr/bin:/bin
MAILTO=""

# Poll healthchecks.io every 5 min. On new DOWN: telegram alert + dispatch responder.
*/5 * * * * /home/claude-responder/scripts/healthcheck-monitor.sh \
    >> /home/claude-responder/logs/monitor.log 2>&1

# Nightly cleanup of per-incident files.
5 3 * * * find /home/claude-responder/logs -name "incident-*.log" -mtime +10 -delete

The monitor script polls the Healthchecks.io v3 API, compares the current state against a JSON state file, and computes three buckets: new DOWNs, recoveries (previously alerted, now back up), and repeat DOWNs that are still inside a 55-minute dedup window. Only new DOWNs fire the agent. The whole thing is curl | python3 -c with no external dependencies:

healthcheck-monitor.sh (excerpt)
# healthcheck-monitor.sh (excerpt)
RESPONSE=$(curl -s --max-time 20 -H "X-Api-Key: $HC_API_KEY" \
  https://healthchecks.io/api/v3/checks/)

# Returns JSON: {"down": [...], "newDowns": [...], "recoveries": [...]}
TRANSITIONS=$(STATE_FILE="$STATE_FILE" RESPONSE="$RESPONSE" python3 <<'PY'
import json, os, time
state = json.load(open(os.environ['STATE_FILE'])) if os.path.exists(...) else {}
alerted = state.get('alertedChecks', {})
# ... figure out new_downs, recoveries, persist updated state ...
PY
)

When the monitor sees a new DOWN, it does two things, in this order:

  1. Fires a Telegram message immediately so I know within five minutes that something broke - before Claude has even loaded.
  2. Detaches the responder with setsid ... & disown, so cron isn't blocked waiting for Claude.

Recoveries get their own Telegram with the downtime duration. That detail was a small win - I was tired of wondering whether the thing was still down.

Step 2: a sandboxed Linux user is Claude's body

This is the part I think about the most, because 'give an LLM shell access' is a sentence that should make anyone twitch. The trick is that the LLM doesn't really have shell access - a Linux user does, and the LLM just sends instructions to it. Whatever the model decides to try, the kernel and sudo are the final word.

The claude-responder user is a system account with very specific powers:

user setup + sudoers allowlist
# Created as a normal account, then added to two key groups:
#   adm       -> read /var/log/* (system journal, nginx, etc.)
#   www-data  -> read /srv/vhosts/ryderush.com/logs/* via ACL
sudo usermod -a -G adm,www-data claude-responder

# /etc/sudoers.d/99-claude-responder - the entire allowlist:
claude-responder ALL=(root) NOPASSWD: \
    /bin/systemctl restart ryde_prod.service, \
    /bin/systemctl restart daphne.service, \
    /bin/systemctl restart celery_ryderush.service, \
    /bin/systemctl restart celerybeat_ryderush.service, \
    /bin/systemctl restart nginx.service, \
    /bin/journalctl

So Claude - running as claude-responder - can:

  • Read the logs it would need to diagnose a problem (via group membership, not file-by-file ACLs).
  • Run journalctl against any unit.
  • Restart exactly five services I've decided are safe to restart: gunicorn, daphne, the Celery worker, Celery beat, and nginx.

It cannot install packages, edit code, touch the database, push to git, or restart anything off that list. If the model goes off the rails, the worst it can do is systemctl restart a service that didn't need restarting. That's a tolerable failure mode for a 2 a.m. agent.

Step 3: the Claude Code CLI, with hard caps

The responder script doesn't talk to the Anthropic API directly - it shells out to the Claude Code CLI, which gives me three things almost for free: a built-in tool-use loop, a --max-turns budget, and an --allowed-tools allowlist that the harness enforces before any command runs.

The whole invocation looks like this:

incident-responder.sh - calling Claude
timeout --kill-after=30 600 ~/.local/bin/claude \
  -p "$PROMPT" \
  --max-turns 15 \
  --allowed-tools "$ALLOWED_TOOLS_ARG" \
  --permission-mode default \
  >"$CLAUDE_OUTPUT" 2>>"$LOG"

Four layers of leash, stacked:

  • flock at the top of the script - only one responder session can run at a time.
  • timeout 600 - the whole Claude run dies after 10 minutes of wall time, no matter what.
  • --max-turns 15 - Claude gets at most 15 turns to investigate and report. If it doesn't reach a conclusion, it has to write up what it does know.
  • --allowed-tools - a tight allowlist: Read, Grep, Glob, tail/head/cat/ls, systemctl status, journalctl, df, free, curl -sI, git -C /srv/vhosts/ryderush.git, and the five sudo systemctl restart lines.

Fifteen turns is enough to look at two or three log files, check service status, and form a clear hypothesis. It's not enough to spiral into a 45-minute exploration that costs me $4 in tokens and leaves me with no clearer picture than I started with.

The prompt itself does a fair bit of heavy lifting. It tells Claude which services map to which Healthchecks ('Web App (HTTP)' goes to gunicorn + daphne; everything else to Celery), points it at the exact log paths, and includes a strict rule for the only mutating action it's allowed:

Investigate first. You MAY perform ONE targeted service restart at the end of your investigation IF - and only if - all four conditions hold: (a) the log clearly points at a transient condition a restart would resolve; (b) confidence is HIGH with specific log lines cited; (c) the target is one of the 5 allowed services; (d) the report states what you intend to restart and why, BEFORE you run it.

— from the responder system prompt

The output contract is the other half of the discipline. Claude has to emit its report between literal markers:

required output structure
<<<REPORT
# RydeRush Incident Report
## DOWN checks
## Likely root cause
## Evidence
## Action taken
## Recommended next action
REPORT>>>

A short Python regex pulls the block out of stdout. If the markers aren't there - usually because Claude ran out of turns mid-investigation - the report just says so, and I get a pointer to the full transcript on disk.

Step 4: a report I can act on from my phone

When the responder finishes, it dispatches the report on two channels:

  • Telegram, truncated to ~3,800 characters (their hard limit is 4,096). This is the headline I read in line at the coffee shop.
  • Email via a small Go CLI called gog, which sends through the Gmail API and renders the report as <pre>-formatted HTML.
notification dispatch
# Telegram
curl -s -X POST "https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
  --data-urlencode "chat_id=${TELEGRAM_CHAT_ID}" \
  --data-urlencode "text=$TG_BODY_TRIMMED"

# Email - gog reads keyring credentials from $GOG_KEYRING_PASSWORD
~/.local/bin/gog gmail send \
  --to me@example.com \
  --subject "RydeRush DOWN: $DOWN_NAMES" \
  --body-html "$(cat $HOME/state/incident-body.html)" \
  --account me@example.com

The Telegram is the glanceable headline. The email is the full transcript: log excerpts the agent quoted as evidence, the action it took (or didn't), and the recommended next step - even if it restarted something successfully, the report still has to tell me what the underlying fix should be.

What it actually looks like in practice

A recent real-world incident: the Celery check 'RydeRush: Wait Times Update' went red. The monitor pinged me on Telegram within five minutes. The responder spawned. Claude actually used all 15 turns and ran past its budget without emitting a complete REPORT block - so the email I got was a 'did not reach a conclusion' notice with a pointer to /home/claude-responder/logs/incident-20260507-192003.log on the box. Not a triumph, but the right failure mode: it told me where to look instead of pretending it knew the answer.

The wins are smaller and quieter. When a worker dies cleanly with a Python traceback at the top of the journal, Claude finds it in three turns, writes a one-paragraph diagnosis with the exact line number, recommends a restart, and (sometimes) executes it. The first time it correctly chose not to restart - citing logs showing the worker would recover on its own - was the moment I stopped thinking of this as 'automation' and started thinking of it as 'a junior teammate I trust with very small decisions.'

What I'd do differently

A few things I'd change on the next pass:

  • Per-check playbooks. Right now every incident gets the same prompt. Different services have different failure modes; a short per-check addendum would tighten the agent's first guesses.
  • Cross-incident memory. Each run starts cold. If the same Celery task has died for the same reason three times this week, the agent should be the one telling me, not the other way around.
  • A confidence floor for restarts. 'HIGH confidence' is currently judged by the model. I'd like a numeric self-score in the report, and a hard cutoff below which restarts are blocked at the script layer.

The takeaway

The thing I keep coming back to is how small this system is. It's a cron entry, two Bash scripts (~300 lines total), a Python state file, a locked-down user, and the Claude Code CLI doing the actual reasoning. There is no platform here. There is no framework. Most of the engineering work was deciding what the agent shouldn't be able to do, and writing those rules down as Linux permissions and CLI flags - not as prompt instructions the model could rationalize its way around.

If you're a solo operator with a server you care about, I'd encourage you to try a version of this. The sandbox-first mindset - give the model a body with narrow powers, then trust the body, not the model - is what makes the difference between 'scary' and 'useful.' And honestly, getting a clean diagnostic Telegram instead of a cryptic Healthchecks email at 2 a.m. is the kind of small dignity worth a weekend project.