Back to lessons

Hosting Operations

Build a Restart Loop Timeline

A service keeps restarting and you need to separate the first application failure from later supervisor retries.

Command

journalctl -u app-worker -b --no-pager -o short-iso | grep -E 'Started|Failed|Scheduled restart|Main process exited'

What changed

Nothing changes. The pipeline prints a compact timeline of start, failure, and restart-counter lines.

Danger

safe

When to use it

Use when Restart=on-failure is hiding the first useful failure under repeated retries.

When not to use it

Do not use this as the only diagnosis; read the adjacent app log lines around the first failure.

Undo or recovery

No undo needed because the command is read-only.

Expected output

Timestamped service lifecycle lines showing starts, main-process exits, failed results, and scheduled restarts.

demo script

Disposable terminal steps

  1. journalctl -u app-worker -b --no-pager -o short-iso
  2. journalctl -u app-worker -b --no-pager -o short-iso | grep -E 'Started|Failed|Scheduled restart|Main process exited'

simulated output

What it looks like

disposable vessel
::fixture-ready::
$ journalctl -u app-worker -b --no-pager -o short-iso
2026-06-25T14:20:58-05:00 vps systemd[1]: Started app-worker.service - Background job worker.
2026-06-25T14:20:58-05:00 vps worker[2081]: loading /etc/app/worker.env
2026-06-25T14:20:58-05:00 vps worker[2081]: ERROR redis connection refused at 127.0.0.1:6379
2026-06-25T14:20:59-05:00 vps systemd[1]: app-worker.service: Failed with result 'exit-code'.
2026-06-25T14:21:04-05:00 vps systemd[1]: app-worker.service: Scheduled restart job, restart counter is at 4.
2026-06-25T14:22:17-05:00 vps systemd[1]: Started app-worker.service - Background job worker.
2026-06-25T14:22:17-05:00 vps systemd[2144]: app-worker.service: Failed to determine user credentials: No such process
2026-06-25T14:22:17-05:00 vps systemd[2144]: app-worker.service: Failed at step USER spawning /srv/app/bin/worker: No such process
2026-06-25T14:22:17-05:00 vps systemd[1]: app-worker.service: Main process exited, code=exited, status=217/USER
2026-06-25T14:22:17-05:00 vps systemd[1]: app-worker.service: Failed with result 'exit-code'.
::exit-code::0
$ journalctl -u app-worker -b --no-pager -o short-iso | grep -E 'Started|Failed|Scheduled restart|Main process exited'
2026-06-25T14:20:58-05:00 vps systemd[1]: Started app-worker.service - Background job worker.
2026-06-25T14:20:59-05:00 vps systemd[1]: app-worker.service: Failed with result 'exit-code'.
2026-06-25T14:21:04-05:00 vps systemd[1]: app-worker.service: Scheduled restart job, restart counter is at 4.
2026-06-25T14:22:17-05:00 vps systemd[1]: Started app-worker.service - Background job worker.
2026-06-25T14:22:17-05:00 vps systemd[2144]: app-worker.service: Failed to determine user credentials: No such process
2026-06-25T14:22:17-05:00 vps systemd[2144]: app-worker.service: Failed at step USER spawning /srv/app/bin/worker: No such process
2026-06-25T14:22:17-05:00 vps systemd[1]: app-worker.service: Main process exited, code=exited, status=217/USER
2026-06-25T14:22:17-05:00 vps systemd[1]: app-worker.service: Failed with result 'exit-code'.
::exit-code::0

YouTube Short

Make the restart loop visible.

When systemd retries a service, line up the starts, exits, failures, and restart counters before blaming the latest line.

LinkedIn hook

Restart loops make more sense when you line up starts, failures, and counters.

Question: When a service is flapping, how do you find the first useful failure?

experiments

A/B tests to run

Metric: save_rate

A: Line up starts and failures.

B: Restart loops hide the first clue.