List Newest Source Files Before Backup
Before trusting a backup, know which files changed most recently.
find source -type f -printf '%TY-%Tm-%Td %TH:%TM %p\n' | sort
problem area
Web hosting, SSL, DNS, Nginx, deployment, backups, and VPS management.
133 checked fixes
Before trusting a backup, know which files changed most recently.
find source -type f -printf '%TY-%Tm-%Td %TH:%TM %p\n' | sort
A file list says what exists; checksums say whether bytes match.
sha256sum source/app/config.yml source/content/index.md source/content/about.md source/assets/logo.svg
A checksum file is only useful if you actually verify it.
sha256sum -c checksums.sha256
A backup can be missing files and still look plausible at a glance.
comm -3 <(find source -type f | sed 's#^source/##' | sort) <(find backup -type f | sed 's#^backup/##' | sort)
Rsync can tell you what would change before it changes anything.
rsync -ain --delete source/ backup/
Zero-byte files can be normal, or they can be failed writes.
find backup -type f -size 0 -print
Large backup files are where storage surprises usually start.
find backup -type f -printf '%s %p\n' | sort -nr | head
Files newer than the last snapshot are the ones most likely missing from it.
find source -type f -newer backup/.snapshot -print | sort
A restore drill starts by proving which backups actually exist.
cd restore-dr && find backups -maxdepth 2 -type f -name MANIFEST.txt -printf '%TY-%Tm-%Td %TH:%TM %h\n' | sort -r
The manifest should say what backup you are about to trust.
cd restore-dr && cat backups/2026-06-25/MANIFEST.txt
You can inspect a tar backup before it writes a single file.
cd restore-dr && tar -tf backups/2026-06-25/site.tar | sed 's#^./##' | sort
The fastest failed restore drill is the one that finds missing critical files early.
cd restore-dr && tar -tf backups/2026-06-24/site.tar | sed 's#^./##' | sort | comm -23 required-files.txt -
A restore drill should write to a sandbox, not production.
cd restore-dr && rm -rf restore-sandbox/full && mkdir -p restore-sandbox/full && tar -xf backups/2026-06-25/site.tar -C restore-sandbox/full
A restore is not validated until the bytes match.
cd restore-dr && rm -rf restore-sandbox/full && mkdir -p restore-sandbox/full && tar -xf backups/2026-06-25/site.tar -C restore-sandbox/full && (cd restore-sandbox/full && sha256sum -c CHECKSUMS.sha256)
A restored config can exist and still be the wrong config.
cd restore-dr && rm -rf restore-sandbox/full && mkdir -p restore-sandbox/full && tar -xf backups/2026-06-25/site.tar -C restore-sandbox/full && diff -u expected/app/config.yml restore-sandbox/full/app/config.yml
A successful extraction still needs a required-file check.
cd restore-dr && rm -rf restore-sandbox/full && mkdir -p restore-sandbox/full && tar -xf backups/2026-06-25/site.tar -C restore-sandbox/full && find restore-sandbox/full -type f | sed 's#^restore-sandbox/full/##' | sort | comm -23 required-files.txt -
Permissions are part of the restore, not decoration.
cd restore-dr && tar -tvf backups/2026-06-25/site.tar | awk '/secrets.env|deploy.sh/ {print $1, $6}'
A restore drill that leaves no evidence is hard to trust later.
cd restore-dr && grep -E 'status=|rpo_minutes=|rto_seconds=|checksum=|file_count=' reports/restore-dr-2026-06-25.txt
The failing file is usually one of the newest artifacts.
find artifacts logs -type f \( -name '*.log' -o -name '*.txt' \) -printf '%TY-%Tm-%Td %TH:%TM %p\n' | sort -r | head
One grep pass can turn a log pile into a failure list.
grep -RInE 'error|failed|failure|exception|traceback' artifacts logs | head -50
The line before the error often explains the error.
grep -RInC 3 -m 1 'ERROR' artifacts logs
The XML report already knows which tests failed.
grep -RIn '
Before debugging a test failure, measure the blast radius.
grep -RhoE 'tests="[0-9]+"|failures="[0-9]+"|errors="[0-9]+"|skipped="[0-9]+"' artifacts/test/*.xml | sort | uniq -c
Coverage failures usually say the threshold out loud.
grep -RInE 'coverage|threshold|minimum|below' artifacts logs
A bloated artifact can explain a slow or failed pipeline.
find artifacts -type f -printf '%s %p\n' | sort -nr | head -10
The deploy failed because the build never produced the file.
find artifacts/dist -maxdepth 2 -type f | sort
Artifacts are public more often than you think.
grep -RInE 'AWS_ACCESS_KEY|SECRET|TOKEN|PRIVATE KEY|PASSWORD' artifacts logs | head -50
A green retry can still hide a flaky test.
grep -RInE 'rerun|retry|flaky|passed on retry|failed attempt' artifacts logs
The database was running, but it was not ready.
pg_isready -h 127.0.0.1 -p 5432
The database was not down. It was full.
psql -X -A -F '|' -c "select pid,usename,datname,state,client_addr from pg_stat_activity order by state, pid;"
One query can make the whole app look broken.
psql -X -c "select pid, now() - query_start as age, state, left(query, 80) as query from pg_stat_activity where query_start is not null order by age desc limit 10;"
The outage was a queue, not a crash.
psql -X -c "select pid, wait_event_type, wait_event, state, left(query, 80) as query from pg_stat_activity where wait_event_type is not null order by pid;"
Disk pressure starts with knowing what grew.
psql -X -c "select datname, pg_size_pretty(pg_database_size(datname)) as size from pg_database order by pg_database_size(datname) desc;"
The port was open. MySQL still had to answer.
mysqladmin ping -h 127.0.0.1 -P 3306
The app was waiting behind busy sessions.
mysql -e "show full processlist;"
One old query explained the whole slowdown.
mysql -e "select id,user,host,db,command,time,state,left(info,80) as info from information_schema.processlist where command <> 'Sleep' order by time desc limit 10;"
The storage alert needed a database name.
mysql -e "select table_schema, round(sum(data_length + index_length)/1024/1024, 1) as mb from information_schema.tables group by table_schema order by mb desc;"
Skip the full CI log and jump straight to lines that usually explain the failure.
grep -RInE 'error|failed|exception|traceback|fatal' logs/ | tail -50
Confirm what your pipeline actually produced before you deploy it.
find artifacts/ -type f -printf '%TY-%Tm-%Td %TH:%TM %10s %p\n' | sort | tail -20
See your newest release directories without opening a dashboard.
find releases/ -mindepth 1 -maxdepth 1 -type d -printf '%T@ %TY-%Tm-%Td %TH:%TM %p\n' | sort -nr | head -10 | cut -d' ' -f2-
Verify two artifact copies match before blaming deployment code.
sha256sum artifacts/app.tar.gz releases/current/app.tar.gz
Find the image tags your deployment files reference without printing env values.
grep -RhoE 'image:[[:space:]]*[^[:space:]]+' deploy/ | sort -u
No space left can mean full bytes, full inodes, or both.
df -h /lab/disk-inode-cleanup && df -ih /lab/disk-inode-cleanup
A cleanup scan should not wander into mounted backups or network storage.
du -xh --max-depth=1 /lab/disk-inode-cleanup/var 2>/dev/null | sort -h
The safe version of cleanup is a candidate list first.
find /lab/disk-inode-cleanup/var/tmp/uploads -xdev -type f -mtime +7 -printf '%TY-%Tm-%Td %10s %p\n' | sort
Inode cleanup starts by finding the directory with too many files.
find /lab/disk-inode-cleanup/var/cache/app -xdev -type f -printf '%h\n' | sort | uniq -c | sort -nr | head
Empty directories are low-risk candidates, but they still deserve a preview.
find /lab/disk-inode-cleanup/var/cache/app -xdev -depth -type d -empty -print
Release cleanup should prove what current points to before listing old directories.
current=$(readlink -f /lab/disk-inode-cleanup/home/deploy/current); find /lab/disk-inode-cleanup/home/deploy/releases -mindepth 1 -maxdepth 1 -type d ! -samefile "$current" -printf '%TY-%Tm-%Td %p\n' | sort
The oldest file is not always the file that buys back meaningful space.
find /lab/disk-inode-cleanup/var -xdev -type f -mtime +7 -printf '%s %TY-%Tm-%Td %p\n' | sort -nr | head
Before truncating logs, prove which log files are large and how old they are.
find /lab/disk-inode-cleanup/var/log -xdev -type f -printf '%10s %TY-%Tm-%Td %p\n' | sort -nr
Cache cleanup is safer when you know whether files are stale or still active.
find /lab/disk-inode-cleanup/var/cache/app -xdev -type f -printf '%TY-%Tm-%Td\n' | sort | uniq -c
Turn noisy docker ps output into the few fields operators scan first.
docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.Image}}\t{{.Ports}}'
Docker may say a container is running while its health check says otherwise.
docker inspect --format '{{.Name}} health={{if .State.Health}}{{.State.Health.Status}}{{else}}none{{end}} status={{.State.Status}}' web
Get Docker resource usage once, without leaving a live dashboard running.
docker stats --no-stream --format 'table {{.Name}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.NetIO}}\t{{.BlockIO}}'
See how Docker storage is split across images, containers, volumes, and cache.
docker system df -v
A container can be healthy and still attached to the wrong network.
docker inspect --format '{{.Name}} {{range $name, $net := .NetworkSettings.Networks}}{{$name}} {{$net.IPAddress}} {{end}}' api
Before rollback commands, capture the branch and dirty files.
cd /lab/git-recovery-rollback && git status --short --branch
A rollback is easier when the last few release tags are visible.
cd /lab/git-recovery-rollback && git log --oneline --decorate --graph --all -8
Compare the suspect release against the last known-good tag.
cd /lab/git-recovery-rollback && git diff --name-status release-2026-06-25-1000..HEAD
A reset does not mean the commit vanished.
cd /lab/git-recovery-rollback && git reflog --date=iso --format='%h %gd %gs' -6
Put a name on the reflog commit before it slips away.
cd /lab/git-recovery-rollback && git branch recovered-incident-note HEAD@{1}
Recover a config file without rolling back the whole branch.
cd /lab/git-recovery-rollback && git restore --source=release-2026-06-25-1000 -- app/config.yml
Git may say one thing while the release pointer serves another.
cd /lab/git-recovery-rollback && readlink releases/current && cat releases/current/VERSION
Practice the pointer switch where the blast radius is zero.
cd /lab/git-recovery-rollback && ln -sfn 2026-06-25-1000 releases/current
Show the exact file changes before moving the branch back.
cd /lab/git-recovery-rollback && git diff --stat HEAD..release-2026-06-25-1000
Undo a bad release with a new commit instead of rewriting history.
cd /lab/git-recovery-rollback && git restore -- app/config.yml && git revert --no-edit release-2026-06-25-1030
The config looked fine. Nginx disagreed before reload broke anything.
nginx -t
The config existed, but it was not enabled.
ls -l /etc/nginx/sites-enabled/
The wrong server block was answering the domain.
grep -R "server_name" /etc/nginx/sites-enabled/
HTTPS worked. The plain HTTP redirect still mattered.
curl -I http://example.com
The page loaded, but the headers told the operational story.
curl -sI https://example.com
The site was fine. The domain was pointed somewhere else.
dig +short example.com A
The certificate existed. The question was which domains it covered.
certbot certificates
The deploy finished. The symlink told me what was actually live.
readlink -f /srv/www/example.com/current
The missing file was not random. The access log had a pattern.
awk '$9==404 {print $7}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head
LinkedIn traffic was not a guess. The referrer field showed it.
awk -F'"' '{print $4}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head
Start with severity counts before opening every log line.
journalctl -p warning..alert --since "2 hours ago" --no-pager -o short-iso | awk '{count[$4]++} END {for (level in count) print count[level], level}' | sort -nr
A noisy incident usually has a noisy source.
journalctl -p err..alert --since "2 hours ago" --no-pager -o short-iso | awk '{split($3,a,"["); unit=a[1]; count[unit]++} END {for (u in count) print count[u], u}' | sort -nr
Timeline beats guesswork when several failures happen close together.
journalctl -p err..alert --since "2 hours ago" --no-pager -o short-iso | awk '{print $1, $3, $4, substr($0,index($0,$5))}'
A minute-by-minute count shows whether an incident is a spike or a drip.
awk 'tolower($0) ~ /(error|fatal|timeout|exception)/ {minute=substr($1,1,16); count[minute]++} END {for (m in count) print count[m], m}' fixtures/incidents/app.log | sort -nr
Repeated request IDs can connect separate error lines to one failing path.
grep -Ei 'error|timeout|fatal|exception' fixtures/incidents/app.log | awk '{for (i=1;i<=NF;i++) if ($i ~ /^request_id=/) print $i}' | sort | uniq -c | sort -nr
Deploys and restarts are incident landmarks.
grep -Eh 'deploy|release|restart|started|stopped|rolled back' fixtures/incidents/*.log | sort
The biggest log is not always right, but it is worth knowing.
wc -l fixtures/incidents/*.log | sort -nr
A release file that someone besides the owner can modify deserves a second look.
find fixtures/perm-audit/releases/2026-06-25 -type f -perm /0022 -printf '%M %u:%g %p\n' | sort
Runtime directories often need writes, but the write boundary should be visible.
find fixtures/perm-audit/releases/2026-06-25/storage fixtures/perm-audit/releases/2026-06-25/uploads -type d -perm /0022 -printf '%M %u:%g %p\n' | sort
Group-writable files are not automatically wrong, but the owning group decides the risk.
find fixtures/perm-audit -type f -perm -0020 -printf '%g %M %p\n' | sort
A symlink can make the path you audited different from the file the app opens.
find fixtures/perm-audit -type l -printf '%p -> %l\n' -exec namei -l {} \;
A server feels slow, but you need proof before restarting anything.
ps -eo pid,ppid,stat,pcpu,pmem,comm,args --sort=-pcpu | head -n 10
Memory pressure can look like a slow app, a stuck deploy, or random crashes.
ps -eo pid,ppid,stat,pcpu,pmem,rss,comm,args --sort=-pmem | head -n 10
Sometimes the disk has free bytes but still cannot create files.
df -ih
A file can be deleted but still occupy disk while a process holds it open.
lsof +L1
A job can be nowhere in your crontab and still run every night.
find /etc/cron.d /etc/cron.hourly /etc/cron.daily /etc/cron.weekly /etc/cron.monthly -maxdepth 1 -type f -print 2>/dev/null | sort
A silent cron job is a future incident with no witness.
crontab -l | awk 'NF && $1 !~ /^#/ && $0 !~ /(>>|2>|logger|mail)/ {print}'
A dot in a filename can keep a cron.daily script from running.
run-parts --test /etc/cron.daily
The suspicious timer is the one with no next run.
systemctl list-timers --all --no-pager --plain | awk 'NR==1 || $1=="n/a" || /backup\.timer|logrotate\.timer/'
When a timer fires, the useful logs are usually on the service.
journalctl -u backup.service -n 20 --no-pager
Logrotate can explain its plan without rotating anything.
logrotate -d /etc/logrotate.conf 2>&1 | sed -n '/rotating pattern/p;/considering log/p;/error:/p'
The biggest log risk is often the file no policy mentions.
find /var/log -type f -name '*.log' -printf '%p\n' | while read -r log; do grep -Rqs -- "$log" /etc/logrotate.conf /etc/logrotate.d || grep -Rqs -- "$(dirname "$log")/[*].log" /etc/logrotate.conf /etc/logrotate.d || printf '%s\n' "$log"; done
A failed query is often just a wrong assumption about column names.
sqlite3 app.db ".schema users"
When a SQLite-backed app behaves strangely, first rule out file corruption.
sqlite3 app.db "PRAGMA integrity_check;"
System metadata tables can distract from the app tables you care about.
sqlite3 app.db "SELECT name FROM sqlite_master WHERE type='table' ORDER BY name;"
A quick row count can reveal empty imports, runaway events, or missing data.
sqlite3 app.db "SELECT 'users', count(*) FROM users UNION ALL SELECT 'orders', count(*) FROM orders UNION ALL SELECT 'events', count(*) FROM events;"
Slow lookups often start with missing or misunderstood indexes.
sqlite3 app.db "PRAGMA index_list('orders');"
For small apps, the quickest timeline may be inside the SQLite file.
sqlite3 app.db "SELECT created_at, event_type FROM events ORDER BY created_at DESC LIMIT 5;"
A noisy event type stands out faster when you group it.
sqlite3 app.db "SELECT event_type, count(*) FROM events GROUP BY event_type ORDER BY count(*) DESC;"
Duplicate account data is easier to spot with one grouped query.
sqlite3 app.db "SELECT email, count(*) FROM users GROUP BY email HAVING count(*) > 1;"
Copying a live SQLite file blindly can produce a bad backup.
sqlite3 app.db ".backup backup/app.db"
Duplicate titles make a static site harder to scan in search results and browser tabs.
grep -Rho --include='*.html' '[^<]* ' public | sed 's###;s# ##' | sort | uniq -c | sort -nr
Canonical tags are easy to drop when templates branch.
find public -name '*.html' -print | while read -r f; do grep -qi 'rel="canonical"' "$f" || echo "$f"; done
A leftover noindex can hide a page after launch.
grep -Rni --include='*.html' 'noindex' public
Missing descriptions are usually a content template problem, not a mystery.
find public -name '*.html' -print | while read -r f; do grep -qi 'name="description"' "$f" || echo "$f"; done
A sitemap can exist and still be hard to discover.
grep -n '^Sitemap:' public/robots.txt
A page can exist in the build but never make it into the sitemap.
find public -name '*.html' -print | sed 's#^public#https://example.com#' | while read -r url; do grep -q "$url" public/sitemap.xml || echo "$url"; done
Social previews often fail because one template missed Open Graph tags.
find public -name '*.html' -print | while read -r f; do grep -qi 'property="og:title"' "$f" || echo "$f"; done
Your feed can advertise URLs that the sitemap never lists.
grep -o 'https://example.com/[^<]*' public/feed.xml | sed 's###;s###' | while read -r url; do grep -q "$url" public/sitemap.xml || echo "$url"; done
Filter a failed unit's journal to the lines most likely to explain the stop.
journalctl -u app-worker -b -p warning..alert --no-pager -n 80
Restart loops make more sense when you line up starts, failures, and counters.
journalctl -u app-worker -b --no-pager -o short-iso | grep -E 'Started|Failed|Scheduled restart|Main process exited'
Confirm the user, working directory, env file, and ExecStart systemd is actually using.
systemctl show app-worker --property=FragmentPath,DropInPaths,EnvironmentFiles,ExecStart,User,WorkingDirectory --no-pager
Sometimes the service is only the messenger for a failed dependency.
systemctl list-dependencies app-worker --failed --no-pager
The first failure line is often more useful than the last restart message.
journalctl -u app-worker -b --no-pager -o short-iso | grep -m1 -E 'ERROR|Failed|status='
The site was configured, but the port was not.
grep -RInE '^[[:space:]]*listen[[:space:]]' fixtures/nginx/conf.d fixtures/nginx/sites-enabled
The wrong site answered because it was the fallback.
grep -RIn 'default_server' fixtures/nginx/conf.d fixtures/nginx/sites-enabled
The config was valid; it just was not included.
grep -RInE '^[[:space:]]*include[[:space:]]' fixtures/nginx/nginx.conf fixtures/nginx/conf.d fixtures/nginx/sites-enabled
The URL was right. The filesystem path was not.
grep -RInE '^[[:space:]]*(root|alias)[[:space:]]' fixtures/nginx/conf.d fixtures/nginx/sites-enabled
Nginx was healthy. It was proxying to the wrong place.
grep -RInE '^[[:space:]]*proxy_pass[[:space:]]' fixtures/nginx/conf.d fixtures/nginx/sites-enabled
The Apache config existed. The enabled symlink did not.
find fixtures/apache/sites-enabled -maxdepth 1 -type l -printf '%f -> %l\n' | sort
Apache chose a virtual host. You need to know which one.
grep -RInE '
Apache was serving files from a different directory than expected.
grep -RInE '^[[:space:]]*DocumentRoot[[:space:]]' fixtures/apache/sites-enabled
Apache was up. The reverse proxy target was wrong.
grep -RInE '^[[:space:]]*(ProxyPass|ProxyPassReverse)[[:space:]]' fixtures/apache/sites-enabled
The redirect loop was hiding in plain text.
grep -RInE 'return[[:space:]]+30[18]|rewrite[[:space:]]|Redirect[[:space:]]|RewriteRule|RewriteCond' fixtures/nginx fixtures/apache
Before chasing individual lines, get the shape of the whole log.
awk '{count[$9]++} END {for (code in count) print count[code], code}' ./fixtures/nginx/access.log | sort -nr
A 500 spike is easier to triage when the broken path is obvious.
awk '$9 ~ /^5/ {count[$7]++} END {for (path in count) print count[path], path}' ./fixtures/nginx/access.log | sort -nr | head
A few huge responses can explain bandwidth, latency, and suspicious download patterns.
awk '$10 ~ /^[0-9]+$/ && $10 > 1000000 {print $10, $1, $7, $9}' ./fixtures/nginx/access.log | sort -nr | head