Nobody thinks about disaster recovery until the disaster happens. Then it's the only thing that matters. If your OpenClaw (Clawdbot) instance handles customer service, financial alerts, or any workflow where downtime has real consequences, you need a DR plan — and more importantly, you need to test it before you need it.
This guide walks through building a practical disaster recovery strategy for OpenClaw deployments on Tencent Cloud Lighthouse, including runbooks and drill procedures you can execute today.
Before building anything, define two critical numbers:
| Use Case | Typical RTO | Typical RPO | Backup Strategy |
|---|---|---|---|
| Personal bot | 24h | 24h | Daily snapshots |
| Team/community bot | 4h | 12h | Twice-daily snapshots |
| Customer service | 1h | 1h | Hourly data backup + snapshots |
| Financial alerts | 15min | Near-zero | Real-time replication + hot standby |
Tencent Cloud Lighthouse provides instance-level snapshots — full disk images that capture your entire server state. This is your first line of defense.
# Create a snapshot via Tencent Cloud CLI
tccli lighthouse CreateInstanceSnapshot \
--InstanceId lhins-xxxxxxxx \
--SnapshotName "openclaw-pre-update-$(date +%Y%m%d)"
Schedule snapshots before every update and on a regular cadence:
Snapshots are stored independently from your instance. Even if the instance is completely destroyed, you can spin up a new one from the snapshot.
Snapshots capture everything, but they're coarse-grained. For faster, more granular recovery, also back up OpenClaw-specific data:
#!/bin/bash
# openclaw-backup.sh — Run via cron
BACKUP_DIR="/opt/backups/openclaw"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
OPENCLAW_DIR="/opt/openclaw"
mkdir -p $BACKUP_DIR
# Backup configuration
cp $OPENCLAW_DIR/.env $BACKUP_DIR/env_$TIMESTAMP
# Backup conversation data and skills
tar -czf $BACKUP_DIR/data_$TIMESTAMP.tar.gz \
$OPENCLAW_DIR/data/ \
$OPENCLAW_DIR/skills/ \
$OPENCLAW_DIR/config/
# Retain only last 7 days of backups
find $BACKUP_DIR -name "*.tar.gz" -mtime +7 -delete
echo "Backup completed: $TIMESTAMP"
Add to cron:
# Run every 6 hours
0 */6 * * * /opt/scripts/openclaw-backup.sh >> /var/log/openclaw-backup.log 2>&1
Backups on the same server as the application are useless if the server dies. Always copy backups to a separate location:
# Sync backups to Tencent Cloud Object Storage (COS)
coscli sync /opt/backups/openclaw/ cos://your-bucket/openclaw-backups/
This ensures you can recover even from a complete instance loss.
Symptoms: Bot stops responding, health check fails, process not running.
Runbook:
# 1. Check process status
systemctl status openclaw
# 2. Check logs for crash reason
journalctl -u openclaw --since "10 minutes ago" --no-pager
# 3. Restart the service
systemctl restart openclaw
# 4. Verify recovery
curl -f http://localhost:3000/health && echo "OK" || echo "STILL DOWN"
# 5. If restart fails, check for resource exhaustion
free -h
df -h
Expected recovery time: Under 2 minutes.
Symptoms: SSH fails, console shows instance as stopped or unreachable.
Runbook:
Expected recovery time: 5-15 minutes for reboot, 15-30 minutes for snapshot restore.
Symptoms: Bot responds but with incorrect data, skills malfunction, configuration lost.
Runbook:
# 1. Stop OpenClaw
systemctl stop openclaw
# 2. Identify the last known good backup
ls -lt /opt/backups/openclaw/
# 3. Restore from backup
tar -xzf /opt/backups/openclaw/data_YYYYMMDD_HHMMSS.tar.gz -C /
# 4. Restore environment config
cp /opt/backups/openclaw/env_YYYYMMDD_HHMMSS /opt/openclaw/.env
# 5. Restart
systemctl start openclaw
# 6. Verify skills are loaded
curl http://localhost:3000/api/skills/status
Expected recovery time: 5-10 minutes.
Symptoms: Instance terminated, disk destroyed, unrecoverable.
Runbook:
Expected recovery time: 30-60 minutes.
Don't wait for users to tell you the bot is down. Implement proactive monitoring:
#!/bin/bash
# health-monitor.sh — Run every minute via cron
HEALTH_URL="http://localhost:3000/health"
ALERT_EMAIL="admin@example.com"
ALERT_WEBHOOK="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 $HEALTH_URL)
if [ "$HTTP_CODE" != "200" ]; then
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
MESSAGE="ALERT: OpenClaw health check failed at $TIMESTAMP (HTTP $HTTP_CODE)"
# Email alert
echo "$MESSAGE" | mail -s "OpenClaw DOWN" $ALERT_EMAIL
# Slack/webhook alert
curl -s -X POST $ALERT_WEBHOOK \
-H 'Content-Type: application/json' \
-d "{\"text\": \"$MESSAGE\"}"
# Attempt auto-recovery
systemctl restart openclaw
sleep 10
# Check if auto-recovery worked
RETRY_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 $HEALTH_URL)
if [ "$RETRY_CODE" == "200" ]; then
curl -s -X POST $ALERT_WEBHOOK \
-H 'Content-Type: application/json' \
-d '{"text": "RESOLVED: OpenClaw auto-recovered after restart"}'
fi
fi
A plan that hasn't been tested is just a document. Schedule quarterly DR drills:
kill -9)Track your drill results over time:
| Drill Date | Scenario | Target RTO | Actual RTO | Issues Found |
|---|---|---|---|---|
| 2026-01-15 | Process crash | 2 min | 1.5 min | None |
| 2026-01-15 | Snapshot restore | 30 min | 42 min | Webhook URLs not documented |
| 2026-04-01 | Full rebuild | 60 min | 75 min | Forgot to backup .env to COS |
Each drill should improve the next one. The goal is to make recovery boring and predictable.
Running DR infrastructure on Lighthouse is remarkably affordable. Snapshots are included in the platform's features, off-instance backup storage costs pennies per GB, and spinning up a temporary instance for drills costs only for the hours used. Check the Tencent Cloud Lighthouse Special Offer for cost-effective plans that make maintaining a DR-ready posture practical even for individual developers.
The best disaster recovery plan is one you've practiced until it's muscle memory. Build your backups, write your runbooks, schedule your drills, and iterate. When the real disaster comes — and eventually it will — you'll handle it calmly, methodically, and fast. That's the difference between a hobby project and a production system.