A bot that crashes once a week isn't just annoying — it erodes trust. Users stop relying on it, admins stop maintaining it, and eventually it becomes that thing nobody uses but nobody bothers to shut down either.
Stability isn't a feature you add later. It's a foundation you build from day one. Here's how to make your OpenClaw QQ robot rock-solid on Tencent Cloud Lighthouse.
Think of stability as five layers, each building on the last:
Running a QQ bot on a local machine or a shared hosting plan is asking for trouble. Tencent Cloud Lighthouse gives you dedicated resources, guaranteed uptime, and one-click snapshots for recovery.
Configure systemd to keep your bot alive no matter what:
# /etc/systemd/system/clawdbot.service.d/stability.conf
[Service]
Restart=always
RestartSec=3
StartLimitIntervalSec=300
StartLimitBurst=10
# Kill the process if it doesn't respond within 30 seconds
WatchdogSec=30
TimeoutStopSec=10
# Resource limits
MemoryMax=1G
CPUQuota=80%
TasksMax=256
Apply and verify:
sudo systemctl daemon-reload
sudo systemctl restart clawdbot
systemctl show clawdbot | grep -E "Restart|Memory|CPU|Tasks"
With StartLimitBurst=10 and StartLimitIntervalSec=300, systemd allows up to 10 restarts in 5 minutes before giving up — enough to survive transient issues without masking persistent failures.
Long-running bots accumulate memory over time. Set up monitoring and automatic mitigation:
#!/bin/bash
# /opt/clawdbot/memory-guard.sh
MEM_USAGE=$(ps -o rss= -p $(pgrep -f clawdbot) | awk '{print $1/1024}')
THRESHOLD=800 # MB
if (( $(echo "$MEM_USAGE > $THRESHOLD" | bc -l) )); then
echo "$(date) Memory usage ${MEM_USAGE}MB exceeds threshold. Restarting..." >> /var/log/clawdbot/stability.log
sudo systemctl restart clawdbot
fi
# Run every 10 minutes
echo "*/10 * * * * /opt/clawdbot/memory-guard.sh" | crontab -
Configure your bot to degrade gracefully instead of crashing:
# /opt/clawdbot/config/qq-stability.yaml
resilience:
model_fallback:
primary: "claude-sonnet-4-20250514"
fallback: "claude-haiku"
trigger: "timeout_or_error"
retry_policy:
max_attempts: 3
backoff_ms: [500, 1000, 2000]
retry_on: ["timeout", "rate_limit", "server_error"]
circuit_breaker:
enabled: true
failure_threshold: 5
reset_timeout_sec: 60
half_open_requests: 2
graceful_degradation:
on_model_failure: "I'm experiencing some issues right now. Please try again in a moment."
on_skill_failure: "That feature is temporarily unavailable. I can still help with general questions."
on_overload: "I'm handling a lot of requests right now. Your message is queued."
The circuit breaker is especially important — if the model API fails 5 times in a row, the bot stops hammering it and waits 60 seconds before trying again. This prevents cascading failures.
Don't wait for users to report problems:
#!/bin/bash
# /opt/clawdbot/stability-check.sh
ISSUES=0
REPORT=""
# Check 1: Is the process running?
if ! systemctl is-active --quiet clawdbot; then
REPORT+="[CRITICAL] Bot process is DOWN\n"
ISSUES=$((ISSUES + 1))
fi
# Check 2: Memory usage
MEM=$(ps -o rss= -p $(pgrep -f clawdbot) 2>/dev/null | awk '{print int($1/1024)}')
if [ "${MEM:-0}" -gt 800 ]; then
REPORT+="[WARNING] Memory usage: ${MEM}MB\n"
ISSUES=$((ISSUES + 1))
fi
# Check 3: Recent errors
ERRORS=$(journalctl -u clawdbot --since "10 min ago" -p err --no-pager | wc -l)
if [ "$ERRORS" -gt 10 ]; then
REPORT+="[WARNING] $ERRORS errors in last 10 minutes\n"
ISSUES=$((ISSUES + 1))
fi
# Check 4: Response time
AVG_RT=$(grep "$(date +%Y-%m-%d)" /var/log/clawdbot/output.log | \
grep -oP 'response_time=\K[0-9]+' | tail -20 | awk '{sum+=$1;n++} END{print int(sum/n)}')
if [ "${AVG_RT:-0}" -gt 5000 ]; then
REPORT+="[WARNING] Avg response time: ${AVG_RT}ms\n"
ISSUES=$((ISSUES + 1))
fi
if [ "$ISSUES" -gt 0 ]; then
echo -e "$REPORT"
# Send alert via webhook
fi
Keep a simple uptime log:
# Add to crontab — runs every minute
* * * * * systemctl is-active --quiet clawdbot && echo "$(date +\%s) UP" >> /var/log/clawdbot/uptime.log || echo "$(date +\%s) DOWN" >> /var/log/clawdbot/uptime.log
Calculate uptime percentage:
TOTAL=$(wc -l < /var/log/clawdbot/uptime.log)
UP=$(grep -c "UP" /var/log/clawdbot/uptime.log)
echo "Uptime: $(echo "scale=2; $UP * 100 / $TOTAL" | bc)%"
Target: 99.9% uptime — that's less than 45 minutes of downtime per month.
A stable bot is a trusted bot. Users rely on it, admins sleep through the night, and the organization gets consistent value from its AI investment.
Build on the right foundation:
Stability isn't boring. It's professional.