The worst way to find out your QQ bot is down? A user telling you. The second worst? Checking manually every few hours. Proper status monitoring means you know about problems before anyone else does — and ideally, the system fixes itself before you even wake up.
Your OpenClaw QQ bot has four critical health dimensions:
| Dimension | Metric | Healthy Range |
|---|---|---|
| Availability | Process running, port open | Always up |
| Performance | Response time | < 3 seconds p95 |
| Resources | CPU, memory, disk | < 80% utilization |
| Functionality | Successful message processing | > 99% success rate |
On Tencent Cloud Lighthouse, build a lightweight monitoring system using shell scripts and cron — no Prometheus required:
#!/bin/bash
# /opt/clawdbot/monitor.sh
# Comprehensive status check — runs every minute
TIMESTAMP=$(date +%Y-%m-%dT%H:%M:%S)
STATUS_FILE="/opt/clawdbot/data/status.json"
# Process check
PROCESS_UP=$(systemctl is-active --quiet clawdbot && echo "true" || echo "false")
# Port check
PORT_OPEN=$(ss -tlnp | grep -q ":8080" && echo "true" || echo "false")
# Memory (MB)
MEM_MB=$(ps -o rss= -p $(pgrep -f clawdbot 2>/dev/null) 2>/dev/null | awk '{print int($1/1024)}')
MEM_MB=${MEM_MB:-0}
# CPU (%)
CPU_PCT=$(ps -o %cpu= -p $(pgrep -f clawdbot 2>/dev/null) 2>/dev/null | awk '{print int($1)}')
CPU_PCT=${CPU_PCT:-0}
# Disk usage (%)
DISK_PCT=$(df /opt/clawdbot | tail -1 | awk '{print $5}' | tr -d '%')
# Recent errors (last 5 min)
RECENT_ERRORS=$(journalctl -u clawdbot --since "5 min ago" -p err --no-pager 2>/dev/null | wc -l)
# Messages processed (last 5 min)
RECENT_MSGS=$(journalctl -u clawdbot --since "5 min ago" --no-pager 2>/dev/null | grep -c "msg_processed")
# Write status
cat > "$STATUS_FILE" <<EOF
{
"timestamp": "$TIMESTAMP",
"process_running": $PROCESS_UP,
"port_open": $PORT_OPEN,
"memory_mb": $MEM_MB,
"cpu_percent": $CPU_PCT,
"disk_percent": $DISK_PCT,
"recent_errors": $RECENT_ERRORS,
"recent_messages": $RECENT_MSGS,
"health": "$([ "$PROCESS_UP" = "true" ] && [ "$MEM_MB" -lt 800 ] && [ "$DISK_PCT" -lt 80 ] && echo "healthy" || echo "degraded")"
}
EOF
Add to crontab:
* * * * * /opt/clawdbot/monitor.sh
Expose the status as a simple HTTP endpoint:
# Quick status server using Python
cat > /opt/clawdbot/status-server.py <<'EOF'
from http.server import HTTPServer, SimpleHTTPRequestHandler
import json, os
class StatusHandler(SimpleHTTPRequestHandler):
def do_GET(self):
if self.path == '/status':
try:
with open('/opt/clawdbot/data/status.json') as f:
data = f.read()
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(data.encode())
except:
self.send_response(503)
self.end_headers()
else:
self.send_response(404)
self.end_headers()
HTTPServer(('0.0.0.0', 9090), StatusHandler).serve_forever()
EOF
Now you can check status from anywhere: curl http://YOUR_LIGHTHOUSE_IP:9090/status
Different severity levels need different responses:
#!/bin/bash
# /opt/clawdbot/alert.sh
STATUS=$(cat /opt/clawdbot/data/status.json)
HEALTH=$(echo "$STATUS" | jq -r '.health')
PROCESS=$(echo "$STATUS" | jq -r '.process_running')
MEM=$(echo "$STATUS" | jq -r '.memory_mb')
ERRORS=$(echo "$STATUS" | jq -r '.recent_errors')
# CRITICAL: Process down
if [ "$PROCESS" = "false" ]; then
echo "[CRITICAL] Bot process is DOWN. Attempting restart..."
sudo systemctl restart clawdbot
# Send alert
fi
# WARNING: High memory
if [ "$MEM" -gt 700 ]; then
echo "[WARNING] Memory usage: ${MEM}MB"
fi
# WARNING: Error spike
if [ "$ERRORS" -gt 20 ]; then
echo "[WARNING] $ERRORS errors in last 5 minutes"
fi
Monitoring only works if the monitoring system itself is reliable. Running it on the same Lighthouse instance as your bot is fine for single-instance setups.
Append status snapshots to a history file for trend analysis:
# Add to the monitor.sh script
cat /opt/clawdbot/data/status.json >> /opt/clawdbot/data/status-history.jsonl
Analyze trends:
# Average memory over the last 24 hours
tail -1440 /opt/clawdbot/data/status-history.jsonl | \
jq -r '.memory_mb' | awk '{sum+=$1;n++} END{print "Avg memory: " sum/n "MB"}'
# Error trend
tail -1440 /opt/clawdbot/data/status-history.jsonl | \
jq -r '.recent_errors' | awk '{sum+=$1} END{print "Total errors (24h): " sum}'
Go beyond alerting — automate the fix:
# Auto-restart on crash (already handled by systemd)
# Auto-cleanup on high disk
if [ "$DISK_PCT" -gt 85 ]; then
find /var/log/clawdbot/ -name "*.log.gz" -mtime +7 -delete
journalctl --vacuum-time=3d
fi
# Auto-restart on memory leak
if [ "$MEM" -gt 900 ]; then
sudo systemctl restart clawdbot
fi
Monitoring isn't a one-time setup — it's a living system that evolves with your bot. Start simple, add complexity as needed, and always ask: "Would this alert wake me up at 3 AM? If so, can the system fix it automatically?"
Monitor everything. Alert wisely. Automate relentlessly.