Nothing humbles a developer faster than a bot that silently fails. No error message, no log entry, just... silence. The user sends a message, the bot does nothing, and you're left staring at a terminal wondering if the problem is the webhook, the model API, the config, or the alignment of the planets.
Let's build a proper error handling and debugging toolkit for your OpenClaw Lark robot.
Before you can handle errors, you need to categorize them. OpenClaw Lark bot failures typically fall into four buckets:
| Category | Symptoms | Common Cause |
|---|---|---|
| Webhook errors | Bot never receives messages | Wrong callback URL, expired token, firewall blocking |
| Model API errors | Bot receives but doesn't respond | API key invalid, rate limit hit, model timeout |
| Skill errors | Partial or garbled responses | Skill misconfiguration, missing dependencies |
| Channel errors | Response sent but not delivered | Lark API rate limit, message format mismatch |
First, make sure errors are actually captured. On your Tencent Cloud Lighthouse instance:
# Enable debug-level logging for troubleshooting
sudo systemctl edit clawdbot
Add the override:
[Service]
Environment="LOG_LEVEL=debug"
Environment="LOG_FORMAT=json"
sudo systemctl daemon-reload
sudo systemctl restart clawdbot
Now every error includes structured context — timestamps, request IDs, user IDs, and stack traces.
The most common "bot doesn't respond" problem. Start here:
# Check if the webhook endpoint is reachable
curl -v https://YOUR_LIGHTHOUSE_IP/webhook/lark
# Check if Lark events are arriving
journalctl -u clawdbot -f --no-pager | grep -i "webhook\|event"
If nothing shows up, the problem is before your server:
# Quick firewall check
sudo iptables -L -n | grep 443
# Or on Lighthouse, check via the console's firewall settings
When the bot receives messages but doesn't respond:
# Look for API-related errors
journalctl -u clawdbot --since "30 min ago" --no-pager | grep -i "api\|model\|timeout\|rate"
Common fixes:
# /opt/clawdbot/config/lark.yaml
model:
api_key: "${MODEL_API_KEY}"
timeout: 30s # Increase if you're hitting timeouts
max_retries: 3 # Retry on transient failures
retry_backoff: 1s # Wait between retries
fallback_model: "claude-haiku" # Use a faster model if primary fails
The fallback model is crucial. If your primary model is overloaded, the bot gracefully degrades instead of going silent.
Never let the user see silence. Configure error messages for each failure type:
error_handling:
webhook_error:
user_message: "I'm having trouble receiving your message. Please try again in a moment."
log_level: error
alert: true
model_timeout:
user_message: "My thinking is taking longer than usual. Let me try a simpler approach..."
action: retry_with_fallback
log_level: warn
skill_error:
user_message: "I encountered an issue with that capability. Here's what I can help with instead: [list available skills]"
log_level: error
alert: true
rate_limit:
user_message: "I'm getting a lot of requests right now. Please try again in {retry_after} seconds."
log_level: warn
Build a set of scripts that help you diagnose issues fast:
#!/bin/bash
# /opt/clawdbot/debug-toolkit.sh
echo "=== OpenClaw Lark Bot Diagnostic ==="
echo ""
echo "1. Service Status:"
systemctl is-active clawdbot && echo " RUNNING" || echo " DOWN"
echo ""
echo "2. Last 5 Errors:"
journalctl -u clawdbot --no-pager -p err -n 5
echo ""
echo "3. Port Binding:"
ss -tlnp | grep clawdbot
echo ""
echo "4. Memory Usage:"
ps aux | grep clawdbot | grep -v grep | awk '{print " RSS: " $6/1024 "MB"}'
echo ""
echo "5. Recent Webhook Events (last 10 min):"
journalctl -u clawdbot --since "10 min ago" --no-pager | grep -c "webhook"
echo " events received"
echo ""
echo "6. API Error Rate (last hour):"
journalctl -u clawdbot --since "1 hour ago" --no-pager | grep -c "ERROR"
echo " errors"
Make it executable and run it whenever something feels off:
chmod +x /opt/clawdbot/debug-toolkit.sh
sudo /opt/clawdbot/debug-toolkit.sh
Don't wait for users to report problems. Get notified proactively:
#!/bin/bash
# /opt/clawdbot/error-alert.sh
ERROR_COUNT=$(journalctl -u clawdbot --since "5 min ago" --no-pager -p err | wc -l)
if [ "$ERROR_COUNT" -gt 5 ]; then
curl -X POST "https://open.larksuite.com/open-apis/bot/v2/hook/YOUR_ALERT_WEBHOOK" \
-H "Content-Type: application/json" \
-d "{\"msg_type\":\"text\",\"content\":{\"text\":\"[ALERT] $ERROR_COUNT errors in last 5 minutes. Run debug-toolkit.sh for details.\"}}"
fi
Add to cron: */5 * * * * /opt/clawdbot/error-alert.sh
All of this runs best on a properly configured instance. If you're debugging on a local machine, you're missing systemd integration, persistent logs, and snapshot-based rollbacks.
When something breaks, resist the urge to restart and hope. Instead: check logs, identify the error category, apply the targeted fix, verify the fix, then document what happened.
Your future self — and your team — will thank you.
Start with infrastructure that supports proper debugging:
Errors are inevitable. Silent failures are not.