健康检查
Gateway 提供内置的健康检查(Health Check)端点,用于监控运行状态并集成外部监控系统。
健康检查端点
text
GET /healthbash
curl http://127.0.0.1:18789/health响应示例
json
{
"status": "healthy",
"version": "0.5.0",
"uptime": 9240,
"timestamp": "2025-01-15T13:04:00Z",
"components": {
"gateway": "healthy",
"channels": "healthy",
"sessions": "healthy",
"storage": "healthy"
}
}状态码
| HTTP 状态码 | 含义 |
|---|---|
200 OK | Gateway 运行正常 |
503 Service Unavailable | Gateway 异常或降级 |
状态指标
整体状态
| 状态值 | 说明 |
|---|---|
healthy | 所有组件正常运行 |
degraded | 部分组件异常,但核心功能可用 |
unhealthy | 严重故障,无法正常提供服务 |
详细状态端点
text
GET /health/detailedjson
{
"status": "degraded",
"version": "0.5.0",
"uptime": 9240,
"components": {
"gateway": {
"status": "healthy",
"pid": 42187,
"memory": "85 MB",
"cpu": "2.1%"
},
"channels": {
"status": "degraded",
"details": [
{"name": "openai", "status": "healthy", "latency": "45ms"},
{"name": "anthropic", "status": "unhealthy", "error": "connection timeout"}
]
},
"sessions": {
"status": "healthy",
"active": 3,
"total": 156
},
"storage": {
"status": "healthy",
"diskFree": "45 GB",
"dataSize": "128 MB"
}
}
}降级状态
当某个 Channel 不可用但其他 Channel 正常时,整体状态为 degraded(降级)而非 unhealthy。
Metrics 端点
提供 Prometheus 兼容的指标端点:
text
GET /metricstext
# HELP openclaw_gateway_uptime_seconds Gateway uptime in seconds
# TYPE openclaw_gateway_uptime_seconds gauge
openclaw_gateway_uptime_seconds 9240
# HELP openclaw_sessions_active Current active sessions
# TYPE openclaw_sessions_active gauge
openclaw_sessions_active 3
# HELP openclaw_requests_total Total requests processed
# TYPE openclaw_requests_total counter
openclaw_requests_total{channel="openai"} 1523
openclaw_requests_total{channel="anthropic"} 847
# HELP openclaw_request_duration_seconds Request duration histogram
# TYPE openclaw_request_duration_seconds histogram
openclaw_request_duration_seconds_bucket{le="0.1"} 500
openclaw_request_duration_seconds_bucket{le="1.0"} 1200
openclaw_request_duration_seconds_bucket{le="10.0"} 1500监控集成
Prometheus + Grafana
yaml
# prometheus.yml
scrape_configs:
- job_name: 'openclaw-gateway'
scrape_interval: 15s
static_configs:
- targets: ['127.0.0.1:18789']
metrics_path: '/metrics'Docker 健康检查
yaml
# docker-compose.yml
services:
openclaw:
image: openclaw/gateway:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:18789/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 10sKubernetes Probes
yaml
# k8s deployment
spec:
containers:
- name: openclaw
livenessProbe:
httpGet:
path: /health
port: 18789
initialDelaySeconds: 10
periodSeconds: 30
readinessProbe:
httpGet:
path: /health/detailed
port: 18789
initialDelaySeconds: 5
periodSeconds: 10告警配置
基于健康检查的简单告警
bash
#!/bin/bash
# health-check.sh — 简单的健康检查脚本
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:18789/health)
if [ "$STATUS" != "200" ]; then
echo "Gateway health check failed! Status: $STATUS" | \
mail -s "OpenClaw Alert" admin@example.com
fibash
# crontab 每分钟检查
* * * * * /path/to/health-check.shPrometheus 告警规则
yaml
# alert_rules.yml
groups:
- name: openclaw
rules:
- alert: GatewayDown
expr: up{job="openclaw-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "OpenClaw Gateway is down"
- alert: ChannelUnhealthy
expr: openclaw_channel_status{status="unhealthy"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Channel {{ $labels.channel }} is unhealthy"告警策略
建议对 Gateway 进程状态设置 critical 告警(1 分钟),对 Channel 状态设置 warning 告警(5 分钟,允许临时波动)。
CLI 健康检查
bash
# 快速健康检查
openclaw status
# 详细状态
openclaw status --detailed
# JSON 输出
openclaw status --format json