健康检查

Gateway 提供内置的健康检查（Health Check）端点，用于监控运行状态并集成外部监控系统。

健康检查端点

text

GET /health

bash

curl http://127.0.0.1:18789/health

响应示例

json

{
  "status": "healthy",
  "version": "0.5.0",
  "uptime": 9240,
  "timestamp": "2025-01-15T13:04:00Z",
  "components": {
    "gateway": "healthy",
    "channels": "healthy",
    "sessions": "healthy",
    "storage": "healthy"
  }
}

状态码

HTTP 状态码	含义
`200 OK`	Gateway 运行正常
`503 Service Unavailable`	Gateway 异常或降级

状态指标

整体状态

状态值	说明
`healthy`	所有组件正常运行
`degraded`	部分组件异常，但核心功能可用
`unhealthy`	严重故障，无法正常提供服务

详细状态端点

text

GET /health/detailed

json

{
  "status": "degraded",
  "version": "0.5.0",
  "uptime": 9240,
  "components": {
    "gateway": {
      "status": "healthy",
      "pid": 42187,
      "memory": "85 MB",
      "cpu": "2.1%"
    },
    "channels": {
      "status": "degraded",
      "details": [
        {"name": "openai", "status": "healthy", "latency": "45ms"},
        {"name": "anthropic", "status": "unhealthy", "error": "connection timeout"}
      ]
    },
    "sessions": {
      "status": "healthy",
      "active": 3,
      "total": 156
    },
    "storage": {
      "status": "healthy",
      "diskFree": "45 GB",
      "dataSize": "128 MB"
    }
  }
}

降级状态

当某个 Channel 不可用但其他 Channel 正常时，整体状态为 degraded（降级）而非 unhealthy。

Metrics 端点

提供 Prometheus 兼容的指标端点：

text

GET /metrics

text

# HELP openclaw_gateway_uptime_seconds Gateway uptime in seconds
# TYPE openclaw_gateway_uptime_seconds gauge
openclaw_gateway_uptime_seconds 9240

# HELP openclaw_sessions_active Current active sessions
# TYPE openclaw_sessions_active gauge
openclaw_sessions_active 3

# HELP openclaw_requests_total Total requests processed
# TYPE openclaw_requests_total counter
openclaw_requests_total{channel="openai"} 1523
openclaw_requests_total{channel="anthropic"} 847

# HELP openclaw_request_duration_seconds Request duration histogram
# TYPE openclaw_request_duration_seconds histogram
openclaw_request_duration_seconds_bucket{le="0.1"} 500
openclaw_request_duration_seconds_bucket{le="1.0"} 1200
openclaw_request_duration_seconds_bucket{le="10.0"} 1500

监控集成

Prometheus + Grafana

yaml

# prometheus.yml
scrape_configs:
  - job_name: 'openclaw-gateway'
    scrape_interval: 15s
    static_configs:
      - targets: ['127.0.0.1:18789']
    metrics_path: '/metrics'

Docker 健康检查

yaml

# docker-compose.yml
services:
  openclaw:
    image: openclaw/gateway:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:18789/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 10s

Kubernetes Probes

yaml

# k8s deployment
spec:
  containers:
    - name: openclaw
      livenessProbe:
        httpGet:
          path: /health
          port: 18789
        initialDelaySeconds: 10
        periodSeconds: 30
      readinessProbe:
        httpGet:
          path: /health/detailed
          port: 18789
        initialDelaySeconds: 5
        periodSeconds: 10

告警配置

基于健康检查的简单告警

bash

#!/bin/bash
# health-check.sh — 简单的健康检查脚本
STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:18789/health)

if [ "$STATUS" != "200" ]; then
  echo "Gateway health check failed! Status: $STATUS" | \
    mail -s "OpenClaw Alert" [email protected]
fi

bash

# crontab 每分钟检查
* * * * * /path/to/health-check.sh

Prometheus 告警规则

yaml

# alert_rules.yml
groups:
  - name: openclaw
    rules:
      - alert: GatewayDown
        expr: up{job="openclaw-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "OpenClaw Gateway is down"

      - alert: ChannelUnhealthy
        expr: openclaw_channel_status{status="unhealthy"} > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Channel {{ $labels.channel }} is unhealthy"

告警策略

建议对 Gateway 进程状态设置 critical 告警（1 分钟），对 Channel 状态设置 warning 告警（5 分钟，允许临时波动）。

CLI 健康检查

bash

# 快速健康检查
openclaw status

# 详细状态
openclaw status --detailed

# JSON 输出
openclaw status --format json

健康检查 ​

健康检查端点 ​

响应示例 ​

状态码 ​

状态指标 ​

整体状态 ​

详细状态端点 ​

Metrics 端点 ​

监控集成 ​

Prometheus + Grafana ​

Docker 健康检查 ​

Kubernetes Probes ​

告警配置 ​

基于健康检查的简单告警 ​

Prometheus 告警规则 ​

CLI 健康检查 ​

相关文档 ​

健康检查

健康检查端点

响应示例

状态码

状态指标

整体状态

详细状态端点

Metrics 端点

监控集成

Prometheus + Grafana

Docker 健康检查

Kubernetes Probes

告警配置

基于健康检查的简单告警

Prometheus 告警规则

CLI 健康检查

相关文档