34.3 企业级监控与维护

学习如何建立企业级监控和维护体系，确保 Claude Code 在生产环境中的稳定运行和持续优化。

34.3.1 监控体系概述

监控的重要性

企业级监控对于 Claude Code 部署至关重要，它可以帮助：

确保可用性：及时发现和解决服务中断
优化性能：识别性能瓶颈并优化资源使用
安全防护：检测异常行为和安全威胁
成本控制：监控使用情况和资源消耗
合规审计：满足企业合规要求

监控维度

企业级监控维度

MONITORING_DIMENSIONS = { "可用性监控": { "指标": ["服务状态", "响应时间", "错误率"], "目标": "99.9% 可用性" }, "性能监控": { "指标": ["API 延迟", "令牌使用", "并发连接"], "目标": "P95 延迟 < 2s" }, "资源监控": { "指标": ["CPU 使用率", "内存使用", "磁盘 I/O", "网络带宽"], "目标": "资源利用率 < 80%" }, "安全监控": { "指标": ["异常访问", "权限违规", "数据泄露"], "目标": "零安全事件" }, "成本监控": { "指标": ["API 调用成本", "令牌成本", "基础设施成本"], "目标": "成本控制在预算内" } }

34.3.2 指标收集

Prometheus 配置

prometheus.yml

global: scrape_interval: 15s evaluation_interval: 15s

scrape_configs:

Claude Code API 监控

job_name: 'claude-code-api' static_configs:
- targets: ['localhost:8080'] metrics_path: '/metrics' scrape_interval: 10s

LLM 网关监控

job_name: 'llm-gateway' static_configs:
- targets: ['localhost:4000'] metrics_path: '/metrics' scrape_interval: 10s

开发容器监控

job_name: 'dev-containers' static_configs:
- targets: ['localhost:9323'] metrics_path: '/metrics' scrape_interval: 30s

沙箱监控

job_name: 'sandbox' static_configs:
- targets: ['localhost:9100'] metrics_path: '/metrics' scrape_interval: 15s

alerting: alertmanagers: - static_configs: - targets: ['localhost:9093']

自定义指标导出器

claude_code_exporter.py

python

from prometheus_client import start_http_server, Gauge, Counter, Histogram
import time
import json
import requests
from datetime import datetime

定义指标

api_requests_total = Counter( 'claude_code_api_requests_total', 'Total API requests', ['endpoint', 'status'] )

api_latency = Histogram( 'claude_code_api_latency_seconds', 'API request latency', ['endpoint'] )

active_sessions = Gauge( 'claude_code_active_sessions', 'Number of active sessions' )

tokens_used = Counter( 'claude_code_tokens_used_total', 'Total tokens used', ['model', 'type'] )

cost_incurred = Gauge( 'claude_code_cost_usd', 'Total cost incurred in USD' )

python

class ClaudeCodeMetricsCollector:
    def __init__(self, api_base_url='http://localhost:8080'):
        self.api_base_url = api_base_url
        self.start_time = datetime.now()

    def collect_api_metrics(self):
        """收集 API 指标"""
        try:

        # 获取 API 状态
        response = requests.get(f'{self.api_base_url}/health')

bash

            if response.status_code == 200:
                data = response.json()

            # 更新活跃会话数

bash

                active_sessions.set(data.get('active_sessions', 0))

            # 更新令牌使用
            tokens = data.get('tokens_used', {})

bash

                for model, count in tokens.items():
                    tokens_used.labels(model=model, type='input').inc(count.get('input', 0))
                    tokens_used.labels(model=model, type='output').inc(count.get('output', 0))

            # 更新成本

bash

                cost_incurred.set(data.get('total_cost', 0.0))
        except Exception as e:
            print(f"Error collecting API metrics: {e}")

    def collect_performance_metrics(self):

    """收集性能指标"""

bash

        try:

        # 测试 API 延迟
        start_time = time.time()
        response = requests.get(f'{self.api_base_url}/health')
        latency = time.time() - start_time

        # 记录延迟

bash

            api_latency.labels(endpoint='/health').observe(latency)

        # 记录请求

bash

            api_requests_total.labels(
                endpoint='/health',
                status=response.status_code
            ).inc()
        except Exception as e:
            print(f"Error collecting performance metrics: {e}")

    def collect_sandbox_metrics(self):

    """收集沙箱指标"""

bash

        try:
            response = requests.get(f'{self.api_base_url}/sandbox/status')
            if response.status_code == 200:
                data = response.json()

            # 沙箱违规计数
            violations = data.get('violations', 0)
            # 可以添加更多沙箱相关指标

bash

        except Exception as e:
            print(f"Error collecting sandbox metrics: {e}")

    def run(self, interval=10):

    """运行指标收集器"""
    start_http_server(9100)
    print("Metrics server started on port 9100")

bash

        while True:
            self.collect_api_metrics()
            self.collect_performance_metrics()
            self.collect_sandbox_metrics()
            time.sleep(interval)

if __name__ == '__main__':
    collector = ClaudeCodeMetricsCollector()
    collector.run()

日志收集配置

filebeat.yml

filebeat.inputs:

type: log enabled: true paths:
- /var/log/claude-code/*.log fields: service: claude-code environment: production fields_under_root: true
type: log enabled: true paths:
- /var/log/llm-gateway/*.log fields: service: llm-gateway environment: production fields_under_root: true
type: log enabled: true paths:
- /var/log/claude-sandbox/*.log fields: service: claude-sandbox environment: production fields_under_root: true

output.elasticsearch: hosts: ["elasticsearch:9200"] index: "claude-code-%{+yyyy.MM.dd}"

setup.kibana: host: "kibana:5601"

processors:

add_host_metadata: ~
add_cloud_metadata: ~

34.3.3 告警配置

Prometheus 告警规则

alert_rules.yml

groups:

name: claude_code_alerts interval: 30s rules:
服务可用性告警
- alert: ClaudeCodeServiceDown expr: up{job="claude-code-api"} == 0 for: 1m labels: severity: critical annotations: summary: "Claude Code 服务不可用" description: "Claude Code API 服务已宕机超过 1 分钟"
API 错误率告警
- alert: HighAPIErrorRate expr: | rate(claude_code_api_requests_total{status=~"5.."}[5m]) / rate(claude_code_api_requests_total[5m]) > 0.05 for: 5m labels: severity: warning annotations: summary: "API 错误率过高" description: "API 错误率超过 5% (当前: )"
API 延迟告警
- alert: HighAPILatency expr: | histogram_quantile(0.95, rate(claude_code_api_latency_seconds_bucket[5m]) ) > 2 for: 5m labels: severity: warning annotations: summary: "API 延迟过高" description: "API P95 延迟超过 2 秒 (当前: s)"
令牌使用告警
- alert: HighTokenUsage expr: | rate(claude_code_tokens_used_total[1h]) > 100000 for: 10m labels: severity: warning annotations: summary: "令牌使用率过高" description: "令牌使用率超过 100,000/小时 (当前: )"
成本告警
- alert: HighCostIncurred expr: claude_code_cost_usd > 1000 for: 1h labels: severity: warning annotations: summary: "成本超过阈值" description: "累计成本超过 $1000 (当前: $)"
沙箱违规告警
- alert: SandboxViolations expr: | rate(claude_sandbox_violations_total[5m]) > 10 for: 5m labels: severity: critical annotations: summary: "沙箱违规频繁" description: "沙箱违规率超过 10/分钟 (当前: )"
资源使用告警
- alert: HighCPUUsage expr: | rate(process_cpu_seconds_total{job="claude-code-api"}[5m]) > 0.8 for: 10m labels: severity: warning annotations: summary: "CPU 使用率过高" description: "CPU 使用率超过 80% (当前: )"
- alert: HighMemoryUsage expr: | process_resident_memory_bytes{job="claude-code-api"} / node_memory_MemTotal_bytes > 0.8 for: 10m labels: severity: warning annotations: summary: "内存使用率过高" description: "内存使用率超过 80% (当前: )"

Alertmanager 配置

yaml

## alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      continue: false

    - match:
        severity: warning
      receiver: 'warning-alerts'
      continue: false

receivers:
  - name: 'default'
    email_configs:
      - to: 'team@company.com'
        from: 'alerts@company.com'
        smarthost: 'smtp.company.com:587'
        auth_username: 'alerts@company.com'
        auth_password: 'password'

  - name: 'critical-alerts'
    email_configs:
      - to: 'oncall@company.com'
        from: 'alerts@company.com'
        smarthost: 'smtp.company.com:587'
        auth_username: 'alerts@company.com'
        auth_password: 'password'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#critical-alerts'
        title: 'Claude Code Critical Alert'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'warning-alerts'
    email_configs:
      - to: 'dev-team@company.com'
        from: 'alerts@company.com'
        smarthost: 'smtp.company.com:587'
        auth_username: 'alerts@company.com'
        auth_password: 'password'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/XXX/YYY/ZZZ'
        channel: '#warnings'
        title: 'Claude Code Warning'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname']

34.3.4 可视化仪表板

Grafana 仪表板配置

json

{
  "dashboard": {
    "title": "Claude Code Enterprise Dashboard",
    "panels": [
      {
        "title": "API 请求速率",
        "targets": [
          {
            "expr": "rate(claude_code_api_requests_total[5m])",
            "legendFormat": "{{ endpoint }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "API 延迟 (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(claude_code_api_latency_seconds_bucket[5m]))",
            "legendFormat": "P95"
          }
        ],
        "type": "graph"
      },
      {
        "title": "活跃会话数",
        "targets": [
          {
            "expr": "claude_code_active_sessions",
            "legendFormat": "Sessions"
          }
        ],
        "type": "stat"
      },
      {
        "title": "令牌使用率",
        "targets": [
          {
            "expr": "rate(claude_code_tokens_used_total[1h])",
            "legendFormat": "{{ model }} - {{ type }}"
          }
        ],
        "type": "graph"
      },
      {
        "title": "累计成本",
        "targets": [
          {
            "expr": "claude_code_cost_usd",
            "legendFormat": "Cost (USD)"
          }
        ],
        "type": "stat"
      },
      {
        "title": "API 错误率",
        "targets": [
          {
            "expr": "rate(claude_code_api_requests_total{status=~\"5..\"}[5m]) / rate(claude_code_api_requests_total[5m])",
            "legendFormat": "Error Rate"
          }
        ],
        "type": "graph"
      },
      {
        "title": "沙箱违规",
        "targets": [
          {
            "expr": "rate(claude_sandbox_violations_total[5m])",
            "legendFormat": "Violations/min"
          }
        ],
        "type": "graph"
      },
      {
        "title": "资源使用",
        "targets": [
          {
            "expr": "rate(process_cpu_seconds_total{job=\"claude-code-api\"}[5m])",
            "legendFormat": "CPU"
          },
          {
            "expr": "process_resident_memory_bytes{job=\"claude-code-api\"} / 1024 / 1024 / 1024",
            "legendFormat": "Memory (GB)"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

34.3.5 日志分析

ELK Stack 配置

log_analyzer.py

python

import elasticsearch
from elasticsearch import Elasticsearch
from datetime import datetime, timedelta
import json

class ClaudeCodeLogAnalyzer:
    def __init__(self, es_host='http://localhost:9200'):
        self.es = Elasticsearch([es_host])
        self.index_pattern = 'claude-code-*'

    def search_errors(self, hours=24):

    """搜索错误日志"""
    query = {
        "query": {
            "bool": {
                "must": [
                    {"match": {"level": "ERROR"}},
                    {"range": {
                        "@timestamp": {
                            "gte": (datetime.now() - timedelta(hours=hours)).isoformat()
                        }
                    }}
                ]
            }
        }
    }

    response = self.es.search(index=self.index_pattern, body=query)

bash

        return response['hits']['hits']

    def search_slow_requests(self, threshold_seconds=2, hours=24):

    """搜索慢请求"""
    query = {
        "query": {
            "bool": {
                "must": [
                    {"range": {
                        "latency": {
                            "gte": threshold_seconds
                        }
                    }},
                    {"range": {
                        "@timestamp": {
                            "gte": (datetime.now() - timedelta(hours=hours)).isoformat()
                        }
                    }}
                ]
            }
        }
    }

    response = self.es.search(index=self.index_pattern, body=query)

bash

        return response['hits']['hits']

    def analyze_user_activity(self, user_id, days=7):

    """分析用户活动"""
    query = {
        "query": {
            "bool": {
                "must": [
                    {"match": {"user_id": user_id}},
                    {"range": {
                        "@timestamp": {
                            "gte": (datetime.now() - timedelta(days=days)).isoformat()
                        }
                    }}
                ]
            }
        },
        "aggs": {
            "daily_requests": {
                "date_histogram": {
                    "field": "@timestamp",
                    "calendar_interval": "day"
                },
                "aggs": {
                    "total_tokens": {
                        "sum": {
                            "field": "tokens_used"
                        }
                    }
                }
            }
        }
    }

    response = self.es.search(index=self.index_pattern, body=query)

bash

        return response

    def detect_anomalies(self, hours=1):

    """检测异常"""
    # 计算平均请求速率
    avg_query = {
        "query": {
            "range": {
                "@timestamp": {
                    "gte": (datetime.now() - timedelta(hours=hours*2)).isoformat(),
                    "lt": (datetime.now() - timedelta(hours=hours)).isoformat()
                }
            }
        },
        "aggs": {
            "avg_rate": {
                "avg": {
                    "script": {
                        "source": "doc['request_count'].value"
                    }
                }
            }
        }
    }

    avg_response = self.es.search(index=self.index_pattern, body=avg_query)
    avg_rate = avg_response['aggregations']['avg_rate']['value']

    # 检查当前速率是否异常
    current_query = {
        "query": {
            "range": {
                "@timestamp": {
                    "gte": (datetime.now() - timedelta(hours=hours)).isoformat()
                }
            }
        },
        "aggs": {
            "current_rate": {
                "avg": {
                    "script": {
                        "source": "doc['request_count'].value"
                    }
                }
            }
        }
    }

    current_response = self.es.search(index=self.index_pattern, body=current_query)
    current_rate = current_response['aggregations']['current_rate']['value']

    # 如果当前速率超过平均值的 2 倍，视为异常

bash

        if current_rate > avg_rate * 2:
            return {
                "anomaly": True,
                "avg_rate": avg_rate,
                "current_rate": current_rate,
                "threshold": avg_rate * 2
            }

        return {"anomaly": False}

使用示例

analyzer = ClaudeCodeLogAnalyzer()

搜索错误

errors = analyzer.search_errors(hours=24) print(f"发现 {len(errors)} 个错误")

搜索慢请求

slow_requests = analyzer.search_slow_requests(threshold_seconds=2, hours=24) print(f"发现 {len(slow_requests)} 个慢请求")

分析用户活动

user_activity = analyzer.analyze_user_activity(user_id="user123", days=7)

检测异常

anomalies = analyzer.detect_anomalies(hours=1)

bash

if anomalies['anomaly']:
    print(f"检测到异常！当前速率: {anomalies['current_rate']}, 阈值: {anomalies['threshold']}")

34.3.6 维护策略

定期维护任务

bash

#!/bin/bash

maintenance.sh

set -e

LOG_DIR="/var/log/claude-code" BACKUP_DIR="/backup/claude-code" DATE=$(date +%Y-%m-%d)

echo "=== Claude Code 维护脚本 - $DATE ==="

1. 日志轮转

echo "执行日志轮转..." logrotate -f /etc/logrotate.d/claude-code

2. 清理旧日志

echo "清理 30 天前的日志..." find $LOG_DIR -name "*.log" -mtime +30 -delete

3. 备份配置

echo "备份配置文件..." mkdir -p $BACKUP_DIR/$DATE cp -r /etc/claude-code $BACKUP_DIR/$DATE/

4. 清理缓存

echo "清理缓存..." rm -rf /tmp/claude-code-cache/*

5. 数据库维护（如果使用）

echo "执行数据库维护..."

psql -U claude -d claude_code -c "VACUUM ANALYZE;"

6. 生成维护报告

echo "生成维护报告..." cat > $BACKUP_DIR/$DATE/maintenance-report.txt << EOF Claude Code 维护报告日期: $DATE

日志轮转: 完成旧日志清理: 完成配置备份: 完成缓存清理: 完成数据库维护: 完成

磁盘使用情况: $(df -h /var/log/claude-code)

服务状态: $(systemctl status claude-code --no-pager) EOF

echo "维护完成！报告已保存到 $BACKUP_DIR/$DATE/maintenance-report.txt"

健康检查脚本

health_check.py

python

import requests
import json
import sys
from datetime import datetime

class ClaudeCodeHealthChecker:
    def __init__(self, api_base_url='http://localhost:8080'):
        self.api_base_url = api_base_url
        self.checks = []

    def check_api_health(self):

    """检查 API 健康状态"""

bash

        try:
            response = requests.get(f'{self.api_base_url}/health', timeout=5)
            if response.status_code == 200:
                data = response.json()
                self.checks.append({
                    "name": "API Health",
                    "status": "healthy",
                    "details": data
                })
                return True
            else:
                self.checks.append({
                    "name": "API Health",
                    "status": "unhealthy",
                    "details": f"Status code: {response.status_code}"
                })
                return False
        except Exception as e:
            self.checks.append({
                "name": "API Health",
                "status": "unhealthy",
                "details": str(e)
            })
            return False

    def check_llm_gateway(self):
        """检查 LLM 网关"""
        try:
            response = requests.get('http://localhost:4000/health', timeout=5)
            if response.status_code == 200:
                self.checks.append({
                    "name": "LLM Gateway",
                    "status": "healthy",
                    "details": response.json()
                })
                return True
            else:
                self.checks.append({
                    "name": "LLM Gateway",
                    "status": "unhealthy",
                    "details": f"Status code: {response.status_code}"
                })
                return False
        except Exception as e:
            self.checks.append({
                "name": "LLM Gateway",
                "status": "unhealthy",
                "details": str(e)
            })
            return False

    def check_sandbox(self):

    """检查沙箱状态"""

bash

        try:
            response = requests.get(f'{self.api_base_url}/sandbox/status', timeout=5)
            if response.status_code == 200:
                data = response.json()
                self.checks.append({
                    "name": "Sandbox",
                    "status": "healthy",
                    "details": data
                })
                return True
            else:
                self.checks.append({
                    "name": "Sandbox",
                    "status": "unhealthy",
                    "details": f"Status code: {response.status_code}"
                })
                return False
        except Exception as e:
            self.checks.append({
                "name": "Sandbox",
                "status": "unhealthy",
                "details": str(e)
            })
            return False

    def check_disk_space(self, threshold=90):

    """检查磁盘空间"""

python

        import shutil
        usage = shutil.disk_usage('/')
        percent = (usage.used / usage.total) * 100

        if percent < threshold:
            self.checks.append({
                "name": "Disk Space",
                "status": "healthy",
                "details": f"Usage: {percent:.1f}%"
            })
            return True
        else:
            self.checks.append({
                "name": "Disk Space",
                "status": "unhealthy",
                "details": f"Usage: {percent:.1f}% (Threshold: {threshold}%)"
            })
            return False

    def check_memory(self, threshold=90):

    """检查内存使用"""

python

        import psutil
        percent = psutil.virtual_memory().percent

        if percent < threshold:
            self.checks.append({
                "name": "Memory",
                "status": "healthy",
                "details": f"Usage: {percent:.1f}%"
            })
            return True
        else:
            self.checks.append({
                "name": "Memory",
                "status": "unhealthy",
                "details": f"Usage: {percent:.1f}% (Threshold: {threshold}%)"
            })
            return False

    def run_all_checks(self):

    """运行所有检查"""

python

        self.check_api_health()
        self.check_llm_gateway()
        self.check_sandbox()
        self.check_disk_space()
        self.check_memory()

        return self.checks

    def generate_report(self):

    """生成健康检查报告"""
    report = {
        "timestamp": datetime.now().isoformat(),
        "overall_status": "healthy",
        "checks": self.checks
    }

    # 确定整体状态

python

        for check in self.checks:
            if check['status'] == 'unhealthy':
                report['overall_status'] = 'unhealthy'
                break

        return report

    def print_report(self):

    """打印报告"""
    report = self.generate_report()

    print("=" * 50)
    print(f"Claude Code 健康检查报告")
    print(f"时间: {report['timestamp']}")
    print(f"整体状态: {report['overall_status'].upper()}")
    print("=" * 50)

bash

        for check in report['checks']:
            status_icon = "✓" if check['status'] == 'healthy' else "✗"
            print(f"{status_icon} {check['name']}: {check['status']}")
            print(f"  详情: {check['details']}")
            print()

        return report['overall_status'] == 'healthy'

if __name__ == '__main__':
    checker = ClaudeCodeHealthChecker()
    checker.run_all_checks()
    is_healthy = checker.print_report()

    sys.exit(0 if is_healthy else 1)

34.3.7 灾难恢复

备份策略

bash

#!/bin/bash

backup.sh

set -e

BACKUP_DIR="/backup/claude-code" DATE=$(date +%Y-%m-%d_%H-%M-%S) BACKUP_PATH="$BACKUP_DIR/$DATE"

echo "=== Claude Code 备份脚本 - $DATE ==="

创建备份目录

mkdir -p $BACKUP_PATH

1. 备份配置文件

echo "备份配置文件..." tar -czf $BACKUP_PATH/config.tar.gz /etc/claude-code

2. 备份数据库

echo "备份数据库..."

pg_dump -U claude claude_code > $BACKUP_PATH/database.sql

3. 备份日志

echo "备份日志..." tar -czf $BACKUP_PATH/logs.tar.gz /var/log/claude-code

4. 备份用户数据

echo "备份用户数据..." tar -czf $BACKUP_PATH/user-data.tar.gz /var/lib/claude-code

5. 生成备份清单

echo "生成备份清单..." cat > $BACKUP_PATH/manifest.txt << EOF 备份清单日期: $DATE 配置文件: config.tar.gz 数据库: database.sql 日志: logs.tar.gz 用户数据: user-data.tar.gz

文件大小: $(du -sh $BACKUP_PATH/*) EOF

6. 上传到远程存储（可选）

echo "上传到远程存储..."

aws s3 cp $BACKUP_PATH s3://company-backups/claude-code/$DATE --recursive

7. 清理旧备份（保留最近 30 天）

echo "清理旧备份..." find $BACKUP_DIR -type d -mtime +30 -exec rm -rf {} ;

echo "备份完成！备份位置: $BACKUP_PATH"

恢复脚本

bash

#!/bin/bash

restore.sh

set -e

bash

if [ -z "$1" ]; then

bash

    echo "用法: $0 <备份目录>"

exit 1

BACKUP_PATH="$1"

echo "=== Claude Code 恢复脚本 ===" echo "备份目录: $BACKUP_PATH"

1. 停止服务

echo "停止服务..." systemctl stop claude-code

2. 恢复配置文件

echo "恢复配置文件..." tar -xzf $BACKUP_PATH/config.tar.gz -C /

3. 恢复数据库

echo "恢复数据库..."

psql -U claude -d claude_code < $BACKUP_PATH/database.sql

4. 恢复用户数据

echo "恢复用户数据..." tar -xzf $BACKUP_PATH/user-data.tar.gz -C /

5. 启动服务

echo "启动服务..." systemctl start claude-code

6. 验证恢复

echo "验证恢复..." sleep 5

bash

if systemctl is-active --quiet claude-code; then

echo "服务启动成功！"

else echo "服务启动失败！" exit 1 fi

echo "恢复完成！"

34.3.8 性能优化

缓存策略

cache_manager.py

python

import redis
import json
from datetime import datetime, timedelta

class CacheManager:
    def __init__(self, redis_host='localhost', redis_port=6379):
        self.redis = redis.Redis(host=redis_host, port=redis_port, decode_responses=True)

    def cache_api_response(self, key, response, ttl=3600):
        """缓存 API 响应"""
        self.redis.setex(key, ttl, json.dumps(response))

    def get_cached_response(self, key):

    """获取缓存的响应"""
    cached = self.redis.get(key)

bash

        if cached:
            return json.loads(cached)
        return None

    def cache_token_count(self, user_id, count, ttl=86400):

    """缓存令牌计数"""
    key = f"tokens:`{user_id}`:{datetime.now().strftime('%Y-%m-%d')}"

python

        self.redis.incrby(key, count)
        self.redis.expire(key, ttl)

    def get_token_count(self, user_id):

    """获取令牌计数"""
    key = f"tokens:`{user_id}`:{datetime.now().strftime('%Y-%m-%d')}"
    count = self.redis.get(key)

bash

        return int(count) if count else 0

    def cache_model_response(self, model, prompt_hash, response, ttl=7200):

    """缓存模型响应"""
    key = f"model:`{model}`:`{prompt_hash}`"

python

        self.redis.setex(key, ttl, json.dumps(response))

    def get_cached_model_response(self, model, prompt_hash):

    """获取缓存的模型响应"""
    key = f"model:`{model}`:`{prompt_hash}`"
    cached = self.redis.get(key)

bash

        if cached:
            return json.loads(cached)
        return None

使用示例

cache = CacheManager()

缓存 API 响应

bash

cache.cache_api_response("api:user:123:profile", {"name": "John"}, ttl=3600)

获取缓存的响应

cached = cache.get_cached_response("api:user:123:profile")

负载均衡配置

nginx.conf

upstream claude_code_backend { least_conn; server claude-code-1:8080 weight=3; server claude-code-2:8080 weight=2; server claude-code-3:8080 weight=1;

keepalive 32;

}

server { listen 80; server_name claude-code.company.com;

# 重定向到 HTTPS

bash

    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name claude-code.company.com;

    ssl_certificate /etc/nginx/ssl/claude-code.crt;
    ssl_certificate_key /etc/nginx/ssl/claude-code.key;

# SSL 配置
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;

# 日志
access_log /var/log/nginx/claude-code-access.log;
error_log /var/log/nginx/claude-code-error.log;

# 代理配置
location / {
    proxy_pass http://claude_code_backend;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;

    # 超时配置
    proxy_connect_timeout 60s;
    proxy_send_timeout 60s;
    proxy_read_timeout 60s;

    # 缓冲配置
    proxy_buffering on;
    proxy_buffer_size 4k;
    proxy_buffers 8 4k;
    proxy_busy_buffers_size 8k;

    # 健康检查
    health_check interval=10s fails=3 passes=2;
}

# 健康检查端点
location /health {
    proxy_pass http://claude_code_backend/health;
    access_log off;
}

}

34.3.9 小结

本节介绍了企业级监控和维护的各个方面，包括：

监控体系概述和监控维度
指标收集（Prometheus、自定义导出器）
告警配置（Prometheus、Alertmanager）
可视化仪表板（Grafana）
日志分析（ELK Stack）
维护策略（定期维护、健康检查）
灾难恢复（备份和恢复）
性能优化（缓存、负载均衡）

通过建立完善的监控和维护体系，企业可以确保 Claude Code 在生产环境中的稳定运行，及时发现和解决问题，优化性能和成本控制。

34.3 企业级监控与维护 ​

34.3.1 监控体系概述 ​

监控的重要性 ​

监控维度 ​

企业级监控维度 ​

34.3.2 指标收集 ​

Prometheus 配置 ​

prometheus.yml ​

Claude Code API 监控 ​

LLM 网关监控 ​

开发容器监控 ​

沙箱监控 ​

自定义指标导出器 ​

claude_code_exporter.py ​

定义指标 ​

日志收集配置 ​

filebeat.yml ​

34.3.3 告警配置 ​

Prometheus 告警规则 ​

alert_rules.yml ​

服务可用性告警 ​

API 错误率告警 ​

API 延迟告警 ​

令牌使用告警 ​

成本告警 ​

沙箱违规告警 ​

资源使用告警 ​

Alertmanager 配置 ​

34.3.4 可视化仪表板 ​

Grafana 仪表板配置 ​

34.3.5 日志分析 ​

ELK Stack 配置 ​

log_analyzer.py ​

使用示例 ​

搜索错误 ​

搜索慢请求 ​

分析用户活动 ​

检测异常 ​

34.3.6 维护策略 ​

定期维护任务 ​

maintenance.sh ​

1. 日志轮转 ​

2. 清理旧日志 ​

3. 备份配置 ​

4. 清理缓存 ​

5. 数据库维护（如果使用） ​

psql -U claude -d claude_code -c "VACUUM ANALYZE;" ​

6. 生成维护报告 ​

健康检查脚本 ​

health_check.py ​

34.3.7 灾难恢复 ​

备份策略 ​

backup.sh ​

创建备份目录 ​

1. 备份配置文件 ​

2. 备份数据库 ​

pg_dump -U claude claude_code > $BACKUP_PATH/database.sql ​

3. 备份日志 ​

4. 备份用户数据 ​

5. 生成备份清单 ​

6. 上传到远程存储（可选） ​

aws s3 cp $BACKUP_PATH s3://company-backups/claude-code/$DATE --recursive ​

7. 清理旧备份（保留最近 30 天） ​

恢复脚本 ​

restore.sh ​

1. 停止服务 ​

2. 恢复配置文件 ​

3. 恢复数据库 ​

psql -U claude -d claude_code < $BACKUP_PATH/database.sql ​

4. 恢复用户数据 ​

5. 启动服务 ​

6. 验证恢复 ​

34.3.8 性能优化 ​

缓存策略 ​

cache_manager.py ​

使用示例 ​

缓存 API 响应 ​

获取缓存的响应 ​

负载均衡配置 ​

nginx.conf ​

34.3 企业级监控与维护

34.3.1 监控体系概述

监控的重要性

监控维度

企业级监控维度

34.3.2 指标收集

Prometheus 配置

prometheus.yml

Claude Code API 监控

LLM 网关监控

开发容器监控

沙箱监控

自定义指标导出器

claude_code_exporter.py

定义指标

日志收集配置

filebeat.yml

34.3.3 告警配置

Prometheus 告警规则

alert_rules.yml

服务可用性告警

API 错误率告警

API 延迟告警

令牌使用告警

成本告警

沙箱违规告警

资源使用告警

Alertmanager 配置

34.3.4 可视化仪表板

Grafana 仪表板配置

34.3.5 日志分析

ELK Stack 配置

log_analyzer.py

使用示例

搜索错误

搜索慢请求

分析用户活动

检测异常

34.3.6 维护策略

定期维护任务

maintenance.sh

1. 日志轮转

2. 清理旧日志

3. 备份配置

4. 清理缓存

5. 数据库维护（如果使用）

psql -U claude -d claude_code -c "VACUUM ANALYZE;"

6. 生成维护报告

健康检查脚本

health_check.py

34.3.7 灾难恢复

备份策略

backup.sh

创建备份目录

1. 备份配置文件

2. 备份数据库

pg_dump -U claude claude_code > $BACKUP_PATH/database.sql

3. 备份日志

4. 备份用户数据

5. 生成备份清单

6. 上传到远程存储（可选）

aws s3 cp $BACKUP_PATH s3://company-backups/claude-code/$DATE --recursive

7. 清理旧备份（保留最近 30 天）

恢复脚本

restore.sh

1. 停止服务

2. 恢复配置文件

3. 恢复数据库

psql -U claude -d claude_code < $BACKUP_PATH/database.sql

4. 恢复用户数据

5. 启动服务

6. 验证恢复

34.3.8 性能优化

缓存策略

cache_manager.py

使用示例

缓存 API 响应

获取缓存的响应

负载均衡配置

nginx.conf