Skip to content

33.2 LiteLLM 网关部署

33.2.1 LiteLLM 简介

LiteLLM 是一个开源的 LLM 网关,支持 100+ 个 LLM 提供商,包括 Anthropic、OpenAI、Cohere 等。它提供了统一的 API 接口,简化了多提供商的使用和管理。

LiteLLM 的核心特性

  1. 多提供商支持 :支持 100+ LLM 提供商
  2. 统一 API :一致的 API 接口,简化集成
  3. 智能缓存 :内置缓存机制,减少成本和延迟
  4. 速率限制 :可配置的速率限制,控制使用
  5. 成本跟踪 :详细的使用情况和成本分析
  6. 负载均衡 :在多个 API 密钥之间分配请求
  7. 失败重试 :自动重试失败的请求
  8. 流式响应 :支持流式输出

LiteLLM 架构

┌─────────────────────────────────────────┐ │ Claude Code 客户端 │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LiteLLM Proxy │ │ ┌──────────────────────────────┐ │ │ │ API 层 │ │ │ │ (Anthropic、OpenAI 等) │ │ │ └──────────────────────────────┘ │ │ ┌──────────────────────────────┐ │ │ │ 缓存层 │ │ │ │ (Redis、Memcached) │ │ │ └──────────────────────────────┘ │ │ ┌──────────────────────────────┐ │ │ │ 监控层 │ │ │ │ (Prometheus、Grafana) │ │ │ └──────────────────────────────┘ │ └─────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────┐ │ LLM 提供商 │ │ (Anthropic、OpenAI、Cohere 等) │ └─────────────────────────────────────────┘

33.2.2 安装和配置

1. 安装 LiteLLM

使用 Docker 安装(推荐)

bash

    # 拉取 LiteLLM 镜像
    docker pull litellm/litellm:latest

    # 创建配置目录
    mkdir -p ~/litellm/config
    cd ~/litellm

    # 创建配置文件
    cat > config.yaml << EOF
    model_list:
      - model_name: claude-sonnet-4
        litellm_params:
          model: claude-sonnet-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY

      - model_name: claude-opus-4
        litellm_params:
          model: claude-opus-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY

      - model_name: claude-haiku-4
        litellm_params:
          model: claude-haiku-4-20250514
          api_key: os.environ/ANTHROPIC_API_KEY

    litellm_settings:
      drop_params: true
      set_verbose: true

    general_settings:
      master_key: sk-litellm-master-key-123456
      database_url: postgresql://user:password@localhost:5432/litellm

    security_settings:
      valid_api_keys:
        - sk-team-a-key-123
        - sk-team-b-key-456
    EOF

    # 启动 LiteLLM
    docker run -d \
      --name litellm \
      -p 4000:4000 \
      -v $(pwd)/config.yaml:/app/config.yaml \
      -e ANTHROPIC_API_KEY=sk-ant-xxx \
      litellm/litellm:latest

    ```#### 使用 Python 安装

    # 安装 LiteLLM
    pip install litellm[proxy]
    # 初始化配置
    litellm init
    # 编辑配置文件
    nano litellm_config.yaml
    # 启动代理服务器
    litellm proxy --config litellm_config.yaml --port 4000

2. 配置文件详解

```yaml

# litellm_config.yaml

# 模型列表

model_list:

  # Anthropic Claude 模型

  - model_name: claude-sonnet-4

    litellm_params:
      model: claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      api_base: https://api.anthropic.com
      max_tokens: 4096
      temperature: 0.7

  - model_name: claude-opus-4

    litellm_params:
      model: claude-opus-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 4096

  - model_name: claude-haiku-4

    litellm_params:
      model: claude-haiku-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY
      max_tokens: 4096

  # Amazon Bedrock 模型

  - model_name: bedrock-claude-sonnet

    litellm_params:
      model: anthropic.claude-sonnet-4-5-20250929-v1:0
      api_base: https://bedrock-runtime.us-east-1.amazonaws.com
      api_key: os.environ/AWS_ACCESS_KEY_ID
      aws_secret_access_key: os.environ/AWS_SECRET_ACCESS_KEY
      aws_region_name: us-east-1

  # Google Vertex AI 模型

  - model_name: vertex-claude-sonnet

    litellm_params:
      model: claude-sonnet-4-5@20250929
      api_base: https://us-central1-aiplatform.googleapis.com
      api_key: os.environ/GOOGLE_APPLICATION_CREDENTIALS
      vertex_project: os.environ/VERTEX_PROJECT_ID
      vertex_location: us-central1

# LiteLLM 设置

litellm_settings:
  drop_params: true              # 删除未使用的参数
  set_verbose: true              # 启用详细日志
  json_logs: true               # JSON 格式日志
  success_callback: http://localhost:5000/callback  # 成功回调
  failure_callback: http://localhost:5000/failure  # 失败回调

# 通用设置

general_settings:
  master_key: sk-litellm-master-key-123456  # 主密钥
  database_url: postgresql://user:password@localhost:5432/litellm  # 数据库 URL
  cache: redis://localhost:6379  # Redis 缓存
  cache_seconds: 3600  # 缓存时间(秒)

# 安全设置

security_settings:
  valid_api_keys:  # 有效的 API 密钥

    - sk-team-a-key-123
    - sk-team-b-key-456
    - sk-team-c-key-789

  max_budget: 1000.0  # 最大预算(美元)
  budget_duration: monthly  # 预算周期
  rpm_limit: 100  # 每分钟请求数限制
  tpm_limit: 10000  # 每分钟令牌数限制

# 负载均衡设置

load_balancing_settings:
  routing_strategy: usage-based  # 路由策略:usage-based, round-robin, least-latency
  health_check: true  # 启用健康检查
  health_check_interval: 60  # 健康检查间隔(秒)

# 监控设置

monitoring_settings:
  enable_prometheus: true  # 启用 Prometheus
  prometheus_port: 9090  # Prometheus 端口
  enable_slack_alerts: true  # 启用 Slack 告警
  slack_webhook_url: https://hooks.slack.com/services/xxx/yyy/zzz
  alert_thresholds:
    error_rate: 0.05  # 错误率阈值
    latency_p99: 5000  # P99 延迟阈值(毫秒)

```## 33.2.3 高级配置

### 1. 缓存配置

# 缓存设置
cache_settings:
type: redis  # 缓存类型:redis, memory, none
redis_url: redis://localhost:6379/0
cache_ttl: 3600  # 缓存生存时间(秒)
cache_key_prefix: litellm  # 缓存键前缀
enable_cache_for_stream: false  # 是否为流式响应启用缓存
cache_control_headers: true  # 是否使用缓存控制头

### 2\. 速率限制配置

    ```yaml

    # 速率限制设置

    rate_limit_settings:
      enabled: true
      strategy: sliding_window  # 策略:sliding_window, token_bucket, fixed_window
      limits:

        - api_key: sk-team-a-key-123

          rpm: 100  # 每分钟请求数
          tpm: 10000  # 每分钟令牌数
          rpd: 10000  # 每天请求数

        - api_key: sk-team-b-key-456

          rpm: 50
          tpm: 5000
          rpd: 5000
      default_limits:
        rpm: 10
        tpm: 1000
        rpd: 100
      burst_size: 20  # 突发大小

    ```### 3. 预算控制配置

    # 预算设置
    budget_settings:
    enabled: true
    currency: USD
    budgets:
    - name: team-a-budget
    api_keys:
    - sk-team-a-key-123
    limit: 1000.0
    period: monthly
    alert_threshold: 0.8  # 在 80% 时告警
    hard_limit: true  # 达到限制时阻止请求
    - name: team-b-budget
    api_keys:
    - sk-team-b-key-456
    limit: 500.0
    period: monthly
    alert_threshold: 0.9
    hard_limit: false
    cost_tracking:
    enabled: true
    update_interval: 60  # 更新间隔(秒)
    storage: database  # 存储方式:database, file

4. 监控和告警配置

```yaml

# 监控设置

monitoring_settings:
  prometheus:
    enabled: true
    port: 9090
    metrics:

      - request_count
      - request_duration
      - error_count
      - cache_hit_rate
      - token_usage
      - cost

  grafana:
    enabled: true
    dashboard_url: http://localhost:3000/d/litellm

  alerts:
    slack:
      enabled: true
      webhook_url: https://hooks.slack.com/services/xxx/yyy/zzz
      channels:

        - litellm-alerts
        - devops-notifications

      alert_rules:

        - name: high_error_rate

          condition: error_rate > 0.05
          duration: 5m
          severity: warning

        - name: high_latency

          condition: p99_latency > 5000
          duration: 2m
          severity: critical

        - name: budget_exceeded

          condition: budget_usage > 1.0
          severity: critical

    email:
      enabled: true
      smtp_server: smtp.gmail.com
      smtp_port: 587
      smtp_username: alerts@company.com
      smtp_password: $`{SMTP_PASSWORD}`
      from_address: litellm-alerts@company.com
      to_addresses:

        - devops@company.com
        - finance@company.com

```## 33.2.4 集成 Claude Code

### 1. 配置 Claude Code 使用 LiteLLM

# 方法 1:使用统一端点(推荐)
export ANTHROPIC_BASE_URL=https://litellm-server:4000
export ANTHROPIC_AUTH_TOKEN=sk-litellm-static-key
# 方法 2:使用 Anthropic 格式端点
export ANTHROPIC_BASE_URL=https://litellm-server:4000/anthropic
export ANTHROPIC_AUTH_TOKEN=sk-litellm-static-key

## 方法 3:使用 API 密钥辅助程序

## 创建辅助程序脚本

cat > ~/bin/get-litellm-key.sh << 'EOF' #!/bin/bash

## 从 Vault 获取密钥

vault kv get -field=api_key secret/litellm/claude-code EOF chmod +x ~/bin/get-litellm-key.sh

## 配置 Claude Code 使用辅助程序

cat > ~~/.claude-code/settings.json &lt;&lt; EOF { "apiKeyHelper": "~~/bin/get-litellm-key.sh", "env": { "ANTHROPIC_BASE_URL": "<https://litellm-server:4000>" } } EOF

```bash

    ### 2. 验证配置

    ```python

```python
    class LiteLLMValidator:
        """LiteLLM 验证器"""

        def __init__(self, gateway_url: str, auth_token: str):
            self.gateway_url = gateway_url
            self.auth_token = auth_token

        def validate_connection(self) -> ValidationResult:
        """验证连接"""
        result = ValidationResult()
bash
            try:
            # 测试健康检查端点
            response = requests.get(
                f"{self.gateway_url}/health",
                headers={'Authorization': f'Bearer {self.auth_token}'},
                timeout=10
            )
bash
                if response.status_code == 200:
                    result.success = True
                    result.message = "Connection successful"
                else:
                    result.success = False
                    result.message = f"Health check failed: {response.status_code}"

            except requests.exceptions.Timeout:
                result.success = False
                result.message = "Connection timeout"
            except requests.exceptions.ConnectionError:
                result.success = False
                result.message = "Connection error"
            except Exception as e:
                result.success = False
                result.message = f"Unexpected error: {str(e)}"

            return result

        def validate_model_access(self, model: str) -> ValidationResult:
        """验证模型访问"""
        result = ValidationResult()
bash
            try:
            # 测试模型访问
            response = requests.post(
                f"{self.gateway_url}/v1/completions",
                headers={
                    'Authorization': f'Bearer {self.auth_token}',
                    'Content-Type': 'application/json'
                },
                json={
                    'model': model,
                    'prompt': 'Hello',
                    'max_tokens': 10
                },
                timeout=30
            )
bash
                if response.status_code == 200:
                    result.success = True
                    result.message = f"Model {model} accessible"
                else:
                    result.success = False
                    result.message = f"Model access failed: {response.status_code}"
                    result.error = response.text

            except Exception as e:
                result.success = False
                result.message = f"Model access error: {str(e)}"

            return result

        def validate_all(self) -> ValidationReport:
        """验证所有配置"""
        report = ValidationReport()

        # 验证连接
        report.connection = self.validate_connection()

        # 验证模型访问
        models = ['claude-sonnet-4', 'claude-opus-4', 'claude-haiku-4']
        report.models = {}
bash
            for model in models:
                report.models[model] = self.validate_model_access(model)
        # 生成摘要
        report.summary = self._generate_summary(report)
bash
            return report

        def _generate_summary(self, report: ValidationReport) -> str:
        """生成验证摘要"""
        summary = "LiteLLM Validation Summary:\n\n"

        summary += f"Connection: {'✓' if report.connection.success else '✗'} "
        summary += f"{report.connection.message}\n\n"

        summary += "Model Access:\n"
bash
            for model, result in report.models.items():
                status = '✓' if result.success else '✗'
                summary += f"  {status} {model}: {result.message}\n"

            return summary
```## 33.2.5 监控和维护
python
    ### 1. Prometheus 监控

    # prometheus.yml
    global:
    scrape_interval: 15s
    evaluation_interval: 15s
    scrape_configs:
    - job_name: 'litellm'
    static_configs:
    - targets: ['litellm-server:9090']
    metrics_path: '/metrics'

2. Grafana 仪表板

python
    {
      "dashboard": {
        "title": "LiteLLM Dashboard",
        "panels": [
          {
            "title": "Request Rate",
            "targets": [
              {
                "expr": "rate(litellm_request_count[1m])"
              }
            ]
          },
          {
            "title": "Error Rate",
            "targets": [
              {
                "expr": "rate(litellm_error_count[1m]) / rate(litellm_request_count[1m])"
              }
            ]
          },
          {
            "title": "P99 Latency",
            "targets": [
              {
                "expr": "histogram_quantile(0.99, rate(litellm_request_duration_bucket[1m]))"
              }
            ]
          },
          {
            "title": "Cache Hit Rate",
            "targets": [
              {
                "expr": "rate(litellm_cache_hits[1m]) / rate(litellm_cache_requests[1m])"
              }
            ]
          },
          {
            "title": "Token Usage",
            "targets": [
              {
                "expr": "rate(litellm_token_usage[1m])"
              }
            ]
          },
          {
            "title": "Cost",
            "targets": [
              {
                "expr": "litellm_cost_total"
              }
            ]
          }
        ]
      }
    }

    ```### 3. 日志管理

    class LiteLLMLogManager:
    """LiteLLM 日志管理器"""
    def __init__(self, log_file: str):
    self.log_file = log_file
    self.log_parser = LiteLLMLogParser()
    def analyze_logs(self,
    start_time: datetime = None,
    end_time: datetime = None) -> LogAnalysis:
    """分析日志"""
    analysis = LogAnalysis()

    # 读取日志文件

    with open(self.log_file, 'r') as f:
    logs = f.readlines()

    # 解析日志

    parsed_logs = []
    for log in logs:
    try:
    parsed = self.log_parser.parse(log)
    parsed_logs.append(parsed)
    except Exception as e:
    logger.warning(f"Failed to parse log: {e}")

    # 过滤时间范围

    if start_time or end_time:
    parsed_logs = [
    log for log in parsed_logs
    if (not start_time or log.timestamp >= start_time) and
    (not end_time or log.timestamp <= end_time)
    ]

    # 分析日志

    analysis.total_requests = len(parsed_logs)
    analysis.successful_requests = sum(

    1 for log in parsed_logs if log.status == 'success'

    )
    analysis.failed_requests = sum(

    1 for log in parsed_logs if log.status == 'error'

    )
    analysis.error_rate = (
    analysis.failed_requests / analysis.total_requests
    if analysis.total_requests > 0 else 0
    )

    # 分析延迟

    latencies = [log.duration for log in parsed_logs if log.duration]
    if latencies:
    analysis.avg_latency = sum(latencies) / len(latencies)
    analysis.p50_latency = np.percentile(latencies, 50)
    analysis.p95_latency = np.percentile(latencies, 95)
    analysis.p99_latency = np.percentile(latencies, 99)

    # 分析令牌使用

    analysis.total_tokens = sum(
    log.input_tokens + log.output_tokens
    for log in parsed_logs
    )

    # 分析成本

    analysis.total_cost = sum(log.cost for log in parsed_logs)
    return analysis
    def generate_report(self, analysis: LogAnalysis) -> str:
    """生成报告"""
    report = "LiteLLM Log Analysis Report\n"
    report += "=" * 50 + "\n\n"
    report += "Request Summary:\n"
    report += f"  Total: {analysis.total_requests}\n"
    report += f"  Successful: {analysis.successful_requests}\n"
    report += f"  Failed: {analysis.failed_requests}\n"
    report += f"  Error Rate: {analysis.error_rate:.2%}\n\n"
    report += "Latency (ms):\n"
    report += f"  Average: {analysis.avg_latency:.0f}\n"
    report += f"  P50: {analysis.p50_latency:.0f}\n"
    report += f"  P95: {analysis.p95_latency:.0f}\n"
    report += f"  P99: {analysis.p99_latency:.0f}\n\n"
    report += "Token Usage:\n"
    report += f"  Total: {analysis.total_tokens:,}\n\n"
    report += "Cost:\n"
    report += f"  Total: ${analysis.total_cost:.2f}\n"
    return report

基于 MIT 许可发布