19 失败恢复与模型回退

模块目标

拆解系统在真实线上错误场景下的三层恢复策略： thinking 降级 → auth profile 轮换 → 模型级 fallback（runWithModelFallback）。

核心文件

src/agents/pi-embedded-helpers/errors.ts — 错误分类（isContextOverflowError 等）
src/agents/pi-embedded-helpers/thinking.ts — thinking 降级
src/agents/pi-embedded-runner/run.ts — auth profile 轮换 + 错误决策树
src/agents/model-fallback.ts — 模型级 fallback
src/agents/failover-error.ts — FailoverError 类型

一、错误分类函数（源码精确模式）

isContextOverflowError

// src/agents/pi-embedded-helpers/errors.ts

export function isContextOverflowError(errorMessage?: string): boolean {
  if (!errorMessage) return false;
  const lower = errorMessage.toLowerCase();
  const hasRequestSizeExceeds = lower.includes("request size exceeds");
  const hasContextWindow =
    lower.includes("context window") ||
    lower.includes("context length") ||
    lower.includes("maximum context length");
  return (
    lower.includes("request_too_large") ||
    lower.includes("request exceeds the maximum size") ||
    lower.includes("context length exceeded") ||
    lower.includes("maximum context length") ||
    lower.includes("prompt is too long") ||
    lower.includes("exceeds model context window") ||
    (hasRequestSizeExceeds && hasContextWindow) ||  // 组合检测
    lower.includes("context overflow:") ||           // OpenClaw 内部标记
    (lower.includes("413") && lower.includes("too large"))  // HTTP 413
  );
}

总计 9 种判断路径，覆盖主流 LLM 提供商的错误消息格式。

isLikelyContextOverflowError（更宽松的启发式）

export function isLikelyContextOverflowError(errorMessage?: string): boolean {
  if (!errorMessage) return false;
  if (CONTEXT_WINDOW_TOO_SMALL_RE.test(errorMessage)) return false;  // 模型本身不支持，排除
  if (isRateLimitErrorMessage(errorMessage)) return false;            // 避免 rate limit 误匹配
  if (isContextOverflowError(errorMessage)) return true;             // 精确匹配优先
  if (RATE_LIMIT_HINT_RE.test(errorMessage)) return false;
  return CONTEXT_OVERFLOW_HINT_RE.test(errorMessage);                // 宽松正则兜底
}

二、thinking 降级（第一层恢复）

// src/agents/pi-embedded-helpers/thinking.ts

export function pickFallbackThinkingLevel(params: {
  message?: string;
  attempted: Set<ThinkLevel>;
}): ThinkLevel | undefined {
  const raw = params.message?.trim();
  if (!raw) return undefined;

  // 从错误消息中提取支持的 thinking level 列表
  const supported = extractSupportedValues(raw);
  if (supported.length === 0) return undefined;

  // 找一个未尝试过的 level
  for (const entry of supported) {
    const normalized = normalizeThinkLevel(entry);
    if (!normalized) continue;
    if (params.attempted.has(normalized)) continue;  // 跳过已尝试的
    return normalized;
  }
  return undefined;
}

调用时机（在 run.ts 的重试循环中）：

const fallbackThinking = pickFallbackThinkingLevel({
  message: errorText,
  attempted: attemptedThinking,
});
if (fallbackThinking) {
  log.warn(`unsupported thinking level for ${provider}/${modelId}; retrying with ${fallbackThinking}`);
  thinkLevel = fallbackThinking;
  continue;   // 同一模型，降级 thinking 重试
}

工作机制：某些模型返回 "supported values: none, low" 这类错误消息，extractSupportedValues 从中解析出可用 level，自动降级。

三、auth profile 轮换（第二层恢复）

// src/agents/pi-embedded-runner/run.ts

// 条件：可 failover 的错误 + 非超时 + 还有可用 profile
if (
  isFailoverErrorMessage(errorText) &&
  promptFailoverReason !== "timeout" &&
  (await advanceAuthProfile())   // ← 尝试切换到下一个 profile
) {
  continue;   // 同一模型，用新 profile 重试
}

profile 轮换失败时（全部在冷却期）抛 FailoverError：

const throwAuthProfileFailover = (params: {
  allInCooldown: boolean;
  message?: string;
  error?: unknown;
}): never => {
  const reason = resolveAuthProfileFailoverReason({ allInCooldown: params.allInCooldown, message });
  if (fallbackConfigured) {
    throw new FailoverError(message, {
      reason,
      provider,
      model: modelId,
      status: resolveFailoverStatus(reason),
      cause: params.error,
    });
  }
  // 无 fallback 时直接抛原始错误
  if (params.error instanceof Error) throw params.error;
  throw new Error(message);
};

冷却期设计：profile 失败后进入指数退避冷却，同一 provider 的其他 profile 可继续使用。

四、runWithModelFallback（第三层恢复，精确签名）

// src/agents/model-fallback.ts

export async function runWithModelFallback<T>(params: {
  cfg: OpenClawConfig | undefined;
  provider: string;
  model: string;
  agentDir?: string;
  /** 显式 fallback 列表；提供时（即使为空）替换 agents.defaults.model.fallbacks */
  fallbacksOverride?: string[];
  run: (provider: string, model: string) => Promise<T>;
  onError?: (attempt: {
    provider: string;
    model: string;
    error: unknown;
    attempt: number;
    total: number;
  }) => void | Promise<void>;
}): Promise<{
  result: T;
  provider: string;
  model: string;
  attempts: FallbackAttempt[];
}>

FallbackAttempt 类型

type FallbackAttempt = {
  provider: string;
  model: string;
  error: string;   // 已格式化的错误描述
  reason?: FailoverReason;
  status?: number;
  code?: string;
};

五、候选列表构建（resolveFallbackCandidates）

function resolveFallbackCandidates(params: {
  cfg: OpenClawConfig | undefined;
  provider: string;
  model: string;
  fallbacksOverride?: string[];
}): ModelCandidate[] {
  // 1. 首先加入主 model
  addCandidate({ provider, model }, false);

  // 2. 确定 fallback 列表来源
  const modelFallbacks = (() => {
    if (params.fallbacksOverride !== undefined) {
      return params.fallbacksOverride;  // override 完全替换（即使为空 []）
    }
    // 从 cfg.agents.defaults.model.fallbacks 读取
    const model = params.cfg?.agents?.defaults?.model as { fallbacks?: string[] } | string | undefined;
    if (model && typeof model === "object") return model.fallbacks ?? [];
    return [];
  })();

  // 3. 解析每个 fallback 字符串为 provider/model ref
  for (const raw of modelFallbacks) {
    const resolved = resolveModelRefFromString({ raw, defaultProvider, aliasIndex });
    if (!resolved) continue;
    addCandidate(resolved.ref, true);
  }

  return candidates;  // [主 model, ...fallbacks]
}

fallbacksOverride 的精确语义：

undefined → 使用 config 中的 agents.defaults.model.fallbacks
[]（空数组）→ 禁用 fallback（只尝试主 model）
["openai/gpt-4.1"] → 用这个列表替换 config（即使 config 有 fallbacks）

六、fallback 仅处理 FailoverError

for (let i = 0; i < candidates.length; i += 1) {
  const candidate = candidates[i];
  try {
    const result = await params.run(candidate.provider, candidate.model);
    return { result, provider: candidate.provider, model: candidate.model, attempts };
  } catch (err) {
    if (shouldRethrowAbort(err)) throw err;   // Abort 直接透传

    const normalized = coerceToFailoverError(err, {
      provider: candidate.provider,
      model: candidate.model,
    }) ?? err;

    if (!isFailoverError(normalized)) {
      throw err;   // ← 非 FailoverError：立即 rethrow，不尝试下一个模型
    }

    // FailoverError：记录本次失败，继续尝试下一个候选
    lastError = normalized;
    const described = describeFailoverError(normalized);
    attempts.push({
      provider: candidate.provider,
      model: candidate.model,
      error: described.message,
      reason: described.reason,
      status: described.status,
      code: described.code,
    });
    await params.onError?.({...});
  }
}

设计意图：只有 FailoverError 才表示"这个模型不可用，可以换下一个"；业务逻辑错误（如用户 query 格式问题）应当直接报给用户，不应被 fallback 消化。

七、三层恢复决策树

LLM 调用失败
    │
    ▼
[第一层] thinking 降级
    pickFallbackThinkingLevel(errorText, attemptedThinking)
    │
    ├─ 有可降级 level → 同模型重试，thinkLevel 降级（continue）
    │
    └─ 无可降级 level
           │
           ▼
       [第二层] auth profile 轮换
           isFailoverErrorMessage(errorText) && advanceAuthProfile()
           │
           ├─ 有可用 profile → 同模型重试（continue）
           │
           └─ 全部冷却 → throw FailoverError
                  │
                  ▼
              [第三层] 模型 fallback
                  runWithModelFallback → 尝试下一个候选模型
                  │
                  ├─ 某候选成功 → 返回结果（带 attempts 记录）
                  │
                  └─ 全部失败 → 抛最终错误（含所有尝试的 FallbackAttempt[]）

八、FailoverReason 类型

type FailoverReason =
  | "auth"           // 认证失败
  | "rate_limit"     // 超过速率限制
  | "billing"        // 计费/余额问题
  | "timeout"        // 请求超时
  | "context_overflow"  // 上下文超限
  | "model_unavailable" // 模型不可用
  | "unknown";       // 未知原因

不同 reason 对应不同恢复策略：

timeout → 不触发 auth profile 轮换（超时是暂时的，不是 auth 问题）
context_overflow → 触发 compaction/truncation（在 run.ts 的外层处理）
其他 → 尝试 auth profile 轮换

九、自检清单

isContextOverflowError 检查 9 种模式，包括 HTTP 413 组合判断。
pickFallbackThinkingLevel 从错误消息中解析 supported level，不是硬编码降级顺序。
auth profile 轮换：promptFailoverReason !== "timeout" 是必须条件（超时不轮换 profile）。
fallbacksOverride = [] 会禁用 fallback（比不传 fallbacksOverride 更严格）。
非 FailoverError 立即 rethrow，不会被 fallback 消化。
attempts 数组记录所有失败尝试，供上层错误消息使用（"all N models failed"）。

十、开发避坑

自定义 run 函数如果抛非 FailoverError，fallback 不生效，直接失败。需要先 throw new FailoverError(msg, { reason: "rate_limit", ... }) 才能触发模型切换。
fallbacksOverride 传 undefined 和不传 fallbacksOverride 字段是等价的（都用 config）；传 [] 才是明确禁用 fallback。
thinking 降级发生在 run.ts 内部循环，对 runWithModelFallback 完全透明。
auth profile 冷却期是指数退避的，短时间内多次失败会导致冷却期越来越长。

19 失败恢复与模型回退 ​

模块目标 ​

核心文件 ​

一、错误分类函数（源码精确模式） ​

isContextOverflowError ​

isLikelyContextOverflowError（更宽松的启发式） ​

二、thinking 降级（第一层恢复） ​

三、auth profile 轮换（第二层恢复） ​

四、runWithModelFallback（第三层恢复，精确签名） ​

FallbackAttempt 类型 ​

五、候选列表构建（resolveFallbackCandidates） ​

六、fallback 仅处理 FailoverError ​

七、三层恢复决策树 ​

八、FailoverReason 类型 ​

九、自检清单 ​

十、开发避坑 ​