﻿# AI Incident Runbook Template

## 1. Incident Overview
- Incident ID:
- Date and Time (UTC):
- Incident Commander:
- Severity (SEV1/SEV2/SEV3):
- Affected Workflows:
- Customer Impact Summary:

## 2. Detection
- How was the incident detected?
- First alert timestamp:
- Trace URL(s):
- Error signature or failing step name:

## 3. Triage
- Confirmed scope (which services/providers/models):
- Last known good deployment:
- Approximate start time:
- Is this still active? (yes/no)

## 4. Containment Actions
- Action 1:
- Action 2:
- Temporary mitigation in place:
- Owner for each action:

## 5. Root Cause Investigation
- Primary failing step:
- Root cause hypothesis:
- Evidence (trace IDs, logs, provider responses):
- Confirmed root cause:

## 6. Resolution
- Final fix deployed at:
- Validation steps completed:
- Incident status closed at:

## 7. Follow-up
- Prevention action items:
- Owner and due date:
- Monitoring/alert updates required:
- Prompt/model routing changes required:

## 8. Metrics
- Mean Time to Detect (MTTD):
- Mean Time to Resolve (MTTR):
- Estimated user impact count:
- Estimated wasted token spend:
