incident-runbook-templates

Name: incident-runbook-templates
Author: W. Shobson

incident responserunbooksdevopsSREoperationstemplateson-callincident management

⭐ 36.8k📄 MIT🕒 2026-06-16Source ↗

Install this skill

npx skills add wshobson/agents

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

The incident-runbook-templates skill provides a structured framework for managing production failures, offering standardized response workflows for DevOps teams. Instead of starting from scratch during an outage, engineers access pre-defined templates covering detection, immediate triage, mitigation, and verification. These templates prioritize rapid assessment through CLI-based health checks, rollout status monitoring, and service-level diagnostics. By defining clear severity tiers and systematic debugging paths for database connectivity, high latency, or deployment-related crashes, the skill reduces cognitive load during high-pressure events. The templates are ready-to-run, enabling teams to inject service-specific variables and instantly gain a baseline for operational stability. It serves as a starting point for on-call teams to standardize communication, escalate issues correctly, and verify service restoration through documented command-line procedures.

When to Use This Skill

•Setting up on-call runbooks for new microservices
•Responding to production outages where time-to-resolution is critical
•Training junior engineers on standard incident investigation procedures
•Building consistent troubleshooting documentation for internal service dependencies

How to Invoke This Skill

Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:

“generate a new incident response template
“show me the runbook steps for a payment service outage
“how do I handle a SEV2 service latency issue
“create a Kubernetes rollback runbook procedure
“initialize an escalation matrix for our incident response
“what are the health check commands for the incident runbook

Pro Tips

💡Customize templates with service-specific alerts and diagnostic commands for faster triage.
💡Integrate communication templates directly into your incident management platform for consistent stakeholder updates.
💡Regularly review and update runbooks based on post-incident analyses to ensure they remain effective and current.

What this skill does

•Categorizes incidents by severity from SEV1 to SEV4 with defined response time targets
•Provides shell commands for Kubernetes pod health, log analysis, and database query monitoring
•Includes specific workflows for rolling back deployments and scaling services under load
•Establishes standardized communication and escalation protocols for active incidents
•Guides the user through isolating partial versus total service failures

When not to use it

✕Handling incidents that require manual architectural code refactoring
✕Managing incidents for platforms that are not hosted on Kubernetes or similar cloud-native infrastructure

Example workflow

Identify current incident severity based on impact level
Deploy the service-specific runbook template to the incident repository
Execute initial triage commands to verify pod and database health
Apply mitigation steps such as rolling back a bad deployment
Perform verification checks to confirm the service is healthy
Log resolution details for the post-mortem record

Prerequisites

–Kubernetes access for the target cluster
–Basic knowledge of CLI-based diagnostics
–PagerDuty or similar on-call alert configuration

Pitfalls & limitations

!Generic templates may require significant customization to match specific production architectures
!Over-reliance on automated steps can lead to missing subtle, non-standard failure modes
!Outdated runbooks can provide dangerous instructions if infrastructure configurations change

FAQ

Can I use these templates for non-Kubernetes services?

The provided command patterns are Kubernetes-specific. You would need to translate the pod and deployment commands into the equivalent operations for your specific platform (e.g., VMs or serverless).

How often should I update these runbooks?

Runbooks should be updated whenever the underlying service architecture changes significantly or after every post-mortem that reveals a new, more efficient remediation path.

Do these templates cover communication?

Yes, the structure includes sections for communication templates and escalation matrices to ensure stakeholders remain informed throughout the incident lifecycle.

How it compares

This skill provides a structured, templated approach that enforces consistency across incidents, whereas manual troubleshooting often leads to fragmented knowledge and inconsistent response quality during stress.

Source & trust

⭐ 37k stars📄 MIT🕒 Updated 2026-06-16

View original skill on GitHub →

📄 Full skill instructions — original source: wshobson/agents

# Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

## When to Use This Skill

- Creating incident response procedures
- Building service-specific runbooks
- Establishing escalation paths
- Documenting recovery procedures
- Responding to active incidents
- Onboarding on-call engineers

## Core Concepts

### 1. Incident Severity Levels

| Severity | Impact | Response Time | Example |
| -------- | -------------------------- | ----------------- | ----------------------- |
| **SEV1** | Complete outage, data loss | 15 min | Production down |
| **SEV2** | Major degradation | 30 min | Critical feature broken |
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |

### 2. Runbook Structure

1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix

## Runbook Templates

### Template 1: Service Outage Runbook

# [Service Name] Outage Runbook

## Overview

**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall

## Impact Assessment

- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?

## Detection

### Alerts

- payment_error_rate > 5% (PagerDuty)
- payment_latency_p99 > 2s (Slack)
- payment_success_rate < 95% (PagerDuty)

### Dashboards

- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)

## Initial Triage (First 5 Minutes)

### 1. Assess Scope
bash
# Check service health
kubectl get pods -n payments -l app=payment-service

# Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"

### 2. Quick Health Checks

- [ ] Can you reach the service? curl -I https://api.company.com/payments/health
- [ ] Database connectivity? Check connection pool metrics
- [ ] External dependencies? Check Stripe, bank API status
- [ ] Recent changes? Check deploy history

### 3. Initial Classification

| Symptom | Likely Cause | Go To Section |
| -------------------- | ------------------- | ------------- |
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |

## Mitigation Procedures

### 4.1 Service Completely Down

# Step 1: Check pod status
kubectl get pods -n payments

# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100

# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments

# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10

# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments

### 4.2 High Latency

# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
  curl localhost:8080/metrics | grep db_pool

# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
  SELECT pid, now() - query_start AS duration, query
  FROM pg_stat_activity
  WHERE state = 'active' AND duration > interval '5 seconds'
  ORDER BY duration DESC;"

# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
  STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments

### 4.3 Partial Failures (Specific Errors)

# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
  grep -i error | sort | uniq -c | sort -rn | head -20

# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments

# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
  SELECT * FROM audit_log
  WHERE table_name = 'payment_methods'
  AND created_at > now() - interval '1 hour';"

### 4.4 Traffic Surge

# Step 1: Check current request rate
kubectl top pods -n payments

# Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20

# Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
  RATE_LIMIT_ENABLED=true \
  RATE_LIMIT_RPS=1000 -n payments

# Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: block-suspicious
  namespace: payments
spec:
  podSelector:
    matchLabels:
      app: payment-service
  ingress:
  - from:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 192.168.1.0/24  # Suspicious range
EOF

## Verification Steps

# Verify service is healthy
curl -s https://api.company.com/payments/health | jq

# Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# Smoke test critical flows
./scripts/smoke-test-payments.sh

## Rollback Procedures

# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments

# Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION

# Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \
  -d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'

## Escalation Matrix

| Condition | Escalate To | Contact |
| ----------------------------- | ------------------- | ------------------- |
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |

## Communication Templates

### Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]

Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards

Updates in #payments-incidents

### Status Update

📊 UPDATE: Payment Service Incident

Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes

Actions Taken:
- Rolled back deployment v2.3.4 → v2.3.3
- Scaled service from 5 → 10 replicas

Next Steps:
- Continuing to monitor
- Root cause analysis in progress

ETA to Resolution: ~15 minutes

### Resolution Notification

✅ RESOLVED: Payment Service Incident

Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4

Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully

Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress

### Template 2: Database Incident Runbook
markdown
# Database Incident Runbook

## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections |

SELECT count(*) FROM pg_stat_activity;

 |
| Kill query |

SELECT pg_terminate_backend(pid);

 |
| Check replication lag |

SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp()));

 |
| Check locks |

SELECT * FROM pg_locks WHERE NOT granted;

 |

## Connection Pool Exhaustion
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';

## Replication Lag

-- Check lag on replica
SELECT
  CASE
    WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
    ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
  END AS lag_seconds;

-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable

## Disk Space Critical

# Check disk usage
df -h /var/lib/postgresql/data

# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"

# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk

## Best Practices

### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress

### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident

## Resources

- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)

By W. Shobson

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

Click "Download" above
In your project, create the directory: .agent/skills/incident-runbook-templates/
Save the file as SKILL.md
The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

Claude Code: ~/.claude/skills/wshobson/agents/incident-runbook-templates/SKILL.md
Cursor: ~/.cursor/skills/wshobson/agents/incident-runbook-templates/SKILL.md
Antigravity: ~/.gemini/antigravity/skills/wshobson/agents/incident-runbook-templates/SKILL.md

🚀 Install with CLI:
npx skills add wshobson/agents

Read the Master Guide: Mastering Agent Skills →

Recommended Rules

View more rules →

Recommended Workflows

View more workflows →

Check SSL Certificates

SecurityDevOpsSSL

--- description: Verify SSL certificate validity and expiration --- 1. **Check Expiry**: - Use openssl to check a domain. Replace `google.com` wit...

Implement Feature Flags

Feature FlagsDeploymentA/B Testing

--- description: Safely release features with toggles for gradual rollouts --- 1. **Simple Approach: Environment Variables**: - Use env vars for b...

Implement Blue-Green Deployment

DeploymentDevOpsZero-Downtime

--- description: Zero-downtime deploys --- 1. **Setup Two Environments**: - Blue: Current (v1.0) - Green: New (v1.1) 2. **Route Traffic Gradua...

Recommended MCP Servers

View more MCP servers →

MCP-OpenStack-Ops

Community

Professional OpenStack operations automation via MCP server. Specialized tools for cluster monitoring, instance management, volume control & network analysis. FastMCP + OpenStack SDK + Bearer auth. Claude Desktop ready. Perfect for DevOps & cloud automation.

Azure DevOps

Official

Interact with Azure DevOps services like repositories, work items, builds, releases, test plans, and code search.

Context Templates

Official

An open-source collection of reusable context templates designed to assist developers in structuring prompts, configurations, and workflows across various development tasks. Community contributions are encouraged to expand and refine available templates.

Take It Further

Maximize your productivity with these powerful resources

📋

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library

📖

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide