Back to DevOps & CI/CD

incident-runbook-templates

incident responserunbooksdevopsSREoperationstemplateson-callincident management
⭐ 36.8kπŸ“„ MITπŸ•’ 2026-06-16Source β†—

Install this skill

npx skills add wshobson/agents

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

The incident-runbook-templates skill provides a structured framework for managing production failures, offering standardized response workflows for DevOps teams. Instead of starting from scratch during an outage, engineers access pre-defined templates covering detection, immediate triage, mitigation, and verification. These templates prioritize rapid assessment through CLI-based health checks, rollout status monitoring, and service-level diagnostics. By defining clear severity tiers and systematic debugging paths for database connectivity, high latency, or deployment-related crashes, the skill reduces cognitive load during high-pressure events. The templates are ready-to-run, enabling teams to inject service-specific variables and instantly gain a baseline for operational stability. It serves as a starting point for on-call teams to standardize communication, escalate issues correctly, and verify service restoration through documented command-line procedures.

When to Use This Skill

  • β€’Setting up on-call runbooks for new microservices
  • β€’Responding to production outages where time-to-resolution is critical
  • β€’Training junior engineers on standard incident investigation procedures
  • β€’Building consistent troubleshooting documentation for internal service dependencies

How to Invoke This Skill

Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:

  • β€œgenerate a new incident response template
  • β€œshow me the runbook steps for a payment service outage
  • β€œhow do I handle a SEV2 service latency issue
  • β€œcreate a Kubernetes rollback runbook procedure
  • β€œinitialize an escalation matrix for our incident response
  • β€œwhat are the health check commands for the incident runbook

Pro Tips

  • πŸ’‘Customize templates with service-specific alerts and diagnostic commands for faster triage.
  • πŸ’‘Integrate communication templates directly into your incident management platform for consistent stakeholder updates.
  • πŸ’‘Regularly review and update runbooks based on post-incident analyses to ensure they remain effective and current.

What this skill does

  • β€’Categorizes incidents by severity from SEV1 to SEV4 with defined response time targets
  • β€’Provides shell commands for Kubernetes pod health, log analysis, and database query monitoring
  • β€’Includes specific workflows for rolling back deployments and scaling services under load
  • β€’Establishes standardized communication and escalation protocols for active incidents
  • β€’Guides the user through isolating partial versus total service failures

When not to use it

  • βœ•Handling incidents that require manual architectural code refactoring
  • βœ•Managing incidents for platforms that are not hosted on Kubernetes or similar cloud-native infrastructure

Example workflow

  1. Identify current incident severity based on impact level
  2. Deploy the service-specific runbook template to the incident repository
  3. Execute initial triage commands to verify pod and database health
  4. Apply mitigation steps such as rolling back a bad deployment
  5. Perform verification checks to confirm the service is healthy
  6. Log resolution details for the post-mortem record

Prerequisites

  • –Kubernetes access for the target cluster
  • –Basic knowledge of CLI-based diagnostics
  • –PagerDuty or similar on-call alert configuration

Pitfalls & limitations

  • !Generic templates may require significant customization to match specific production architectures
  • !Over-reliance on automated steps can lead to missing subtle, non-standard failure modes
  • !Outdated runbooks can provide dangerous instructions if infrastructure configurations change

FAQ

Can I use these templates for non-Kubernetes services?
The provided command patterns are Kubernetes-specific. You would need to translate the pod and deployment commands into the equivalent operations for your specific platform (e.g., VMs or serverless).
How often should I update these runbooks?
Runbooks should be updated whenever the underlying service architecture changes significantly or after every post-mortem that reveals a new, more efficient remediation path.
Do these templates cover communication?
Yes, the structure includes sections for communication templates and escalation matrices to ensure stakeholders remain informed throughout the incident lifecycle.

How it compares

This skill provides a structured, templated approach that enforces consistency across incidents, whereas manual troubleshooting often leads to fragmented knowledge and inconsistent response quality during stress.

Source & trust

⭐ 37k starsπŸ“„ MITπŸ•’ Updated 2026-06-16
πŸ“„ Full skill instructions β€” original source: wshobson/agents
# Incident Runbook Templates

Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.

## When to Use This Skill

- Creating incident response procedures
- Building service-specific runbooks
- Establishing escalation paths
- Documenting recovery procedures
- Responding to active incidents
- Onboarding on-call engineers

## Core Concepts

### 1. Incident Severity Levels

| Severity | Impact | Response Time | Example |
| -------- | -------------------------- | ----------------- | ----------------------- |
| **SEV1** | Complete outage, data loss | 15 min | Production down |
| **SEV2** | Major degradation | 30 min | Critical feature broken |
| **SEV3** | Minor impact | 2 hours | Non-critical bug |
| **SEV4** | Minimal impact | Next business day | Cosmetic issue |

### 2. Runbook Structure

1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix


## Runbook Templates

### Template 1: Service Outage Runbook

# [Service Name] Outage Runbook

## Overview

**Service**: Payment Processing Service
**Owner**: Platform Team
**Slack**: #payments-incidents
**PagerDuty**: payments-oncall

## Impact Assessment

- [ ] Which customers are affected?
- [ ] What percentage of traffic is impacted?
- [ ] Are there financial implications?
- [ ] What's the blast radius?

## Detection

### Alerts

-
payment_error_rate > 5% (PagerDuty)
-
payment_latency_p99 > 2s (Slack)
-
payment_success_rate < 95% (PagerDuty)

### Dashboards

- [Payment Service Dashboard](https://grafana/d/payments)
- [Error Tracking](https://sentry.io/payments)
- [Dependency Status](https://status.stripe.com)

## Initial Triage (First 5 Minutes)

### 1. Assess Scope
bash
# Check service health
kubectl get pods -n payments -l app=payment-service

# Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Check error rates
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))"


### 2. Quick Health Checks

- [ ] Can you reach the service? curl -I https://api.company.com/payments/health
- [ ] Database connectivity? Check connection pool metrics
- [ ] External dependencies? Check Stripe, bank API status
- [ ] Recent changes? Check deploy history

### 3. Initial Classification

| Symptom | Likely Cause | Go To Section |
| -------------------- | ------------------- | ------------- |
| All requests failing | Service down | Section 4.1 |
| High latency | Database/dependency | Section 4.2 |
| Partial failures | Code bug | Section 4.3 |
| Spike in errors | Traffic surge | Section 4.4 |

## Mitigation Procedures

### 4.1 Service Completely Down

# Step 1: Check pod status
kubectl get pods -n payments

# Step 2: If pods are crash-looping, check logs
kubectl logs -n payments -l app=payment-service --tail=100

# Step 3: Check recent deployments
kubectl rollout history deployment/payment-service -n payments

# Step 4: ROLLBACK if recent deploy is suspect
kubectl rollout undo deployment/payment-service -n payments

# Step 5: Scale up if resource constrained
kubectl scale deployment/payment-service -n payments --replicas=10

# Step 6: Verify recovery
kubectl rollout status deployment/payment-service -n payments


### 4.2 High Latency

# Step 1: Check database connections
kubectl exec -n payments deploy/payment-service -- \
curl localhost:8080/metrics | grep db_pool

# Step 2: Check slow queries (if DB issue)
psql -h $DB_HOST -U $DB_USER -c "
SELECT pid, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND duration > interval '5 seconds'
ORDER BY duration DESC;"

# Step 3: Kill long-running queries if needed
psql -h $DB_HOST -U $DB_USER -c "SELECT pg_terminate_backend(pid);"

# Step 4: Check external dependency latency
curl -w "@curl-format.txt" -o /dev/null -s https://api.stripe.com/v1/health

# Step 5: Enable circuit breaker if dependency is slow
kubectl set env deployment/payment-service \
STRIPE_CIRCUIT_BREAKER_ENABLED=true -n payments


### 4.3 Partial Failures (Specific Errors)

# Step 1: Identify error pattern
kubectl logs -n payments -l app=payment-service --tail=500 | \
grep -i error | sort | uniq -c | sort -rn | head -20

# Step 2: Check error tracking
# Go to Sentry: https://sentry.io/payments

# Step 3: If specific endpoint, enable feature flag to disable
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "DISABLE_PROBLEMATIC_FEATURE", "enabled": true}'

# Step 4: If data issue, check recent data changes
psql -h $DB_HOST -c "
SELECT * FROM audit_log
WHERE table_name = 'payment_methods'
AND created_at > now() - interval '1 hour';"


### 4.4 Traffic Surge

# Step 1: Check current request rate
kubectl top pods -n payments

# Step 2: Scale horizontally
kubectl scale deployment/payment-service -n payments --replicas=20

# Step 3: Enable rate limiting
kubectl set env deployment/payment-service \
RATE_LIMIT_ENABLED=true \
RATE_LIMIT_RPS=1000 -n payments

# Step 4: If attack, block suspicious IPs
kubectl apply -f - <<EOF
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: block-suspicious
namespace: payments
spec:
podSelector:
matchLabels:
app: payment-service
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
except:
- 192.168.1.0/24 # Suspicious range
EOF


## Verification Steps

# Verify service is healthy
curl -s https://api.company.com/payments/health | jq

# Verify error rate is back to normal
curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~'5..'}[5m]))" | jq '.data.result[0].value[1]'

# Verify latency is acceptable
curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99,sum(rate(http_request_duration_seconds_bucket[5m]))by(le))" | jq

# Smoke test critical flows
./scripts/smoke-test-payments.sh


## Rollback Procedures

# Rollback Kubernetes deployment
kubectl rollout undo deployment/payment-service -n payments

# Rollback database migration (if applicable)
./scripts/db-rollback.sh $MIGRATION_VERSION

# Rollback feature flag
curl -X POST https://api.company.com/internal/feature-flags \
-d '{"flag": "NEW_PAYMENT_FLOW", "enabled": false}'


## Escalation Matrix

| Condition | Escalate To | Contact |
| ----------------------------- | ------------------- | ------------------- |
| > 15 min unresolved SEV1 | Engineering Manager | @manager (Slack) |
| Data breach suspected | Security Team | #security-incidents |
| Financial impact > $10k | Finance + Legal | @finance-oncall |
| Customer communication needed | Support Lead | @support-lead |

## Communication Templates

### Initial Notification (Internal)

🚨 INCIDENT: Payment Service Degradation

Severity: SEV2
Status: Investigating
Impact: ~20% of payment requests failing
Start Time: [TIME]
Incident Commander: [NAME]

Current Actions:
- Investigating root cause
- Scaling up service
- Monitoring dashboards

Updates in #payments-incidents


### Status Update

πŸ“Š UPDATE: Payment Service Incident

Status: Mitigating
Impact: Reduced to ~5% failure rate
Duration: 25 minutes

Actions Taken:
- Rolled back deployment v2.3.4 β†’ v2.3.3
- Scaled service from 5 β†’ 10 replicas

Next Steps:
- Continuing to monitor
- Root cause analysis in progress

ETA to Resolution: ~15 minutes


### Resolution Notification

βœ… RESOLVED: Payment Service Incident

Duration: 45 minutes
Impact: ~5,000 affected transactions
Root Cause: Memory leak in v2.3.4

Resolution:
- Rolled back to v2.3.3
- Transactions auto-retried successfully

Follow-up:
- Postmortem scheduled for [DATE]
- Bug fix in progress


### Template 2: Database Incident Runbook
markdown
# Database Incident Runbook

## Quick Reference
| Issue | Command |
|-------|---------|
| Check connections |
SELECT count(*) FROM pg_stat_activity; |
| Kill query |
SELECT pg_terminate_backend(pid); |
| Check replication lag |
SELECT extract(epoch from (now() - pg_last_xact_replay_timestamp())); |
| Check locks |
SELECT * FROM pg_locks WHERE NOT granted; |

## Connection Pool Exhaustion
-- Check current connections
SELECT datname, usename, state, count(*)
FROM pg_stat_activity
GROUP BY datname, usename, state
ORDER BY count(*) DESC;

-- Identify long-running connections
SELECT pid, usename, datname, state, query_start, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;

-- Terminate idle connections
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE state = 'idle'
AND query_start < now() - interval '10 minutes';


## Replication Lag

-- Check lag on replica
SELECT
CASE
WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0
ELSE extract(epoch from now() - pg_last_xact_replay_timestamp())
END AS lag_seconds;

-- If lag > 60s, consider:
-- 1. Check network between primary/replica
-- 2. Check replica disk I/O
-- 3. Consider failover if unrecoverable


## Disk Space Critical

# Check disk usage
df -h /var/lib/postgresql/data

# Find large tables
psql -c "SELECT relname, pg_size_pretty(pg_total_relation_size(relid))
FROM pg_catalog.pg_statio_user_tables
ORDER BY pg_total_relation_size(relid) DESC
LIMIT 10;"

# VACUUM to reclaim space
psql -c "VACUUM FULL large_table;"

# If emergency, delete old data or expand disk


## Best Practices

### Do's
- **Keep runbooks updated** - Review after every incident
- **Test runbooks regularly** - Game days, chaos engineering
- **Include rollback steps** - Always have an escape hatch
- **Document assumptions** - What must be true for steps to work
- **Link to dashboards** - Quick access during stress

### Don'ts
- **Don't assume knowledge** - Write for 3 AM brain
- **Don't skip verification** - Confirm each step worked
- **Don't forget communication** - Keep stakeholders informed
- **Don't work alone** - Escalate early
- **Don't skip postmortems** - Learn from every incident

## Resources

- [Google SRE Book - Incident Management](https://sre.google/sre-book/managing-incidents/)
- [PagerDuty Incident Response](https://response.pagerduty.com/)
- [Atlassian Incident Management](https://www.atlassian.com/incident-management)

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

  1. Click "Download" above
  2. In your project, create the directory: .agent/skills/incident-runbook-templates/
  3. Save the file as SKILL.md
  4. The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

  • Claude Code: ~/.claude/skills/wshobson/agents/incident-runbook-templates/SKILL.md
  • Cursor: ~/.cursor/skills/wshobson/agents/incident-runbook-templates/SKILL.md
  • Antigravity: ~/.gemini/antigravity/skills/wshobson/agents/incident-runbook-templates/SKILL.md

πŸš€ Install with CLI:
npx skills add wshobson/agents

Read the Master Guide: Mastering Agent Skills β†’

Recommended Rules

View more rules β†’

Recommended Workflows

View more workflows β†’

Recommended MCP Servers

View more MCP servers β†’

Take It Further

Maximize your productivity with these powerful resources

πŸ“‹

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library
πŸ“–

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide

How to use this Skill in Claude Code & Cursor

For Claude Code (CLI)

To use this skill in Claude Code, copy the rule content into your project's custom instructions or follow our Add-Skill CLI guide. This ensures Claude follows your standards during every code generation.

For Cursor & Windsurf

For Cursor or Windsurf, individual skills are best used in the "Rules for AI" section. This specific unit helps the agent avoid devops & ci/cd issues, leading to cleaner, more efficient code.

Why the skill format matters: the standardized Agent Skills format lets your AI agent load detailed instructions only when they are relevant, keeping your prompt clean while improving results.

Source & attribution

This skill is categorized under DevOps & CI/CD and is published by W. Shobson, maintained in wshobson/agents.

← Browse All Agent Skills
Sponsored AI assistant. Recommendations may be paid.