Back to DevOps & CI/CD

slo-implementation

SRESLOSLIError BudgetsReliabilityMonitoringDevOpsPerformance
⭐ 36.8kπŸ“„ MITπŸ•’ 2026-06-16Source β†—

Install this skill

npx skills add wshobson/agents

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

The SLO Implementation skill provides a structured method for engineering reliability targets within a technical infrastructure. It translates abstract business goals into specific technical measurements by defining Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets. The skill focuses on balancing service stability with software deployment frequency. By automating the calculation of burn rates and compliance ratios via Prometheus recording rules, it enables precise, data-driven decisions about when to accelerate development or prioritize system stability. This process moves teams away from reactive troubleshooting toward proactive reliability management, ensuring that performance metrics align with user experience requirements and operational constraints. It establishes a clear hierarchy of accountability, defining how service performance is monitored and when corrective action must occur based on the remaining error budget.

When to Use This Skill

  • β€’Setting reliability targets for high-traffic microservice APIs
  • β€’Transitioning from reactive manual monitoring to automated budget-based alerting
  • β€’Determining when to freeze features due to excessive incident frequency
  • β€’Measuring the impact of new code releases on established latency targets

How to Invoke This Skill

Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:

  • β€œDefine a new SLO for our production API
  • β€œCalculate the error budget for 99.9% availability
  • β€œCreate Prometheus rules for tracking my latency SLI
  • β€œSet up alerting for high error budget burn rates
  • β€œHow much downtime does 99.99% availability allow per month?

Pro Tips

  • πŸ’‘Always cross-reference generated PromQL queries with your specific monitoring system's documentation to ensure syntax and metric names are accurate for your environment.
  • πŸ’‘Utilize the agent to generate initial SLO documentation outlines, then collaborate with your team to refine the human-readable context and business impact for each objective.
  • πŸ’‘When defining SLIs, guide the agent to focus on user-centric metrics (e.g., successful requests from the user's perspective) rather than purely internal system health, for a more accurate reflection of customer experience.

What this skill does

  • β€’Define quantitative SLIs for availability, latency, and durability
  • β€’Calculate monthly and annual downtime windows for various reliability percentages
  • β€’Implement automated error budget tracking using Prometheus recording rules
  • β€’Configure alerting thresholds for rapid and slow error budget consumption
  • β€’Establish policy-based actions for service degradation based on remaining budget

When not to use it

  • βœ•Systems with extremely low traffic where statistical significance is difficult to achieve
  • βœ•Internal non-production environments where reliability impacts do not influence business outcomes

Example workflow

  1. Identify critical user journeys and their associated performance metrics
  2. Select appropriate SLI types like availability or latency for each journey
  3. Establish the target SLO percentage and calculate the permissible error budget
  4. Deploy Prometheus recording rules to track burn rate and compliance
  5. Configure alertmanager rules to trigger notifications when burn rates exceed defined limits
  6. Adjust development velocity based on the current state of the error budget

Prerequisites

  • –Prometheus monitoring stack
  • –Baseline metrics for service latency and success rates

Pitfalls & limitations

  • !Choosing targets that are too ambitious for the current system architecture
  • !Ignoring the lag inherent in long-window aggregations
  • !Over-alerting on transient blips rather than sustained budget consumption

FAQ

What is the difference between an SLI and an SLO?
An SLI is the specific metric you measure, while an SLO is the target percentage or threshold you want that metric to maintain.
How does an error budget change my development cycle?
When the error budget is healthy, you prioritize new features. When it is exhausted, the policy dictates a shift in focus to reliability and bug fixes.
Can I use this for non-web services?
Yes, as long as the service emits metrics that can be aggregated and evaluated as a ratio of successful events versus total events.
Why use 28 days as a window for calculations?
A 28-day window aligns closely with a four-week cycle, providing a consistent period to measure performance against monthly business goals.

How it compares

Unlike generic monitoring, this skill forces a strict contractual relationship between metrics and deployment policy, preventing arbitrary alerting and focusing solely on user-impacting thresholds.

Source & trust

⭐ 37k starsπŸ“„ MITπŸ•’ Updated 2026-06-16
πŸ“„ Full skill instructions β€” original source: wshobson/agents
# SLO Implementation

Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.

## Purpose

Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.

## When to Use

- Define service reliability targets
- Measure user-perceived reliability
- Implement error budgets
- Create SLO-based alerts
- Track reliability goals

## SLI/SLO/SLA Hierarchy

SLA (Service Level Agreement)
↓ Contract with customers
SLO (Service Level Objective)
↓ Internal reliability target
SLI (Service Level Indicator)
↓ Actual measurement


## Defining SLIs

### Common SLI Types

#### 1. Availability SLI

# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))


#### 2. Latency SLI

# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))


#### 3. Durability SLI

# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)


**Reference:** See references/slo-definitions.md

## Setting SLO Targets

### Availability SLO Examples

| SLO % | Downtime/Month | Downtime/Year |
| ------ | -------------- | ------------- |
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43.2 minutes | 8.76 hours |
| 99.95% | 21.6 minutes | 4.38 hours |
| 99.99% | 4.32 minutes | 52.56 minutes |

### Choose Appropriate SLOs

**Consider:**

- User expectations
- Business requirements
- Current performance
- Cost of reliability
- Competitor benchmarks

**Example SLOs:**

slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))


## Error Budget Calculation

### Error Budget Formula

Error Budget = 1 - SLO Target


**Example:**

- SLO: 99.9% availability
- Error Budget: 0.1% = 43.2 minutes/month
- Current Error: 0.05% = 21.6 minutes/month
- Remaining Budget: 50%

### Error Budget Policy

error_budget_policy:
- remaining_budget: 100%
action: Normal development velocity
- remaining_budget: 50%
action: Consider postponing risky changes
- remaining_budget: 10%
action: Freeze non-critical changes
- remaining_budget: 0%
action: Feature freeze, focus on reliability


**Reference:** See references/error-budget.md

## SLO Implementation

### Prometheus Recording Rules

# SLI Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))

# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))

- name: slo_rules
interval: 5m
rules:
# SLO compliance (1 = meeting SLO, 0 = violating)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999

- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99

# Error budget remaining (percentage)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100

# Error budget burn rate
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)


### SLO Alerting Rules

groups:
- name: slo_alerts
interval: 1m
rules:
# Fast burn: 14.4x rate, 1 hour window
# Consumes 2% error budget in 1 hour
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"

# Slow burn: 6x rate, 6 hour window
# Consumes 5% error budget in 6 hours
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"

# Error budget exhausted
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO error budget exhausted"
description: "Error budget remaining: {{ $value }}%"


## SLO Dashboard

**Grafana Dashboard Structure:**

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SLO Compliance (Current) β”‚
β”‚ βœ“ 99.95% (Target: 99.9%) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Error Budget Remaining: 65% β”‚
β”‚ β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ 65% β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ SLI Trend (28 days) β”‚
β”‚ [Time series graph] β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Burn Rate Analysis β”‚
β”‚ [Burn rate by time window] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


**Example Queries:**

# Current SLO compliance
sli:http_availability:ratio * 100

# Error budget remaining
slo:http_availability:error_budget_remaining

# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)


## Multi-Window Burn Rate Alerts

# Combination of short and long windows reduces false positives
rules:
- alert: SLOBurnRateHigh
expr: |
(
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
)
or
(
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
)
labels:
severity: critical


## SLO Review Process

### Weekly Review

- Current SLO compliance
- Error budget status
- Trend analysis
- Incident impact

### Monthly Review

- SLO achievement
- Error budget usage
- Incident postmortems
- SLO adjustments

### Quarterly Review

- SLO relevance
- Target adjustments
- Process improvements
- Tooling enhancements

## Best Practices

1. **Start with user-facing services**
2. **Use multiple SLIs** (availability, latency, etc.)
3. **Set achievable SLOs** (don't aim for 100%)
4. **Implement multi-window alerts** to reduce noise
5. **Track error budget** consistently
6. **Review SLOs regularly**
7. **Document SLO decisions**
8. **Align with business goals**
9. **Automate SLO reporting**
10. **Use SLOs for prioritization**

## Reference Files

- assets/slo-template.md - SLO definition template
- references/slo-definitions.md - SLO definition patterns
- references/error-budget.md - Error budget calculations

## Related Skills

- prometheus-configuration - For metric collection
- grafana-dashboards - For SLO visualization

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

  1. Click "Download" above
  2. In your project, create the directory: .agent/skills/slo-implementation/
  3. Save the file as SKILL.md
  4. The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

  • Claude Code: ~/.claude/skills/wshobson/agents/slo-implementation/SKILL.md
  • Cursor: ~/.cursor/skills/wshobson/agents/slo-implementation/SKILL.md
  • Antigravity: ~/.gemini/antigravity/skills/wshobson/agents/slo-implementation/SKILL.md

πŸš€ Install with CLI:
npx skills add wshobson/agents

Read the Master Guide: Mastering Agent Skills β†’

Recommended Rules

View more rules β†’

Recommended Workflows

View more workflows β†’

Recommended MCP Servers

View more MCP servers β†’

Take It Further

Maximize your productivity with these powerful resources

πŸ“‹

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library
πŸ“–

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide

How to use this Skill in Claude Code & Cursor

For Claude Code (CLI)

To use this skill in Claude Code, copy the rule content into your project's custom instructions or follow our Add-Skill CLI guide. This ensures Claude follows your standards during every code generation.

For Cursor & Windsurf

For Cursor or Windsurf, individual skills are best used in the "Rules for AI" section. This specific unit helps the agent avoid devops & ci/cd issues, leading to cleaner, more efficient code.

Why the skill format matters: the standardized Agent Skills format lets your AI agent load detailed instructions only when they are relevant, keeping your prompt clean while improving results.

Source & attribution

This skill is categorized under DevOps & CI/CD and is published by W. Shobson, maintained in wshobson/agents.

← Browse All Agent Skills
Sponsored AI assistant. Recommendations may be paid.