slo-implementation
Install this skill
npx skills add wshobson/agentsWorks across Claude Code, Cursor, Codex, Copilot & Antigravity
The SLO Implementation skill provides a structured method for engineering reliability targets within a technical infrastructure. It translates abstract business goals into specific technical measurements by defining Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets. The skill focuses on balancing service stability with software deployment frequency. By automating the calculation of burn rates and compliance ratios via Prometheus recording rules, it enables precise, data-driven decisions about when to accelerate development or prioritize system stability. This process moves teams away from reactive troubleshooting toward proactive reliability management, ensuring that performance metrics align with user experience requirements and operational constraints. It establishes a clear hierarchy of accountability, defining how service performance is monitored and when corrective action must occur based on the remaining error budget.
When to Use This Skill
- β’Setting reliability targets for high-traffic microservice APIs
- β’Transitioning from reactive manual monitoring to automated budget-based alerting
- β’Determining when to freeze features due to excessive incident frequency
- β’Measuring the impact of new code releases on established latency targets
How to Invoke This Skill
Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:
- βDefine a new SLO for our production API
- βCalculate the error budget for 99.9% availability
- βCreate Prometheus rules for tracking my latency SLI
- βSet up alerting for high error budget burn rates
- βHow much downtime does 99.99% availability allow per month?
Pro Tips
- π‘Always cross-reference generated PromQL queries with your specific monitoring system's documentation to ensure syntax and metric names are accurate for your environment.
- π‘Utilize the agent to generate initial SLO documentation outlines, then collaborate with your team to refine the human-readable context and business impact for each objective.
- π‘When defining SLIs, guide the agent to focus on user-centric metrics (e.g., successful requests from the user's perspective) rather than purely internal system health, for a more accurate reflection of customer experience.
What this skill does
- β’Define quantitative SLIs for availability, latency, and durability
- β’Calculate monthly and annual downtime windows for various reliability percentages
- β’Implement automated error budget tracking using Prometheus recording rules
- β’Configure alerting thresholds for rapid and slow error budget consumption
- β’Establish policy-based actions for service degradation based on remaining budget
When not to use it
- βSystems with extremely low traffic where statistical significance is difficult to achieve
- βInternal non-production environments where reliability impacts do not influence business outcomes
Example workflow
- Identify critical user journeys and their associated performance metrics
- Select appropriate SLI types like availability or latency for each journey
- Establish the target SLO percentage and calculate the permissible error budget
- Deploy Prometheus recording rules to track burn rate and compliance
- Configure alertmanager rules to trigger notifications when burn rates exceed defined limits
- Adjust development velocity based on the current state of the error budget
Prerequisites
- βPrometheus monitoring stack
- βBaseline metrics for service latency and success rates
Pitfalls & limitations
- !Choosing targets that are too ambitious for the current system architecture
- !Ignoring the lag inherent in long-window aggregations
- !Over-alerting on transient blips rather than sustained budget consumption
FAQ
How it compares
Unlike generic monitoring, this skill forces a strict contractual relationship between metrics and deployment policy, preventing arbitrary alerting and focusing solely on user-impacting thresholds.
π Full skill instructions β original source: wshobson/agents
Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets.
## Purpose
Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity.
## When to Use
- Define service reliability targets
- Measure user-perceived reliability
- Implement error budgets
- Create SLO-based alerts
- Track reliability goals
## SLI/SLO/SLA Hierarchy
SLA (Service Level Agreement)
β Contract with customers
SLO (Service Level Objective)
β Internal reliability target
SLI (Service Level Indicator)
β Actual measurement## Defining SLIs
### Common SLI Types
#### 1. Availability SLI
# Successful requests / Total requests
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))#### 2. Latency SLI
# Requests below latency threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))#### 3. Durability SLI
# Successful writes / Total writes
sum(storage_writes_successful_total)
/
sum(storage_writes_total)**Reference:** See
references/slo-definitions.md## Setting SLO Targets
### Availability SLO Examples
| SLO % | Downtime/Month | Downtime/Year |
| ------ | -------------- | ------------- |
| 99% | 7.2 hours | 3.65 days |
| 99.9% | 43.2 minutes | 8.76 hours |
| 99.95% | 21.6 minutes | 4.38 hours |
| 99.99% | 4.32 minutes | 52.56 minutes |
### Choose Appropriate SLOs
**Consider:**
- User expectations
- Business requirements
- Current performance
- Cost of reliability
- Competitor benchmarks
**Example SLOs:**
slos:
- name: api_availability
target: 99.9
window: 28d
sli: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
- name: api_latency_p95
target: 99
window: 28d
sli: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))## Error Budget Calculation
### Error Budget Formula
Error Budget = 1 - SLO Target**Example:**
- SLO: 99.9% availability
- Error Budget: 0.1% = 43.2 minutes/month
- Current Error: 0.05% = 21.6 minutes/month
- Remaining Budget: 50%
### Error Budget Policy
error_budget_policy:
- remaining_budget: 100%
action: Normal development velocity
- remaining_budget: 50%
action: Consider postponing risky changes
- remaining_budget: 10%
action: Freeze non-critical changes
- remaining_budget: 0%
action: Feature freeze, focus on reliability**Reference:** See
references/error-budget.md## SLO Implementation
### Prometheus Recording Rules
# SLI Recording Rules
groups:
- name: sli_rules
interval: 30s
rules:
# Availability SLI
- record: sli:http_availability:ratio
expr: |
sum(rate(http_requests_total{status!~"5.."}[28d]))
/
sum(rate(http_requests_total[28d]))
# Latency SLI (requests < 500ms)
- record: sli:http_latency:ratio
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))
/
sum(rate(http_request_duration_seconds_count[28d]))
- name: slo_rules
interval: 5m
rules:
# SLO compliance (1 = meeting SLO, 0 = violating)
- record: slo:http_availability:compliance
expr: sli:http_availability:ratio >= bool 0.999
- record: slo:http_latency:compliance
expr: sli:http_latency:ratio >= bool 0.99
# Error budget remaining (percentage)
- record: slo:http_availability:error_budget_remaining
expr: |
(sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100
# Error budget burn rate
- record: slo:http_availability:burn_rate_5m
expr: |
(1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
)) / (1 - 0.999)### SLO Alerting Rules
groups:
- name: slo_alerts
interval: 1m
rules:
# Fast burn: 14.4x rate, 1 hour window
# Consumes 2% error budget in 1 hour
- alert: SLOErrorBudgetBurnFast
expr: |
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Slow burn: 6x rate, 6 hour window
# Consumes 5% error budget in 6 hours
- alert: SLOErrorBudgetBurnSlow
expr: |
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
for: 15m
labels:
severity: warning
annotations:
summary: "Slow error budget burn detected"
description: "Error budget burning at {{ $value }}x rate"
# Error budget exhausted
- alert: SLOErrorBudgetExhausted
expr: slo:http_availability:error_budget_remaining < 0
for: 5m
labels:
severity: critical
annotations:
summary: "SLO error budget exhausted"
description: "Error budget remaining: {{ $value }}%"## SLO Dashboard
**Grafana Dashboard Structure:**
ββββββββββββββββββββββββββββββββββββββ
β SLO Compliance (Current) β
β β 99.95% (Target: 99.9%) β
ββββββββββββββββββββββββββββββββββββββ€
β Error Budget Remaining: 65% β
β ββββββββββ 65% β
ββββββββββββββββββββββββββββββββββββββ€
β SLI Trend (28 days) β
β [Time series graph] β
ββββββββββββββββββββββββββββββββββββββ€
β Burn Rate Analysis β
β [Burn rate by time window] β
ββββββββββββββββββββββββββββββββββββββ**Example Queries:**
# Current SLO compliance
sli:http_availability:ratio * 100
# Error budget remaining
slo:http_availability:error_budget_remaining
# Days until error budget exhausted (at current burn rate)
(slo:http_availability:error_budget_remaining / 100)
*
28
/
(1 - sli:http_availability:ratio) * (1 - 0.999)## Multi-Window Burn Rate Alerts
# Combination of short and long windows reduces false positives
rules:
- alert: SLOBurnRateHigh
expr: |
(
slo:http_availability:burn_rate_1h > 14.4
and
slo:http_availability:burn_rate_5m > 14.4
)
or
(
slo:http_availability:burn_rate_6h > 6
and
slo:http_availability:burn_rate_30m > 6
)
labels:
severity: critical## SLO Review Process
### Weekly Review
- Current SLO compliance
- Error budget status
- Trend analysis
- Incident impact
### Monthly Review
- SLO achievement
- Error budget usage
- Incident postmortems
- SLO adjustments
### Quarterly Review
- SLO relevance
- Target adjustments
- Process improvements
- Tooling enhancements
## Best Practices
1. **Start with user-facing services**
2. **Use multiple SLIs** (availability, latency, etc.)
3. **Set achievable SLOs** (don't aim for 100%)
4. **Implement multi-window alerts** to reduce noise
5. **Track error budget** consistently
6. **Review SLOs regularly**
7. **Document SLO decisions**
8. **Align with business goals**
9. **Automate SLO reporting**
10. **Use SLOs for prioritization**
## Reference Files
-
assets/slo-template.md - SLO definition template-
references/slo-definitions.md - SLO definition patterns-
references/error-budget.md - Error budget calculations## Related Skills
-
prometheus-configuration - For metric collection-
grafana-dashboards - For SLO visualizationHow to Use This Skill Unit
Option A: Project-Specific (Recommended)
- Click "Download" above
- In your project, create the directory:
.agent/skills/slo-implementation/ - Save the file as
SKILL.md - The agent will automatically discover the skill based on its description.
Option B: Global Installation (All Agents)
Save the file to these locations to make it available across all projects:
- Claude Code:
~/.claude/skills/wshobson/agents/slo-implementation/SKILL.md - Cursor:
~/.cursor/skills/wshobson/agents/slo-implementation/SKILL.md - Antigravity:
~/.gemini/antigravity/skills/wshobson/agents/slo-implementation/SKILL.md
π Install with CLI:npx skills add wshobson/agents