prometheus-configuration
Install this skill
npx skills add wshobson/agentsWorks across Claude Code, Cursor, Codex, Copilot & Antigravity
The prometheus-configuration skill facilitates the automated management of time-series monitoring environments. It streamlines the creation of prometheus.yml files, defines scrape intervals, and manages target discovery mechanisms. This skill maps how to connect Prometheus to infrastructure and application metrics endpoints. It handles the syntax for static host definitions, file-based service discovery, and complex Kubernetes label-relabeling logic. By implementing this configuration, you enable the server to poll targets for performance data, route alerts to Alertmanager, and integrate with external visualization platforms like Grafana. The skill focuses on the structural requirements of the Prometheus monitoring stack, ensuring consistent data ingestion from containerized services, node exporters, and custom application endpoints while maintaining retention settings for long-term storage requirements in production deployments.
When to Use This Skill
- β’Setting up a new monitoring stack for a Kubernetes cluster
- β’Integrating application-specific metrics from custom HTTP endpoints
- β’Centralizing infrastructure monitoring from multiple Node Exporters
- β’Scaling scrape targets dynamically via file-based service discovery
How to Invoke This Skill
Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:
- βGenerate a prometheus.yml for my kubernetes pods
- βHow do I configure file-based service discovery in Prometheus?
- βAdd a scrape job for node-exporter to my prometheus configuration
- βHelp me write relabeling rules for my k8s service annotations
- βWhat is the syntax for adding TLS to a prometheus scrape target?
Pro Tips
- π‘Prioritize Service Discovery: Instead of manual scrape configurations, leverage Prometheus's service discovery mechanisms (e.g., Kubernetes SD, EC2 SD) to dynamically discover targets and reduce configuration overhead.
- π‘Optimize Storage and Retention: Carefully plan your storage allocation and retention policies. Use downsampling with tools like Thanos or Cortex for long-term storage to balance cost and historical data needs.
- π‘Refine Alerting: Design your alert rules with clear thresholds and effective notification channels. Use `for` clauses to prevent flapping alerts and ensure alerts are actionable.
What this skill does
- β’Generates prometheus.yml files with global scrape and evaluation intervals
- β’Configures static and dynamic target discovery using file-based or Kubernetes metadata
- β’Defines relabeling rules for standardizing incoming metric labels
- β’Sets up alert alerting integrations and rule file loading
- β’Configures secure communication via TLS and client certificate authentication
When not to use it
- βWhen you require long-term storage for historical metrics exceeding local disk capacity
- βWhen you need push-based metrics delivery rather than pull-based scraping
Example workflow
- Identify target endpoints and their required scraping frequency
- Determine the appropriate service discovery method (static, file-based, or Kubernetes)
- Draft the scrape_configs block including necessary relabel_configs for metadata filtering
- Validate the generated configuration file syntax against the Prometheus schema
- Apply the configuration via Helm chart update or deployment volume reload
- Verify target health status in the Prometheus expression browser UI
Prerequisites
- βRunning Prometheus server instance
- βAccess to the Prometheus configuration directory
- βTarget endpoints with functioning /metrics endpoints
Pitfalls & limitations
- !Incorrect relabeling regex patterns can drop all metrics or cause duplicate target errors
- !Excessively frequent scrape intervals can significantly increase CPU and memory usage on the Prometheus server
- !Missing TLS files on the server side will cause scrape failures for HTTPS targets
FAQ
How it compares
This skill automates complex YAML syntax and relabeling logic that is error-prone when written manually, ensuring valid Prometheus configuration blocks that follow best practices.
π Full skill instructions β original source: wshobson/agents
Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.
## Purpose
Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.
## When to Use
- Set up Prometheus monitoring
- Configure metric scraping
- Create recording rules
- Design alert rules
- Implement service discovery
## Prometheus Architecture
ββββββββββββββββ
β Applications β β Instrumented with client libraries
ββββββββ¬ββββββββ
β /metrics endpoint
β
ββββββββββββββββ
β Prometheus β β Scrapes metrics periodically
β Server β
ββββββββ¬ββββββββ
β
βββ AlertManager (alerts)
βββ Grafana (visualization)
βββ Long-term storage (Thanos/Cortex)## Installation
### Kubernetes with Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageVolumeSize=50Gi### Docker Compose
version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"
volumes:
prometheus-data:## Configuration File
**prometheus.yml:**
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-west-2"
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules files
rule_files:
- /etc/prometheus/rules/*.yml
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Node exporters
- job_name: "node-exporter"
static_configs:
- targets:
- "node1:9100"
- "node2:9100"
- "node3:9100"
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: "([^:]+)(:[0-9]+)?"
replacement: "${1}"
# Kubernetes pods with annotations
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
# Application metrics
- job_name: "my-app"
static_configs:
- targets:
- "app1.example.com:9090"
- "app2.example.com:9090"
metrics_path: "/metrics"
scheme: "https"
tls_config:
ca_file: /etc/prometheus/ca.crt
cert_file: /etc/prometheus/client.crt
key_file: /etc/prometheus/client.key**Reference:** See
assets/prometheus.yml.template## Scrape Configurations
### Static Targets
scrape_configs:
- job_name: "static-targets"
static_configs:
- targets: ["host1:9100", "host2:9100"]
labels:
env: "production"
region: "us-west-2"### File-based Service Discovery
scrape_configs:
- job_name: "file-sd"
file_sd_configs:
- files:
- /etc/prometheus/targets/*.json
- /etc/prometheus/targets/*.yml
refresh_interval: 5m**targets/production.json:**
[
{
"targets": ["app1:9090", "app2:9090"],
"labels": {
"env": "production",
"service": "api"
}
}
]### Kubernetes Service Discovery
scrape_configs:
- job_name: "kubernetes-services"
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)**Reference:** See
references/scrape-configs.md## Recording Rules
Create pre-computed metrics for frequently queried expressions:
# /etc/prometheus/rules/recording_rules.yml
groups:
- name: api_metrics
interval: 15s
rules:
# HTTP request rate per service
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
# Error rate percentage
- record: job:http_requests_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:http_requests_error_rate:percentage
expr: |
(job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100
# P95 latency
- record: job:http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: resource_metrics
interval: 30s
rules:
# CPU utilization percentage
- record: instance:node_cpu:utilization
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory utilization percentage
- record: instance:node_memory:utilization
expr: |
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
# Disk usage percentage
- record: instance:node_disk:utilization
expr: |
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)**Reference:** See
references/recording-rules.md## Alert Rules
# /etc/prometheus/rules/alert_rules.yml
groups:
- name: availability
interval: 30s
rules:
- alert: ServiceDown
expr: up{job="my-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} has been down for more than 1 minute"
- alert: HighErrorRate
expr: job:http_requests_error_rate:percentage > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate for {{ $labels.job }}"
description: "Error rate is {{ $value }}% (threshold: 5%)"
- alert: HighLatency
expr: job:http_request_duration:p95 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for {{ $labels.job }}"
description: "P95 latency is {{ $value }}s (threshold: 1s)"
- name: resources
interval: 1m
rules:
- alert: HighCPUUsage
expr: instance:node_cpu:utilization > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: HighMemoryUsage
expr: instance:node_memory:utilization > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
- alert: DiskSpaceLow
expr: instance:node_disk:utilization > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}%"## Validation
# Validate configuration
promtool check config prometheus.yml
# Validate rules
promtool check rules /etc/prometheus/rules/*.yml
# Test query
promtool query instant http://localhost:9090 'up'**Reference:** See
scripts/validate-prometheus.sh## Best Practices
1. **Use consistent naming** for metrics (prefix_name_unit)
2. **Set appropriate scrape intervals** (15-60s typical)
3. **Use recording rules** for expensive queries
4. **Implement high availability** (multiple Prometheus instances)
5. **Configure retention** based on storage capacity
6. **Use relabeling** for metric cleanup
7. **Monitor Prometheus itself**
8. **Implement federation** for large deployments
9. **Use Thanos/Cortex** for long-term storage
10. **Document custom metrics**
## Troubleshooting
**Check scrape targets:**
curl http://localhost:9090/api/v1/targets**Check configuration:**
curl http://localhost:9090/api/v1/status/config**Test query:**
curl 'http://localhost:9090/api/v1/query?query=up'## Reference Files
-
assets/prometheus.yml.template - Complete configuration template-
references/scrape-configs.md - Scrape configuration patterns-
references/recording-rules.md - Recording rule examples-
scripts/validate-prometheus.sh - Validation script## Related Skills
-
grafana-dashboards - For visualization-
slo-implementation - For SLO monitoring-
distributed-tracing - For request tracingHow to Use This Skill Unit
Option A: Project-Specific (Recommended)
- Click "Download" above
- In your project, create the directory:
.agent/skills/prometheus-configuration/ - Save the file as
SKILL.md - The agent will automatically discover the skill based on its description.
Option B: Global Installation (All Agents)
Save the file to these locations to make it available across all projects:
- Claude Code:
~/.claude/skills/wshobson/agents/prometheus-configuration/SKILL.md - Cursor:
~/.cursor/skills/wshobson/agents/prometheus-configuration/SKILL.md - Antigravity:
~/.gemini/antigravity/skills/wshobson/agents/prometheus-configuration/SKILL.md
π Install with CLI:npx skills add wshobson/agents