Back to DevOps & CI/CD

prometheus-configuration

Prometheusmonitoringmetricsalertingdevopskubernetesgrafanaobservability
⭐ 36.8kπŸ“„ MITπŸ•’ 2026-06-16Source β†—

Install this skill

npx skills add wshobson/agents

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

The prometheus-configuration skill facilitates the automated management of time-series monitoring environments. It streamlines the creation of prometheus.yml files, defines scrape intervals, and manages target discovery mechanisms. This skill maps how to connect Prometheus to infrastructure and application metrics endpoints. It handles the syntax for static host definitions, file-based service discovery, and complex Kubernetes label-relabeling logic. By implementing this configuration, you enable the server to poll targets for performance data, route alerts to Alertmanager, and integrate with external visualization platforms like Grafana. The skill focuses on the structural requirements of the Prometheus monitoring stack, ensuring consistent data ingestion from containerized services, node exporters, and custom application endpoints while maintaining retention settings for long-term storage requirements in production deployments.

When to Use This Skill

  • β€’Setting up a new monitoring stack for a Kubernetes cluster
  • β€’Integrating application-specific metrics from custom HTTP endpoints
  • β€’Centralizing infrastructure monitoring from multiple Node Exporters
  • β€’Scaling scrape targets dynamically via file-based service discovery

How to Invoke This Skill

Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:

  • β€œGenerate a prometheus.yml for my kubernetes pods
  • β€œHow do I configure file-based service discovery in Prometheus?
  • β€œAdd a scrape job for node-exporter to my prometheus configuration
  • β€œHelp me write relabeling rules for my k8s service annotations
  • β€œWhat is the syntax for adding TLS to a prometheus scrape target?

Pro Tips

  • πŸ’‘Prioritize Service Discovery: Instead of manual scrape configurations, leverage Prometheus's service discovery mechanisms (e.g., Kubernetes SD, EC2 SD) to dynamically discover targets and reduce configuration overhead.
  • πŸ’‘Optimize Storage and Retention: Carefully plan your storage allocation and retention policies. Use downsampling with tools like Thanos or Cortex for long-term storage to balance cost and historical data needs.
  • πŸ’‘Refine Alerting: Design your alert rules with clear thresholds and effective notification channels. Use `for` clauses to prevent flapping alerts and ensure alerts are actionable.

What this skill does

  • β€’Generates prometheus.yml files with global scrape and evaluation intervals
  • β€’Configures static and dynamic target discovery using file-based or Kubernetes metadata
  • β€’Defines relabeling rules for standardizing incoming metric labels
  • β€’Sets up alert alerting integrations and rule file loading
  • β€’Configures secure communication via TLS and client certificate authentication

When not to use it

  • βœ•When you require long-term storage for historical metrics exceeding local disk capacity
  • βœ•When you need push-based metrics delivery rather than pull-based scraping

Example workflow

  1. Identify target endpoints and their required scraping frequency
  2. Determine the appropriate service discovery method (static, file-based, or Kubernetes)
  3. Draft the scrape_configs block including necessary relabel_configs for metadata filtering
  4. Validate the generated configuration file syntax against the Prometheus schema
  5. Apply the configuration via Helm chart update or deployment volume reload
  6. Verify target health status in the Prometheus expression browser UI

Prerequisites

  • –Running Prometheus server instance
  • –Access to the Prometheus configuration directory
  • –Target endpoints with functioning /metrics endpoints

Pitfalls & limitations

  • !Incorrect relabeling regex patterns can drop all metrics or cause duplicate target errors
  • !Excessively frequent scrape intervals can significantly increase CPU and memory usage on the Prometheus server
  • !Missing TLS files on the server side will cause scrape failures for HTTPS targets

FAQ

How do I refresh Prometheus configuration without restarting?
You can trigger a hot reload by sending a SIGHUP signal to the Prometheus process or hitting the /-/reload endpoint if enabled.
What is the difference between static_configs and file_sd_configs?
Static configs use hardcoded IP or DNS lists in the YAML file, whereas file_sd_configs allows Prometheus to watch external files for target changes without requiring a config reload.
Why is my Kubernetes pod not being scraped?
Check your relabel_configs to ensure the scraping annotations match the keys configured in your Prometheus scrape job.

How it compares

This skill automates complex YAML syntax and relabeling logic that is error-prone when written manually, ensuring valid Prometheus configuration blocks that follow best practices.

Source & trust

⭐ 37k starsπŸ“„ MITπŸ•’ Updated 2026-06-16
πŸ“„ Full skill instructions β€” original source: wshobson/agents
# Prometheus Configuration

Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.

## Purpose

Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.

## When to Use

- Set up Prometheus monitoring
- Configure metric scraping
- Create recording rules
- Design alert rules
- Implement service discovery

## Prometheus Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Applications β”‚ ← Instrumented with client libraries
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ /metrics endpoint
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Prometheus β”‚ ← Scrapes metrics periodically
β”‚ Server β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”œβ”€β†’ AlertManager (alerts)
β”œβ”€β†’ Grafana (visualization)
└─→ Long-term storage (Thanos/Cortex)


## Installation

### Kubernetes with Helm

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageVolumeSize=50Gi


### Docker Compose

version: "3.8"
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=30d"

volumes:
prometheus-data:


## Configuration File

**prometheus.yml:**

global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: "production"
region: "us-west-2"

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093

# Load rules files
rule_files:
- /etc/prometheus/rules/*.yml

# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]

# Node exporters
- job_name: "node-exporter"
static_configs:
- targets:
- "node1:9100"
- "node2:9100"
- "node3:9100"
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: "([^:]+)(:[0-9]+)?"
replacement: "${1}"

# Kubernetes pods with annotations
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels:
[__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod

# Application metrics
- job_name: "my-app"
static_configs:
- targets:
- "app1.example.com:9090"
- "app2.example.com:9090"
metrics_path: "/metrics"
scheme: "https"
tls_config:
ca_file: /etc/prometheus/ca.crt
cert_file: /etc/prometheus/client.crt
key_file: /etc/prometheus/client.key


**Reference:** See assets/prometheus.yml.template

## Scrape Configurations

### Static Targets

scrape_configs:
- job_name: "static-targets"
static_configs:
- targets: ["host1:9100", "host2:9100"]
labels:
env: "production"
region: "us-west-2"


### File-based Service Discovery

scrape_configs:
- job_name: "file-sd"
file_sd_configs:
- files:
- /etc/prometheus/targets/*.json
- /etc/prometheus/targets/*.yml
refresh_interval: 5m


**targets/production.json:**

[
{
"targets": ["app1:9090", "app2:9090"],
"labels": {
"env": "production",
"service": "api"
}
}
]


### Kubernetes Service Discovery

scrape_configs:
- job_name: "kubernetes-services"
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels:
[__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)


**Reference:** See references/scrape-configs.md

## Recording Rules

Create pre-computed metrics for frequently queried expressions:

# /etc/prometheus/rules/recording_rules.yml
groups:
- name: api_metrics
interval: 15s
rules:
# HTTP request rate per service
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))

# Error rate percentage
- record: job:http_requests_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

- record: job:http_requests_error_rate:percentage
expr: |
(job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100

# P95 latency
- record: job:http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)

- name: resource_metrics
interval: 30s
rules:
# CPU utilization percentage
- record: instance:node_cpu:utilization
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory utilization percentage
- record: instance:node_memory:utilization
expr: |
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

# Disk usage percentage
- record: instance:node_disk:utilization
expr: |
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)


**Reference:** See references/recording-rules.md

## Alert Rules

# /etc/prometheus/rules/alert_rules.yml
groups:
- name: availability
interval: 30s
rules:
- alert: ServiceDown
expr: up{job="my-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} has been down for more than 1 minute"

- alert: HighErrorRate
expr: job:http_requests_error_rate:percentage > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate for {{ $labels.job }}"
description: "Error rate is {{ $value }}% (threshold: 5%)"

- alert: HighLatency
expr: job:http_request_duration:p95 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for {{ $labels.job }}"
description: "P95 latency is {{ $value }}s (threshold: 1s)"

- name: resources
interval: 1m
rules:
- alert: HighCPUUsage
expr: instance:node_cpu:utilization > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"

- alert: HighMemoryUsage
expr: instance:node_memory:utilization > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"

- alert: DiskSpaceLow
expr: instance:node_disk:utilization > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}%"


## Validation

# Validate configuration
promtool check config prometheus.yml

# Validate rules
promtool check rules /etc/prometheus/rules/*.yml

# Test query
promtool query instant http://localhost:9090 'up'


**Reference:** See scripts/validate-prometheus.sh

## Best Practices

1. **Use consistent naming** for metrics (prefix_name_unit)
2. **Set appropriate scrape intervals** (15-60s typical)
3. **Use recording rules** for expensive queries
4. **Implement high availability** (multiple Prometheus instances)
5. **Configure retention** based on storage capacity
6. **Use relabeling** for metric cleanup
7. **Monitor Prometheus itself**
8. **Implement federation** for large deployments
9. **Use Thanos/Cortex** for long-term storage
10. **Document custom metrics**

## Troubleshooting

**Check scrape targets:**

curl http://localhost:9090/api/v1/targets


**Check configuration:**

curl http://localhost:9090/api/v1/status/config


**Test query:**

curl 'http://localhost:9090/api/v1/query?query=up'


## Reference Files

- assets/prometheus.yml.template - Complete configuration template
- references/scrape-configs.md - Scrape configuration patterns
- references/recording-rules.md - Recording rule examples
- scripts/validate-prometheus.sh - Validation script

## Related Skills

- grafana-dashboards - For visualization
- slo-implementation - For SLO monitoring
- distributed-tracing - For request tracing

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

  1. Click "Download" above
  2. In your project, create the directory: .agent/skills/prometheus-configuration/
  3. Save the file as SKILL.md
  4. The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

  • Claude Code: ~/.claude/skills/wshobson/agents/prometheus-configuration/SKILL.md
  • Cursor: ~/.cursor/skills/wshobson/agents/prometheus-configuration/SKILL.md
  • Antigravity: ~/.gemini/antigravity/skills/wshobson/agents/prometheus-configuration/SKILL.md

πŸš€ Install with CLI:
npx skills add wshobson/agents

Read the Master Guide: Mastering Agent Skills β†’

Recommended Rules

View more rules β†’

Recommended Workflows

View more workflows β†’

Recommended MCP Servers

View more MCP servers β†’

Take It Further

Maximize your productivity with these powerful resources

πŸ“‹

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library
πŸ“–

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide

How to use this Skill in Claude Code & Cursor

For Claude Code (CLI)

To use this skill in Claude Code, copy the rule content into your project's custom instructions or follow our Add-Skill CLI guide. This ensures Claude follows your standards during every code generation.

For Cursor & Windsurf

For Cursor or Windsurf, individual skills are best used in the "Rules for AI" section. This specific unit helps the agent avoid devops & ci/cd issues, leading to cleaner, more efficient code.

Why the skill format matters: the standardized Agent Skills format lets your AI agent load detailed instructions only when they are relevant, keeping your prompt clean while improving results.

Source & attribution

This skill is categorized under DevOps & CI/CD and is published by W. Shobson, maintained in wshobson/agents.

← Browse All Agent Skills
Sponsored AI assistant. Recommendations may be paid.