prometheus-configuration

Name: prometheus-configuration
Author: W. Shobson

Prometheusmonitoringmetricsalertingdevopskubernetesgrafanaobservability

⭐ 36.8k📄 MIT🕒 2026-06-16Source ↗

Install this skill

npx skills add wshobson/agents

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

The prometheus-configuration skill facilitates the automated management of time-series monitoring environments. It streamlines the creation of prometheus.yml files, defines scrape intervals, and manages target discovery mechanisms. This skill maps how to connect Prometheus to infrastructure and application metrics endpoints. It handles the syntax for static host definitions, file-based service discovery, and complex Kubernetes label-relabeling logic. By implementing this configuration, you enable the server to poll targets for performance data, route alerts to Alertmanager, and integrate with external visualization platforms like Grafana. The skill focuses on the structural requirements of the Prometheus monitoring stack, ensuring consistent data ingestion from containerized services, node exporters, and custom application endpoints while maintaining retention settings for long-term storage requirements in production deployments.

When to Use This Skill

•Setting up a new monitoring stack for a Kubernetes cluster
•Integrating application-specific metrics from custom HTTP endpoints
•Centralizing infrastructure monitoring from multiple Node Exporters
•Scaling scrape targets dynamically via file-based service discovery

How to Invoke This Skill

Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:

“Generate a prometheus.yml for my kubernetes pods
“How do I configure file-based service discovery in Prometheus?
“Add a scrape job for node-exporter to my prometheus configuration
“Help me write relabeling rules for my k8s service annotations
“What is the syntax for adding TLS to a prometheus scrape target?

Pro Tips

💡Prioritize Service Discovery: Instead of manual scrape configurations, leverage Prometheus's service discovery mechanisms (e.g., Kubernetes SD, EC2 SD) to dynamically discover targets and reduce configuration overhead.
💡Optimize Storage and Retention: Carefully plan your storage allocation and retention policies. Use downsampling with tools like Thanos or Cortex for long-term storage to balance cost and historical data needs.
💡Refine Alerting: Design your alert rules with clear thresholds and effective notification channels. Use `for` clauses to prevent flapping alerts and ensure alerts are actionable.

What this skill does

•Generates prometheus.yml files with global scrape and evaluation intervals
•Configures static and dynamic target discovery using file-based or Kubernetes metadata
•Defines relabeling rules for standardizing incoming metric labels
•Sets up alert alerting integrations and rule file loading
•Configures secure communication via TLS and client certificate authentication

When not to use it

✕When you require long-term storage for historical metrics exceeding local disk capacity
✕When you need push-based metrics delivery rather than pull-based scraping

Example workflow

Identify target endpoints and their required scraping frequency
Determine the appropriate service discovery method (static, file-based, or Kubernetes)
Draft the scrape_configs block including necessary relabel_configs for metadata filtering
Validate the generated configuration file syntax against the Prometheus schema
Apply the configuration via Helm chart update or deployment volume reload
Verify target health status in the Prometheus expression browser UI

Prerequisites

–Running Prometheus server instance
–Access to the Prometheus configuration directory
–Target endpoints with functioning /metrics endpoints

Pitfalls & limitations

!Incorrect relabeling regex patterns can drop all metrics or cause duplicate target errors
!Excessively frequent scrape intervals can significantly increase CPU and memory usage on the Prometheus server
!Missing TLS files on the server side will cause scrape failures for HTTPS targets

FAQ

How do I refresh Prometheus configuration without restarting?

You can trigger a hot reload by sending a SIGHUP signal to the Prometheus process or hitting the /-/reload endpoint if enabled.

What is the difference between static_configs and file_sd_configs?

Static configs use hardcoded IP or DNS lists in the YAML file, whereas file_sd_configs allows Prometheus to watch external files for target changes without requiring a config reload.

Why is my Kubernetes pod not being scraped?

Check your relabel_configs to ensure the scraping annotations match the keys configured in your Prometheus scrape job.

How it compares

This skill automates complex YAML syntax and relabeling logic that is error-prone when written manually, ensuring valid Prometheus configuration blocks that follow best practices.

Source & trust

⭐ 37k stars📄 MIT🕒 Updated 2026-06-16

View original skill on GitHub →

📄 Full skill instructions — original source: wshobson/agents

# Prometheus Configuration

Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.

## Purpose

Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.

## When to Use

- Set up Prometheus monitoring
- Configure metric scraping
- Create recording rules
- Design alert rules
- Implement service discovery

## Prometheus Architecture

┌──────────────┐
│ Applications │ ← Instrumented with client libraries
└──────┬───────┘
       │ /metrics endpoint
       ↓
┌──────────────┐
│  Prometheus  │ ← Scrapes metrics periodically
│    Server    │
└──────┬───────┘
       │
       ├─→ AlertManager (alerts)
       ├─→ Grafana (visualization)
       └─→ Long-term storage (Thanos/Cortex)

## Installation

### Kubernetes with Helm

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageVolumeSize=50Gi

### Docker Compose

version: "3.8"
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"

volumes:
  prometheus-data:

## Configuration File

**prometheus.yml:**

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: "production"
    region: "us-west-2"

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Load rules files
rule_files:
  - /etc/prometheus/rules/*.yml

# Scrape configurations
scrape_configs:
  # Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Node exporters
  - job_name: "node-exporter"
    static_configs:
      - targets:
          - "node1:9100"
          - "node2:9100"
          - "node3:9100"
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: "([^:]+)(:[0-9]+)?"
        replacement: "${1}"

  # Kubernetes pods with annotations
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels:
          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: pod

  # Application metrics
  - job_name: "my-app"
    static_configs:
      - targets:
          - "app1.example.com:9090"
          - "app2.example.com:9090"
    metrics_path: "/metrics"
    scheme: "https"
    tls_config:
      ca_file: /etc/prometheus/ca.crt
      cert_file: /etc/prometheus/client.crt
      key_file: /etc/prometheus/client.key

**Reference:** See assets/prometheus.yml.template

## Scrape Configurations

### Static Targets

scrape_configs:
  - job_name: "static-targets"
    static_configs:
      - targets: ["host1:9100", "host2:9100"]
        labels:
          env: "production"
          region: "us-west-2"

### File-based Service Discovery

scrape_configs:
  - job_name: "file-sd"
    file_sd_configs:
      - files:
          - /etc/prometheus/targets/*.json
          - /etc/prometheus/targets/*.yml
        refresh_interval: 5m

**targets/production.json:**

[
  {
    "targets": ["app1:9090", "app2:9090"],
    "labels": {
      "env": "production",
      "service": "api"
    }
  }
]

### Kubernetes Service Discovery

scrape_configs:
  - job_name: "kubernetes-services"
    kubernetes_sd_configs:
      - role: service
    relabel_configs:
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels:
          [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

**Reference:** See references/scrape-configs.md

## Recording Rules

Create pre-computed metrics for frequently queried expressions:

# /etc/prometheus/rules/recording_rules.yml
groups:
  - name: api_metrics
    interval: 15s
    rules:
      # HTTP request rate per service
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # Error rate percentage
      - record: job:http_requests_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

      - record: job:http_requests_error_rate:percentage
        expr: |
          (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100

      # P95 latency
      - record: job:http_request_duration:p95
        expr: |
          histogram_quantile(0.95,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
          )

  - name: resource_metrics
    interval: 30s
    rules:
      # CPU utilization percentage
      - record: instance:node_cpu:utilization
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

      # Memory utilization percentage
      - record: instance:node_memory:utilization
        expr: |
          100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)

      # Disk usage percentage
      - record: instance:node_disk:utilization
        expr: |
          100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)

**Reference:** See references/recording-rules.md

## Alert Rules

# /etc/prometheus/rules/alert_rules.yml
groups:
  - name: availability
    interval: 30s
    rules:
      - alert: ServiceDown
        expr: up{job="my-app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "{{ $labels.job }} has been down for more than 1 minute"

      - alert: HighErrorRate
        expr: job:http_requests_error_rate:percentage > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate for {{ $labels.job }}"
          description: "Error rate is {{ $value }}% (threshold: 5%)"

      - alert: HighLatency
        expr: job:http_request_duration:p95 > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for {{ $labels.job }}"
          description: "P95 latency is {{ $value }}s (threshold: 1s)"

  - name: resources
    interval: 1m
    rules:
      - alert: HighCPUUsage
        expr: instance:node_cpu:utilization > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"

      - alert: HighMemoryUsage
        expr: instance:node_memory:utilization > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}%"

      - alert: DiskSpaceLow
        expr: instance:node_disk:utilization > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is {{ $value }}%"

## Validation

# Validate configuration
promtool check config prometheus.yml

# Validate rules
promtool check rules /etc/prometheus/rules/*.yml

# Test query
promtool query instant http://localhost:9090 'up'

**Reference:** See scripts/validate-prometheus.sh

## Best Practices

1. **Use consistent naming** for metrics (prefix_name_unit)
2. **Set appropriate scrape intervals** (15-60s typical)
3. **Use recording rules** for expensive queries
4. **Implement high availability** (multiple Prometheus instances)
5. **Configure retention** based on storage capacity
6. **Use relabeling** for metric cleanup
7. **Monitor Prometheus itself**
8. **Implement federation** for large deployments
9. **Use Thanos/Cortex** for long-term storage
10. **Document custom metrics**

## Troubleshooting

**Check scrape targets:**

curl http://localhost:9090/api/v1/targets

**Check configuration:**

curl http://localhost:9090/api/v1/status/config

**Test query:**

curl 'http://localhost:9090/api/v1/query?query=up'

## Reference Files

- assets/prometheus.yml.template - Complete configuration template
- references/scrape-configs.md - Scrape configuration patterns
- references/recording-rules.md - Recording rule examples
- scripts/validate-prometheus.sh - Validation script

## Related Skills

- grafana-dashboards - For visualization
- slo-implementation - For SLO monitoring
- distributed-tracing - For request tracing

By W. Shobson

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

Click "Download" above
In your project, create the directory: .agent/skills/prometheus-configuration/
Save the file as SKILL.md
The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

Claude Code: ~/.claude/skills/wshobson/agents/prometheus-configuration/SKILL.md
Cursor: ~/.cursor/skills/wshobson/agents/prometheus-configuration/SKILL.md
Antigravity: ~/.gemini/antigravity/skills/wshobson/agents/prometheus-configuration/SKILL.md

🚀 Install with CLI:
npx skills add wshobson/agents

Read the Master Guide: Mastering Agent Skills →

Recommended Rules

View more rules →

Recommended Workflows

View more workflows →

Check SSL Certificates

SecurityDevOpsSSL

--- description: Verify SSL certificate validity and expiration --- 1. **Check Expiry**: - Use openssl to check a domain. Replace `google.com` wit...

Implement Feature Flags

Feature FlagsDeploymentA/B Testing

--- description: Safely release features with toggles for gradual rollouts --- 1. **Simple Approach: Environment Variables**: - Use env vars for b...

Implement Blue-Green Deployment

DeploymentDevOpsZero-Downtime

--- description: Zero-downtime deploys --- 1. **Setup Two Environments**: - Blue: Current (v1.0) - Green: New (v1.1) 2. **Route Traffic Gradua...

Recommended MCP Servers

View more MCP servers →

VictoriaMetrics

Official

Comprehensive integration with [VictoriaMetrics APIs](https://docs.victoriametrics.com/victoriametrics/url-examples/) and [documentation](https://docs.victoriametrics.com/) for monitoring, observability, and debugging tasks related to your VictoriaMetrics instances.

Dynatrace

Official

Manage and interact with the [Dynatrace Platform ](https://www.dynatrace.com/platform) for real-time observability and monitoring.

Netdata

Official

Discovery, exploration, reporting and root cause analysis using all observability data, including metrics, logs, systems, containers, processes, and network connections

Take It Further

Maximize your productivity with these powerful resources

📋

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library

📖

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide

prometheus-configuration

Install this skill

When to Use This Skill

How to Invoke This Skill

Pro Tips

What this skill does

When not to use it

Example workflow

Prerequisites

Pitfalls & limitations

FAQ

How it compares

Source & trust

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

Option B: Global Installation (All Agents)

Recommended Rules

Monitoring & Observability (Prometheus, Grafana)

🚀 DevOps & CI/CD Agent - Pipeline Expert

Kubernetes & Container Orchestration

Recommended Workflows

Check SSL Certificates

Implement Feature Flags

Implement Blue-Green Deployment

Recommended MCP Servers

VictoriaMetrics

Dynatrace

Netdata

Take It Further

Define Your Standards

Master Workflows

How to use this Skill in Claude Code & Cursor

For Claude Code (CLI)

For Cursor & Windsurf

Source & attribution