service-mesh-observability

Name: service-mesh-observability
Author: W. Shobson

service meshobservabilityistiolinkerddistributed tracingmetricsmonitoringslodevops

⭐ 36.8k📄 MIT🕒 2026-06-16Source ↗

Install this skill

npx skills add wshobson/agents

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

Service mesh observability focuses on surfacing traffic patterns, internal latency, and failure points within microservice architectures. This skill manages the integration of telemetry collectors like Prometheus, Jaeger, and Linkerd-viz to monitor service communication. By tracking the four golden signals—latency, traffic, errors, and saturation—it enables agents to interpret mesh-specific metrics such as request rates and success percentages. It facilitates the diagnosis of network-level bottlenecks and inter-service dependencies. The skill translates raw proxy data into actionable insights, providing agents with the logic to build dashboard queries, configure tracing spans, and perform live traffic inspections. It is essential for maintaining visibility across sidecar-injected workloads where standard application logs fail to capture the complexity of the underlying service mesh topology.

When to Use This Skill

•Identifying the specific service responsible for cascading 5xx errors
•Detecting latency spikes in cross-cluster gRPC communications
•Visualizing dependency bottlenecks using traffic topology maps
•Verifying canary deployment health through request success rates

How to Invoke This Skill

Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:

“Show me the error rate for the checkout service
“Visualize dependencies for my production namespace
“Configure Jaeger tracing for my Istio mesh
“Isolate the source of p99 latency spikes in the mesh
“Generate a Grafana dashboard for Istio traffic

Pro Tips

💡Always start by establishing the 'Golden Signals' (latency, errors, requests, saturation) as your core monitoring targets.
💡Ensure distributed tracing is fully integrated end-to-end to accurately follow requests across all services, even outside the mesh.
💡Leverage powerful visualization tools like Grafana with Prometheus to build insightful dashboards that correlate metrics, traces, and logs.

What this skill does

•Querying Istio and Linkerd telemetry using PromQL
•Configuring distributed tracing collectors for span analysis
•Visualizing service dependencies and traffic flow
•Defining and monitoring service-level objective thresholds
•Inspecting live proxy traffic and request routing

When not to use it

✕Debugging code-level application logic inside a binary
✕Monitoring bare-metal servers lacking sidecar proxies
✕When standard logs provide sufficient resolution without mesh overhead

Example workflow

Confirm the mesh provider is running and reachable
Install Prometheus or Linkerd-viz as the telemetry backend
Define ServiceMonitor resources to scrape proxy metrics
Apply PromQL queries to identify high-latency endpoints
Run tap or tracing commands to isolate faulty requests

Prerequisites

–Operational service mesh (Istio or Linkerd)
–Kubernetes cluster with admin access
–Prometheus or similar metrics engine

Pitfalls & limitations

!Over-sampling traces significantly increases network overhead
!High-cardinality metrics can exhaust memory in Prometheus
!Reliance on destination labels requires consistent naming conventions

FAQ

Should I use 100% sampling for Jaeger traces?

No. High sampling rates in production can lead to significant network congestion and storage costs. Start with 1-5% for production and use higher rates only for specific debug sessions.

Why do my latency metrics look high?

Latency metrics in a mesh include both network transit time and proxy processing time. Ensure you are measuring from the destination side to account for actual processing duration.

Can I use this for non-Kubernetes services?

Most service mesh observability tooling is tightly coupled with Kubernetes sidecars. Non-Kubernetes workloads usually require alternative agent-based monitoring methods.

How it compares

This skill automates the complex configuration of telemetry stacks, whereas manual methods require manual YAML maintenance and ad-hoc query writing which are prone to syntax errors.

Source & trust

⭐ 37k stars📄 MIT🕒 Updated 2026-06-16

View original skill on GitHub →

📄 Full skill instructions — original source: wshobson/agents

# Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

## When to Use This Skill

- Setting up distributed tracing across services
- Implementing service mesh metrics and dashboards
- Debugging latency and error issues
- Defining SLOs for service communication
- Visualizing service dependencies
- Troubleshooting mesh connectivity

## Core Concepts

### 1. Three Pillars of Observability

┌─────────────────────────────────────────────────────┐
│                  Observability                       │
├─────────────────┬─────────────────┬─────────────────┤
│     Metrics     │     Traces      │      Logs       │
│                 │                 │                 │
│ • Request rate  │ • Span context  │ • Access logs   │
│ • Error rate    │ • Latency       │ • Error details │
│ • Latency P50   │ • Dependencies  │ • Debug info    │
│ • Saturation    │ • Bottlenecks   │ • Audit trail   │
└─────────────────┴─────────────────┴─────────────────┘

### 2. Golden Signals for Mesh

| Signal | Description | Alert Threshold |
| -------------- | ------------------------- | ----------------- |
| **Latency** | Request duration P50, P99 | P99 > 500ms |
| **Traffic** | Requests per second | Anomaly detection |
| **Errors** | 5xx error rate | > 1% |
| **Saturation** | Resource utilization | > 80% |

## Templates

### Template 1: Istio with Prometheus & Grafana

# Install Prometheus
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus
  namespace: istio-system
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
      - job_name: 'istio-mesh'
        kubernetes_sd_configs:
          - role: endpoints
            namespaces:
              names:
                - istio-system
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_name]
            action: keep
            regex: istio-telemetry
---
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istio-mesh
  namespace: istio-system
spec:
  selector:
    matchLabels:
      app: istiod
  endpoints:
    - port: http-monitoring
      interval: 15s

### Template 2: Key Istio Metrics Queries

# Request rate by service
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

# Error rate (5xx)
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
  / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100

# P99 latency
histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
  by (le, destination_service_name))

# TCP connections
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)

# Request size
histogram_quantile(0.99,
  sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
  by (le, destination_service_name))

### Template 3: Jaeger Distributed Tracing

# Jaeger installation for Istio
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    enableTracing: true
    defaultConfig:
      tracing:
        sampling: 100.0 # 100% in dev, lower in prod
        zipkin:
          address: jaeger-collector.istio-system:9411
---
# Jaeger deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: istio-system
spec:
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:1.50
          ports:
            - containerPort: 5775 # UDP
            - containerPort: 6831 # Thrift
            - containerPort: 6832 # Thrift
            - containerPort: 5778 # Config
            - containerPort: 16686 # UI
            - containerPort: 14268 # HTTP
            - containerPort: 14250 # gRPC
            - containerPort: 9411 # Zipkin
          env:
            - name: COLLECTOR_ZIPKIN_HOST_PORT
              value: ":9411"

### Template 4: Linkerd Viz Dashboard

# Install Linkerd viz extension
linkerd viz install | kubectl apply -f -

# Access dashboard
linkerd viz dashboard

# CLI commands for observability
# Top requests
linkerd viz top deploy/my-app

# Per-route metrics
linkerd viz routes deploy/my-app --to deploy/backend

# Live traffic inspection
linkerd viz tap deploy/my-app --to deploy/backend

# Service edges (dependencies)
linkerd viz edges deployment -n my-namespace

### Template 5: Grafana Dashboard JSON

{
  "dashboard": {
    "title": "Service Mesh Overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
            "legendFormat": "{{destination_service_name}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                { "value": 0, "color": "green" },
                { "value": 1, "color": "yellow" },
                { "value": 5, "color": "red" }
              ]
            }
          }
        }
      },
      {
        "title": "P99 Latency",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
            "legendFormat": "{{destination_service_name}}"
          }
        ]
      },
      {
        "title": "Service Topology",
        "type": "nodeGraph",
        "targets": [
          {
            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
          }
        ]
      }
    ]
  }
}

### Template 6: Kiali Service Mesh Visualization

# Kiali installation
apiVersion: kiali.io/v1alpha1
kind: Kiali
metadata:
  name: kiali
  namespace: istio-system
spec:
  auth:
    strategy: anonymous # or openid, token
  deployment:
    accessible_namespaces:
      - "**"
  external_services:
    prometheus:
      url: http://prometheus.istio-system:9090
    tracing:
      url: http://jaeger-query.istio-system:16686
    grafana:
      url: http://grafana.istio-system:3000

### Template 7: OpenTelemetry Integration

# OpenTelemetry Collector for mesh
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
      zipkin:
        endpoint: 0.0.0.0:9411

    processors:
      batch:
        timeout: 10s

    exporters:
      jaeger:
        endpoint: jaeger-collector:14250
        tls:
          insecure: true
      prometheus:
        endpoint: 0.0.0.0:8889

    service:
      pipelines:
        traces:
          receivers: [otlp, zipkin]
          processors: [batch]
          exporters: [jaeger]
        metrics:
          receivers: [otlp]
          processors: [batch]
          exporters: [prometheus]
---
# Istio Telemetry v2 with OTel
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system
spec:
  tracing:
    - providers:
        - name: otel
      randomSamplingPercentage: 10

## Alerting Rules

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: mesh-alerts
  namespace: istio-system
spec:
  groups:
    - name: mesh.rules
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
            / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate for {{ $labels.destination_service_name }}"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
            by (le, destination_service_name)) > 1000
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High P99 latency for {{ $labels.destination_service_name }}"

        - alert: MeshCertExpiring
          expr: |
            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
          labels:
            severity: warning
          annotations:
            summary: "Mesh certificate expiring in less than 7 days"

## Best Practices

### Do's

- **Sample appropriately** - 100% in dev, 1-10% in prod
- **Use trace context** - Propagate headers consistently
- **Set up alerts** - For golden signals
- **Correlate metrics/traces** - Use exemplars
- **Retain strategically** - Hot/cold storage tiers

### Don'ts

- **Don't over-sample** - Storage costs add up
- **Don't ignore cardinality** - Limit label values
- **Don't skip dashboards** - Visualize dependencies
- **Don't forget costs** - Monitor observability costs

## Resources

- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
- [OpenTelemetry](https://opentelemetry.io/)
- [Kiali](https://kiali.io/)

By W. Shobson

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

Click "Download" above
In your project, create the directory: .agent/skills/service-mesh-observability/
Save the file as SKILL.md
The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

Claude Code: ~/.claude/skills/wshobson/agents/service-mesh-observability/SKILL.md
Cursor: ~/.cursor/skills/wshobson/agents/service-mesh-observability/SKILL.md
Antigravity: ~/.gemini/antigravity/skills/wshobson/agents/service-mesh-observability/SKILL.md

🚀 Install with CLI:
npx skills add wshobson/agents

Read the Master Guide: Mastering Agent Skills →

Recommended Rules

View more rules →

Recommended Workflows

View more workflows →

Check SSL Certificates

SecurityDevOpsSSL

--- description: Verify SSL certificate validity and expiration --- 1. **Check Expiry**: - Use openssl to check a domain. Replace `google.com` wit...

Implement Feature Flags

Feature FlagsDeploymentA/B Testing

--- description: Safely release features with toggles for gradual rollouts --- 1. **Simple Approach: Environment Variables**: - Use env vars for b...

Implement Blue-Green Deployment

DeploymentDevOpsZero-Downtime

--- description: Zero-downtime deploys --- 1. **Setup Two Environments**: - Blue: Current (v1.0) - Green: New (v1.1) 2. **Route Traffic Gradua...

Recommended MCP Servers

View more MCP servers →

VictoriaMetrics

Official

Comprehensive integration with [VictoriaMetrics APIs](https://docs.victoriametrics.com/victoriametrics/url-examples/) and [documentation](https://docs.victoriametrics.com/) for monitoring, observability, and debugging tasks related to your VictoriaMetrics instances.

Dynatrace

Official

Manage and interact with the [Dynatrace Platform ](https://www.dynatrace.com/platform) for real-time observability and monitoring.

Netdata

Official

Discovery, exploration, reporting and root cause analysis using all observability data, including metrics, logs, systems, containers, processes, and network connections

Take It Further

Maximize your productivity with these powerful resources

📋

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library

📖

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide

service-mesh-observability

Install this skill

When to Use This Skill

How to Invoke This Skill

Pro Tips

What this skill does

When not to use it

Example workflow

Prerequisites

Pitfalls & limitations

FAQ

How it compares

Source & trust

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

Option B: Global Installation (All Agents)

Recommended Rules

Monitoring & Observability (Prometheus, Grafana)

🚀 DevOps & CI/CD Agent - Pipeline Expert

Python DevOps & Infrastructure Automation

Recommended Workflows

Check SSL Certificates

Implement Feature Flags

Implement Blue-Green Deployment

Recommended MCP Servers

VictoriaMetrics

Dynatrace

Netdata

Take It Further

Define Your Standards

Master Workflows

How to use this Skill in Claude Code & Cursor

For Claude Code (CLI)

For Cursor & Windsurf

Source & attribution