service-mesh-observability
Install this skill
npx skills add wshobson/agentsWorks across Claude Code, Cursor, Codex, Copilot & Antigravity
Service mesh observability focuses on surfacing traffic patterns, internal latency, and failure points within microservice architectures. This skill manages the integration of telemetry collectors like Prometheus, Jaeger, and Linkerd-viz to monitor service communication. By tracking the four golden signalsβlatency, traffic, errors, and saturationβit enables agents to interpret mesh-specific metrics such as request rates and success percentages. It facilitates the diagnosis of network-level bottlenecks and inter-service dependencies. The skill translates raw proxy data into actionable insights, providing agents with the logic to build dashboard queries, configure tracing spans, and perform live traffic inspections. It is essential for maintaining visibility across sidecar-injected workloads where standard application logs fail to capture the complexity of the underlying service mesh topology.
When to Use This Skill
- β’Identifying the specific service responsible for cascading 5xx errors
- β’Detecting latency spikes in cross-cluster gRPC communications
- β’Visualizing dependency bottlenecks using traffic topology maps
- β’Verifying canary deployment health through request success rates
How to Invoke This Skill
Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:
- βShow me the error rate for the checkout service
- βVisualize dependencies for my production namespace
- βConfigure Jaeger tracing for my Istio mesh
- βIsolate the source of p99 latency spikes in the mesh
- βGenerate a Grafana dashboard for Istio traffic
Pro Tips
- π‘Always start by establishing the 'Golden Signals' (latency, errors, requests, saturation) as your core monitoring targets.
- π‘Ensure distributed tracing is fully integrated end-to-end to accurately follow requests across all services, even outside the mesh.
- π‘Leverage powerful visualization tools like Grafana with Prometheus to build insightful dashboards that correlate metrics, traces, and logs.
What this skill does
- β’Querying Istio and Linkerd telemetry using PromQL
- β’Configuring distributed tracing collectors for span analysis
- β’Visualizing service dependencies and traffic flow
- β’Defining and monitoring service-level objective thresholds
- β’Inspecting live proxy traffic and request routing
When not to use it
- βDebugging code-level application logic inside a binary
- βMonitoring bare-metal servers lacking sidecar proxies
- βWhen standard logs provide sufficient resolution without mesh overhead
Example workflow
- Confirm the mesh provider is running and reachable
- Install Prometheus or Linkerd-viz as the telemetry backend
- Define ServiceMonitor resources to scrape proxy metrics
- Apply PromQL queries to identify high-latency endpoints
- Run tap or tracing commands to isolate faulty requests
Prerequisites
- βOperational service mesh (Istio or Linkerd)
- βKubernetes cluster with admin access
- βPrometheus or similar metrics engine
Pitfalls & limitations
- !Over-sampling traces significantly increases network overhead
- !High-cardinality metrics can exhaust memory in Prometheus
- !Reliance on destination labels requires consistent naming conventions
FAQ
How it compares
This skill automates the complex configuration of telemetry stacks, whereas manual methods require manual YAML maintenance and ad-hoc query writing which are prone to syntax errors.
π Full skill instructions β original source: wshobson/agents
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
## When to Use This Skill
- Setting up distributed tracing across services
- Implementing service mesh metrics and dashboards
- Debugging latency and error issues
- Defining SLOs for service communication
- Visualizing service dependencies
- Troubleshooting mesh connectivity
## Core Concepts
### 1. Three Pillars of Observability
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Observability β
βββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ€
β Metrics β Traces β Logs β
β β β β
β β’ Request rate β β’ Span context β β’ Access logs β
β β’ Error rate β β’ Latency β β’ Error details β
β β’ Latency P50 β β’ Dependencies β β’ Debug info β
β β’ Saturation β β’ Bottlenecks β β’ Audit trail β
βββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ### 2. Golden Signals for Mesh
| Signal | Description | Alert Threshold |
| -------------- | ------------------------- | ----------------- |
| **Latency** | Request duration P50, P99 | P99 > 500ms |
| **Traffic** | Requests per second | Anomaly detection |
| **Errors** | 5xx error rate | > 1% |
| **Saturation** | Resource utilization | > 80% |
## Templates
### Template 1: Istio with Prometheus & Grafana
# Install Prometheus
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus
namespace: istio-system
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'istio-mesh'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- istio-system
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: istio-telemetry
---
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-mesh
namespace: istio-system
spec:
selector:
matchLabels:
app: istiod
endpoints:
- port: http-monitoring
interval: 15s### Template 2: Key Istio Metrics Queries
# Request rate by service
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
# Error rate (5xx)
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
# P99 latency
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
# TCP connections
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
# Request size
histogram_quantile(0.99,
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))### Template 3: Jaeger Distributed Tracing
# Jaeger installation for Istio
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
enableTracing: true
defaultConfig:
tracing:
sampling: 100.0 # 100% in dev, lower in prod
zipkin:
address: jaeger-collector.istio-system:9411
---
# Jaeger deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: istio-system
spec:
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.50
ports:
- containerPort: 5775 # UDP
- containerPort: 6831 # Thrift
- containerPort: 6832 # Thrift
- containerPort: 5778 # Config
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP
- containerPort: 14250 # gRPC
- containerPort: 9411 # Zipkin
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"### Template 4: Linkerd Viz Dashboard
# Install Linkerd viz extension
linkerd viz install | kubectl apply -f -
# Access dashboard
linkerd viz dashboard
# CLI commands for observability
# Top requests
linkerd viz top deploy/my-app
# Per-route metrics
linkerd viz routes deploy/my-app --to deploy/backend
# Live traffic inspection
linkerd viz tap deploy/my-app --to deploy/backend
# Service edges (dependencies)
linkerd viz edges deployment -n my-namespace### Template 5: Grafana Dashboard JSON
{
"dashboard": {
"title": "Service Mesh Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "value": 0, "color": "green" },
{ "value": 1, "color": "yellow" },
{ "value": 5, "color": "red" }
]
}
}
}
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Service Topology",
"type": "nodeGraph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
}
]
}
]
}
}### Template 6: Kiali Service Mesh Visualization
# Kiali installation
apiVersion: kiali.io/v1alpha1
kind: Kiali
metadata:
name: kiali
namespace: istio-system
spec:
auth:
strategy: anonymous # or openid, token
deployment:
accessible_namespaces:
- "**"
external_services:
prometheus:
url: http://prometheus.istio-system:9090
tracing:
url: http://jaeger-query.istio-system:16686
grafana:
url: http://grafana.istio-system:3000### Template 7: OpenTelemetry Integration
# OpenTelemetry Collector for mesh
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
---
# Istio Telemetry v2 with OTel
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: otel
randomSamplingPercentage: 10## Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mesh-alerts
namespace: istio-system
spec:
groups:
- name: mesh.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.destination_service_name }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (le, destination_service_name)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency for {{ $labels.destination_service_name }}"
- alert: MeshCertExpiring
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
labels:
severity: warning
annotations:
summary: "Mesh certificate expiring in less than 7 days"## Best Practices
### Do's
- **Sample appropriately** - 100% in dev, 1-10% in prod
- **Use trace context** - Propagate headers consistently
- **Set up alerts** - For golden signals
- **Correlate metrics/traces** - Use exemplars
- **Retain strategically** - Hot/cold storage tiers
### Don'ts
- **Don't over-sample** - Storage costs add up
- **Don't ignore cardinality** - Limit label values
- **Don't skip dashboards** - Visualize dependencies
- **Don't forget costs** - Monitor observability costs
## Resources
- [Istio Observability](https://istio.io/latest/docs/tasks/observability/)
- [Linkerd Observability](https://linkerd.io/2.14/features/dashboard/)
- [OpenTelemetry](https://opentelemetry.io/)
- [Kiali](https://kiali.io/)
How to Use This Skill Unit
Option A: Project-Specific (Recommended)
- Click "Download" above
- In your project, create the directory:
.agent/skills/service-mesh-observability/ - Save the file as
SKILL.md - The agent will automatically discover the skill based on its description.
Option B: Global Installation (All Agents)
Save the file to these locations to make it available across all projects:
- Claude Code:
~/.claude/skills/wshobson/agents/service-mesh-observability/SKILL.md - Cursor:
~/.cursor/skills/wshobson/agents/service-mesh-observability/SKILL.md - Antigravity:
~/.gemini/antigravity/skills/wshobson/agents/service-mesh-observability/SKILL.md
π Install with CLI:npx skills add wshobson/agents