distributed-tracing
Install this skill
npx skills add wshobson/agentsWorks across Claude Code, Cursor, Codex, Copilot & Antigravity
Distributed tracing provides end-to-end observability into the request lifecycle across complex microservice architectures. By implementing instrumentation using OpenTelemetry standards, this skill allows for the visualization of every transaction as it hops through services, gateways, and databases. It generates traces comprised of discrete spans that capture metadata, execution duration, and potential error states. This methodology eliminates guesswork when diagnosing latency spikes or mysterious failures, as it renders the entire execution path into a coherent sequence. Through the integration of collectors like Jaeger or Tempo, operators can map service dependencies and identify performance bottlenecks that are invisible to local logging or metrics alone. This tool provides the structural clarity required to debug multi-service environments effectively.
When to Use This Skill
- β’Isolating services responsible for slow API response times
- β’Debugging error propagation during cascading service failures
- β’Mapping complex inter-service communication patterns
- β’Verifying that database queries align with expected request flows
How to Invoke This Skill
Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:
- βtrace the request path for this endpoint
- βfind why this microservice is timing out
- βshow me the dependency graph for our services
- βinstrument this flask app for distributed tracing
- βhow is this request failing across services
Pro Tips
- π‘Ensure consistent context propagation (e.g., HTTP headers) across ALL services, even non-instrumented ones, to maintain trace integrity.
- π‘Utilize meaningful tags and structured logs within spans to enrich trace data, making filtering and debugging more effective.
- π‘Regularly review your distributed traces to identify common patterns, recurring bottlenecks, or unexpected service dependencies for proactive optimization.
What this skill does
- β’Visualize request paths across multiple microservices
- β’Measure duration of individual operations within a request
- β’Propagate trace context metadata through service calls
- β’Identify specific nodes causing latency or errors
- β’Map service dependency topology
When not to use it
- βMonolithic applications with simple execution paths
- βLow-traffic services where overhead outweighs diagnostic value
- βEnvironments where security policies prohibit external metadata propagation
Example workflow
- Deploy a Jaeger collector into the observability namespace
- Configure OpenTelemetry providers in the target microservice
- Instrument specific methods and network calls with spans
- Generate traffic to trigger service-to-service communication
- Access the Jaeger dashboard to review the generated traces
Prerequisites
- βA microservices architecture
- βOpenTelemetry SDKs for your language
- βA backend collector such as Jaeger or Tempo
Pitfalls & limitations
- !High sampling rates can consume significant storage and network bandwidth
- !Incomplete instrumentation in one service will result in fragmented trace views
- !Adding too much metadata to spans may degrade service performance
FAQ
How it compares
Unlike manual logging which provides disjointed events, distributed tracing creates a unified, queryable timeline that links related actions across different servers and processes.
π Full skill instructions β original source: wshobson/agents
Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.
## Purpose
Track requests across distributed systems to understand latency, dependencies, and failure points.
## When to Use
- Debug latency issues
- Understand service dependencies
- Identify bottlenecks
- Trace error propagation
- Analyze request paths
## Distributed Tracing Concepts
### Trace Structure
Trace (Request ID: abc123)
β
Span (frontend) [100ms]
β
Span (api-gateway) [80ms]
ββ Span (auth-service) [10ms]
ββ Span (user-service) [60ms]
ββ Span (database) [40ms]### Key Components
- **Trace** - End-to-end request journey
- **Span** - Single operation within a trace
- **Context** - Metadata propagated between services
- **Tags** - Key-value pairs for filtering
- **Logs** - Timestamped events within a span
## Jaeger Setup
### Kubernetes Deployment
# Deploy Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability
# Deploy Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
ingress:
enabled: true
EOF### Docker Compose
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # UI
- "14268:14268" # Collector
- "14250:14250" # gRPC
- "9411:9411" # Zipkin
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411**Reference:** See
references/jaeger-setup.md## Application Instrumentation
### OpenTelemetry (Recommended)
#### Python (Flask)
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask
# Initialize tracer
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
# Instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)
@app.route('/api/users')
def get_users():
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("get_users") as span:
span.set_attribute("user.count", 100)
# Business logic
users = fetch_users_from_db()
return {"users": users}
def fetch_users_from_db():
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("database_query") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM users")
# Database query
return query_database()#### Node.js (Express)
const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { registerInstrumentations } = require("@opentelemetry/instrumentation");
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const {
ExpressInstrumentation,
} = require("@opentelemetry/instrumentation-express");
// Initialize tracer
const provider = new NodeTracerProvider({
resource: { attributes: { "service.name": "my-service" } },
});
const exporter = new JaegerExporter({
endpoint: "http://jaeger:14268/api/traces",
});
provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();
// Instrument libraries
registerInstrumentations({
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
});
const express = require("express");
const app = express();
app.get("/api/users", async (req, res) => {
const tracer = trace.getTracer("my-service");
const span = tracer.startSpan("get_users");
try {
const users = await fetchUsers();
span.setAttributes({ "user.count": users.length });
res.json({ users });
} finally {
span.end();
}
});#### Go
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
))
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func getUsers(ctx context.Context) ([]User, error) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "get_users")
defer span.End()
span.SetAttributes(attribute.String("user.filter", "active"))
users, err := fetchUsersFromDB(ctx)
if err != nil {
span.RecordError(err)
return nil, err
}
span.SetAttributes(attribute.Int("user.count", len(users)))
return users, nil
}**Reference:** See
references/instrumentation.md## Context Propagation
### HTTP Headers
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE### Propagation in HTTP Requests
#### Python
from opentelemetry.propagate import inject
headers = {}
inject(headers) # Injects trace context
response = requests.get('http://downstream-service/api', headers=headers)#### Node.js
const { propagation } = require("@opentelemetry/api");
const headers = {};
propagation.inject(context.active(), headers);
axios.get("http://downstream-service/api", { headers });## Tempo Setup (Grafana)
### Kubernetes Deployment
apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
data:
tempo.yaml: |
server:
http_listen_port: 3200
distributor:
receivers:
jaeger:
protocols:
thrift_http:
grpc:
otlp:
protocols:
http:
grpc:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
querier:
frontend_worker:
frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
spec:
replicas: 1
template:
spec:
containers:
- name: tempo
image: grafana/tempo:latest
args:
- -config.file=/etc/tempo/tempo.yaml
volumeMounts:
- name: config
mountPath: /etc/tempo
volumes:
- name: config
configMap:
name: tempo-config**Reference:** See
assets/jaeger-config.yaml.template## Sampling Strategies
### Probabilistic Sampling
# Sample 1% of traces
sampler:
type: probabilistic
param: 0.01### Rate Limiting Sampling
# Sample max 100 traces per second
sampler:
type: ratelimiting
param: 100### Adaptive Sampling
from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased
# Sample based on trace ID (deterministic)
sampler = ParentBased(root=TraceIdRatioBased(0.01))## Trace Analysis
### Finding Slow Requests
**Jaeger Query:**
service=my-service
duration > 1s### Finding Errors
**Jaeger Query:**
service=my-service
error=true
tags.http.status_code >= 500### Service Dependency Graph
Jaeger automatically generates service dependency graphs showing:
- Service relationships
- Request rates
- Error rates
- Average latencies
## Best Practices
1. **Sample appropriately** (1-10% in production)
2. **Add meaningful tags** (user_id, request_id)
3. **Propagate context** across all service boundaries
4. **Log exceptions** in spans
5. **Use consistent naming** for operations
6. **Monitor tracing overhead** (<1% CPU impact)
7. **Set up alerts** for trace errors
8. **Implement distributed context** (baggage)
9. **Use span events** for important milestones
10. **Document instrumentation** standards
## Integration with Logging
### Correlated Logs
import logging
from opentelemetry import trace
logger = logging.getLogger(__name__)
def process_request():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id
logger.info(
"Processing request",
extra={"trace_id": format(trace_id, '032x')}
)## Troubleshooting
**No traces appearing:**
- Check collector endpoint
- Verify network connectivity
- Check sampling configuration
- Review application logs
**High latency overhead:**
- Reduce sampling rate
- Use batch span processor
- Check exporter configuration
## Reference Files
-
references/jaeger-setup.md - Jaeger installation-
references/instrumentation.md - Instrumentation patterns-
assets/jaeger-config.yaml.template - Jaeger configuration## Related Skills
-
prometheus-configuration - For metrics-
grafana-dashboards - For visualization-
slo-implementation - For latency SLOsHow to Use This Skill Unit
Option A: Project-Specific (Recommended)
- Click "Download" above
- In your project, create the directory:
.agent/skills/distributed-tracing/ - Save the file as
SKILL.md - The agent will automatically discover the skill based on its description.
Option B: Global Installation (All Agents)
Save the file to these locations to make it available across all projects:
- Claude Code:
~/.claude/skills/wshobson/agents/distributed-tracing/SKILL.md - Cursor:
~/.cursor/skills/wshobson/agents/distributed-tracing/SKILL.md - Antigravity:
~/.gemini/antigravity/skills/wshobson/agents/distributed-tracing/SKILL.md
π Install with CLI:npx skills add wshobson/agents