Back to DevOps & CI/CD

distributed-tracing

distributed tracingJaegerTempomicroservicesobservabilitydebuggingperformance monitoringcloud-native
⭐ 36.8kπŸ“„ MITπŸ•’ 2026-06-16Source β†—

Install this skill

npx skills add wshobson/agents

Works across Claude Code, Cursor, Codex, Copilot & Antigravity

Distributed tracing provides end-to-end observability into the request lifecycle across complex microservice architectures. By implementing instrumentation using OpenTelemetry standards, this skill allows for the visualization of every transaction as it hops through services, gateways, and databases. It generates traces comprised of discrete spans that capture metadata, execution duration, and potential error states. This methodology eliminates guesswork when diagnosing latency spikes or mysterious failures, as it renders the entire execution path into a coherent sequence. Through the integration of collectors like Jaeger or Tempo, operators can map service dependencies and identify performance bottlenecks that are invisible to local logging or metrics alone. This tool provides the structural clarity required to debug multi-service environments effectively.

When to Use This Skill

  • β€’Isolating services responsible for slow API response times
  • β€’Debugging error propagation during cascading service failures
  • β€’Mapping complex inter-service communication patterns
  • β€’Verifying that database queries align with expected request flows

How to Invoke This Skill

Example prompts that trigger this skill in Claude Code, Cursor, or Antigravity:

  • β€œtrace the request path for this endpoint
  • β€œfind why this microservice is timing out
  • β€œshow me the dependency graph for our services
  • β€œinstrument this flask app for distributed tracing
  • β€œhow is this request failing across services

Pro Tips

  • πŸ’‘Ensure consistent context propagation (e.g., HTTP headers) across ALL services, even non-instrumented ones, to maintain trace integrity.
  • πŸ’‘Utilize meaningful tags and structured logs within spans to enrich trace data, making filtering and debugging more effective.
  • πŸ’‘Regularly review your distributed traces to identify common patterns, recurring bottlenecks, or unexpected service dependencies for proactive optimization.

What this skill does

  • β€’Visualize request paths across multiple microservices
  • β€’Measure duration of individual operations within a request
  • β€’Propagate trace context metadata through service calls
  • β€’Identify specific nodes causing latency or errors
  • β€’Map service dependency topology

When not to use it

  • βœ•Monolithic applications with simple execution paths
  • βœ•Low-traffic services where overhead outweighs diagnostic value
  • βœ•Environments where security policies prohibit external metadata propagation

Example workflow

  1. Deploy a Jaeger collector into the observability namespace
  2. Configure OpenTelemetry providers in the target microservice
  3. Instrument specific methods and network calls with spans
  4. Generate traffic to trigger service-to-service communication
  5. Access the Jaeger dashboard to review the generated traces

Prerequisites

  • –A microservices architecture
  • –OpenTelemetry SDKs for your language
  • –A backend collector such as Jaeger or Tempo

Pitfalls & limitations

  • !High sampling rates can consume significant storage and network bandwidth
  • !Incomplete instrumentation in one service will result in fragmented trace views
  • !Adding too much metadata to spans may degrade service performance

FAQ

What is the difference between a trace and a span?
A trace represents the entire end-to-end request journey, while a span represents a single operation or unit of work within that journey.
Do I need to change my application code to use tracing?
Yes, you must instrument your application code to initialize the tracer and create spans around business logic or network calls.
Can I use distributed tracing for database queries?
Yes, by wrapping database calls in spans, you can track query latency and identify inefficient database operations within the request path.

How it compares

Unlike manual logging which provides disjointed events, distributed tracing creates a unified, queryable timeline that links related actions across different servers and processes.

Source & trust

⭐ 37k starsπŸ“„ MITπŸ•’ Updated 2026-06-16
πŸ“„ Full skill instructions β€” original source: wshobson/agents
# Distributed Tracing

Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices.

## Purpose

Track requests across distributed systems to understand latency, dependencies, and failure points.

## When to Use

- Debug latency issues
- Understand service dependencies
- Identify bottlenecks
- Trace error propagation
- Analyze request paths

## Distributed Tracing Concepts

### Trace Structure

Trace (Request ID: abc123)
↓
Span (frontend) [100ms]
↓
Span (api-gateway) [80ms]
β”œβ†’ Span (auth-service) [10ms]
β””β†’ Span (user-service) [60ms]
β””β†’ Span (database) [40ms]


### Key Components

- **Trace** - End-to-end request journey
- **Span** - Single operation within a trace
- **Context** - Metadata propagated between services
- **Tags** - Key-value pairs for filtering
- **Logs** - Timestamped events within a span

## Jaeger Setup

### Kubernetes Deployment

# Deploy Jaeger Operator
kubectl create namespace observability
kubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability

# Deploy Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
ingress:
enabled: true
EOF


### Docker Compose

version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "5775:5775/udp"
- "6831:6831/udp"
- "6832:6832/udp"
- "5778:5778"
- "16686:16686" # UI
- "14268:14268" # Collector
- "14250:14250" # gRPC
- "9411:9411" # Zipkin
environment:
- COLLECTOR_ZIPKIN_HOST_PORT=:9411


**Reference:** See references/jaeger-setup.md

## Application Instrumentation

### OpenTelemetry (Recommended)

#### Python (Flask)

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask

# Initialize tracer
resource = Resource(attributes={SERVICE_NAME: "my-service"})
provider = TracerProvider(resource=resource)
processor = BatchSpanProcessor(JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route('/api/users')
def get_users():
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("get_users") as span:
span.set_attribute("user.count", 100)
# Business logic
users = fetch_users_from_db()
return {"users": users}

def fetch_users_from_db():
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("database_query") as span:
span.set_attribute("db.system", "postgresql")
span.set_attribute("db.statement", "SELECT * FROM users")
# Database query
return query_database()


#### Node.js (Express)

const { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");
const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");
const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");
const { registerInstrumentations } = require("@opentelemetry/instrumentation");
const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");
const {
ExpressInstrumentation,
} = require("@opentelemetry/instrumentation-express");

// Initialize tracer
const provider = new NodeTracerProvider({
resource: { attributes: { "service.name": "my-service" } },
});

const exporter = new JaegerExporter({
endpoint: "http://jaeger:14268/api/traces",
});

provider.addSpanProcessor(new BatchSpanProcessor(exporter));
provider.register();

// Instrument libraries
registerInstrumentations({
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
});

const express = require("express");
const app = express();

app.get("/api/users", async (req, res) => {
const tracer = trace.getTracer("my-service");
const span = tracer.startSpan("get_users");

try {
const users = await fetchUsers();
span.setAttributes({ "user.count": users.length });
res.json({ users });
} finally {
span.end();
}
});


#### Go

package main

import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer() (*sdktrace.TracerProvider, error) {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger:14268/api/traces"),
))
if err != nil {
return nil, err
}

tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)

otel.SetTracerProvider(tp)
return tp, nil
}

func getUsers(ctx context.Context) ([]User, error) {
tracer := otel.Tracer("my-service")
ctx, span := tracer.Start(ctx, "get_users")
defer span.End()

span.SetAttributes(attribute.String("user.filter", "active"))

users, err := fetchUsersFromDB(ctx)
if err != nil {
span.RecordError(err)
return nil, err
}

span.SetAttributes(attribute.Int("user.count", len(users)))
return users, nil
}


**Reference:** See references/instrumentation.md

## Context Propagation

### HTTP Headers

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE


### Propagation in HTTP Requests

#### Python

from opentelemetry.propagate import inject

headers = {}
inject(headers) # Injects trace context

response = requests.get('http://downstream-service/api', headers=headers)


#### Node.js

const { propagation } = require("@opentelemetry/api");

const headers = {};
propagation.inject(context.active(), headers);

axios.get("http://downstream-service/api", { headers });


## Tempo Setup (Grafana)

### Kubernetes Deployment

apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
data:
tempo.yaml: |
server:
http_listen_port: 3200

distributor:
receivers:
jaeger:
protocols:
thrift_http:
grpc:
otlp:
protocols:
http:
grpc:

storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com

querier:
frontend_worker:
frontend_address: tempo-query-frontend:9095
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: tempo
spec:
replicas: 1
template:
spec:
containers:
- name: tempo
image: grafana/tempo:latest
args:
- -config.file=/etc/tempo/tempo.yaml
volumeMounts:
- name: config
mountPath: /etc/tempo
volumes:
- name: config
configMap:
name: tempo-config


**Reference:** See assets/jaeger-config.yaml.template

## Sampling Strategies

### Probabilistic Sampling

# Sample 1% of traces
sampler:
type: probabilistic
param: 0.01


### Rate Limiting Sampling

# Sample max 100 traces per second
sampler:
type: ratelimiting
param: 100


### Adaptive Sampling

from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

# Sample based on trace ID (deterministic)
sampler = ParentBased(root=TraceIdRatioBased(0.01))


## Trace Analysis

### Finding Slow Requests

**Jaeger Query:**

service=my-service
duration > 1s


### Finding Errors

**Jaeger Query:**

service=my-service
error=true
tags.http.status_code >= 500


### Service Dependency Graph

Jaeger automatically generates service dependency graphs showing:

- Service relationships
- Request rates
- Error rates
- Average latencies

## Best Practices

1. **Sample appropriately** (1-10% in production)
2. **Add meaningful tags** (user_id, request_id)
3. **Propagate context** across all service boundaries
4. **Log exceptions** in spans
5. **Use consistent naming** for operations
6. **Monitor tracing overhead** (<1% CPU impact)
7. **Set up alerts** for trace errors
8. **Implement distributed context** (baggage)
9. **Use span events** for important milestones
10. **Document instrumentation** standards

## Integration with Logging

### Correlated Logs

import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request():
span = trace.get_current_span()
trace_id = span.get_span_context().trace_id

logger.info(
"Processing request",
extra={"trace_id": format(trace_id, '032x')}
)


## Troubleshooting

**No traces appearing:**

- Check collector endpoint
- Verify network connectivity
- Check sampling configuration
- Review application logs

**High latency overhead:**

- Reduce sampling rate
- Use batch span processor
- Check exporter configuration

## Reference Files

- references/jaeger-setup.md - Jaeger installation
- references/instrumentation.md - Instrumentation patterns
- assets/jaeger-config.yaml.template - Jaeger configuration

## Related Skills

- prometheus-configuration - For metrics
- grafana-dashboards - For visualization
- slo-implementation - For latency SLOs

How to Use This Skill Unit

Option A: Project-Specific (Recommended)

  1. Click "Download" above
  2. In your project, create the directory: .agent/skills/distributed-tracing/
  3. Save the file as SKILL.md
  4. The agent will automatically discover the skill based on its description.

Option B: Global Installation (All Agents)

Save the file to these locations to make it available across all projects:

  • Claude Code: ~/.claude/skills/wshobson/agents/distributed-tracing/SKILL.md
  • Cursor: ~/.cursor/skills/wshobson/agents/distributed-tracing/SKILL.md
  • Antigravity: ~/.gemini/antigravity/skills/wshobson/agents/distributed-tracing/SKILL.md

πŸš€ Install with CLI:
npx skills add wshobson/agents

Read the Master Guide: Mastering Agent Skills β†’

Recommended Rules

View more rules β†’

Recommended Workflows

View more workflows β†’

Recommended MCP Servers

View more MCP servers β†’

Take It Further

Maximize your productivity with these powerful resources

πŸ“‹

Define Your Standards

Set up coding standards to ensure this workflow produces consistent, high-quality results.

Browse Rules Library
πŸ“–

Master Workflows

Learn how to create custom workflows, use Turbo Mode, and build your automation library.

Complete Guide

How to use this Skill in Claude Code & Cursor

For Claude Code (CLI)

To use this skill in Claude Code, copy the rule content into your project's custom instructions or follow our Add-Skill CLI guide. This ensures Claude follows your standards during every code generation.

For Cursor & Windsurf

For Cursor or Windsurf, individual skills are best used in the "Rules for AI" section. This specific unit helps the agent avoid devops & ci/cd issues, leading to cleaner, more efficient code.

Why the skill format matters: the standardized Agent Skills format lets your AI agent load detailed instructions only when they are relevant, keeping your prompt clean while improving results.

Source & attribution

This skill is categorized under DevOps & CI/CD and is published by W. Shobson, maintained in wshobson/agents.

← Browse All Agent Skills
Sponsored AI assistant. Recommendations may be paid.