Back to Blog
Observability Stack: Prometheus, Grafana & Jaeger Complete Guide

Observability Stack: Prometheus, Grafana & Jaeger Complete Guide

December 19, 2025
10 min read
Tushar Agrawal

Build production-grade observability with Prometheus metrics, Grafana dashboards, and Jaeger distributed tracing. Complete setup guide with alerting, custom metrics, and troubleshooting patterns.

Introduction

You can't fix what you can't see. In distributed systems, observability isn't a luxury—it's survival. When a healthcare platform serving millions of patients experiences a latency spike at 2 AM, you need answers in seconds, not hours.

This guide covers the three pillars of observability: Metrics (Prometheus), Visualization (Grafana), and Distributed Tracing (Jaeger). I'll share production patterns from building HIPAA-compliant systems that demand 99.99% uptime.

The Three Pillars of Observability

┌─────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                             │
├───────────────────┬───────────────────┬─────────────────────┤
│      METRICS      │       LOGS        │      TRACES         │
│                   │                   │                     │
│  ┌─────────────┐  │  ┌─────────────┐  │  ┌───────────────┐  │
│  │ Prometheus  │  │  │    Loki     │  │  │    Jaeger     │  │
│  │             │  │  │   (ELK)     │  │  │   (Zipkin)    │  │
│  └─────────────┘  │  └─────────────┘  │  └───────────────┘  │
│                   │                   │                     │
│  What happened?   │  Why it happened? │  How it happened?   │
│  (Numeric data)   │  (Event context)  │  (Request flow)     │
└───────────────────┴───────────────────┴─────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │     Grafana     │
                    │  (Visualization)│
                    └─────────────────┘

Setting Up Prometheus

Docker Compose Setup

# docker-compose.observability.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:v2.47.0
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    networks:
      - observability

  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    networks:
      - observability

  grafana:
    image: grafana/grafana:10.2.0
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=secure_password
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    networks:
      - observability

  jaeger:
    image: jaegertracing/all-in-one:1.51
    container_name: jaeger
    ports:
      - "16686:16686"  # UI
      - "14268:14268"  # HTTP collector
      - "6831:6831/udp"  # Thrift compact
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    networks:
      - observability

volumes:
  prometheus_data:
  grafana_data:

networks:
  observability:
    driver: bridge

Prometheus Configuration

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'
    env: 'prod'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Application services
  - job_name: 'api-services'
    metrics_path: /metrics
    static_configs:
      - targets:
          - 'user-service:8000'
          - 'order-service:8000'
          - 'payment-service:8000'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'

  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__

  # Node exporter for system metrics
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  # PostgreSQL exporter
  - job_name: 'postgresql'
    static_configs:
      - targets: ['postgres-exporter:9187']

  # Redis exporter
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

Instrumenting Python Applications

FastAPI with Prometheus Metrics

# metrics.py
from prometheus_client import (
    Counter, Histogram, Gauge, Info,
    generate_latest, CONTENT_TYPE_LATEST
)
from fastapi import FastAPI, Request, Response
from starlette.middleware.base import BaseHTTPMiddleware
import time

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency in seconds',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

REQUESTS_IN_PROGRESS = Gauge(
    'http_requests_in_progress',
    'Number of HTTP requests in progress',
    ['method', 'endpoint']
)

DB_QUERY_LATENCY = Histogram(
    'db_query_duration_seconds',
    'Database query latency',
    ['query_type', 'table'],
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5]
)

CACHE_HITS = Counter(
    'cache_hits_total',
    'Cache hit count',
    ['cache_name']
)

CACHE_MISSES = Counter(
    'cache_misses_total',
    'Cache miss count',
    ['cache_name']
)

APP_INFO = Info('app', 'Application information')
APP_INFO.info({
    'version': '1.2.3',
    'environment': 'production'
})


class PrometheusMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        method = request.method
        endpoint = self._get_endpoint(request)

        REQUESTS_IN_PROGRESS.labels(method=method, endpoint=endpoint).inc()

        start_time = time.perf_counter()
        try:
            response = await call_next(request)
            status_code = response.status_code
        except Exception as e:
            status_code = 500
            raise
        finally:
            duration = time.perf_counter() - start_time

            REQUEST_COUNT.labels(
                method=method,
                endpoint=endpoint,
                status_code=status_code
            ).inc()

            REQUEST_LATENCY.labels(
                method=method,
                endpoint=endpoint
            ).observe(duration)

            REQUESTS_IN_PROGRESS.labels(
                method=method,
                endpoint=endpoint
            ).dec()

        return response

    def _get_endpoint(self, request: Request) -> str:
        # Normalize path parameters
        path = request.url.path
        for route in request.app.routes:
            if hasattr(route, 'path_regex'):
                match = route.path_regex.match(path)
                if match:
                    return route.path
        return path


# FastAPI setup
app = FastAPI()
app.add_middleware(PrometheusMiddleware)


@app.get('/metrics')
async def metrics():
    return Response(
        content=generate_latest(),
        media_type=CONTENT_TYPE_LATEST
    )


# Custom metrics decorator
def track_db_query(query_type: str, table: str):
    def decorator(func):
        async def wrapper(*args, **kwargs):
            start = time.perf_counter()
            try:
                return await func(*args, **kwargs)
            finally:
                duration = time.perf_counter() - start
                DB_QUERY_LATENCY.labels(
                    query_type=query_type,
                    table=table
                ).observe(duration)
        return wrapper
    return decorator


# Usage
@track_db_query('select', 'users')
async def get_user(user_id: int):
    return await db.fetch_one("SELECT * FROM users WHERE id = $1", user_id)

Business Metrics

# business_metrics.py
from prometheus_client import Counter, Gauge, Histogram

# Revenue metrics
ORDERS_TOTAL = Counter(
    'orders_total',
    'Total orders placed',
    ['status', 'payment_method']
)

ORDER_VALUE = Histogram(
    'order_value_dollars',
    'Order value distribution',
    buckets=[10, 25, 50, 100, 250, 500, 1000, 2500, 5000]
)

# User metrics
ACTIVE_USERS = Gauge(
    'active_users_current',
    'Currently active users',
    ['user_type']
)

USER_SIGNUPS = Counter(
    'user_signups_total',
    'Total user signups',
    ['source', 'plan']
)

# Healthcare specific
LAB_TESTS_PROCESSED = Counter(
    'lab_tests_processed_total',
    'Lab tests processed',
    ['test_type', 'priority']
)

REPORT_GENERATION_TIME = Histogram(
    'report_generation_seconds',
    'Time to generate patient reports',
    ['report_type'],
    buckets=[1, 5, 10, 30, 60, 120, 300]
)


# Usage in business logic
async def create_order(order: OrderCreate) -> Order:
    result = await db.create_order(order)

    # Record business metrics
    ORDERS_TOTAL.labels(
        status='created',
        payment_method=order.payment_method
    ).inc()

    ORDER_VALUE.observe(float(order.total))

    return result

Distributed Tracing with Jaeger

OpenTelemetry Setup

# tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.asyncpg import AsyncPGInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.b3 import B3MultiFormat

def setup_tracing(service_name: str):
    """Initialize OpenTelemetry tracing with Jaeger."""

    # Create resource with service info
    resource = Resource.create({
        "service.name": service_name,
        "service.version": "1.2.3",
        "deployment.environment": "production",
    })

    # Create tracer provider
    provider = TracerProvider(resource=resource)

    # Configure Jaeger exporter
    jaeger_exporter = JaegerExporter(
        agent_host_name="jaeger",
        agent_port=6831,
    )

    # Add span processor
    provider.add_span_processor(
        BatchSpanProcessor(jaeger_exporter)
    )

    # Set global tracer provider
    trace.set_tracer_provider(provider)

    # Set propagation format (B3 for compatibility)
    set_global_textmap(B3MultiFormat())

    return trace.get_tracer(service_name)


def instrument_app(app):
    """Instrument FastAPI and dependencies."""

    # FastAPI
    FastAPIInstrumentor.instrument_app(app)

    # HTTP client
    HTTPXClientInstrumentor().instrument()

    # Database
    AsyncPGInstrumentor().instrument()

    # Redis
    RedisInstrumentor().instrument()


# Usage
tracer = setup_tracing("user-service")


# Custom span creation
async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        # Validate order
        with tracer.start_as_current_span("validate_order"):
            await validate_order(order_id)

        # Process payment
        with tracer.start_as_current_span("process_payment") as payment_span:
            result = await payment_service.charge(order_id)
            payment_span.set_attribute("payment.status", result.status)

        # Send notifications
        with tracer.start_as_current_span("send_notifications"):
            await notification_service.send(order_id)

        span.set_attribute("order.status", "completed")

Trace Context Propagation

# context_propagation.py
import httpx
from opentelemetry import trace
from opentelemetry.propagate import inject

async def call_downstream_service(endpoint: str, data: dict):
    """Call downstream service with trace context."""

    headers = {}
    inject(headers)  # Inject trace context into headers

    async with httpx.AsyncClient() as client:
        response = await client.post(
            endpoint,
            json=data,
            headers=headers
        )
        return response.json()


# gRPC context propagation
import grpc
from opentelemetry.propagate import inject

def create_grpc_metadata():
    """Create gRPC metadata with trace context."""
    carrier = {}
    inject(carrier)
    return [(k, v) for k, v in carrier.items()]


async def call_grpc_service(stub, request):
    metadata = create_grpc_metadata()
    return await stub.SomeMethod(request, metadata=metadata)

Grafana Dashboards

API Performance Dashboard

{
  "dashboard": {
    "title": "API Performance",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (endpoint)",
            "legendFormat": "{{endpoint}}"
          }
        ]
      },
      {
        "title": "Latency P95",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))",
            "legendFormat": "{{endpoint}}"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100",
            "legendFormat": "Error %"
          }
        ],
        "thresholds": {
          "steps": [
            {"color": "green", "value": 0},
            {"color": "yellow", "value": 1},
            {"color": "red", "value": 5}
          ]
        }
      },
      {
        "title": "Requests In Progress",
        "type": "gauge",
        "targets": [
          {
            "expr": "sum(http_requests_in_progress)",
            "legendFormat": "In Progress"
          }
        ]
      }
    ]
  }
}

Database Dashboard Queries

# Connection pool usage
pg_stat_activity_count{state="active"} / pg_settings_max_connections * 100

# Query latency by type
histogram_quantile(0.95, sum(rate(db_query_duration_seconds_bucket[5m])) by (le, query_type))

# Slow queries count
sum(rate(db_query_duration_seconds_count{le="1"}[5m])) - sum(rate(db_query_duration_seconds_count{le="0.1"}[5m]))

# Cache hit ratio
sum(rate(cache_hits_total[5m])) / (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m]))) * 100

Alerting Rules

# prometheus/rules/alerts.yml
groups:
  - name: api_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
          > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High API latency"
          description: "P95 latency is {{ $value }}s (threshold: 2s)"

      - alert: ServiceDown
        expr: up{job="api-services"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "Service has been unreachable for more than 1 minute"

  - name: database_alerts
    rules:
      - alert: HighConnectionUsage
        expr: |
          pg_stat_activity_count / pg_settings_max_connections > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High database connection usage"
          description: "{{ $value | humanizePercentage }} of connections in use"

      - alert: SlowQueries
        expr: |
          rate(pg_stat_activity_max_tx_duration{state="active"}[5m]) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Slow database queries detected"

  - name: business_alerts
    rules:
      - alert: OrderProcessingDelayed
        expr: |
          histogram_quantile(0.95, rate(order_processing_duration_seconds_bucket[5m]))
          > 60
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Order processing is delayed"

      - alert: LowOrderVolume
        expr: |
          sum(rate(orders_total[1h])) < 10
        for: 30m
        labels:
          severity: info
        annotations:
          summary: "Unusually low order volume"

AlertManager Configuration

# alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true
    - match:
        severity: critical
      receiver: 'slack-critical'
    - match:
        severity: warning
      receiver: 'slack-warning'

receivers:
  - name: 'default-receiver'
    slack_configs:
      - channel: '#alerts'
        send_resolved: true

  - name: 'slack-critical'
    slack_configs:
      - channel: '#alerts-critical'
        color: '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
        send_resolved: true

  - name: 'slack-warning'
    slack_configs:
      - channel: '#alerts-warning'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'
        severity: critical

SLO/SLI Implementation

# slo_metrics.py
from prometheus_client import Counter, Histogram

# SLI: Request success rate
REQUESTS_TOTAL = Counter(
    'sli_requests_total',
    'Total requests for SLO calculation',
    ['service', 'endpoint']
)

REQUESTS_SUCCESS = Counter(
    'sli_requests_success_total',
    'Successful requests for SLO calculation',
    ['service', 'endpoint']
)

# SLI: Latency
REQUEST_LATENCY = Histogram(
    'sli_request_latency_seconds',
    'Request latency for SLO calculation',
    ['service', 'endpoint'],
    buckets=[.1, .25, .5, 1, 2.5]
)


# SLO Prometheus rules
"""
# prometheus/rules/slo.yml
groups:
  - name: slo_rules
    rules:
      # Error budget burn rate
      - record: slo:error_budget:ratio
        expr: |
          1 - (
            sum(rate(sli_requests_success_total[30d]))
            / sum(rate(sli_requests_total[30d]))
          )

      # SLO: 99.9% availability
      - record: slo:availability:ratio
        expr: |
          sum(rate(sli_requests_success_total[5m]))
          / sum(rate(sli_requests_total[5m]))

      # SLO: 95% of requests under 500ms
      - record: slo:latency:ratio
        expr: |
          sum(rate(sli_request_latency_seconds_bucket{le="0.5"}[5m]))
          / sum(rate(sli_request_latency_seconds_count[5m]))

      # Alert on error budget burn
      - alert: ErrorBudgetBurn
        expr: slo:error_budget:ratio > 0.001  # Burned more than 0.1%
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Error budget burning too fast"
          description: "{{ $value | humanizePercentage }} of monthly error budget consumed"
"""

Conclusion

A robust observability stack is essential for operating reliable distributed systems:

  • Prometheus for time-series metrics and alerting
  • Grafana for visualization and dashboards
  • Jaeger for distributed tracing across services
Key takeaways:
  • Instrument early, not after problems occur
  • Use business metrics alongside technical metrics
  • Set up SLOs and error budgets
  • Create runbooks linked to alerts
  • Practice observability-driven development
This stack has helped me maintain 99.99% uptime for healthcare systems where reliability isn't optional.

Related Articles

Share this article

Related Articles