What metrics matter most for LLM inference monitoring?

The most important metrics are time-to-first-token (TTFT), end-to-end request latency, input and output token throughput, queue length, batch size, GPU utilisation, and GPU memory usage. Token throughput matters more than raw request rate because the amount of work per LLM request varies widely with prompt and output length.

Why is GPU memory a critical signal for LLM systems?

LLM models are loaded entirely into GPU memory, so GPU memory saturation causes failures and performance degradation long before CPU or network show stress. Monitoring GPU memory with tools like the NVIDIA DCGM exporter lets teams detect saturation early, configure alerts, and plan capacity before queuing or out-of-memory errors begin.

What is time-to-first-token and why should you monitor it?

Time-to-first-token (TTFT) is the elapsed time between sending an inference request and receiving the first generated token in a streaming response. Users perceive high TTFT as a model that is hanging or slow to start, making it the primary latency indicator for streaming LLM interfaces even when total generation time remains acceptable.

How does distributed tracing help diagnose slow LLM requests?

Distributed tracing records timing and causality for each stage of an LLM request — retrieval, tool calls, safety filters, and model execution. By correlating spans from the API gateway to the inference server, engineers can pinpoint which stage is slow or failing without guesswork or expensive log searches.

What is the difference between Prometheus and OpenTelemetry for LLM observability?

Prometheus is a pull-based metrics system that scrapes numeric time-series data from endpoints at a fixed interval, suited to fast aggregation and alerting. OpenTelemetry is a vendor-neutral instrumentation framework covering metrics, traces, and logs through a unified SDK and Collector pipeline. Most production LLM stacks use both together, routing OpenTelemetry metrics to Prometheus and traces to a backend such as Tempo or Jaeger.

Observability for LLM Systems: Metrics, Traces, Logs, and Testing in Production

End-to-end observability strategy for LLM inference and LLM applications

Page content

LLM systems fail in ways that traditional API monitoring cannot surface — queues fill silently, GPU memory saturates long before CPU looks busy, and latency blows up at the batching layer rather than the application layer. This guide covers an end-to-end observability strategy for LLM inference and LLM applications: what to measure, how to instrument it with Prometheus, OpenTelemetry, and Grafana, and how to deploy the telemetry pipeline at scale.

observability monitoring dashboard happy user

TL;DR (Executive summary)

LLM systems degrade in ways that classic “HTTP latency + error rate” monitoring cannot explain. Production-grade Observability for LLM systems must answer, quickly and defensibly:

Whether user experience is degrading (tail latency, time-to-first-token, inter-token latency, errors and aborts).
Where time is spent (queueing vs batching vs model execution; retrieval/tools/safety filters vs inference).
What saturates first (GPU utilisation and memory pressure, KV-cache/queue pressure, CPU tokenisation).
How cost and capacity drift (tokens per request, tokens/sec per GPU, cache hit rate, wasted generations).
Whether telemetry is safe to store (prompts can contain PII; prevent sensitive leakage into logs/attributes).

The most resilient design is a multi-signal pipeline:

Metrics for fast detection and capacity planning (Prometheus + PromQL; optional long-term storage via Thanos/Cortex/Mimir/VictoriaMetrics).
Traces for request-level causality (OpenTelemetry with OTLP; backends such as Tempo/Jaeger/Zipkin/Elastic APM).
Logs for context, correlated to traces (Loki/Elastic/OpenSearch), designed for low-cardinality metadata.
Profiling for CPU/memory hotspots and tail latency (Grafana Pyroscope).
Synthetic + load tests to detect regressions before users do (Grafana k6; blackbox-style probing).
SLOs to measure user outcomes and drive actionable alerting (error budgets; burn-rate style).

What makes observability for LLM systems different

Scope note: the target LLM framework is unspecified. Examples in this article cover common servers/frameworks (Triton, vLLM, TGI, LangChain/LangSmith) and remain applicable to other stacks by substituting equivalent metrics and spans.

LLMs introduce operational behaviours that differ from conventional web services:

Variable work per request: token counts (input/output) vary widely, so “requests per second” can look stable while token throughput collapses. TGI and vLLM explicitly export token- and token-latency-related telemetry to support this style of monitoring.
Queueing + continuous batching: throughput depends on batching/queue disciplines; queue size and batch size become first-class indicators (TGI exposes both).
Streaming UX: users care about TTFT and inter-token latency at least as much as full response time; OpenTelemetry even standardises server TTFT/time-per-token metrics under GenAI semantic conventions.
GPU pressure dominates failure modes: GPU utilisation and GPU memory (including memory used) are central to reliability; NVIDIA’s DCGM exporter exists specifically to expose GPU telemetry at a Prometheus /metrics endpoint.
Multi-step pipelines: retrieval, tool calls, safety filters, and post-processing mean end-to-end latency is a composition of multiple spans/queues—making distributed tracing and careful metric design essential.

Concrete examples from popular inference servers highlight this:

NVIDIA Triton Inference Server exposes metrics as plain text via /metrics (commonly :8002/metrics) and provides flags to enable/disable metrics and select a metrics port.
vLLM exposes an extensive Prometheus /metrics endpoint with a vllm: prefix; its documentation includes counters for generation tokens and histograms such as time to first token.
Hugging Face TGI documents a /metrics endpoint with queue size, batch size, end-to-end request duration, generated tokens, and queue duration.

Core observability tasks and required LLM telemetry

Observability for LLM systems is easiest to implement when you map tasks → signals → tools, and then constrain cardinality and sampling from day one.

Metrics: For online-serving systems, Prometheus’ own instrumentation guidance highlights query count, errors, and latency as key metrics; LLMs expand this with TTFT, per-token throughput/latency, queue length, batch size, and GPU utilisation.

Traces: Traces are how you attribute latency and failures across retrieval/tools/safety/inference stages; OpenTelemetry frames traces/exporters as a vendor-neutral way to emit and send telemetry to collectors or backends.

Logs: Logs provide human-readable context and “why”, but only remain usable at scale if you avoid indexing unbounded values (example: Loki indexes labels only and stores compressed log chunks in object storage).

Profiling: Continuous profiling captures production CPU/memory behaviour with low-overhead sampling; Grafana Pyroscope is explicitly positioned for this.

Synthetic tests and load tests: Grafana k6 is an open-source load testing tool, and Grafana notes Synthetic Monitoring is powered by k6 and extends beyond simple protocol checks.

SLOs: Google’s SRE guidance defines an SLO as a target value/range for a service level measured by an SLI, and provides guidance for alerting on SLOs (precision/recall/detection time trade-offs).

Key LLM metrics blueprint

Category	Example metric names (real examples)	Type	Why it matters	Example sources
End-to-end latency	`tgi_request_duration`	Histogram	Tail latency is the user experience	TGI exports this explicitly
Time-to-first-token	`vllm:time_to_first_token_seconds` ; `gen_ai.server.time_to_first_token`	Histogram	Streaming/delayed-first-token is often first sign of saturation	vLLM and OTel semconv GenAI
Time per output token	`tgi_request_mean_time_per_token_duration` ; `gen_ai.server.time_per_output_token`	Histogram	Inter-token latency; “feels slow” even if request completes	TGI and OTel semconv GenAI
Token usage/volume	`tgi_request_generated_tokens` ; `gen_ai.client.token.usage`	Histogram / Counter	Cost + capacity are token-driven	TGI and OTel semconv GenAI
Requests	`tgi_request_count` ; `vllm:request_success_total`	Counter	Traffic baseline and outcomes	TGI and vLLM
Queue length	`tgi_queue_size`	Gauge	Queueing predicts latency blowups	TGI
Batch size and batch limits	`tgi_batch_current_size` ; `tgi_batch_current_max_tokens`	Gauge	Throughput–latency trade-offs	TGI
GPU utilisation/memory	`DCGM_*` (exporter-provided)	Gauge	Saturation, OOM risk, scaling trigger	DCGM exporter exposes GPU metrics at `/metrics`
Inference server telemetry endpoint	`:8002/metrics` (Triton default in docs/archives)	—	Standard scrape target for Prometheus	Triton docs

OpenTelemetry GenAI semantic conventions for standardisation

OpenTelemetry provides GenAI semantic conventions (status: “Development”) with standard naming for GenAI metrics such as:

gen_ai.client.token.usage and gen_ai.client.operation.duration
gen_ai.server.request.duration, gen_ai.server.time_per_output_token, and gen_ai.server.time_to_first_token

This standardisation is a practical lever for portable “monitoring LLM models with OpenTelemetry” strategies: emit once and route the same telemetry to OSS or vendor backends later.

Designing the telemetry pipeline

llm observability flowchart

Pull vs push

Prometheus is pull-first. Processes expose metrics in a supported exposition format, and Prometheus scrapes them according to configured scrape jobs.

Push is for exceptions. Prometheus’ “When to use the Pushgateway” guide explicitly recommends the Pushgateway only in limited cases (not as a general push replacement), and the Pushgateway README emphasises it cannot “turn Prometheus into a push-based monitoring system”.

LLM-specific practical pattern:

Use pull for inference servers/exporters (Triton/vLLM/TGI metrics endpoints; DCGM exporter; node metrics).
Use OTLP push for traces/logs/OTel metrics (OpenTelemetry Protocol defines transport/encoding/delivery between sources, collectors, and backends).
Use remote write when scaling beyond a single Prometheus (Prometheus provides remote write tuning guidance; Mimir/Thanos/Cortex provide long-term and/or HA storage options).

Agent vs sidecar vs gateway collectors

OpenTelemetry documents an agent deployment pattern, where telemetry is sent to a Collector running alongside the application or on the same host (sidecar/DaemonSet), then exported.

For Kubernetes, sidecar injection is supported via the OpenTelemetry Operator (annotation-based injection).

Pragmatic rule-of-thumb for LLM stacks:

Use a DaemonSet agent for host-level enrichment and shared pipelines across many pods.
Use a sidecar when you need strict per-workload isolation or dedicated local filtering (common when prompts may contain sensitive data).
Use a gateway collector for centralised tail sampling, batching, retries, and export fan-out.

Sampling and cardinality control

OpenTelemetry clarifies that tail sampling allows sampling decisions using criteria derived from a trace (not possible with head sampling alone).

Prometheus’ instrumentation guidance warns against label overuse, provides a rule-of-thumb to keep cardinality low, and advises redesigning metrics if potential cardinality exceeds ~100.

LLM-specific “cardinality traps” to ban early:

Prompt text, response text, conversation IDs, request IDs as labels/attributes.
Tool argument blobs as span attributes.
Unbounded “user_id” labels.

Prefer bounded dimensions: model, model_family, endpoint, region, status_code, deployment, tenant (only if bounded).

LLM observability tools comparison

Tools mapped to observability tasks

Tool	Metrics	Traces	Logs	Profiling	Synthetic tests	SLOs / alerting	LLM relevance
Prometheus	✅	◻️	◻️	◻️	◻️	✅	Instrumentation guidance + alerting model; pull-based scraping
Grafana	✅ (viz)	✅ (viz)	✅ (viz)	✅	✅	✅	Dashboards are panels over data sources; supports broad data sources
OpenTelemetry	✅	✅	✅	✅ (profiles evolving)	◻️	◻️	OTLP spec + GenAI semantic conventions; vendor-neutral instrumentation
Jaeger	◻️	✅	◻️	◻️	◻️	◻️	Accepts OTLP (gRPC/HTTP) and is a common tracing backend
Grafana Tempo	◻️	✅	◻️	◻️	◻️	◻️	High-scale tracing; can generate metrics from spans via metrics-generator
Grafana Loki	◻️	◻️	✅	◻️	◻️	◻️	Indexes labels only; stores compressed chunks; reduces log cost at scale
Elastic Stack (ELK)	✅	✅	✅	◻️	◻️	✅	Elastic Stack lists Elasticsearch + Kibana foundations; Elastic APM supports OTel integration
DCGM exporter	✅	◻️	◻️	◻️	◻️	◻️	GPU metrics exporter exposing `/metrics` scrape endpoint
Mimir / Thanos / Cortex	✅	◻️	◻️	◻️	◻️	◻️	Long-term/HA Prometheus-compatible metrics storage
Datadog	✅	✅	✅	✅	✅	✅	Accepts OTel traces/metrics/logs; includes sensitive data scanning features
New Relic	✅	✅	✅	✅	✅	✅	Documents OTLP endpoint configuration and supported OTLP/HTTP practices
Honeycomb	✅	✅	✅	◻️	◻️	✅	Supports receiving OTLP over gRPC/HTTP; OTel-first ingestion
LangSmith	◻️	✅	◻️	◻️	◻️	◻️	Supports OpenTelemetry-based tracing for LLM apps

Grafana vs alternatives for visualisation

Grafana dashboards are composed of panels that query data sources (including Loki and Mimir) to produce charts and visualisations.
Kibana provides dashboards/visualisations as the UI layer within the Elastic Stack.
OpenSearch Dashboards provides data visualisation tooling for OpenSearch.
InfluxData’s documentation positions Chronograf as the visualisation component within the Influx ecosystem.

Prometheus vs alternatives for metrics backends

Prometheus local storage defaults: if retention flags aren’t set, retention defaults to 15d (plan retention/cost early).
Grafana Mimir is described as horizontally scalable, HA, multi-tenant long-term storage for Prometheus and OpenTelemetry metrics.
Thanos is described as a highly available Prometheus setup with long-term storage capabilities.
Cortex describes itself as a horizontally scalable, HA, multi-tenant long-term storage solution for Prometheus and OpenTelemetry metrics.
VictoriaMetrics Cloud documents Prometheus remote write integration for long-term storage.
Amazon Managed Service for Prometheus describes a managed offering that scales with ingestion/query needs and supports PromQL and remote write.

Practical implementation cookbook

Metric names and types to implement today

OpenTelemetry GenAI semantic conventions (status: Development) define metric names you can standardise on immediately:

gen_ai.client.token.usage
gen_ai.client.operation.duration
gen_ai.server.request.duration
gen_ai.server.time_per_output_token
gen_ai.server.time_to_first_token

Server-side examples you can scrape right away:

vLLM’s Prometheus endpoint includes counters (e.g., generation tokens total) and histograms (TTFT) and documents a model_name label strategy.
TGI documents metrics including queue size, request duration, generated tokens, and mean time per token.
Triton documents /metrics exposure and metric toggles.

PromQL examples for LLM latency and throughput dashboards

# p95 end-to-end latency for an application histogram
histogram_quantile(
  0.95,
  sum(rate(llm_request_latency_seconds_bucket[5m])) by (le, model)
)

# Error rate percentage (5xx)
100 *
(
  sum(rate(llm_requests_total{status_code=~"5.."}[5m]))
  /
  sum(rate(llm_requests_total[5m]))
)

# Tokens/sec (output) across all models
sum(rate(llm_tokens_total{direction="out"}[5m]))

# TGI queue size (gauge)
max(tgi_queue_size) by (instance)

# vLLM TTFT p95
histogram_quantile(
  0.95,
  sum(rate(vllm:time_to_first_token_seconds_bucket[5m])) by (le, model_name)
)

Prometheus’ histogram guidance explains that histogram quantiles are computed server-side from buckets using histogram_quantile().

OpenTelemetry instrumentation notes for LLM systems

OTLP is the OpenTelemetry Protocol specifying how telemetry is encoded/transmitted between sources, collectors, and backends.
OpenTelemetry SDK configuration documents environment variables such as OTEL_EXPORTER_OTLP_ENDPOINT (and protocol options) for exporting telemetry.
OpenTelemetry Python contrib documents FastAPI instrumentation support for automatic and manual instrumentation.
GenAI semantic conventions include an opt-in stability mechanism via OTEL_SEMCONV_STABILITY_OPT_IN for GenAI conventions migration.

Short Python example: metrics + traces + logs

The snippet below demonstrates:

Prometheus metrics exposition (/metrics) for “monitoring LLM inference with Prometheus”
OpenTelemetry traces exported via OTLP (vendor-neutral)
Structured logs correlated with trace context, with a privacy-safe default (don’t log raw prompts)

import logging
import time

from fastapi import FastAPI, Request
from pydantic import BaseModel

# Prometheus (pull-based metrics)
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response

# OpenTelemetry (OTLP traces)
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

app = FastAPI(title="LLM Inference API", version="1.0.0")
FastAPIInstrumentor.instrument_app(app)

# --- Logging (privacy-safe default) ---
logger = logging.getLogger("llm")
logging.basicConfig(level=logging.INFO, format="%(message)s")

def trace_id_hex() -> str:
    span = trace.get_current_span()
    ctx = span.get_span_context()
    return format(ctx.trace_id, "032x") if ctx.is_valid else ""

# --- Prometheus metrics ---
LLM_REQUESTS = Counter(
    "llm_requests_total",
    "LLM requests total",
    ["route", "status_code", "model"],
)
LLM_LATENCY = Histogram(
    "llm_request_latency_seconds",
    "End-to-end LLM request latency (seconds)",
    ["route", "model"],
    buckets=(0.1, 0.2, 0.35, 0.5, 0.75, 1, 1.5, 2, 3, 5, 8, 13),
)

# --- OpenTelemetry tracer provider ---
resource = Resource.create({"service.name": "llm-inference-api"})
trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter())  # configure via OTEL_EXPORTER_OTLP_* env vars
)
tracer = trace.get_tracer(__name__)

class GenerateRequest(BaseModel):
    prompt: str
    model: str = "unspecified"
    max_tokens: int = 256

class GenerateResponse(BaseModel):
    model: str
    output: str
    latency_ms: int

@app.post("/v1/generate", response_model=GenerateResponse)
async def generate(req: GenerateRequest, request: Request):
    route = "/v1/generate"
    start = time.perf_counter()

    with tracer.start_as_current_span("llm.generate") as span:
        # Avoid logging the full prompt; emit safe metadata
        span.set_attribute("gen_ai.request.model", req.model)
        span.set_attribute("gen_ai.request.max_tokens", req.max_tokens)

        # Replace with actual LLM call (Triton/vLLM/TGI client)
        time.sleep(0.15)
        output = "Hello from the model."

        latency_s = time.perf_counter() - start
        LLM_LATENCY.labels(route=route, model=req.model).observe(latency_s)
        LLM_REQUESTS.labels(route=route, status_code="200", model=req.model).inc()

        logger.info(
            {
                "msg": "llm_request_complete",
                "trace_id": trace_id_hex(),
                "model": req.model,
                "latency_ms": int(latency_s * 1000),
                # Do NOT include raw prompt/output unless policy allows it.
            }
        )

        return GenerateResponse(model=req.model, output=output, latency_ms=int(latency_s * 1000))

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

Deployment, scaling, security, and troubleshooting

llm dashboard deployment

Deployment options

Deployment option	Best for	Trade-offs
Kubernetes + kube-prometheus-stack (Helm)	Standardised cluster monitoring bundle (Prometheus Operator, dashboards, rules)	CRDs/operator lifecycle management
Kubernetes + OpenTelemetry Collector (DaemonSet/sidecar)	Standardised OTLP pipelines; local sensitive filtering	Needs sampling/limit tuning
Docker Compose	Rapid prototyping on a single host	Not HA; storage is manual
systemd / VM installs	Bare-metal GPU fleets and traditional ops	Manual discovery and config
Managed services (Grafana Cloud / Datadog / New Relic / AMP)	Fast time-to-value; managed scaling	Cost and governance; vendor lock-in trade-offs

Scaling and retention: practical constraints

Prometheus local storage: without explicit size/time flags, the retention time defaults to 15d.
Prometheus remote write: Prometheus documents remote write tuning for scaling beyond “sane defaults”.
Grafana Tempo: positioned as a high-scale tracing backend and can generate metrics from spans using the metrics-generator (remote writes to a Prometheus data source).
Loki storage: Loki’s docs emphasise label-only indexing and compressed chunk storage (object storage), making label strategy central to scale and cost.

Security and privacy: prompts can contain PII

OpenTelemetry’s security guidance stresses that telemetry collection can inadvertently capture sensitive/personal information; you are responsible for handling it appropriately.

Prometheus’ security model warns that Prometheus endpoints should not be exposed to publicly accessible networks (like the internet) because they serve information about monitored systems.

Operational privacy controls that keep “observability for LLM systems” safe:

Default to not logging raw prompts/responses; log token counts, model name, latency, and trace IDs instead.
Redact/drop sensitive attributes in collectors/pipelines (Collector-level filtering is a common approach across ecosystems).
Enforce RBAC and retention policies for logs/traces; consider sensitive-data scanners where appropriate (e.g., vendors document scanners for telemetry).

Troubleshooting checklist

If your Grafana dashboard for LLM latency looks wrong, debug in this order:

Ingestion health
- Prometheus: validate scrape success and config semantics (Prometheus configuration defines scrape jobs/instances).
- OTLP: confirm exporter endpoint configuration (SDKs use OTEL_EXPORTER_OTLP_ENDPOINT, protocol settings).
Schema mismatch
- Dashboard expects model, but your server emits model_name (vLLM explicitly documents model_name labels).
Cardinality explosion
- Someone labelled by request IDs/prompt hashes; Prometheus warns labelsets increase RAM/CPU/disk/network costs and provides cardinality guidance.
Histogram misuse
- Ensure you compute quantiles from _bucket series with rate() and le; Prometheus explains histogram quantile calculation trade-offs.
Trace sampling gaps
- If you head-sample too aggressively, rare slow/error traces disappear; tail sampling retains “important” traces based on full trace criteria.
Tempo span-metrics issues
- If using Tempo metrics-generator and span-metrics, confirm it is enabled and tuned (Tempo documents metrics-generator and span-metrics processors; troubleshooting exists for generator issues).
GPU metrics absent
- Confirm DCGM exporter is deployed and /metrics is reachable (DCGM exporter exposes GPU metrics over HTTP for Prometheus).