What is Prometheus and why should I use it?

Prometheus is an open-source monitoring and alerting system designed for reliability and scalability. It excels at collecting time-series metrics from various targets, storing them efficiently, and providing powerful querying capabilities through PromQL. It’s ideal for microservices, containers, and cloud-native applications.

How does Prometheus differ from traditional monitoring tools?

Prometheus uses a pull-based model where it scrapes metrics from targets at regular intervals, unlike push-based systems. It has a dimensional data model with labels, making it flexible for complex queries. It’s designed for dynamic cloud environments and integrates seamlessly with Kubernetes and container orchestration platforms.

What are Prometheus exporters?

Exporters are components that expose metrics from third-party systems in a format Prometheus can scrape. Popular exporters include Node Exporter for hardware/OS metrics, Blackbox Exporter for probing endpoints, and database-specific exporters for MySQL, PostgreSQL, etc. They act as bridges between Prometheus and systems that don’t natively expose metrics.

Can Prometheus be used for logging and tracing?

Prometheus focuses specifically on metrics. For a complete observability stack, combine it with Loki for logs and Jaeger or Tempo for distributed tracing. This combination provides comprehensive monitoring, logging, and tracing capabilities for modern applications.

How do I scale Prometheus for large deployments?

For large-scale deployments, use Prometheus federation to create hierarchical setups, implement Thanos or Cortex for long-term storage and global querying, or use remote storage solutions. Implement proper retention policies and use recording rules to pre-aggregate frequently queried data.

What is PromQL and how difficult is it to learn?

PromQL (Prometheus Query Language) is a functional query language for selecting and aggregating time-series data. While it has a learning curve, basic queries are straightforward. Start with simple metric queries and gradually learn aggregation operators, functions, and advanced filtering. The official documentation provides excellent examples.

Prometheus Monitoring: Complete Setup & Best Practices

Set up robust infrastructure monitoring with Prometheus

Page content

Prometheus has become the de facto standard for monitoring cloud-native applications and infrastructure, offering metrics collection, querying, and integration with visualization tools.

technical-diagram

What is Prometheus?

Prometheus is an open-source monitoring and alerting toolkit originally developed at SoundCloud in 2012 and now a Cloud Native Computing Foundation (CNCF) graduated project. It’s specifically designed for reliability and scalability in dynamic cloud environments, making it the go-to solution for monitoring microservices, containers, and Kubernetes clusters.

Key Features

Time-Series Database: Prometheus stores all data as time-series, identified by metric names and key-value pairs (labels), enabling flexible and powerful querying capabilities.

Pull-Based Model: Unlike traditional push-based systems, Prometheus actively scrapes metrics from configured targets at specified intervals, making it more reliable and easier to configure.

PromQL Query Language: A powerful functional query language allows you to slice and dice your metrics data in real-time, performing aggregations, transformations, and complex calculations.

Service Discovery: Automatic discovery of monitoring targets through various mechanisms including Kubernetes, Consul, EC2, and static configurations.

No External Dependencies: Prometheus operates as a single binary with no required external dependencies, simplifying deployment and reducing operational complexity.

Built-in Alerting: AlertManager handles alerts from Prometheus, providing deduplication, grouping, and routing to notification channels like email, PagerDuty, or Slack.

Architecture Overview

Understanding Prometheus architecture is crucial for effective deployment. The main components include:

Prometheus Server: Scrapes and stores metrics, evaluates rules, and serves queries
Client Libraries: Instrument application code to expose metrics
Exporters: Bridge third-party systems to Prometheus format
AlertManager: Handles alerts and notifications
Pushgateway: Accepts metrics from short-lived jobs that can’t be scraped

The typical data flow: Applications expose metrics endpoints → Prometheus scrapes these endpoints → Data is stored in time-series database → PromQL queries retrieve and analyze data → Alerts are generated based on rules → AlertManager processes and routes notifications.

When deploying infrastructure on Ubuntu 24.04, Prometheus provides an excellent foundation for comprehensive monitoring.

Installing Prometheus on Ubuntu

Let’s walk through installing Prometheus on a Linux system. We’ll use Ubuntu as the example, but the process is similar for other distributions.

Download and Install

First, create a dedicated user for Prometheus:

sudo useradd --no-create-home --shell /bin/false prometheus

Download the latest Prometheus release:

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar xvf prometheus-2.48.0.linux-amd64.tar.gz
cd prometheus-2.48.0.linux-amd64

Copy binaries and create directories:

sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

sudo cp -r consoles /etc/prometheus
sudo cp -r console_libraries /etc/prometheus
sudo cp prometheus.yml /etc/prometheus/prometheus.yml
sudo chown -R prometheus:prometheus /etc/prometheus

For package management on Ubuntu, refer to our comprehensive Ubuntu Package Management guide.

Configure Prometheus

Edit /etc/prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - 'alert_rules.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

Create Systemd Service

Create /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --storage.tsdb.retention.time=30d

[Install]
WantedBy=multi-user.target

Start and enable Prometheus:

sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus
sudo systemctl status prometheus

Access the Prometheus web interface at http://localhost:9090.

Setting Up Node Exporter

Node Exporter exposes hardware and OS metrics for Linux systems. Install it to monitor your servers:

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporter

Create systemd service /etc/systemd/system/node_exporter.service:

[Unit]
Description=Node Exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

Start Node Exporter:

sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

Node Exporter now exposes metrics on port 9100.

Understanding PromQL

PromQL (Prometheus Query Language) is the heart of querying Prometheus data. Here are essential query patterns:

Basic Queries

Select all time-series for a metric:

node_cpu_seconds_total

Filter by label:

node_cpu_seconds_total{mode="idle"}

Multiple label filters:

node_cpu_seconds_total{mode="idle",cpu="0"}

Range Vectors and Aggregations

Calculate rate over time:

rate(node_cpu_seconds_total{mode="idle"}[5m])

Sum across all CPUs:

sum(rate(node_cpu_seconds_total{mode="idle"}[5m]))

Group by label:

sum by (mode) (rate(node_cpu_seconds_total[5m]))

Practical Examples

CPU usage percentage:

100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage:

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Disk usage:

(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100

Network traffic rate:

rate(node_network_receive_bytes_total[5m])

Docker Deployment

Running Prometheus in Docker containers offers flexibility and easier management:

Create docker-compose.yml:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
    ports:
      - "9090:9090"
    restart: unless-stopped

  node_exporter:
    image: prom/node-exporter:latest
    container_name: node_exporter
    command:
      - '--path.rootfs=/host'
    volumes:
      - '/:/host:ro,rslave'
    ports:
      - "9100:9100"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    ports:
      - "9093:9093"
    restart: unless-stopped

volumes:
  prometheus_data:
  alertmanager_data:

Start the stack:

docker-compose up -d

Kubernetes Monitoring

Prometheus excels at monitoring Kubernetes clusters. The kube-prometheus-stack Helm chart provides a complete monitoring solution.

Install using Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack

This installs:

Prometheus Operator
Prometheus instance
AlertManager
Grafana
Node Exporter
kube-state-metrics
Pre-configured dashboards and alerts

Access Grafana:

kubectl port-forward svc/prometheus-grafana 3000:80

Default credentials: admin/prom-operator

For various Kubernetes distributions, the deployment process is similar with minor adjustments for platform-specific features.

Setting Up Alerting

AlertManager handles alerts sent by Prometheus. Configure alert rules and notification channels.

Alert Rules

Create /etc/prometheus/alert_rules.yml:

groups:
  - name: system_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current value: {{ $value }}%)"

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% (current value: {{ $value }}%)"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 15
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Available disk space is below 15% on {{ $labels.mountpoint }}"

      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} instance {{ $labels.instance }} has been down for more than 2 minutes"

AlertManager Configuration

Create /etc/prometheus/alertmanager.yml:

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alerts@example.com'
  smtp_auth_username: 'alerts@example.com'
  smtp_auth_password: 'your-password'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-email'
  routes:
    - match:
        severity: critical
      receiver: 'team-pagerduty'
    - match:
        severity: warning
      receiver: 'team-slack'

receivers:
  - name: 'team-email'
    email_configs:
      - to: 'team@example.com'
        headers:
          Subject: '{{ .GroupLabels.alertname }}: {{ .Status | toUpper }}'

  - name: 'team-slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'team-pagerduty'
    pagerduty_configs:
      - service_key: 'your-pagerduty-key'

Integration with Grafana

While Prometheus has a basic web interface, Grafana provides superior visualization capabilities for creating comprehensive dashboards.

Add Prometheus as Data Source

Open Grafana and navigate to Configuration → Data Sources
Click “Add data source”
Select “Prometheus”
Set URL to http://localhost:9090 (or your Prometheus server)
Click “Save & Test”

Popular Dashboard IDs

Import pre-built dashboards from grafana.com:

Node Exporter Full (ID: 1860): Comprehensive Linux metrics
Kubernetes Cluster Monitoring (ID: 7249): K8s overview
Docker Container Monitoring (ID: 193): Container metrics
Prometheus Stats (ID: 2): Prometheus internal metrics

Creating Custom Dashboards

Create panels using PromQL queries:

{
  "title": "CPU Usage",
  "targets": [{
    "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
  }]
}

Popular Exporters

Extend Prometheus monitoring with specialized exporters:

Blackbox Exporter

Probes endpoints over HTTP, HTTPS, DNS, TCP, and ICMP:

scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: localhost:9115

Database Exporters

mysqld_exporter: MySQL/MariaDB metrics
postgres_exporter: PostgreSQL metrics
mongodb_exporter: MongoDB metrics
redis_exporter: Redis metrics

Application Exporters

nginx_exporter: NGINX web server metrics
apache_exporter: Apache HTTP server metrics
haproxy_exporter: HAProxy load balancer metrics

Cloud Exporters

cloudwatch_exporter: AWS CloudWatch metrics
stackdriver_exporter: Google Cloud metrics
azure_exporter: Azure Monitor metrics

Best Practices

Data Retention

Configure appropriate retention based on your needs:

--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB

Recording Rules

Pre-calculate frequently queried expressions:

groups:
  - name: example_rules
    interval: 30s
    rules:
      - record: job:node_cpu_utilization:avg
        expr: 100 - (avg by (job) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Label Management

Keep label cardinality low
Use consistent naming conventions
Avoid high-cardinality labels (user IDs, timestamps)

Security

Enable authentication and HTTPS
Restrict access to Prometheus API
Use network policies in Kubernetes
Implement RBAC for sensitive metrics

High Availability

Run multiple Prometheus instances
Use Thanos or Cortex for long-term storage
Implement federation for hierarchical setups

Troubleshooting Common Issues

High Memory Usage

Reduce scrape frequency
Decrease retention period
Optimize PromQL queries
Implement recording rules

Missing Metrics

Check target status in /targets
Verify network connectivity
Validate scrape configuration
Check exporter logs

Slow Queries

Use recording rules for complex aggregations
Optimize label filters
Reduce time range
Add indices if using remote storage

Performance Optimization

Query Optimization

# Bad: High cardinality
sum(rate(http_requests_total[5m]))

# Good: Group by relevant labels
sum by (status, method) (rate(http_requests_total[5m]))

Resource Limits

For Kubernetes deployments:

resources:
  requests:
    memory: "2Gi"
    cpu: "1000m"
  limits:
    memory: "4Gi"
    cpu: "2000m"

Conclusion

Prometheus provides a robust, scalable monitoring solution for modern infrastructure. Its pull-based architecture, powerful query language, and extensive ecosystem of exporters make it ideal for monitoring everything from bare-metal servers to complex Kubernetes clusters.

By combining Prometheus with Grafana for visualization and AlertManager for notifications, you create a comprehensive observability platform capable of handling enterprise-scale monitoring requirements. The active community and CNCF backing ensure continued development and support.

Start with basic metrics collection, gradually add exporters for your specific services, and refine your alerting rules based on real-world experience. Prometheus scales with your infrastructure, from single-server deployments to multi-datacenter monitoring architectures.