Metrics Monitoring and Alerting System (Datadog, Prometheus)

Overview

Introduction
Requirements
Data Model
API Design
High Level Design
Deep Dive 1: Time-Series Storage & Compression
Deep Dive 2: Real-Time Alerting Engine
Deep Dive 3: Downsampling & Data Retention
Deep Dive 4: Cardinality & Sharding Strategies
Complete Architecture

Introduction

A metrics monitoring and alerting system (e.g. Datadog, Prometheus) allows engineering teams to collect, visualize, and trigger alarms based on the real-time operational health of their infrastructure and applications.
Designing a robust monitoring platform revolves around the write-heavy nature of telemetry. The system must ingest millions of data points per second, compress them efficiently for long-term storage, and evaluate complex alerting rules against sliding time windows without falling behind.

Requirements

Functional Requirements
- Metric Ingestion: The system must accept infrastructure and application metrics (e.g., CPU usage, request latency) paired with tags/labels (e.g., host:A, env:prod).
- Querying & Visualization: Users can build dashboards to query, aggregate, and visualize metrics over arbitrary time ranges.
- Alerting: Users can configure thresholds (e.g., "Alert if CPU > 90% for 5 minutes").
- Notifications: The system must dispatch alerts to external channels (e.g., PagerDuty, Slack, Email).
Non Functional Requirements
- Massive Scalability: Support ingesting 10M+ data points per second.
- High Write Availability: 99.99% uptime for the ingestion path; dropping metrics during an incident blinds the customer.
- Low Latency Alerting: Alerts must trigger within seconds of the threshold being breached.
- Query Performance: Dashboard queries spanning days of data should return in < 1 second.
- Storage Efficiency: Metrics data grows exponentially; extreme compression and downsampling are strictly required.

Data Model

To handle the contrasting demands of massive time-series ingestion and relational configuration management, a Polyglot Persistence strategy is required.

Time-Series Database (TSDB) (e.g., Cassandra / Custom HBase): The core storage engine for raw metric data. It is optimized for heavy, sequential append-only writes and sequential scans. Time-series data is fundamentally immutable (metrics from the past don't change).
Relational DB (PostgreSQL): Serves as the "Source of Truth" for system metadata. It stores user accounts, organization structures, dashboard layouts, and Alerting Rules. ACID compliance ensures robust configuration management.
In-Memory Cache (Redis): Used to cache frequently accessed dashboard queries, rate-limit API ingestion requests, and store the "last known state" of active alerts to prevent duplicate notifications.
Message Broker (Apache Kafka): While not a database, Kafka acts as the durable ingestion buffer. It absorbs sudden spikes in metric traffic (backpressure), preventing the TSDB from being overwhelmed and decoupling the ingestion path from the alerting processors.
Object Storage (S3 / GCS): Serves as the cold storage layer for downsampled, historical metrics.

API Design

The API acts as the primary boundary for lightweight agents (like StatsD or the Datadog Agent) and the web UI.

Ingest Metrics:
- POST /v1/series
- Accepts a batch payload of metrics: [{"metric": "cpu.idle", "points": [[timestamp, value]], "tags": ["env:prod", "host:web-01"]}]
Query Metrics:
- GET /v1/query?query={expression}&from={start}&to={end}
- Evaluates PromQL-style queries (e.g., avg:cpu.idle{env:prod} by {host}) and returns aggregated time-series arrays.
Configure Alert:
- POST /v1/alerts
- Creates a new rule payload containing the query, threshold, evaluation window, and notification target.

High Level Design

At a high level, the system architecture separates the concerns of ingesting massive data streams, evaluating real-time alerts, serving user queries, and managing configuration state. The architecture can be broken down into four primary paths:

1. The Ingestion & Storage Path (Write)

This path focuses on securely receiving telemetry data, buffering it, and storing it for both short-term and long-term use.

Distributed Agents: Lightweight processes running on customer infrastructure that push raw metrics to the system.
LB / Gateway: The entry point that handles load balancing, TLS termination, and request routing.
Ingestion Service: Validates incoming payloads and acts as the initial gatekeeper.
Quota & Rate Limits (Redis): Before accepting data, the Ingestion Service synchronously checks Redis to ensure the customer hasn't exceeded their allowed metric limits or cardinality caps.
Kafka: Acts as the central, durable message buffer. It absorbs sudden traffic spikes, completely decoupling the fast ingestion API from the slower database writers.
TSDB Writers: Worker processes that consume batches of metrics from Kafka, compress them, and write them sequentially to the primary time-series database.
Cassandra: The core Time-Series Database holding "hot" (recent and highly granular) metric data.
Spark Jobs & Historical Metrics (S3): Background batch processes that continuously pull older data from Cassandra, downsample it (e.g., from 10-second to 1-hour resolution), and move it to cheaper object storage (S3) for long-term retention.

2. The Real-Time Alerting Path (Compute)

This path focuses on evaluating the live metric stream against user-defined thresholds without touching the primary storage database.

Real-Time Alert Engine (Flink): A stateful stream processing engine that hooks directly into Kafka. It maintains sliding time windows in memory and evaluates incoming metrics against active rules.
Notification Service: Receives trigger events from Flink when a threshold is breached and prepares the outbound payload.
Deduplication (Redis): The Notification Service checks a state key in Redis to ensure it doesn't repeatedly spam users for an ongoing incident.
PagerDuty / Slack / Email: The external third-party integrations where the final alert is dispatched.

3. The Query & Visualization Path (Read)

This path focuses on serving dashboards and analytical queries back to the end user.

Web UI: The frontend application where engineers view dashboards and graphs.
Query Engine: The intelligent router for read requests. When a query arrives via the Gateway, it analyzes the requested time window.
Smart Routing: If the UI requests data from the last 24 hours, the Query Engine fetches it directly from Cassandra. If the user requests a 6-month trend, it automatically routes the query to the pre-aggregated Historical Metrics (S3) bucket to ensure fast load times.

4. The Configuration & Metadata Path (Control Plane)

This path focuses on how users manage their settings, dashboards, and alerting rules, and how those rules reach the alerting engine.

Metadata Service: A standard CRUD web service that handles user configuration requests from the Web UI.
Metadata DB (PostgreSQL): The relational "Source of Truth" storing user accounts, billing tiers, dashboard layouts, and the exact definitions of the alerting rules.
CDC (Change Data Capture): A background process that monitors the PostgreSQL transaction log. The moment a user creates or updates an alerting rule, the CDC pipeline instantly broadcasts that change directly to the Real-Time Alert Engine (Flink), ensuring the stream processor always has the latest rules in memory without needing to poll the database.

While this architecture successfully outlines the data pathways, operating these components at a scale of 10M+ metrics per second introduces severe distributed systems challenges. We must dive deeper into exactly how the TSDB achieves extreme data compression, how Flink manages stateful alerting without constantly polling the database, and how the system protects itself from cardinality explosions.

Deep Dive 1: Time-Series Storage & Compression

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.