Ad Click Event Aggregation (Google Ads)

Overview
Introduction
- An Ad Click Event Aggregation system is the financial and analytical backbone of digital advertising platforms like Google Ads or Facebook Ads. It ingests a massive, continuous stream of user interactions (impressions, clicks, conversions) and aggregates them for real-time reporting and billing.
- Designing this system requires balancing the opposing forces of extreme throughput (millions of events per second) and strict financial accuracy (advertisers are billed per click). It tackles tough distributed systems challenges including deduplication, late-arriving data over unreliable mobile networks, and sub-second analytical querying.
Requirements
- Functional Requirements
- Ingestion: The system must track ad impressions and clicks from client applications.
- Real-Time Aggregation: Events must be aggregated by various dimensions (e.g., campaign_id, ad_id, country, device_type) across multiple time windows (minute, hour, day).
- Analytics Querying: Advertisers can view dashboards to query performance metrics with dynamic filtering and grouping.
- Fraud Filtering: The system must identify and discard invalid or bot-generated clicks before they impact billing.
- Non Functional Requirements
- High Throughput: Must scale to handle 1M+ write events per second.
- Strict Accuracy: "No dropped clicks, no double counting." Inaccuracies directly impact revenue and advertiser trust.
- Low Latency (Read): Analytical queries on the advertiser dashboard must return in < 1 second.
- High Availability: 99.99% for the ingestion path; dropping events means permanently lost revenue.
- Fault Tolerance: System must recover gracefully from worker node failures without data loss or duplication.
Data Model
To handle the contrasting demands of strict financial metadata, high-throughput buffering, and ad-hoc analytical queries, a Polyglot Persistence strategy is strictly required.
- OLAP Database (Apache Druid / ClickHouse): An Online Analytical Processing database specifically designed for fast aggregations over massive, append-only datasets. It uses columnar storage to execute complex GROUP BY and time-series queries in milliseconds.
- Object Storage (S3 / GCS): The system's immutable Data Lake. It stores the raw, unaggregated event logs permanently for auditing, machine learning model training, and batch reconciliation pipelines.
- Relational (PostgreSQL): Serves as the "Source of Truth" for configuration and financial data. It stores advertiser accounts, campaign metadata, ad targeting rules, and the final billing ledgers. ACID compliance is non-negotiable here.
- In-Memory Store (Redis): Used for low-latency state lookups during stream processing, specifically for high-speed deduplication (using TTL keys or Bloom Filters) and caching frequently accessed campaign metadata.
API Design
The API must support ultra-fast, fire-and-forget data ingestion as well as complex read queries for the advertiser dashboards.
- Ingest Event:
- POST /v1/events
- Client sends a lightweight payload: {"event_id": "uuid", "type": "click", "ad_id": "123", "timestamp": "1710150050", "user_ip": "..."}
- Returns 202 Accepted to minimize client blocking.
- Query Analytics:
- GET /v1/analytics/query?campaign_id=999&start=X&end=Y&group_by=device,country
- Returns aggregated metrics (clicks, impressions, spend) for the dashboard.
- Manage Campaign:
- POST /v1/campaigns
- Standard REST endpoint to create or update ad configurations.
High Level Design
At a high level, the system architecture separates the concerns of ingesting massive data streams in real-time from the heavy lifting of batch reconciliation and analytical querying. The architecture can be broken down into four primary paths:
- Client: The mobile device or web browser that generates the ad click event.
- LB / API Gateway: Handles raw ingress, connection pooling, and routes the traffic to the ingestion workers.
- Ingestion Service: A lightweight, stateless microservice that validates the payload and instantly pushes the event to the message broker.
- Apache Kafka: The central event bus. It buffers the massive incoming click stream and safely decouples the fast API from the heavier downstream processing.
- Apache Flink (Stream Processing): Consumes the live stream from Kafka. It groups clicks into time windows and calculates real-time aggregations.
- Redis: A high-speed in-memory cache used by Flink to instantly check for duplicate clicks and fetch campaign metadata without slowing down the stream.
- ML Service: A dedicated machine learning model queried by Flink in real-time to flag obvious botnet traffic or invalid clicks before they are aggregated.
- OLAP Database (Druid / ClickHouse): Ingests the provisional, real-time aggregations from Flink, making them immediately available for dashboard querying.
- Amazon S3: Acts as the permanent Data Lake. A connector continuously archives the raw, unedited event stream from Kafka into cheap, long-term object storage.
- Apache Spark (Batch Jobs): Runs heavy, nightly reconciliation jobs over the massive datasets in S3 to catch slow-moving fraud patterns and calculate the perfectly accurate, final click counts.
- PostgreSQL: The relational "Source of Truth." It receives the finalized, audited financial totals from the Spark jobs to update the advertiser's billing ledger.
- OLAP Database (Druid / ClickHouse): Spark simultaneously overwrites the "provisional" historical segments in the OLAP database with the "perfected" batch data to ensure the advertiser's dashboard exactly matches their bill.
- PostgreSQL: Stores the authoritative state of all advertiser accounts, budgets, and active ad campaign configurations.
- Redis (Metadata Sync): A background process continuously synchronizes active campaign metadata from PostgreSQL into Redis. This ensures Flink can enrich incoming clicks with campaign IDs in microseconds without querying the heavy relational database.
- Advertiser: The end-user logging into the ad platform's web portal.
- LB / API Gateway: Routes the dashboard API requests to the appropriate backend service.
- Query Service: The backend engine that translates dashboard UI interactions into highly optimized database queries.
- PostgreSQL: Queried by the Query Service for standard CRUD operations, such as loading user profiles, billing history, or campaign settings.
- OLAP Database (Druid / ClickHouse): Queried by the Query Service to execute complex, multi-dimensional analytical queries (e.g., "clicks by country and device for the last 30 days"), returning results in sub-second times.
While this architecture successfully outlines the flow of data, operating these components at a scale of 1M+ clicks per second introduces severe distributed systems challenges. We must dive deeper into exactly how Flink guarantees deduplication, how the system handles late-arriving mobile events, and why the OLAP engine is strictly required for the serving layer.
Deep Dive 1: Deduplication
Deep Dive 2: Handling Late Events
Deep Dive 3: Columnar Storage
Deep Dive 4: Click Fraud Detection
Complete Architecture
Additional Discussion Points
Master System Design Interviews
Get ready for the exact system design questions top tech companies are asking right now. Read comprehensive editorial write-ups and practice with our AI whiteboard that simulates a real, step-by-step interviewer experience.
See All System Designs →