Distributed Message Queue

Overview

Introduction
Requirements
API Design
Data Storage Mechanics
High Level Design
Deep Dive 1: Performance
Deep Dive 2: Replication
Deep Dive 3: Rebalancing
Complete Architecture
Additional Discussion Points

Introduction

A Distributed Message Queue acts as the central nervous system for modern microservice architectures, enabling asynchronous communication, event-driven designs, and massive data streaming.
Designing an infrastructure system like Kafka fundamentally differs from a standard web application. The core challenges shift away from relational data modeling and load balancers toward low-level disk I/O optimization, strict delivery semantics, and distributed consensus to maintain high availability under extreme throughput.

Requirements

Functional Requirements
- Topic Management: Users can create and configure topics (logical channels).
- Publishing: Producers can publish messages to specific topics.
- Consuming: Consumers can subscribe to topics and retrieve messages.
- Data Retention: Messages must be durably stored for a configurable period, regardless of whether they have been consumed.
Non Functional Requirements
- Extreme Scalability: Support millions of messages per second (GBs/sec of throughput).
- Low Latency: < 10ms for acknowledging a published message; near real-time consumption.
- High Availability & Durability: No data loss if a broker node crashes. 99.99% availability.
- Delivery Semantics: Guarantee At-Least-Once delivery by default.
- Ordering Guarantees: Strict temporal ordering of messages within a specific partition.

API Design

Unlike standard REST APIs, high-throughput message queues typically utilize highly optimized binary protocols over TCP. However, the conceptual interface remains straightforward.

Create Topic: createTopic(topic_name, partition_count, replication_factor)
Publish Message: publish(topic_name, routing_key, message_payload)
- The routing_key dictates which partition the message lands in. If null, round-robin is used.
Consume Messages (Pull Model): consume(topic_name, consumer_group_id, max_batch_size)
- Architectural Note: We use a Pull Model (like Kafka) rather than a Push Model (like RabbitMQ). A Pull model prevents the queue from overwhelming consumers during traffic spikes, naturally allowing clients to apply backpressure by simply slowing down their fetch requests.
Commit Offset: commitOffset(topic_name, consumer_group_id, partition_id, offset_number)

Data Storage Mechanics

Instead of using a relational database like PostgreSQL, the broker manages the file system directly. To achieve millions of writes per second, we must build a custom storage engine optimized entirely for sequential disk I/O using an Append-Only Log architecture.

The Broker: At the core of the data plane is the Broker. A broker is simply a stateful server (e.g., an AWS EC2 instance with an attached SSD) responsible for receiving, storing, and serving messages. Rather than relying on an external database layer, the broker acts as its own database, managing its local file system directly.
Topics & Partitions: A Topic is a logical concept (e.g. user-clicks). Physically, a topic is split into Partitions. Partitions allow a single topic to be horizontally scaled and distributed across multiple broker nodes.
Append-Only Commit Log: Each partition is stored as an append-only file on the broker's disk. New messages are always written to the exact end of the file. This ensures Sequential I/O, which is magnitudes faster than random disk seeks, allowing modern HDDs/SSDs to sustain hundreds of megabytes per second of write throughput.
Segments: To prevent a partition from becoming a single massive, unmanageable file, the log is broken into Segments (e.g., 1GB rolling files). This allows the broker to efficiently purge old messages by deleting entire segment files once the retention period expires, without pausing reads or writes.
Offset Indexing: Every message in a partition is assigned a sequential, immutable ID called an Offset. To allow consumers to quickly jump to a specific offset without scanning a 1GB file, the broker maintains a memory-mapped Sparse Index that maps logical offsets to physical byte positions in the segment file.

High Level Design

Producers: Client applications that push data. They locally cache cluster metadata to route messages directly to the correct partition's leader broker, bypassing any central proxy.
Broker Cluster: A fleet of stateful nodes that manage the physical storage and serving of data. They host the specific Partitions that make up logical Topics (as seen with user_clicks and analytics spread across the nodes). They store the partition segment files on local disk and handle all direct read/write requests for the partitions they lead.
Consumers / Consumer Groups: Client applications pulling data. They track their progress using offsets.
Metadata Consensus Cluster (KRaft / ZooKeeper): A highly available quorum of nodes that acts as the "brain" of the system. It tracks which brokers are alive, stores the topic configurations, and dictates which broker is the Leader for any given partition.

While this topology supports message routing, it does not explain how the system survives node failures, coordinates distributed consumption, or achieves sub-10ms latency. We must dive into the low-level mechanics.

Deep Dive 1: Performance

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.