Distributed Task Scheduler

Overview

Introduction
Requirements
Data Model
API Design
High Level Design
Redis ZSET: Internal Implementation
Deep Dive 1: High-Precision Scheduling
Deep Dive 2: Worker Fault Tolerance
Deep Dive 3: Execution Guarantees
Deep Dive 4: Scaling the Storage Layer
Complete Architecture

Introduction

A distributed task scheduler (similar to Celery, AWS Step Functions, or Apache Airflow) is a system designed to execute arbitrary code or trigger events at a specific time in the future, either once or on a recurring basis.
Designing a robust task scheduler requires solving complex distributed systems problems: ensuring high precision (tasks execute exactly when they are supposed to), maintaining fault tolerance (no tasks are lost if a worker node crashes), and avoiding duplicate executions in a highly concurrent environment.

Requirements

Functional Requirements
- Task Submission: Users can submit one-off delayed tasks or recurring tasks (e.g., cron expressions).
- Task Execution: The system must execute the provided task payload (HTTP webhook, script, or message push) at the scheduled time.
- Task Management: Users can query the state of a task (Pending, Running, Completed, Failed, Canceled) or cancel an upcoming task.
- Execution History: The system must store the output and execution logs of completed tasks for auditing.
Non Functional Requirements
- Scalability: System must handle 10,000+ task submissions per second and peak execution bursts of 50,000+ tasks per second.
- High Availability: 99.99% uptime for task submission and execution routing.
- Reliability (No Data Loss): Once a task is acknowledged as accepted, it must be durably stored and guaranteed to execute.
- Precision/Latency: Tasks should begin execution within < 200ms of their scheduled time.

Data Model

To handle the contrasting demands of strict state management and massive write-heavy logging, a Polyglot Persistence strategy is strictly required for a system of this scale.

Relational DB (PostgreSQL): Serves as the system's "Source of Truth" for task definitions, schedules, and metadata. We rely heavily on its ACID properties and row-level locking (e.g., SELECT ... FOR UPDATE) to manage the state machine of a task without race conditions. Core entities include:
- TaskDefinition: The core configuration object. It stores the user's intent, the payload to be executed (e.g., an HTTP webhook URL and body), the retry policy, and the scheduling logic (either a one-off timestamp or a cron expression).
- TaskExecution: Represents a single, distinct run of a TaskDefinition. It acts as the primary state machine (PENDING -> QUEUED -> RUNNING -> SUCCESS / FAILED). It includes an idempotency_key and tracks exactly when the task was picked up by a worker and when it finished.
In-Memory Store (Redis): Acts as the high-speed timer and scheduling buffer. We utilize Redis Sorted Sets (ZSET) where the score is the epoch execution time. This allows the system to query for "tasks that need to run right now" in O(log N) time, bypassing the need to constantly scan the relational database.
Message Broker (Kafka / RabbitMQ): Serves as the durable execution queue. It decouples the scheduling components from the worker pool, allowing the system to absorb massive spikes in task executions (backpressure) without overwhelming the database.
Wide-Column Store (Cassandra): Used exclusively for execution logs and history. Since a single recurring task might generate millions of logs over its lifetime, a high-write-throughput, horizontally scalable NoSQL database is ideal.

API Design

For a Distributed Task Scheduler we will use a classic RESTful API to interact with the data. RESTful APIs are simple, widely used, stateless, and support caching which make it a good candidate for our system.

Create Task:
- POST /v1/tasks
- Accepts payload, execution time/cron, and target.
- Returns task_id
Get Task Status:
- GET /v1/tasks/{task_id}
- Returns current state and metadata.
Cancel Task:
- DELETE /v1/tasks/{task_id}
- Marks a pending task as canceled.
Get Logs:
- GET /v1/tasks/{task_id}/executions
- Returns paginated execution history.

High Level Design

At a high level, the system architecture separates the concerns of accepting a task from executing it.

1. The Submission Path (Control Plane)

This path focuses on securely receiving the task and durably persisting its schedule.

API Gateway: Routes incoming requests, authenticates clients, and enforces rate limits.
Task Management Service: Validates the payload and cron expressions. It writes the task definition to the Task DB (PostgreSQL) with a state of PENDING.
Short-Term Puffer: If the task is scheduled to run in the very near future (e.g., within the next 15 minutes), the service also pushes a lightweight reference of the task directly into the Redis ZSET timer.

2. The Execution Path (Data Plane)

This path focuses on time-triggering the task and handing it off to the worker nodes.

Relay Nodes (The Bridge): A background batch process that periodically queries the Task DB (PostgreSQL) for tasks scheduled to run in the near future (e.g., the next 10 minutes) and stages them into the Redis ZSET timer. This prevents the database from being continuously hammered by high-frequency polling.
Time Trigger (Scheduler Nodes): A fleet of lightweight, continuously running processes that poll the Redis ZSET for tasks whose scheduled time is <= the current time.
Task Queue (RabbitMQ / Kafka): Once triggered, the task payload is fetched and pushed onto a highly available message queue. To prevent Noisy Neighbors (where one customer submitting 50,000 tasks saturates the queue and delays others), tasks are routed to Tenant-Specific Queues using a hash of the tenant_id. This allows workers to round-robin between tenant queues, ensuring equitable compute distribution.
Worker Pool: A fleet of horizontally scaled compute nodes that consume messages from the queue, execute the business logic (e.g., making an HTTP call), report the success or failure to the Task DB, and keep a historical log in Cassandra.

While this high-level design separates concerns, it contains critical flaws. Polling a database efficiently is difficult, workers can crash mid-execution leaving tasks stuck in "limbo", and network retries could cause a payment task to execute twice. We must dive deeper to resolve these.

Redis ZSET: Internal Implementation

plaintext

1. THE HASH MAP (For O(1) Lookups - e.g., checking status or canceling a task)
 
      Maps task_id -> execution_timestamp (UNIX Epoch)
      [ "task_A" ] --> 1710150000  (Ready now)
      [ "task_B" ] --> 1710150050  (Ready in 50s)
      [ "task_C" ] --> 1710150300  (Ready in 300s)
      [ "task_D" ] --> 1710150300  (Ready in 300s)
 
 
 2. THE SKIP LIST (For O(log N) Time-Based Polling via ZRANGEBYSCORE)
 
      A multi-layer linked list. Higher levels act as "express lanes" to skip over future tasks.
 
      Level 3:  [HEAD] --------------------------------------------------------> [task_C:1710150300] -> NULL
                  |                                                                        |                                                  
      Level 2:  [HEAD] -----------------------------> [task_B:1710150050] -----> [task_C:1710150300] -> NULL
                  |                                                                        |                                  
      Level 1:  [HEAD] ----> [task_A:1710150000] ---> [task_B:1710150050] -> ... [task_C:1710150300] -> [task_D:1710150300] -> NULL

A Redis Sorted Set (ZSET) is an in-memory data structure that automatically orders unique elements by a specific score, in this case, the UNIX execution timestamp, allowing the scheduler to instantly identify and extract tasks that are ready to run in O(logN) time without constantly polling and overwhelming the primary database.

The Hash Map (For O(1) Point Lookups)
- Purpose: Instant access by Task ID.
- Scenario: A user wants to cancel an upcoming task, or the system needs to check its exact scheduled time. The system runs ZSCORE scheduled_tasks task_B (or ZREM).
- Mechanism: Redis ignores the time-based sorting entirely. It consults the internal Hash Map, finds the key task_B, and instantly returns the epoch timestamp 1710150050 (or deletes the task) in O(1) time. It does not traverse any list.
The Skip List (For O(log N) Range Queries)
- Purpose: Fast traversal by Execution Timestamp (Score).
- Scenario: The Scheduler Node polls for tasks ready to execute right now. It runs ZRANGEBYSCORE scheduled_tasks -inf CURRENT_TIMESTAMP ("Find all tasks scheduled at or before this exact second").
- Mechanism: Redis cannot use the Hash Map here because it doesn't know which specific Task IDs are ripe for execution. Instead, it traverses the Skip List.
  - It uses the upper-level "express lanes" to quickly skip over millions of tasks scheduled far into the future (like task_C scheduled for tomorrow).
  - It drops down to the base level only when it nears the target timestamp, landing directly on the threshold of ready tasks to instantly extract them for the message queue.

Deep Dive 1: High-Precision Scheduling

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.