Messaging App (WhatsApp, WeChat, Messenger)

Blog / Messaging App (WhatsApp, WeChat, Messenger)

Overview

Requirements
Estimates
Data Model
Architecture Overview
Messaging Flow
Presence Service
Additional Discussion Points

Requirements

Functional Requirements
- 1 - 1 Chats
- Group chats (max. 150)
- Media Sharing
- Push notifications
- Online / offline indicator
Non Functional Requirements
- High availability
- High scalability
Not covered
- End to end encryption

Estimates

Queries Per Second (QPS)

Daily active users (DAU) = 100 million (100 * 10⁶)
Average number of daily messages sent per user: 50 messages / day / user
Average number of messages sent per day: (100 * 10⁶) * 50 = (5 * 10⁹) messages
(5 * 10⁹) / (24 hours * 60 minutes * 60 seconds) ≈ 58,000 QPS

Storage Estimate

Average message size: 150 bytes
Daily message storage: (5 * 10⁹) messages * 150 bytes ≈ 750 billion bytes ≈ 750 GB / day
Percentage of messages that include media: 10%
Daily media messages: (5 * 10⁹) * 0.1 = (5 * 10⁸)
Average media size: 100 KB
Daily media storage: (5 * 10⁸) * 100 KB ≈ 50 TB / day
Total daily storage: 750 GB + 50 TB = 50.75 TB / day
Assume store for 10 years
Total storage: 50.75 TB * 365 days * 10 years ≈ 185 PB

Data Model

This is a basic outline of some of the core tables that could be included in a messaging system data model.

users
- Contains information related to the user.
devices
- user_id: Foreign key which is used to identify which user the device belongs to. Many-to-one relationship where a user can have many devices and a device can only belong to one user.
- last_seen_message_id: Used to identify which message was last seen by the device so that if there are any messages with a large message_id the device knows it needs to fetch that message.
conversations
- Covers both 1-1 messaging as well as group messaging, where 1-1 messaging is simply a group with only two people and a different user interface (UI).
user_conversations
- user_id: Foreign key which is used to identify a user.
- conversation_id: Foreign key which is used to identify a conversation.
- This table facilitates the many-to-many relationship between users and conversations.
messages
- Represents a message in the system.
- user_id: Who the message was sent by.
- conversation_id: Which conversation the message belongs to.
- content: The content of the message.
- sent_at, delivered_at, seen_at: Timestamps that indicate different points in a message lifecycle.

Architecture Overview

Gateway

As we are using many different protocols like HTTP (application layer), TCP (transport layer) two layers of the The Open Systems Interconnection (OSI) Model is a conceptual framework that provides a protocol-agnostic description of how the various layers of a network stack combine to enable network communications.
The Gateway (e.g. AWS API Gateway) will also include rate limiting.

User Service

HTTP-based servers that handles all user related information as well as authentication.
User information can be quite relational in nature so we could choose an SQL database like MySQL or Postgres.
We may also have to introduce sharding to handle the scale of the application.
The data can also be replicated in different geographic regions, with a multi datacenter approach to handle availability requirements.

Media Service

Media Service used to store content in object storage (e.g Amazon S3) .
Given the volume of data being stored, we could also do some analysis and move old data into cold storage (e.g. Amazon S3 Glacier) to reduce storage costs.

Chat Service

Pull (polling) vs. Push
- Pull: Long polling is where the client makes a request to the server which holds the request open ("hangs") until new data is available or a timeout occurs. Once the client receives new data or a timeout signal, it immediately sends another request, and the process repeats. This creates a near real-time effect, but with the overhead of repeatedly establishing HTTP connections.
- Push: Web sockets are a TCP based protocol that use bi-directional full duplex communication. It is lightweight and both the client and server can independently send messages to each-other.
Stateless vs. Stateful
- Stateless services process each request independently without retaining user session information, enhancing simplicity and scalability. Stateless protocols include HTTP (Hypertext Transfer Protocol). For example in our Users Service it doesn’t matter which server a user connects to, as the Users Database is the source of truth.
- Stateful services, on the other hand, maintain session data across requests, adding complexity but enabling continuous user interactions. Stateful protocols include TCP (Transmission Control Protocol). For example let's say Client A creates a WebSocket connection with Chat Server 1 and then proceeds to send and receive messages from that server, until the client disconnects or server goes down, Client A will always be connected to that server.
Database Selection
- Given the scale discussed in the Estimates section a NoSQL database like Cassandra or Dynamo DB, would be a suitable choice. Discord used Cassandra for a long time before recently moving to SyllaDB.
- If the system were to use an SQL database, while indexes can speed up data retrieval, they can significantly slow down data insertion and updates, which we want to avoid especially in a high-volume, real-time chat application where new messages are constantly being sent and received.

ID Service

Twitter Snowflake is a distributed system for generating unique, time-sequenced identifiers (IDs) at high scale.
We want message IDs to be globally unique, but we also want them time sequenced. This is so that each device can have a last last_seen_message_id, so that if there are any new messages for a user on a queue where the user and the message_id is larger than the last_seen_message_id the device knows that is hasn't seen that message.

Service Discovery

ZooKeeper can be used to maintain a list of available chat servers and their statuses. This allows each client to query ZooKeeper to discover which server to connect to.
ZooKeeper's strong consistency model ensures that all clients see the same view of the server list and their statuses, which is crucial for correctly balancing the load and ensuring reliability.
A Mapping Database (user_id, server_id) can be used to store which users are connected to which chat servers. This data is solely comprised of key-value pairs which makes MongoDB a good candidate for this mapping table.

Messaging Flow

Client A is authenticated.
Client A is allocated to a chat server by ZooKeeper and this mapping is stored in the Mapping Database.
Client A can then send a message to the Chat Service.
The Chat Service reaches out to the ID Service to generate a unique ID.
The message is stored in the Chat Database.
The message is then put into the relevant Sync Queues.
In this case the message is sent to a group chat with two other users (Client B and Client C)
A message will be put on both of their queues (we can do this because of the 150 person group chat limit).
Client C is offline so the Notification Service sends a notification to the relevant service (e.g. Apple Push Notifications [iOS] or Firebase Cloud Messaging [Android])
Client B is online, so the message is forwarded to the Chat Server Client B is connected to which is known from the Mapping Database.

Presence Service

Each client will have a web socket connection to a stateful Presence server.
A heartbeat mechanism will be used for maintaining real-time user status (e.g. online, offline).
How it works is a periodic signal is sent from the client (user's device) to the server to indicate that the client is still connected and active. This information (e.g. user_id, online_status, last_online_timestamp) is then stored in the Presence Database as well as having messages pushed onto Presence Queues so that other users that user was messaging could know that that user was online.
If the Presence Service does not receive a heartbeat from a client for a specified period of time it can insert a new row in the database indicating that that client is no longer online, as well as push messages onto the relevant Presence Queues to let other users know that that user is no longer online.

Additional Discussion Points

End to end encryption
Media file handling
Rate limiting and abuse prevention
Monitoring and logging
Compliance and data retention
Disaster recovery

Master System Design Interviews

Get ready for the exact system design questions top tech companies are asking right now. Read comprehensive editorial write-ups and practice with our AI whiteboard that simulates a real, step-by-step interviewer experience.

See All System Designs →