Cloud File Sharing Service (Dropbox, Google Drive)

Overview
Introduction
A cloud file sharing service is an online platform that allows users to store, share, and collaborate on files and documents over the internet. Examples include:
- Google Drive
- Dropbox
Requirements
- Functional Requirements
- File upload and download
- Sync files across multiple users and devices
- Non Functional Requirements
- Highly scalable
- Highly reliable
- Not Included
- Real time multi-user editing
Estimates
- Daily active users (DAU) = 500 million (500 * 10⁶)
- Number of files uploaded = 1 file / user / day
- Average file size = 150 KB ≈ (15 * 10⁴ bytes)
- Daily storage:
- (500 * 10⁶ users) * (15 * 10⁴ bytes) ≈ 7.5 * 10¹³ bytes
- Yearly storage requirements:
- 365 days * (7.5 * 10¹³ bytes) ≈ 2.7 * 10¹⁶ bytes ≈ 27 PB
- Assume store for 10 years
- Total storage: 10 years * (27 PB) = 270 PB
- (500 million * 1 upload) / 24 hours / 60 minutes / 60 seconds ≈ 6,000 QPS
Core Problem
The core problem revolves around sending lots of unnecessary data across the network.
For example, let's say we were working on a large file of size 2GB. If we added a single character to the end of that file, we would then have to send the entire 2GB to web servers which would then have to store the entire 2GB in a content storage (e.g. S3), and then that 2GB would have to be sent to other clients who also had access to that file.
That's a lot of data to be sent across the network for a single character change!
To solve this problem we will introduce the concept of blocks.
Each file will be comprised of blocks of a predetermined size (4MB in our case). This means that we greatly reduce the amount of data that is sent across the network by only sending the blocks that changed instead of the entire file.
With this new approach, we would only need to send and store 4MB as apposed to 2GB.
Data Model
This is a basic outline of some of the core tables that could be included in a cloud file sharing service data model.
- users
- Contains information related to the user.
- devices
- user_id: Foreign key which is used to identify which user the device belongs to. Many-to-one relationship where a user can have many devices and a device can only belong to one user.
- team_space
- name: The name given to the space (e.g. personal files)
- user_team_space
- user_id: Foreign key which is used to identify a user.
- team_space_id: Foreign key which is used to identify a team space.
- This table facilitates the many-to-many relationship between users and team spaces.
- object
- Represents a file in the system.
- latest_history_number: Points to the latest version of the object and enables versioning of objects.
- team_space_id: Indicates which team_space the object belongs to.
- object_history
- history_number: Indicates where in the object history this object belongs.
- block
- object_history_id: Which object the block belongs to.
- block_position: The position of the block within in the object. Allows files to be recreated in the correct order.
Architecture Overview
The Client is responsible for several functions in this system.
- Monitor
- Monitors for changes of files in the local workspace and notifies the Blockify service when changes have been made.
- Blockify
- Blockify splits files into smaller blocks and is also responsible for reconstructing blocks into their corresponding files.
- Once the blocks are created the Synchronizer is notified.
- Synchronizer
- The Synchronizer, as the name suggests relays these changes to the Client Database as well as the Block and Metadata service which will be discussed later.
- Client Database
- Locally stores information about the shared space, blocks, hashes of the objects, object history etc. Can be implemented with a lightweight database like SQLight.
- It is focused on managing the local file state and is tailored to individual clients.
- Handles storing blocks sent by clients in the content storage.
- To reduce storage costs and improve security the Block service will also implement compression and encryption.
- Compression: Huffman coding or arithmetic coding. (Note: Google and Dropbox would have their own proprietary compression algorithms as well).
- Encryption: 256-bit Advanced Encryption Standard (AES).
- To further reduce storage costs the system could ultilize cold storage like Amazon S3 Glacier.
- However, the policy for determining when to place blocks in cold storage is important. Analysis should be done beforehand to ensure with high confidence that the block will not be accessed again as it is expensive to access data from cold storage.
- The Meta Service is concerned with the global state of files across the system, facilitating synchronization, and is designed to handle the complexities of a multi-client system.
- Information that gets sent here relates to shared spaces, blocks, hashes of the objects, object history etc.
- Having a cache (e.g. Redis or Memcached) to store frequently accessed objects could improve the overall systems performance.
- The choice of eviction policy is very important, as getting this right could significantly reduce the load on the database
- The database will be used to store the metadata about users, objects, blocks etc.
- Choice reliability and consistency are key here as we don’t want two users in the same file looking at different data
- We can use the CAP theorem trade off to help determine which type of database to use. Reliability and consistency are key for this system as we don’t ever want two users in the same file looking at different data
- Therefore, using an SQL database with it’s ACID properties would be a good choice (e.g. MySQL, PostgreSQL).
- In the estimates section the system will have to handle massive scale and so sharding might be a good choice to distribute the load.
- So this involves partitioning the database into smaller, more manageable pieces, called shards, each of which can be stored on different servers.
- This strategy is used to scale databases horizontally, allowing them to handle more data and traffic by distributing the load across multiple servers. Key things to note:
- Shard Key Selection: For file sharing metadata, common shard keys might include user IDs, file IDs, or a hash of the file name. The choice of shard key is crucial, as it affects the database's performance and scalability.
- Sharding Strategy: There are different sharding strategies, such as range-based sharding, hash-based sharding, etc. For file sharing metadata, hash-based sharding is often used, where a hash function is applied to the shard key to distribute rows evenly across shards. This can help ensure balanced load distribution and improve query performance.
- Consistency and Replication: To ensure high availability and data durability, each shard can be replicated across multiple servers. This adds another layer of complexity to shard management but is essential for fault tolerance.
- The Metadata Service also responsible for pushing messages onto queues. Each workspace will typically have several members so each member will then get a message pushed onto their queue when changes are made.
- The Notification Service notifies clients when changes have been made to a file with which they are associated.
- The Client Synchronizer listens to changes in the notification service via long polling which creates a near real-time effect, but with the overhead of repeatedly establishing HTTP connections.
System Flow
Additional Discussion Points
Master System Design Interviews
Get ready for the exact system design questions top tech companies are asking right now. Read comprehensive editorial write-ups and practice with our AI whiteboard that simulates a real, step-by-step interviewer experience.
See All System Designs →