Proximity Service (Yelp)

Overview
Introduction
Proximity Services (like Yelp) are application that helps users find and access various services, businesses, or points of interest based on their current location.
Designing a proximity service is a very popular system design question as it covers many core system design concepts including storing and retrieving data from multiple data sources, scalability, reliability, and interacting with geolocation services.
Requirements
- Functional Requirements
- Create an attraction (location of service, business etc.)
- Search attractions
- Add a review to an attraction
- Non Functional Requirements
- High scalability
- High availability (willing to accept eventual consistency)
- Low latency
Estimates
- Storage
- 50 million DAU (daily active users)
- Average number of attractions added: 1,000 / day
- Average number of reviews posted per user: 2 / month
- Average attraction size (including metadata, images): 500 KB
- Average review size: 2 KB
- Yearly attraction storage: 1,000 attractions * 365 days * 500 KB = 182.5 GB
- Yearly reviews storage: 50 million users * 2 reviews * 12 months * 2 KB = 2.4 TB
- Total yearly storage: 182.5 GB (attractions) + 2.4 TB (reviews) ≈ 2.6 TB / year
- Queries Per Second (QPS)
- Writes per second (attractions): 1,000 / (24 hours * 3600 seconds) ≈ 0.0116 writes / second
- Writes per second (reviews): (50 million users * 2 reviews) / (30 days * 24 hours * 3600 seconds) ≈ 38.58 writes / second
- Total writes per second: 0.0116 + 38.58 ≈ 38.59 writes / second
- Average number of reads per user: 20 / month
- Reads per second: (50 million users * 20 reads) / (30 days * 24 hours * 3600 seconds) ≈ 385.8 reads / second
- Queries per second: 38.59 (writes) + 385.8 (reads) ≈ 424.39 QPS
Data Model
This is a basic outline of some of the core tables that could be included in a proximity service data model.
- users
- Contains information related to the user.
- attractions
- attraction_id: Uniquely identifies each attraction.
- user_id: Indicates which user created the attraction it.
- latitude: The latitude of the attraction.
- longitude: The latitude of the attraction.
- geohash: The geohash of the attraction location.
- media
- media_id: Uniquely identifies each media item.
- attraction_id: Links each media item to an attraction.
- media_type: Describes the media type (e.g., image, video).
- media_url: Media file location.
- reviews
- review_id: Uniquely identifies each review.
- user_id: Links each review to a user, indicating who created it.
- attraction_id: Links each review to an attraction.
- rating: The rating given to the attraction.
- comment: The content of the review.
API Design
For a proximity services we will use a classic RESTful API to interact with the data. RESTful APIs are simple, widely used, stateless, and support caching which make it a good candidate for our system.
Our REST API will comprise of three main endpoints:
- POST: /api/attractions
- Params:
- name: string (required)
- description: string
- category: string
- latitude: float (required)
- longitude: float (required)
- creator_id: string (required)
- media: array of binary files
- metadata: JSON object (additional details like opening hours, phone, website, etc.)
- Response code: 201 Created
- Params:
- GET: /api/attractions/search
- Params:
- latitude: float (required)
- longitude: float (required)
- radius: float (in kilometers, default: 5)
- category: string (optional)
- query: string (optional, for text search)
- sort: string (options: "distance", "rating", "review_count", default: "distance")
- page: integer (for pagination, default: 1)
- per_page: integer (items per page, default: 20)
- Response code: 200 OK
- Params:
- POST: /api/attractions/{attraction_id}/reviews
- Params:
- id: string (attraction ID in URL)
- user_id: string (required)
- rating: integer (required, 1-5)
- comment: string
- Response code: 201 Created
- Params:
Creating Attraction Flow
- User sends a POST request
- Request Initiation: The user, who must be signed in, sends a POST request to the /api/attractions API endpoint.
- Load Balancer
- Routing and Rate Limiting: The load balancer routes the request to an instance of the Attraction Write Service. It also performs rate limiting using either a sliding window or token bucket algorithm to ensure that the service does not get overwhelmed by malicious users. This helps maintain high availability and prevents service abuse.
- The sliding window algorithm tracks requests over a rolling time period, allowing for more granular control and smoother traffic shaping.
- The token bucket algorithm uses a fixed capacity bucket that fills at a constant rate, allowing for bursts of traffic within limits.
- Image Upload:
- Object Storage: The images included in the request are uploaded to an object storage service, such as Amazon S3 by the Attraction Write Service.
- CDN: The images are then added to a Content Delivery Network (CDN) to ensure fast and reliable delivery to users globally.
- URL Generation: The URLs pointing to these stored images are returned and will be used to reference the images in the attraction table.
- Geohashing Service:
- Geohashing: The Geohash of the location is calculated based on the provided latitude and longitude. Geohashing is a method of encoding latitude/longitude coordinates into a single string of letters and digits, which allows for efficient proximity searches by spatially indexing the location data. The length of the geohash determines the precision of the location representation: longer geohashes represent more precise locations.
- Implementation Strategy:
- Store the full 8-character (precision ≈ 38 meters) geohash in the database for each attraction.
- For broader area searches, use prefix matching on the first 5-6 characters.
- For precise local searches, use the full 7-8 characters.
- Alternative Approach: Alternatively, a Quadtree implementation could be discussed, which is a tree data structure in which each internal node has exactly four children. Quadtrees are used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. However, for this system design, we will focus on using Geohash due to its simplicity and efficiency for our use case.
- Storing the Attraction
- Postgres Database: The attraction details, including the Geohash and image URLs, are stored in the main Postgres database. A Postgres database is chosen because it supports PostGIS, an extension that provides robust spatial database capabilities, enabling efficient geospatial queries (e.g., distance calculations, area overlaps) and indexing.
- Asynchronous Processing:
- Message Queue: Non-critical metadata is processed asynchronously. A message containing this metadata is sent to a message queue (e.g., RabbitMQ or Kafka).
- Metadata Service: The Metadata Service reads the message from the queue and processes the data. Given the non-rigid nature of attraction metadata, a NoSQL database like DynamoDB is a suitable option for storage due to its flexibility and scalability.
- Metadata Index: The Metadata Service also updates the Metadata index, allowing for efficient querying and retrieval of metadata. This could involve updating an ElasticSearch index to enable advanced search capabilities based on the metadata.
After successfully storing the attraction details and initiating metadata processing, a response is sent back to the user indicating that the attraction was successfully created.
Searching Attractions Flow
Reviewing Attraction Flow
Complete Architecture
Additional Discussion Points
Master System Design Interviews
Get ready for the exact system design questions top tech companies are asking right now. Read comprehensive editorial write-ups and practice with our AI whiteboard that simulates a real, step-by-step interviewer experience.
See All System Designs →