System Design Interview - Design a Real-Time Emoji Broadcast for Millions?

Problem #1 - How to Build a Scalable System for Sending and Receiving Emojis in Live Sports Streaming?

Sep 15, 2024

Hello folks, If you have given System Design Interviews, you must have seen some common problems like Designing Twitter, Instagram Feed, Job Scheduler etc.

Those are some old, classic questions that people usually mug up while preparing for their interviews.

As an interviewer, I don’t see any value in asking these questions.

I have taken more than 100+ System Design interviews in my career and don’t often ask these classic problems, so do many companies like Google, Atlassian, Meta etc. even though many other non-MAANG companies have adopted the open-ended real world system design questions.

When answering system design questions, remember that the interviewer knows nothing about your experience—whether you've solved hundreds of design problems or none. You have just 45 minutes to 1 hour for them to evaluate you.
Since there's no single "right" solution to a system design problem, the quality of your answer depends on how well you understand the fundamentals and your familiarity with technologies that address specific challenges. So, focus on keeping your core concepts sharp.

By the end of this system design tutorial, you will understand how to approach the real-world System Design problem , structure your responses, and identify key questions to ask.

Let’s dive right into the problem.

Problem Statement

Imagine watching a cricket match on Disney+ Hotstar, where bursts of emojis 😀🤑—👏 (claps), ❤️ (hearts), and 👍 (thumbs-up)—pop up from users during exciting moments like boundaries or wickets.
In this scenario, 10 million users are watching the match simultaneously, sending reactions that need to be broadcasted to all other viewers in near real-time.

I usually prefer answering system design questions in the following format:-

Clarify Requirements: Ensure a clear understanding of both functional and non-functional requirements by asking clarifying questions.
Functional Requirements: Outline the key features the system must support, such as user actions and expected behaviors.
Non-Functional Requirements: Address the system's scalability, availability, performance, and other quality attributes.
Define Core Entities: Identify and define the core entities and data models the system will work with.
Define High-Level API: Draft the essential API endpoints and interactions between system components.
High-Level Design: Sketch out the overall system architecture, focusing on how components interact to meet the requirements.
Estimations: Perform back-of-the-envelope calculations to ensure the system can handle the expected load.
Detailed Design: Dive deeper into specific components, handling non-functional requirements such as fault tolerance, latency, and database design.

1. Clarifying Requirements

The first step in tackling open-ended system design questions is to ask clarifying questions. Since these questions are often broad, it’s crucial to fully understand the requirements—misinterpretation can lead to a bad solution.

These open ended questions are designed to assess if given a real situation whether you can understand the requirements and can navigate through the design or not?

Generally, you should ask

Who are the primary users (e.g., internal users, end-users)?
What is the main function of the system?
Are there any constraints around what the system must deliver?
What are the performance expectations (e.g., how quickly must responses be sent)?
Are there any edge cases that should be taken into account (e.g., network failure, system load spikes)?
Anything which can affect the storage type, speed, and latency?

For this emoji problem, these are the clarifying questions you can ask:-

Without looking further think what all the clarifying questions you can ask.

How many types of emojis are allowed to be sent during the match
Interviewer:- We support basic reactions like claps, hearts, thumbs-up, and possibly a few others (e.g., 5-6 types of emojis in total).
Do we want to store the emojis sent?
Interviewer:- Yes, Let’s store the emojis for the 24 to 48 hours post-event.
Is there a limit on how frequently users can send emojis?
Interviewer:- No, there’s no hard limit on emoji frequency, but we expect bursts during key moments in the match (e.g., during a boundary or wicket).
What is the expected maximum number of concurrent users during peak moments?
Interviewer:- The system should handle up to 10 million concurrent users during high-traffic moments.
Should all users see every single emoji sent by every other user?
Interviewer:- That is a wonderful question. No, not every single emoji needs to be shown to every user. It’s more of a flood of reactions, so a representative subset (e.g., a few emojis every second) should be broadcast globally.
What’s the expected latency for sending and receiving emojis?
Interviewer:- Emojis should be broadcast in near real-time, ideally with a latency under 2 seconds.
Should we support emojis sent from users on different platforms (mobile, web, etc.)?
Interviewer:- For brevity, the System should support sending and receiving emojis across mobile apps (iOS and Android) only and consider the user base to be global.
What happens when a user experiences network issues or a delay?
Interviewer:-In case of minor delays, emojis should be sent as soon as connectivity is restored. If the user is offline for too long, those emojis will not be displayed to others.

Will you ask any other questions? Let me know

2. Functional Requirements (FR)

Functional requirements define the essential behaviors and features that the system must deliver.

Functional Requirements guide the High-Level Design and should be communicated clearly with the interviewer.

The requirements list user actions and core user flows and generally start with “Users should be able to...”.

Focus on the top 3-4 core requirements and move less critical aspects to the "Out of Scope" section after discussion with your interviewer.

Let’s write down the requirements for this problem

3. Non Functional Requirements (NFR)

Non-Functional Requirements define how the system should perform under various conditions, including increased load and distributed environments.

System design interview success depends on your ability to identify challenges, bottlenecks, and limitations within a given problem and solve them effectively, all while balancing trade-offs such as scalability, reliability, and performance.

Ask questions to yourself like:-

Does this system prioritize availability or consistency?
What is the expected scale, and throughput? Is it a read-heavy or Write-heavy?

Do not just write down the cliche words only that I will make system scalable, fault-tolerant, resiliency etc. Instead, focus on precise, actionable requirements like

“We will prioritize availability for x and consistency for y “ or

“The system is read-heavy, and thus needs to be able to support high read throughput.”

Let’s write down the NFR for this problem:-

Back-of-the-envelope calculations are best done after the High-Level Design (HLD).
Once the HLD outlines the system’s architecture, these calculations help estimate resource requirements, such as storage and bandwidth, and identify potential bottlenecks.
This will validate that if your HLD can handle expected loads and perform efficiently. if not, you can make informed adjustments to the design based on practical constraints and performance expectations which can be a good starting point for the Detailed Design.

If you found this article insightful and believe your friends would benefit from it too, don’t hesitate to share it!

4. Core Entities

Now is the right time to define the core entities that will lay the foundation for API design. Focus on entities essential for meeting functional requirements:

Core Entities for the Emoji problem are:-

User: Represents each individual using the system (viewer of the match).
- Attributes: UserId, Name, Location, DeviceType
Emoji: Represents the type of emoji (clap, thumbs-up, heart, etc.) being sent by users.
- Attributes: EmojiId, Type, CreatedAt
Reaction: This entity tracks each reaction (emoji) sent by a user during the match.
- Attributes: ReactionId, UserId, EmojiId, Timestamp
Match/Event: Represents the live event (e.g., a cricket match) where users are sending and receiving emojis in real-time.

Attributes: EventId, Teams, StartTime, EndTime,CurrentState

At this stage, we've outlined the core entities. I have detailed the attributes but they need not be written now, we can do that during our high-level design.

You can always say to your interviewer:-

At this point in time, I've identified the core entities at a high level, and we can proceed with these for now. However, as we progress, we can always revisit or refine these as we progress.

5. API

1. Submit Emoji API:

This API allows users to send emoji reactions while watching the match. It stores the reaction in the system.

Endpoint: /api/v1/emoji/send
Method: POST
Request Headers:
- Authorization: Bearer token (JWT for authentication and user identification)
Request Body:

{
  "emojiId": "uuid",
  "eventId": "event-abc",
  "emojiType": "clap"

}

UserId is not passed by the client. It is derived from the JWT token used in the Authorization header.

Timestamps for the reaction are generated server-side to avoid manipulation.

Response:

{
  "status": "success"
}

2. Fetch Live Reactions API

This API fetches the top (most popular) emojis sent during a event for clients to render on the user interface. We still have not done how the broadcast of the emojis will be done. For now we can write this and revisit it later :-

Endpoint: /api/v1/emoji/fetch

Method: GET
Request:
```
{
  "eventId": "xyz"
}
```

Response:

{
  "emojis": [
    {"emojiId": "clap", "count": 5000},
    {"emojiId": "heart", "count": 3000}
  ],
  "status": "success"
}

Again, writing the whole request/response is optional and we can revisit this once we have clarity in HLD.

6. High-Level Design

As said previously HLD focuses on solving the functional requirements. Let’s go through this one by one.

1. The user should be able to send emojis

When a user sends an emoji from his/her device it should be successfully broadcasted to all the viewers watching that event.

We will start laying out core components by adding an Emoji Service to process the emojis.

The Core Components are:-

Clients: Users will interact with the system through the client app. All client requests will be routed to the system's backend through an API Gateway.
API Gateway: This serves as an entry point for clients to access the different microservices of the system. It's primarily responsible for routing requests to the appropriate services but can also be configured to handle cross-cutting concerns like authentication, rate limiting, and logging.
Emoji Service: Our first microservice - Emoji service is responsible for receiving API requests. This service will save the emoji request to the database.
Emoji DB: Stores the emojis for particular user.

To conclude what happens when the user sends an emoji.

The client makes a REST POST request with the eventId.
The API gateway then forwards the request to Emoji Service.
Emoji Service will save the emoji request, userId, and metadata and return it to the client.

2. User should be able to receive emojis in near real-time

There are couple of ways we can design this :-

Polling Approach:
- Description: Users would repeatedly hit an API endpoint at regular intervals to fetch the latest emojis.
- Challenges:
  - Polling Overhead: Frequent polling increases the load on the server and network. This can lead to inefficiencies and higher operational costs.
  - Complexity: The server must track which emojis have been sent to each user and handle scenarios where emojis might be missed or duplicated.
  - Latency: This approach introduces delays between when an emoji is sent and when it is received by users, impacting the real-time experience.

Push-Based Approach:
- Description: The server actively pushes emojis to connected clients as soon as they are received. This method eliminates the need for users to poll for updates.
- Advantages:
  - Real-Time Delivery: Emojis are delivered instantaneously, providing a seamless and engaging user experience.
  - Efficiency: Reduces server load and network traffic, as there is no need for continuous polling.
  - Simplified Tracking: The server maintains state and delivery tracking, reducing the complexity of client-side implementation.

The Push-Based Approach is generally preferred for real-time applications where low latency and high responsiveness are critical. where Message Queue is a commonly used mechanism for asynchronous communication between applications. We will use Kafka as a Message Queue for this approach.

Companies like Google don’t usually emphasis on Technology name . In this case you can just say that we will use Message Queue.

At this point of time we can go back to our API design and discard the API 2- Polling the emoji’s

We will extend our previous solution a bit:-

After Emoji Service receives the emoji, it publishes emojis to a topic within a message queue, tagged with the eventId.
Clients subscribed to the topic receive emojis asynchronously in real-time, ensuring efficient and prompt delivery

Emoji Aggregation

Sending emojis as soon as they are received can be inefficient. To optimize performance and enhance user experience, we aggregate emojis over a short time interval—let’s say 1 second—before broadcasting them.

Why Aggregate?

Efficiency: Aggregating emojis reduces the number of messages sent over the network, minimizing the load on both the message queue and client applications.
Network Optimization: By batching emojis, we lower the frequency of updates sent to clients, thereby reducing network traffic and improving overall performance.
User Experience: Aggregation helps in presenting a more cohesive and visually appealing set of emojis to users, reflecting a more accurate real-time sentiment.

Implementation Details:

Time Interval: The interval for aggregation should be short enough to maintain a real-time experience, such as 1 second.
Streaming Platform: We use streaming platforms like Flink, Spark, Storm, or Kafka Streams to aggregate emojis in batches (e.g., every second).
Data Flow: After aggregation, the processed emoji stream is published to a Kafka topic for further consumption by clients.

Data Delivery

The next problem is delivering the emojis to the client. There are various problems now:

Direct Kafka Subscription: Subscribing mobile devices to Kafka is impractical due to several challenges:
- Binary Protocol: Kafka uses a binary protocol unsuitable for direct integration with mobile apps.
- Persistent Connections: Kafka requires persistent connections, which are not ideal for mobile networks that can be unstable.
- Security Risks: Exposing Kafka directly to the internet introduces security vulnerabilities.
- Scalability Issues: Managing many concurrent mobile connections with Kafka would create significant scalability and performance challenges.
Pub/Sub System: Instead, we use a Pub/Sub system to handle high-throughput, low-latency message delivery. A Kafka consumer, which listens to the emoji topic from the streaming platform, normalizes or aggregates the data (e.g., identifies the most popular emojis). Instead of delivering every emoji, we focus on sending only the most relevant ones. The aggregated data is then published to the Pub/Sub system.
Client Delivery via Pub/Sub: Clients subscribe to Pub/Sub topics based on eventId and receive real-time updates efficiently. Pub/Sub manages persistent connections and delivers aggregated emoji data to clients.

Let’s move to the next section which is:-

7. Back of Envelope calculations

Concurrent Users: 10 million
Emojis per User per Minute: 1 (average)
Emoji Size: 500 bytes (including metadata)
Event Duration: 3 hours (180 minutes)

1. Throughput Calculation

Emojis per Second: 10 million users * 1 emoji/user/minute ÷ 60 seconds = ~150K emojis per second.

2. Data Size Calculation

Total Emojis in the Entire Event: 10 million users * 1 emoji/user/minute * 180 minutes = 1.8 billion emojis.
Total Data Size: 1.8 billion emojis * 500 bytes/emoji = 900 GB per event.

3. Bandwidth Calculation

Data Per Second: 150k emoji/second * 500 bytes/emoji = ~75 MB/second.
Data Per Minute: 75 MB/second * 60 seconds = ~4.5 GB/minute.
Total Bandwidth for Event Duration: 4.5 GB/minute * 180 minutes = 810 GB.

Summary

Throughput: ~150k emojis per second
Data Size: 900 GB for the entire event
Bandwidth: ~75 MB/second, totalling 900 GB for the entire event

For this problem, back-of-the-envelope calculations offer limited value because they won’t significantly impact our design decisions. Given the constraints, these estimations are unlikely to influence key aspects of the system. Therefore, it's more strategic to focus on the High-Level Design (HLD) and key design considerations.
If time is tight during the interview, focus on aspects that align with the interviewer's expectations. Back-of-the-envelope calculations might be deprioritized if the interviewer suggests concentrating on specific design elements.
Use your time wisely by concentrating on the areas that demonstrate your problem-solving skills and understanding of system design, rather than spending time on calculations that won’t influence the outcome.
This approach will enhance your chances of success by ensuring that you address the most relevant aspects of the system design.

8. Detailed Design

This is the part where we should look into each of the non-functional requirements and try to satisfy them.

a. Scalability - Handling 10M concurrent users

Horizontal Scaling: Use load balancers to distribute incoming API requests across multiple instances of the Emoji Service.
Kafka Partitioning:
- Topic Partitioning: Partition Kafka topics based on eventId. This allows for parallel processing and distributing the load across multiple Kafka brokers. Each partition can handle a subset of the data, improving throughput and reducing latency.
- Replication: Configure Kafka topics with replication to ensure high availability and fault tolerance. This ensures that data is not lost if a broker fails.
Auto-Scaling:
- Emoji Service: Implement auto-scaling policies that adjust the number of service instances based on metrics such as CPU usage and request rate.
- Kafka Cluster: Set up auto-scaling for the Kafka cluster to add or remove brokers based on load.
Database:
- NoSQL Databases: Use NoSQL databases like DynamoDB or Cassandra for storing emojis. These databases offer high write throughput and can scale horizontally to handle large volumes of data.
- Data Retention: Configure data retention policies to store emojis for 24-48 hours post-event. Ensure efficient data access and deletion policies to manage storage usage.

b. Low Latency ( less than 2 seconds)

Real-Time Processing: Use streaming platforms like Flink, Spark, or Kafka Streams as discussed earlier and set a short aggregation interval
Efficient Data Delivery: Utilize a Pub/Sub system to manage real-time data delivery which was covered earlier.

c. High Availability: Some Emojis could be not delivered

Some emojis may not be delivered, but the system is designed to maintain high availability through Redundant Infrastructure, Load balancing, Message Persistence in Kafka, Graceful Degradation, Retry Mechanisms and Data Replication

Thanks for reading this article so far. If you like this real-world System design Interview solution please share it with your friends and colleagues.

That’s it, folks! 😄.

I spent ~10+ hours curating this article—thinking, designing, and writing. It would mean the world to me if you could spare just 10 seconds to like 👍 and share ❤️ this with your friends. Your support motivates me to create even more valuable content! 🙌

Stay tuned for more dive deeps on Real-World System Design Problems. Let me know in comments if you want me to solve some specific Design Problem.

And hey, don’t forget to follow me on Linkedin for more insights and updates. Let's stay connected!

Y School Of Tech

Discussion about this post