System Design Interview - Real-Time Error Monitoring Tool (Part-1)(Asked at Google, Apr '25)

S1E3 - Backend of Error Monitoring Tool

Aug 04, 2025

Hey everyone 👋 and welcome to Episode 3 of my System Design Interview Series!

Today, we’ll walk through a real-world design asked at Google in Apr '25, where question was asked by a Staff engineer from Texas, US.

We’ll break it down step-by-step and also highlight how to approach this in a real interview — with tips, structure, and thought process.

📬 Know someone preparing for interviews?

👉 Share this with them — this one covers streaming, real-time alerting, polyglot storage, and fault tolerance — everything you’d love to nerd out on.

🧩 Problem Statement

💬 Seen a similar question before?
How would you start breaking it down in a live interview? Drop your thoughts in the comments 👇

Leave a comment

1. ❓Clarifying Requirements (2 mins)

👥 Tenants & Volume:
Assume ~1,000 clients, sending a total of 100M error events/day
🧠 Error Grouping Logic:
Based on stack trace, exception type, and file + line number
⚡ Ingestion Latency:
Yes — ingestion must be fast (<100ms)
👉 Grouping & deduplication can happen asynchronously
🚨 Real-Time Alerting:
Yes — system must alert when a new unique error type is seen
🔐 Multi-Tenancy:
Yes — each SaaS customer must only see their own data
🗓️ Retention Period:
Configurable — system should allow data retention settings per tenant and support deletion of old error data
🏷️ Metadata Payload:
Each error may carry 2–4KB of metadata (e.g., tags, OS, browser, user ID)

❌ Mistake Alert:

Candidates often skip clarification. In real interviews, vague problems are intentional — show your thinking by asking back.

💬 Got your own checklist of clarifying questions? I’d love to hear it — comment below 👇

Leave a comment

2. 📋 Functional Requirements (3 mins)

Although the functional requirements are mentioned in the question itself, I recommend explicitly listing them out to make it easier to refer back later.

You should write like this in interview setup :-

3. 🚦Non Functional Requirements (3 mins)

This is what I will write in interview setup :-

We layout Data model and api to map relevant FR . However, ask your interviewer if they prefer to skip to High-Level Design.

4. 🧱Data Model Design (2 mins)

Here's a smooth way to respond without getting stuck early on:

"I’ll jot down an initial data model based on what’s top-of-mind. We can refine this as we dive deeper into requirements and trade-offs."

🧠 Food for Thought

Think about basic classes which will be important to design the solution. Then, sketch something simple :

Like I told, You might not come up with all attributes of each model at this point . Let your interviewer know this is not final and you will come back if needed.

5. 🔌API Design (3 mins)

❌ Common Mistakes To Avoid

Ask Interviewer if they want to spend time here. Do not spend much time here. Interviewer might be looking for HLD on priority

🧠 Food for Thought:

Go through each ✨ functional requirement and identify which ones need an API to support or expose that feature. Focus on mapping each capability to a clear, purposeful endpoint

Always communicate to your interviewer that you will come back on to these if any modification is required.
Aim to reach here in 13-15 minutes

6. 🏗️High Level Design (10-15 minutes)

🧠 Food for Thought

In this section, we outline the High-Level Design (HLD), mapping it directly to the core Functional Requirements (FRs).

First thing I draw is user and Api Gateway as an entry point.

A) Ingestion & Processing Pipeline

1. 📲 Ingestion:

A client SDK captures an error, batches it with others, compresses the payload, and sends it to the Ingestion Service via an API Gateway.

2. ✅ Validation & Buffering:

The Ingestion Service:

🔐 Validates the auth key
📦 Checks payload schema
📤 Publishes the raw event to Kafka (raw-events topic)
📬 Responds to SDK with 202 Accepted

🧱 Kafka acts as a durable buffer, decoupling ingestion from processing — enabling resilience, scalability, and replayability.

B) ⚡ Real-Time Processing with Apache Flink

Why Not Just Store and Query Later?

Per our Functional Requirements, we need to process a continuous stream of error events and support:

🚀 Fast ingestion
🧬 Deduplication & fingerprinting
🆕 New error detection
🔔 Real-time alerting

At first glance, you might think of storing raw events in a database and running periodic jobs to apply business logic. Sounds reasonable, right?

But this introduces serious problems 👇

❌ Why storing in DB first doesn't work well:

🕒 Latency:
Inserting and then querying each event introduces significant I/O and processing delays.
👉 By the time you detect a new error, it's too late for real-time alerting.
💸 Cost & Scalability:
Performing high-frequency reads/writes (e.g., every second) can overload your DB, forcing expensive horizontal scaling just to keep up.
⚙️ Complexity: Modeling logic like stateful deduplication or time-window aggregations directly inside a DB is hard to express,inefficient to execute, and
not fault-tolerant by design.

✅ Why chose Apache Flink instead

Flink is a stream processing engine purpose-built for low-latency, high-throughput, and stateful computations over unbounded data streams — exactly what we need here.

Instead of relying on a DB for real-time logic, we process events as they arrive using Flink:

🔁 Deduplication: Flink computes a stable error_fingerprint using the stack trace, exception type, file/line info — and checks its in-memory state to see if this fingerprint has already been seen for this tenant.
🚨 New Error Detection: If it’s a new error, Flink emits an event to the new-error-alerts Kafka topic immediately — no DB lookup required.
🧠 Enrichment: Flink adds metadata like a unique event_instance_id, processing timestamp, etc., and publishes this enriched event to a processed-events Kafka topic for downstream consumers.
🛡️ Fault Tolerance: Flink supports checkpointing and exactly-once semantics, making the entire pipeline reliable under failure without requiring expensive retries or duplication handling in a DB.

🧠 Food for Thought

🔍 You need Flink (or a Flink-like engine) when your system must process millions of events per second, apply business logic in real time, and maintain per-user or per-tenant state — all with low latency.

C) 🚨 Real-Time Alerting

🔁 The Flink app publishes new error fingerprints to a dedicated Kafka topic — new-error-alerts.

📩 A Notification Service consumes from this topic and:

🔍 Looks up the tenant’s notification rules
📨 Sends alerts via preferred channels — e.g., Slack, Email, or Webhook

⚡ This ensures real-time alerts for first-time errors, helping teams respond instantly to new failures.

🧠 Pause and Think

How would you design the storage layer here?

👉 Drop a comment — monolithic or polyglot?

Leave a comment

D) 💾 Storage Fan-Out

Once enriched, the event from Apache Flink is published to the processed-events Kafka topic.

From here, multiple consumer services fan out the event to specialized storage systems:

🔥 Apache Cassandra
Stores full error payloads for fast lookup by event_instance_id
👉 Ideal for high-throughput writes and low-latency reads
🔍 Elasticsearch
Powers flexible search across metadata
📌 Maintains grouped error states (e.g., resolved, ignored, assigned)
📊 ClickHouse
Handles real-time time-series aggregations
⚡ Powers fast dashboards like “Errors per version per hour”
❄️ Amazon S3 + Parquet
Archives enriched events in columnar format for long-term storage
🔍 Queryable on-demand via tools like Athena, Presto, or Spark
💰 Great for cost-efficient, durable historical analysis without inflating hot/warm tiers

💡 This polyglot strategy ensures each workload is handled by the right tool, optimising for performance, scale, and cost.

🎉 That’s it for Part 1!

In Part 2, we’ll go deeper into:

🔍 Persistence Deep Dive
📡 Query Layer & APIs
📏 Back-of-the-Envelope Estimations
🧠 Component Deep Dives
⚖️ Trade-Offs & Design Alternatives

Y School Of Tech

Discussion about this post