System Design Interview - Real-Time Error Monitoring Tool (Part-1)(Asked at Google, Apr '25)
S1E3 - Backend of Error Monitoring Tool
Hey everyone 👋 and welcome to Episode 3 of my System Design Interview Series!
Today, we’ll walk through a real-world design asked at Google in Apr '25, where question was asked by a Staff engineer from Texas, US.
We’ll break it down step-by-step and also highlight how to approach this in a real interview — with tips, structure, and thought process.
👉 Share this with them — this one covers streaming, real-time alerting, polyglot storage, and fault tolerance — everything you’d love to nerd out on.
🧩 Problem Statement
💬 Seen a similar question before?
How would you start breaking it down in a live interview? Drop your thoughts in the comments 👇
1. ❓Clarifying Requirements (2 mins)
👥 Tenants & Volume:
Assume ~1,000 clients, sending a total of 100M error events/day🧠 Error Grouping Logic:
Based on stack trace, exception type, and file + line number⚡ Ingestion Latency:
Yes — ingestion must be fast (<100ms)
👉 Grouping & deduplication can happen asynchronously🚨 Real-Time Alerting:
Yes — system must alert when a new unique error type is seen🔐 Multi-Tenancy:
Yes — each SaaS customer must only see their own data🗓️ Retention Period:
Configurable — system should allow data retention settings per tenant and support deletion of old error data🏷️ Metadata Payload:
Each error may carry 2–4KB of metadata (e.g., tags, OS, browser, user ID)
❌ Mistake Alert:
Candidates often skip clarification. In real interviews, vague problems are intentional — show your thinking by asking back.
💬 Got your own checklist of clarifying questions? I’d love to hear it — comment below 👇
2. 📋 Functional Requirements (3 mins)
Although the functional requirements are mentioned in the question itself, I recommend explicitly listing them out to make it easier to refer back later.
You should write like this in interview setup :-
3. 🚦Non Functional Requirements (3 mins)
This is what I will write in interview setup :-
We layout Data model and api to map relevant FR . However, ask your interviewer if they prefer to skip to High-Level Design.
4. 🧱Data Model Design (2 mins)
Here's a smooth way to respond without getting stuck early on:
"I’ll jot down an initial data model based on what’s top-of-mind. We can refine this as we dive deeper into requirements and trade-offs."
🧠 Food for Thought
Think about basic classes which will be important to design the solution. Then, sketch something simple :
Like I told, You might not come up with all attributes of each model at this point . Let your interviewer know this is not final and you will come back if needed.
5. 🔌API Design (3 mins)
❌ Common Mistakes To Avoid
Ask Interviewer if they want to spend time here. Do not spend much time here. Interviewer might be looking for HLD on priority
🧠 Food for Thought:
Go through each ✨ functional requirement and identify which ones need an API to support or expose that feature. Focus on mapping each capability to a clear, purposeful endpoint
Always communicate to your interviewer that you will come back on to these if any modification is required.
Aim to reach here in 13-15 minutes
6. 🏗️High Level Design (10-15 minutes)
🧠 Food for Thought
In this section, we outline the High-Level Design (HLD), mapping it directly to the core Functional Requirements (FRs).
First thing I draw is user and Api Gateway as an entry point.
A) Ingestion & Processing Pipeline
1. 📲 Ingestion:
A client SDK captures an error, batches it with others, compresses the payload, and sends it to the Ingestion Service via an API Gateway.
2. ✅ Validation & Buffering:
The Ingestion Service:
🔐 Validates the auth key
📦 Checks payload schema
📤 Publishes the raw event to Kafka (
raw-events
topic)📬 Responds to SDK with 202 Accepted
🧱 Kafka acts as a durable buffer, decoupling ingestion from processing — enabling resilience, scalability, and replayability.
B) ⚡ Real-Time Processing with Apache Flink
Why Not Just Store and Query Later?
Per our Functional Requirements, we need to process a continuous stream of error events and support:
🚀 Fast ingestion
🧬 Deduplication & fingerprinting
🆕 New error detection
🔔 Real-time alerting
At first glance, you might think of storing raw events in a database and running periodic jobs to apply business logic. Sounds reasonable, right?
But this introduces serious problems 👇
❌ Why storing in DB first doesn't work well:
🕒 Latency:
Inserting and then querying each event introduces significant I/O and processing delays.
👉 By the time you detect a new error, it's too late for real-time alerting.💸 Cost & Scalability:
Performing high-frequency reads/writes (e.g., every second) can overload your DB, forcing expensive horizontal scaling just to keep up.⚙️ Complexity: Modeling logic like stateful deduplication or time-window aggregations directly inside a DB is hard to express,inefficient to execute, and
not fault-tolerant by design.
✅ Why chose Apache Flink instead
Flink is a stream processing engine purpose-built for low-latency, high-throughput, and stateful computations over unbounded data streams — exactly what we need here.
Instead of relying on a DB for real-time logic, we process events as they arrive using Flink:
🔁 Deduplication: Flink computes a stable
error_fingerprint
using the stack trace, exception type, file/line info — and checks its in-memory state to see if this fingerprint has already been seen for this tenant.🚨 New Error Detection: If it’s a new error, Flink emits an event to the
new-error-alerts
Kafka topic immediately — no DB lookup required.🧠 Enrichment: Flink adds metadata like a unique
event_instance_id
, processing timestamp, etc., and publishes this enriched event to aprocessed-events
Kafka topic for downstream consumers.🛡️ Fault Tolerance: Flink supports checkpointing and exactly-once semantics, making the entire pipeline reliable under failure without requiring expensive retries or duplication handling in a DB.
🧠 Food for Thought
🔍 You need Flink (or a Flink-like engine) when your system must process millions of events per second, apply business logic in real time, and maintain per-user or per-tenant state — all with low latency.
C) 🚨 Real-Time Alerting
🔁 The Flink app publishes new error fingerprints to a dedicated Kafka topic — new-error-alerts
.
📩 A Notification Service consumes from this topic and:
🔍 Looks up the tenant’s notification rules
📨 Sends alerts via preferred channels — e.g., Slack, Email, or Webhook
⚡ This ensures real-time alerts for first-time errors, helping teams respond instantly to new failures.
🧠 Pause and Think
How would you design the storage layer here?
👉 Drop a comment — monolithic or polyglot?
D) 💾 Storage Fan-Out
Once enriched, the event from Apache Flink is published to the processed-events
Kafka topic.
From here, multiple consumer services fan out the event to specialized storage systems:
🔥 Apache Cassandra
Stores full error payloads for fast lookup byevent_instance_id
👉 Ideal for high-throughput writes and low-latency reads🔍 Elasticsearch
Powers flexible search across metadata
📌 Maintains grouped error states (e.g., resolved, ignored, assigned)📊 ClickHouse
Handles real-time time-series aggregations
⚡ Powers fast dashboards like “Errors per version per hour”❄️ Amazon S3 + Parquet
Archives enriched events in columnar format for long-term storage
🔍 Queryable on-demand via tools like Athena, Presto, or Spark
💰 Great for cost-efficient, durable historical analysis without inflating hot/warm tiers
💡 This polyglot strategy ensures each workload is handled by the right tool, optimising for performance, scale, and cost.
🎉 That’s it for Part 1!
In Part 2, we’ll go deeper into:
🔍 Persistence Deep Dive
📡 Query Layer & APIs
📏 Back-of-the-Envelope Estimations
🧠 Component Deep Dives
⚖️ Trade-Offs & Design Alternatives
🙌 If you found this helpful:
❤️ Hit like — it helps others discover this series
💬 Comment with your feedback or alternate designs