System Design Interview - Real-Time Error Monitoring Tool (Part-2)(Asked at Google, Apr '25)

S1E4 - Backend of Error Monitoring Tool

Hitesh Gupta

Aug 11, 2025

This is Part 2 of our deep dive into building an Error Monitoring Platform.

If you missed Part 1, you can catch up here 👇 before diving in.

System Design Interview - Real-Time Error Monitoring Tool (Part-1)(Asked at Google, Apr '25)

Hitesh Gupta

Aug 4

System Design Interview - Real-Time Error Monitoring Tool (Part-1)(Asked at Google, Apr '25)

Hey everyone 👋 and welcome to Episode 3 of my System Design Interview Series!

Read full story

🔍 Persistence Deep Dive

We split processed events into multiple specialised storage systems to meet access pattern needs efficiently:

🗄️ Storing Every Error Instance — The Cassandra Way

We keep every single error’s full details for 30–90 days so developers can see exactly what happened.
This means massive writes coming in fast 🚀 and simple lookups when we need them.

💡 Why Apache Cassandra ?

Cassandra is a masterless, distributed database that’s built for exactly this type of workload:

⚡ Extreme write throughput — it can ingest millions of events per second
🛡 High availability — no single point of failure
🔍 Fast lookups — optimized for queries by partition key

🗂️ How We Store the Data

We design Cassandra’s schema to match our access patterns:

Main Table: error_instances

Partition key: (tenant_id, error_fingerprint) → groups all instances of the same error for a given customer together on the same nodes.
Clustering columns: (event_timestamp DESC, event_instance_id) → orders errors so the most recent are fetched first.
Schema: tenant_id (UUID), error_fingerprint (TEXT), event_timestamp (TIMESTAMP), event_instance_id (UUID), raw_payload (BLOB), metadata_tags (MAP<TEXT, TEXT>), PRIMARY KEY ((tenant_id, error_fingerprint), event_timestamp, event_instance_id) with CLUSTERING ORDER BY (event_timestamp DESC)

For the primary use case of fetching a single instance by its unique ID, a secondary "lookup" table will be necessary:

Lookup Table: instance_lookup

Partition key: event_instance_id → quick way to find the full primary key for an error, then fetch from error_instances.
Schema: event_instance_id (UUID), tenant_id (UUID), error_fingerprint (TEXT), event_timestamp (TIMESTAMP), PRIMARY KEY (event_instance_id).

⏳ Automatic Expiry with TTL

Each record is written with a time-to-live.
When it expires, Cassandra automatically deletes it — no manual cleanup jobs, no wasted storage 💰.

💬 Will you use Scylla or DynamoDB instead? Share your take below.

🔍 Elasticsearch for Search & Metadata

We need:

🎯 Powerful filtering on any metadata (OS, browser, user ID, tags…)
📜 Full-text search on error messages
🗂 Mutable state tracking (status, assignee, annotations)

💡 Why Elasticsearch?

⚡ Near real-time search across billions of docs
🧩 Schema-flexible — perfect for semi-structured grouped errors
🔎 Inverted index magic for lightning-fast filtering

📐 Data Model

Index: grouped_errors
Doc ID: tenant_id + error_fingerprint
Fields: tenant_id, project_id, status, title, timestamps, tags, annotations

📊 Flow

1️⃣ New error detected → Kafka consumer creates a new doc in grouped_errors.
2️⃣ Same error happens again → Update the doc: increment count, update last_seen_timestamp.
3️⃣ User actions (e.g., change status, add note) → API directly updates the document in Elastic search.
4️⃣ Search & filter → Developers instantly query by metadata, tags, or text to find exactly what they need.

📊 ClickHouse for Analytics & Aggregations

We need:

📈 Fast aggregations on billions of error events
📅 Time-series queries for trends, counts, breakdowns
🚀 Low-latency analytics even with huge datasets

💡 Why ClickHouse?

⚡ Columnar storage → blazing-fast scans for large datasets
🪶 Compression → keeps storage costs low while handling massive data
🛠 Rich SQL support for GROUP BY, filtering, and time-window calculations

📐 Data Model

Table: error_events_analytics
Partition Key: event_date
Order By: (tenant_id, error_fingerprint, event_timestamp)
Fields: tenant_id, project_id, timestamps, error_fingerprint, tags, counts

📊 Flow

1️⃣ Raw events ingested into Kafka → Flink processes & aggregates → writes to ClickHouse.
2️⃣ Query layer fetches aggregated data for dashboards, charts, and trend lines.
3️⃣ Developers can drill down by date, tags, or fingerprint to see patterns.

🧠 Food for Thought

We use Cassandra for massive write throughput with high availability. Its masterless architecture allows linear horizontal scaling — perfect for append-heavy, always-on ingestion workloads.

👉 When your writes outpace your reads, think Cassandra.

Elasticsearch: Ideal for full text , multi-field search
👉 When users need “find anything, filter everything,” think Elasticsearch.

ClickHouse: Built for blazing-fast OLAP queries over massive datasets. It excels in high read throughput and is typically optimised for batch writes.
👉 When you care about latency and aggregation speed, think ClickHouse.

📣 If you’ve used ClickHouse or Elasticsearch in prod — I’d love to hear your war stories. Share below 👇

☁️ Amazon S3 for Raw Event Archival

We need:

📦 Long-term storage for compliance & offline analysis
🛡 Durability — must never lose archived data
💰 Cost-effective for large historical datasets

💡 Why S3?

🛡 11 nines durability — data is safe for years
💵 Cheap storage tiers (Standard, IA, Glacier)
🔄 Easy integration with Athena, Spark, EMR for ad-hoc queries

📐 Data Model

Bucket: error-events-archive
Prefix: /tenant_id/year/month/day/
Format: Parquet (compressed, query-friendly)

📊 Flow

1️⃣ Flink or batch job periodically writes compressed raw events to S3.
2️⃣ Athena/Spark can query archived data for historical analysis.
3️⃣ Old data automatically transitions to cheaper storage tiers via lifecycle policies.

🧠 Food for Thought

We use S3 + Parquet for cheap, durable storage that’s also queryable on demand.
👉 When access is rare but retention matters, think S3.

🔍 Query & API Layer — The Bridge Between Storage & Users

We need:

⚡ Fast responses to UI and API requests
🧠 Business logic orchestration — combining results from multiple stores
🔒 Access control & multi-tenancy
🔄 Real-time + historical data blending

💡 Why it’s critical

The Query & API Layer acts as the traffic controller for all requests.
Instead of each client querying Elasticsearch, ClickHouse, Cassandra, or S3 directly, the API layer:

🎯 Routes requests to the right backend store
🧩 Merges & shapes the data before returning
🛡 Applies security filters so each tenant sees only their data
📊 Caches common queries to improve performance

📐 Responsibilities

Search & Filtering — Uses Elasticsearch for full-text & metadata filters
Analytics & Trends — Fetches aggregated counts from ClickHouse
Instance Lookups — Pulls full payload from Cassandra by event_instance_id
Historical Fetches — Retrieves archived data from S3 if needed
Mutations — Updates error status, annotations, or assignments in Elasticsearch

📊 Flow

1️⃣ User clicks "Search errors" in the UI → API calls Elasticsearch index for matches.
2️⃣ User opens an error group → API fetches latest occurrences from ClickHouse & instance payload from Cassandra.
3️⃣ User requests a 1-year-old error → API fetches from S3 archive via Athena or Spark.

This is the final Design

💬 Comment with your feedback or alternate designs

📊 Back-of-the-Envelope Estimations

🌍 Total Daily Events → 10,000 customers × 5 projects × 10,000 errors/day = 500M events/day
⚡ Avg Ingestion Rate → 500M ÷ 86,400s ≈ 5.8K QPS
🚀 Peak Ingestion Rate → 5.8K × 3 = 17.4K QPS
💾 Daily Data Volume → 500M × 2 KB = ~1 TB/day
🔥 Hot Storage (30 days, Cassandra) → 1 TB × 30 = 30 TB → with RF=3 → 90 TB
📈 Analytics Storage (ClickHouse, 1 year) → 100:1 aggregation → ~10 GB/day → yearly ~3.65 TB → with replication 8–10 TB
🔍 Metadata Storage (Elasticsearch) → ~1M unique error groups × 5 KB = ~5 GB → with replicas 10–20 GB

💡 Key Takeaways

⚡ Cassandra → Absorb massive writes, 90 TB hot storage with RF=3.
📅 ClickHouse DB → Store 1-year aggregated analytics (~8–10 TB).
🔍 Elasticsearch → Fast metadata & grouped error searches (10–20 GB).

🛠 Detailed Design — How We Meet the NFRs

We’ve engineered this system to not just “work,” but to excel under real-world scale, speed, and security demands. Here’s how each Non-Functional Requirement is handled:

📈 NFR1 — Highly Scalable

⚙ Horizontal Scaling Everywhere → Add more nodes to Ingestion Service, Kafka, Flink, Cassandra, Elasticsearch, and ClickHouse without downtime.
✂ Tenant Partitioning → tenant_id sharding keeps noisy neighbours from impacting others.
📝 Write-Optimized → Handles huge ingestion bursts without choking.

⚡ NFR2 — Low Latency

⏱ Ingestion < 100ms → Immediate Kafka write, async processing.
🚨 Error Alerts (p99 < 10s) → Flink streams process in-memory for near-real-time detection.
📊 Analytics (p95 < 2s) → ClickHouse pre-aggregates to make dashboards instant.

🛡 NFR3 — High Availability (99.99% Uptime)

🔄 Multi-Node Clusters (RF=3) → Survive node failures with no downtime.
🗂 Durable Kafka Buffering → Data stays safe until consumers recover.
🔁 Self-Healing → Automatic failover keeps the system running.

💾 NFR4 — Durability (No Data Loss)

📌 Kafka as the Source of Truth → Persistent before processing begins.
🌍 Cross-Zone Replication → Survives data center outages.

🧩 NFR5 — Data Isolation

🔐 Strict Tenant Partitioning → Enforced at Kafka, Cassandra, Elasticsearch, ClickHouse.
🛠 App-Layer Enforcement → Every query auto-filters by tenant_id.

🔒 NFR6 — Security

🔑 Encryption Everywhere → TLS in transit, AES-256 at rest.
🧍 Role-Based Access Control → Fine-grained API & data permissions.
🛡 DDoS Protection → Rate limiting + WAF at the edge.

💡 TL;DR:
We’ve built a system that’s fast ⚡, scalable 📈, always-on 🛡, loss-proof 💾, tenant-safe 🧩, and locked-down 🔒 — ready for internet-scale traffic without compromise.

🔄 Trade-offs & Alternatives

In interviews, it’s a good idea to highlight one or two key trade-offs you considered. This shows the interviewer not just what decision you made, but why you made it—and demonstrates your ability to weigh options thoughtfully.

1️⃣ Polyglot Persistence 🗂️ vs. Monolithic DB 🏢

Choice: Use multiple purpose-built stores — Cassandra ⚡ (hot writes), Elasticsearch 🔍 (search & metadata), ClickHouse 📊 (analytics).
Alternative: One big DB (e.g., PostgreSQL).
Trade-off:
✅ Best tool for each workload → faster, cheaper at scale.
⚠️ More systems to run, monitor, and secure.
❌ Monolithic DB = simpler ops, but slower writes & weaker analytics under high load.

2️⃣ Event Streaming 🚀 vs. Batch ETL ⏳

Choice: Real-time processing with Kafka + Flink 🌀.
Alternative: Periodic batch jobs (e.g., hourly Spark runs).
Trade-off:
✅ Alerts in seconds, live dashboards.
⚠️ More infra complexity.
❌ Batch = simpler, but stale data + missed SLAs (p99 < 10s alerting would fail).

3️⃣ Async Writes 📬 vs. Sync Writes ⏱️

Choice: API ACKs after Kafka enqueue.
Alternative: Write directly to DBs in request path.
Trade-off:
✅ <100ms ingestion latency, smooth spike handling.
⚠️ Slightly delayed DB persistence.
❌ Sync = stronger immediate durability, but slower & more failure-prone.

4️⃣ Loose Coupling 🪢 vs. Direct Calls 📞

Choice: Kafka backbone for service decoupling.
Alternative: Services call each other directly.
Trade-off:
✅ Fault-tolerant, independently scalable services.
⚠️ More moving parts.
❌ Direct = simpler, but brittle at high QPS.

💬 Your Move

🙌 If you found this helpful:

❤️ Hit like — it helps others discover this series

🔁 Share with a teammate or interview group

Y School Of Tech

System Design Interview - Real-Time Error Monitoring Tool (Part-1)(Asked at Google, Apr '25)

Discussion about this post