System Design Interview - Real-Time Error Monitoring Tool (Part-2)(Asked at Google, Apr '25)
S1E4 - Backend of Error Monitoring Tool
This is Part 2 of our deep dive into building an Error Monitoring Platform.
If you missed Part 1, you can catch up here 👇 before diving in.
System Design Interview - Real-Time Error Monitoring Tool (Part-1)(Asked at Google, Apr '25)
Hey everyone 👋 and welcome to Episode 3 of my System Design Interview Series!
🔍 Persistence Deep Dive
We split processed events into multiple specialised storage systems to meet access pattern needs efficiently:
🗄️ Storing Every Error Instance — The Cassandra Way
We keep every single error’s full details for 30–90 days so developers can see exactly what happened.
This means massive writes coming in fast 🚀 and simple lookups when we need them.
💡 Why Apache Cassandra ?
Cassandra is a masterless, distributed database that’s built for exactly this type of workload:
⚡ Extreme write throughput — it can ingest millions of events per second
🛡 High availability — no single point of failure
🔍 Fast lookups — optimized for queries by partition key
🗂️ How We Store the Data
We design Cassandra’s schema to match our access patterns:
Main Table: error_instances
Partition key:
(tenant_id, error_fingerprint)
→ groups all instances of the same error for a given customer together on the same nodes.Clustering columns:
(event_timestamp DESC, event_instance_id)
→ orders errors so the most recent are fetched first.Schema: tenant_id (UUID), error_fingerprint (TEXT), event_timestamp (TIMESTAMP), event_instance_id (UUID), raw_payload (BLOB), metadata_tags (MAP<TEXT, TEXT>), PRIMARY KEY ((tenant_id, error_fingerprint), event_timestamp, event_instance_id) with CLUSTERING ORDER BY (event_timestamp DESC)
For the primary use case of fetching a single instance by its unique ID, a secondary "lookup" table will be necessary:
Lookup Table: instance_lookup
Partition key:
event_instance_id
→ quick way to find the full primary key for an error, then fetch fromerror_instances
.Schema: event_instance_id (UUID), tenant_id (UUID), error_fingerprint (TEXT), event_timestamp (TIMESTAMP), PRIMARY KEY (event_instance_id).
⏳ Automatic Expiry with TTL
Each record is written with a time-to-live.
When it expires, Cassandra automatically deletes it — no manual cleanup jobs, no wasted storage 💰.
💬 Will you use Scylla or DynamoDB instead? Share your take below.
🔍 Elasticsearch for Search & Metadata
We need:
🎯 Powerful filtering on any metadata (OS, browser, user ID, tags…)
📜 Full-text search on error messages
🗂 Mutable state tracking (status, assignee, annotations)
💡 Why Elasticsearch?
⚡ Near real-time search across billions of docs
🧩 Schema-flexible — perfect for semi-structured grouped errors
🔎 Inverted index magic for lightning-fast filtering
📐 Data Model
Index:
grouped_errors
Doc ID:
tenant_id + error_fingerprint
Fields: tenant_id, project_id, status, title, timestamps, tags, annotations
📊 Flow
1️⃣ New error detected → Kafka consumer creates a new doc in grouped_errors
.
2️⃣ Same error happens again → Update the doc: increment count, update last_seen_timestamp
.
3️⃣ User actions (e.g., change status, add note) → API directly updates the document in Elastic search.
4️⃣ Search & filter → Developers instantly query by metadata, tags, or text to find exactly what they need.
📊 ClickHouse for Analytics & Aggregations
We need:
📈 Fast aggregations on billions of error events
📅 Time-series queries for trends, counts, breakdowns
🚀 Low-latency analytics even with huge datasets
💡 Why ClickHouse?
⚡ Columnar storage → blazing-fast scans for large datasets
🪶 Compression → keeps storage costs low while handling massive data
🛠 Rich SQL support for GROUP BY, filtering, and time-window calculations
📐 Data Model
Table:
error_events_analytics
Partition Key:
event_date
Order By:
(tenant_id, error_fingerprint, event_timestamp)
Fields: tenant_id, project_id, timestamps, error_fingerprint, tags, counts
📊 Flow
1️⃣ Raw events ingested into Kafka → Flink processes & aggregates → writes to ClickHouse.
2️⃣ Query layer fetches aggregated data for dashboards, charts, and trend lines.
3️⃣ Developers can drill down by date, tags, or fingerprint to see patterns.
🧠 Food for Thought
We use Cassandra for massive write throughput with high availability. Its masterless architecture allows linear horizontal scaling — perfect for append-heavy, always-on ingestion workloads.
👉 When your writes outpace your reads, think Cassandra.
Elasticsearch: Ideal for full text , multi-field search
👉 When users need “find anything, filter everything,” think Elasticsearch.
ClickHouse: Built for blazing-fast OLAP queries over massive datasets. It excels in high read throughput and is typically optimised for batch writes.
👉 When you care about latency and aggregation speed, think ClickHouse.
📣 If you’ve used ClickHouse or Elasticsearch in prod — I’d love to hear your war stories. Share below 👇
☁️ Amazon S3 for Raw Event Archival
We need:
📦 Long-term storage for compliance & offline analysis
🛡 Durability — must never lose archived data
💰 Cost-effective for large historical datasets
💡 Why S3?
🛡 11 nines durability — data is safe for years
💵 Cheap storage tiers (Standard, IA, Glacier)
🔄 Easy integration with Athena, Spark, EMR for ad-hoc queries
📐 Data Model
Bucket:
error-events-archive
Prefix:
/tenant_id/year/month/day/
Format: Parquet (compressed, query-friendly)
📊 Flow
1️⃣ Flink or batch job periodically writes compressed raw events to S3.
2️⃣ Athena/Spark can query archived data for historical analysis.
3️⃣ Old data automatically transitions to cheaper storage tiers via lifecycle policies.
🧠 Food for Thought
We use S3 + Parquet for cheap, durable storage that’s also queryable on demand.
👉 When access is rare but retention matters, think S3.
🔍 Query & API Layer — The Bridge Between Storage & Users
We need:
⚡ Fast responses to UI and API requests
🧠 Business logic orchestration — combining results from multiple stores
🔒 Access control & multi-tenancy
🔄 Real-time + historical data blending
💡 Why it’s critical
The Query & API Layer acts as the traffic controller for all requests.
Instead of each client querying Elasticsearch, ClickHouse, Cassandra, or S3 directly, the API layer:
🎯 Routes requests to the right backend store
🧩 Merges & shapes the data before returning
🛡 Applies security filters so each tenant sees only their data
📊 Caches common queries to improve performance
📐 Responsibilities
Search & Filtering — Uses Elasticsearch for full-text & metadata filters
Analytics & Trends — Fetches aggregated counts from ClickHouse
Instance Lookups — Pulls full payload from Cassandra by
event_instance_id
Historical Fetches — Retrieves archived data from S3 if needed
Mutations — Updates error status, annotations, or assignments in Elasticsearch
📊 Flow
1️⃣ User clicks "Search errors" in the UI → API calls Elasticsearch index for matches.
2️⃣ User opens an error group → API fetches latest occurrences from ClickHouse & instance payload from Cassandra.
3️⃣ User requests a 1-year-old error → API fetches from S3 archive via Athena or Spark.
This is the final Design
💬 Comment with your feedback or alternate designs
📊 Back-of-the-Envelope Estimations
🌍 Total Daily Events →
10,000 customers × 5 projects × 10,000 errors/day
= 500M events/day⚡ Avg Ingestion Rate →
500M ÷ 86,400s
≈ 5.8K QPS🚀 Peak Ingestion Rate →
5.8K × 3
= 17.4K QPS💾 Daily Data Volume →
500M × 2 KB
= ~1 TB/day🔥 Hot Storage (30 days, Cassandra) →
1 TB × 30
= 30 TB → with RF=3 → 90 TB📈 Analytics Storage (ClickHouse, 1 year) → 100:1 aggregation → ~10 GB/day → yearly ~3.65 TB → with replication 8–10 TB
🔍 Metadata Storage (Elasticsearch) → ~1M unique error groups × 5 KB = ~5 GB → with replicas 10–20 GB
💡 Key Takeaways
⚡ Cassandra → Absorb massive writes, 90 TB hot storage with RF=3.
📅 ClickHouse DB → Store 1-year aggregated analytics (~8–10 TB).
🔍 Elasticsearch → Fast metadata & grouped error searches (10–20 GB).
🛠 Detailed Design — How We Meet the NFRs
We’ve engineered this system to not just “work,” but to excel under real-world scale, speed, and security demands. Here’s how each Non-Functional Requirement is handled:
📈 NFR1 — Highly Scalable
⚙ Horizontal Scaling Everywhere → Add more nodes to Ingestion Service, Kafka, Flink, Cassandra, Elasticsearch, and ClickHouse without downtime.
✂ Tenant Partitioning →
tenant_id
sharding keeps noisy neighbours from impacting others.📝 Write-Optimized → Handles huge ingestion bursts without choking.
⚡ NFR2 — Low Latency
⏱ Ingestion < 100ms → Immediate Kafka write, async processing.
🚨 Error Alerts (p99 < 10s) → Flink streams process in-memory for near-real-time detection.
📊 Analytics (p95 < 2s) → ClickHouse pre-aggregates to make dashboards instant.
🛡 NFR3 — High Availability (99.99% Uptime)
🔄 Multi-Node Clusters (RF=3) → Survive node failures with no downtime.
🗂 Durable Kafka Buffering → Data stays safe until consumers recover.
🔁 Self-Healing → Automatic failover keeps the system running.
💾 NFR4 — Durability (No Data Loss)
📌 Kafka as the Source of Truth → Persistent before processing begins.
🌍 Cross-Zone Replication → Survives data center outages.
🧩 NFR5 — Data Isolation
🔐 Strict Tenant Partitioning → Enforced at Kafka, Cassandra, Elasticsearch, ClickHouse.
🛠 App-Layer Enforcement → Every query auto-filters by
tenant_id
.
🔒 NFR6 — Security
🔑 Encryption Everywhere → TLS in transit, AES-256 at rest.
🧍 Role-Based Access Control → Fine-grained API & data permissions.
🛡 DDoS Protection → Rate limiting + WAF at the edge.
💡 TL;DR:
We’ve built a system that’s fast ⚡, scalable 📈, always-on 🛡, loss-proof 💾, tenant-safe 🧩, and locked-down 🔒 — ready for internet-scale traffic without compromise.
🔄 Trade-offs & Alternatives
In interviews, it’s a good idea to highlight one or two key trade-offs you considered. This shows the interviewer not just what decision you made, but why you made it—and demonstrates your ability to weigh options thoughtfully.
1️⃣ Polyglot Persistence 🗂️ vs. Monolithic DB 🏢
Choice: Use multiple purpose-built stores — Cassandra ⚡ (hot writes), Elasticsearch 🔍 (search & metadata), ClickHouse 📊 (analytics).
Alternative: One big DB (e.g., PostgreSQL).
Trade-off:
✅ Best tool for each workload → faster, cheaper at scale.
⚠️ More systems to run, monitor, and secure.
❌ Monolithic DB = simpler ops, but slower writes & weaker analytics under high load.
2️⃣ Event Streaming 🚀 vs. Batch ETL ⏳
Choice: Real-time processing with Kafka + Flink 🌀.
Alternative: Periodic batch jobs (e.g., hourly Spark runs).
Trade-off:
✅ Alerts in seconds, live dashboards.
⚠️ More infra complexity.
❌ Batch = simpler, but stale data + missed SLAs (p99 < 10s alerting would fail).
3️⃣ Async Writes 📬 vs. Sync Writes ⏱️
Choice: API ACKs after Kafka enqueue.
Alternative: Write directly to DBs in request path.
Trade-off:
✅ <100ms ingestion latency, smooth spike handling.
⚠️ Slightly delayed DB persistence.
❌ Sync = stronger immediate durability, but slower & more failure-prone.
4️⃣ Loose Coupling 🪢 vs. Direct Calls 📞
Choice: Kafka backbone for service decoupling.
Alternative: Services call each other directly.
Trade-off:
✅ Fault-tolerant, independently scalable services.
⚠️ More moving parts.
❌ Direct = simpler, but brittle at high QPS.
💬 Your Move
🙌 If you found this helpful:
❤️ Hit like — it helps others discover this series