System Design #1:wh(Y) Cap Theorem?

The 'Why' Behind the Trade-Offs in Distributed Systems

Aug 09, 2024

Welcome to Y School of Tech, where we dive into the 'why' behind critical tech concepts. In this issue, we’re unpacking the CAP Theorem, a fundamental principle in distributed systems.

The Inherent Nature of Distributed Systems vs. Centralized Systems

Note :- We will use Distributed System & Distributed Data Store and nodes & server interchangeably

In a single, centralised database system, all data operations are managed in one place, ensuring consistency and simplicity.

However, as systems scale and require higher availability, we move towards distributed systems, where data is spread across multiple nodes or locations. This distribution introduces complexity, particularly when it comes to maintaining data consistency, ensuring availability, and dealing with network partitions.

The Core Trade-Off: Consistency vs. Availability

When designing real-world distributed systems, we often encounter a crucial trade-off: Consistency vs Availability . Achieving both simultaneously, especially during network failures, is challenging. (Do you know why? , I have explained it below)

In this discussion, we’ll explore what the CAP Theorem actually means and why we need to understand this and why these trade-offs are unavoidable and how they impact system design.

Let’s start with understanding the basics :-

1. Distributed System

Distributed System spread tasks and data across multiple nodes(servers) in a network, enhancing speed and efficiency. Unlike centralised systems that store data in one location, distributed systems distribute data across various nodes.

(Why we need Distributed System? This will be explained later.)

2. Understanding CAP Theorem

The CAP Theorem, introduced by Eric Brewer in 2000, states that in any distributed data store, you can only achieve two out of the three :-

Consistency (C)
Availability (A)
Partition Tolerance (P)

Example :-

Let’s understand the CAP Theorem with the help of an example. Imagine you(Mr.X) and your friend(Mr.Y) are a cofounder of a startup Y School of Tech. You both share a joint bank account with a balance of $1000.

One day, you and your co-founder each go to different ATMs (ATMx and ATMy, respectively) to access the same account . These ATMs are part of the bank's distributed system, meaning they both have their own databases and are connected to each other through network.

Image Courtesy - Link

Here's how the CAP Theorem principles apply in this scenario:

3. Consistency (C)

Definition: Every read receives the most recent write or an error. Consistency in our example means that if you withdraw $200 from ATMx, ATMy should immediately reflect the updated balance of $800.
Challenge: In a distributed system, updates might not propagate instantly across all nodes. If there’s a delay in updating ATMy after your withdrawal at ATMx, ATMy might still display the old balance of $1,000, leading to inconsistencies across the system.

4. Availability (A)

Definition: Availability guarantees that every request (like checking a balance or making a withdrawal) receives a response, even if some components of the system are down or unreachable.
Challenge: If link between ATMx and ATMy is temporarily down or facing issues, Both you and your cofounder will be able to withdraw $1000 as they are unaware of the other transaction. (More to this later)

5. Partition

Definition: The network will be allowed to lose arbitrarily many messages sent from one node to another or in other terms nodes are not able to connect with each other.

6. Partition Tolerance (P)

Definition: Partition tolerance means the system continues to operate even if there is a network partition or communication breakdown between different data nodes of the system.
Challenge: Partition tolerance ensures that both ATMs can still operate even if they cannot communicate with each other during the partition but we have to chose between availability and consistency in that case.

Applying CAP Theorem Principles

System without Partition :-
1. Consistency + Availability (Without Partition Tolerance)
  Scenario:
  Consider a hypothetical situation where network partition never occurs (not possible in real world scenarios) and both ATMs (ATMx and ATMy) have perfectly synchronised databases.
  In this idealised setup, both ATMs always show the correct balance immediately after any transaction, ensuring both consistency and availability.
  - Consistency: Changes made at one ATM (e.g., a withdrawal) are immediately reflected at the other ATM.
  - Availability: Both ATMs respond to every request promptly.
  Impact: This scenario is practically impossible in real-world distributed systems because network connections are inherently unreliable.
  
  Network partitions and failures are inevitable, making it unrealistic to achieve both consistency and availability without accounting for partition tolerance.
  Examples: Traditional relational databases like PostgreSQL and MariaDB are designed for consistency and availability in scenarios with minimal network partitions or centralised systems. They are not distributed databases and do not address partition tolerance as part of their core design.
System under Partitions :-
In real-world distributed systems, network partitions are an unavoidable reality. There are two possibility in this case :-
1. CP - Consistency + Partition Tolerance (Without Availability):
  - Scenario: In a real-world distributed system where ATMs (ATMx and ATMy) are located at different locations and requirement will be to show the correct and consistent balance after any transaction, even during network partitions, the ATMs might become temporarily unavailable until the partition is resolved, so in order to provide consistency we are dropping availability.
  - Impact: The system sacrifices availability to maintain consistency and handle partitions. During network issues, users might not be able to perform transactions until the system is fully operational.
  - Examples: MongoDB, HBase
    
    (Note - You might be wondering, "Is the system considered down in CP system if both Mr. X and Mr. Y are unable to withdraw money?" The answer is yes, but this would only occur during a network partition—a temporary disruption in communication between the ATMs . These partitions typically don't last long, so the system is usually able to recover and resume normal operation quickly.)
2. Availability + Partition Tolerance (Without Consistency):
  - Scenario: In this scenario, ATMx and ATMy remain operational and continue to process transactions even if a network partition occurs. However, during this period, both you and your cofounder are able to withdraw whole amount.
  - Impact: The system prioritises keeping ATMs operational and available despite network partitions, but allows for temporary inconsistencies in balance information until the partition is resolved.
  - Examples - Cassandra, CouchDB

Due to the inherent nature of Distributed Data Stores to be Partition Tolerant, we must typically choose between Consistency and Availability.

Application to real world problems

In the example above, opting for an AP system in the context of bank transactions could lead to disastrous outcomes, making a CP approach more viable. In real-world distributed systems, where partitions are inevitable, it becomes essential to choose between consistency and availability, that’s why there is a trade-off between them.

Now, let’s address two key questions:

wh(Y) Do We Need Distributed Systems?
Distributed systems are necessary for achieving horizontal scalability. They allow us to handle increasing requests and manage vast amounts of data, something a single node or server cannot efficiently manage due to limitations in cost, memory, and CPU resources.
wh(Y) Are Partitions Inevitable in Distributed Systems?
To distribute data across multiple nodes, we inherently separate them geographically or logically. This separation makes network partitions unavoidable, leading to temporary communication breakdowns between nodes as you cannot guarantee the network’s up time.

The trade-offs dictated by the CAP Theorem are not just theoretical—they manifest in real-world scenarios where network failures, latency, and other issues force systems to make tough choices.

Application to System Design Interviews

In system design interviews, you will often be tasked with designing systems that need to address real-world challenges, including network partitions.

Interviewers will expect you to evaluate whether consistency or availability is more critical based on the system’s requirements. This trade-offs is usually done during Non-Functional Requirements.

Consistency might be preferred in scenarios where data accuracy and synchronization are crucial, such as in financial transactions or inventory management systems.
Availability might be prioritized in scenarios where uninterrupted access to the system is essential, such as in web applications or user-facing services or Social Media systems

Real-World wh(Y): Case Studies and Applications

Google’s Spanner:
- Spanner is a globally distributed database that focuses on consistency and partition tolerance, sometimes sacrificing availability to ensure that transactions are consistent and isolated.
Amazon DynamoDB:
- DynamoDB, on the other hand, opts for availability and partition tolerance, providing eventual consistency to keep the system highly available across distributed environments.
Apache Cassandra:
- Cassandra allows configuration of the consistency level, providing flexibility between strong and eventual consistency based on specific use cases, highlighting the practical application of the CAP Theorem.

Breaking the Myths

Do CAP Theorem applies for statelesss Web Servers as well?
No. Often misunderstood, CAP theorem works for Data Nodes(Servers) only where it does state management.
Do Availability in CAP theorem is same as Highly Available?
No, the Availability in the CAP Theorem is not the same as High Availability (HA).
- CAP Theorem Availability: This refers to the system's ability to remain responsive during network partitions between distributed data nodes. It ensures that the system can continue to provide responses to requests even if some nodes are unable to communicate with each other.
- High Availability (HA): This refers to the system's ability to remain operational during server or node failures, ensuring minimal downtime and maintaining service availability.
In essence, CAP Theorem Availability is about maintaining functionality during network partitions, while High Availability focuses on the system's resilience to hardware or software failures.

Y School Of Tech

Discussion about this post