CAP THEOREM – Decoding the Complexity: Unveiling the Intricacies of CAP Theorem in Distributed Systems

Consistency:

  1. Eventual Consistency.
  2. Strong Consistency.

Eventual Consistency: As the name suggests, eventual consistency means that changes to the value of a data item will eventually propagate to all replicas, but there is a lag, and during this lag, the replicas might return stale data. A scenario where changes in Database 1 take a minute to replicate to Databases 2 and 3 is an example of eventual consistency. 

Suppose you have a blog post counter. If you increment the counter in Database 1, Databases 2 and 3 might still show the old count until they sync up after that 1-minute lag. (RYW – Consistency) (Read your write consistency) RYW (Read-Your-Writes) consistency is achieved when the system guarantees that any attempt to read a record after it has been updated will return the updated value. RDBMS typically provides read-write consistency. When we read immediately, we get old value as there is delayed sync.

Strong Consistency: In strong consistency, all replicas agree on the value of a data item before any of them responds to a read or a write. If a write operation occurs, it’s not considered successful until the update has been received by all replicas. For example, consider a banking transaction. If you withdraw money from an ATM (Database 1), that new balance is immediately propagated to Databases 2 and 3 before the transaction is considered complete. This ensures that any subsequent transactions, perhaps from another ATM (representing Databases 2 or 3), will have the correct balance and you won’t be able to withdraw more money than you have. Even when we read immediately, we get new value as there is Immediate sync.

Functional Requirements vs Non-Functional Requirements:

Functional Requirements are the basic things a system must do. They describe the tasks or processes the system needs to perform. For example, an e-commerce site must be able to process payments and track orders.

Non-Functional Requirements are qualities a system must have. They describe characteristics or attributes of the system. For example, the e-commerce site must be secure (to protect user data), fast (for good user experience), Availability (system shouldn’t be down for very long) and scalable (to support growth in users and orders).

Availability

Availability in terms of information technology refers to the ability of a system or a service to be operational and accessible when users need it. It’s usually expressed as a percentage of the total system downtime over a predefined period.

Let’s illustrate it with an example:

Consider an e-commerce website like Amazon. Availability refers to the system being operational and accessible for users to browse products, add items to the cart, and make purchases. If Amazon’s website is down and users can’t access it to shop, then the website is experiencing downtime and its availability is affected.

In the world of distributed systems, we often aim for high availability. The term “Five Nines” (99.999%) availability is often mentioned as the gold standard, meaning the service is guaranteed to be operational 99.999% of the time, which translates to about 5.26 minutes of downtime per year.

SLA stands for Service Level Agreement. It’s a contract or agreement between a service provider and a customer that specifies, usually in measurable terms, what services the provider will furnish.

AvailabilityDowntime per year
90% (one nine)More than 36 days
95%About 18 days
98%About 7 days
99% (two nines)About 3.65 days
99.9% (three nines)About 8.76 hours
99.99% (four nines)About 52.6 minutes
99.999% (five nines)About 5.26 minutes
99.9999% (six nines)About 31.5 seconds
99.99999% (seven nines)About 3.15 seconds

To increase the availability of the system:

StrategyExplanationExample
ReplicationCreating duplicate instances of data or servicesKeeping multiple copies of a database, so if one crashes, others can handle requests
RedundancyHaving backup components that can take over if the primary one failsUsing multiple servers to host a website, so if one server goes down, others can continue serving
ScalingAdding more resources to a system to handle increased loadAdding more servers during peak traffic times to maintain system performance
Geographical Distribution (CDN)Distributing resources in different physical locationsUsing a Content Delivery Network (CDN) to serve web content to users from the closest server
Load-BalancingDistributing workload across multiple systems to prevent any single system from getting overwhelmedUsing a load balancer to distribute incoming network traffic across several servers
Failover MechanismsAutomatically switching to a redundant system upon the failure of a primary systemIf the primary server fails, an automatic failover process redirects traffic to backup servers
MonitoringKeeping track of system performance and operationUsing monitoring software to identify when system performance degrades, or a component fails
Cloud ServicesUsing cloud resources that can be scaled as neededUsing cloud-based storage that can be increased or decreased based on demand
Scheduled MaintenancesPerforming regular system maintenance during off-peak timesScheduling system updates and maintenance during times when user traffic is typically low
Testing & SimulationRegularly testing system performance and failover proceduresConducting stress tests to simulate high load conditions and ensure the system can handle it

CAP THEOREM

The CAP theorem is a fundamental principle that specifies that it’s impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:

Consistency (C): Every read from the system receives the latest write or an error.

Availability (A): Every request to the system receives a non-error response, without guarantee that it contains the most recent write.

Partition Tolerance (P): The system continues to operate despite an arbitrary number of network failures.

Let’s illustrate this with an example:

Think of a popular social media platform where users post updates (like Twitter). This platform uses a distributed system to store all the tweets. The system is designed in such a way that it spreads its data across many servers for better performance, scalability, and resilience.

Consistency: When a user posts a new tweet, the tweet becomes instantly available to everyone. When this happens, it means the system has a high level of consistency.

Availability: Every time a user tries to fetch a tweet, the system guarantees to return a tweet (although it might not be the most recent one). This is a high level of availability.

Partition Tolerance: If a network problem happens and servers can’t communicate with each other, the system continues to operate and serve tweets. It might show outdated tweets, but it’s still operational.

According to the CAP theorem, only two of these guarantees can be met at any given time. So, if the network fails (Partition), the system must choose between Consistency and Availability. It might stop showing new tweets until the network problem is resolved (Consistency over Availability), or it might show outdated tweets (Availability over Consistency). It can’t guarantee to show new tweets (Consistency) and never fail to deliver a tweet (Availability) at the same time when there is a network problem.

CA in a distributed system:

Correct, in a single-node system (a system that is not distributed), we can indeed have Consistency and Availability (CA) since the issue of network partitions doesn’t arise. Every read receives the latest write (Consistency), and every request receives a non-error response (Availability). There’s no need for Partition Tolerance since there are no network partitions within a single-node system.

However, once you move to a distributed system where data is spread across multiple nodes (computers, servers, regions), you need to handle the possibility of network partitions. Network partitions are inevitable in a distributed system due to various reasons such as network failures, hardware failures, etc. The CAP theorem stipulates that during a network partition, you can only have either Consistency or Availability.

That is why it’s said you can’t achieve CA in a distributed system. You have to choose between Consistency and Availability when a Partition happens. This choice will largely depend on the nature and requirements of your specific application. For example, a banking system might prefer Consistency over Availability, while a social media platform might prefer Availability over Consistency.

Stateful Systems vs Stateless systems:

 Stateful SystemsStateless Systems
DefinitionSystems that maintain or remember state of the interactions.Systems that don’t maintain any state information from previous interactions.
ExampleE-commerce website remembering items in your shopping cart.HTTP protocol treating each request independently.