Distributed Transactions

In a monolithic application with a single SQL database, you can rely on ACID transactions to ensure that a series of operations either all succeed or all fail together. This makes it easy to maintain data consistency.

But what happens in a microservices architecture? A single business operation (like placing an order) might require changes to be made in several different services, each with its own database.

The Order Service creates an order record.
The Payment Service processes the payment.
The Inventory Service decrements the stock level of the ordered items.

How do you ensure that all three of these operations succeed? What happens if the Inventory Service fails after the payment has already been processed? You can't use a traditional database transaction because the data is spread across multiple, independent databases.

This is the problem of distributed transactions.

Two-Phase Commit (2PC)

The classic solution to distributed transactions is the Two-Phase Commit (2PC) protocol.

How it works:

A central Coordinator manages the transaction.
Phase 1: The Prepare Phase
- The Coordinator sends a "prepare" message to all the participating services.
- Each service gets ready to commit its part of the transaction (e.g., by writing the data to a temporary log) and then votes "yes" or "no" back to the Coordinator.
Phase 2: The Commit Phase
- If all services vote "yes": The Coordinator sends a "commit" message to all services. Each service then makes its changes permanent.
- If any service votes "no" (or fails to respond): The Coordinator sends an "abort" message to all services. Each service then rolls back its changes.

Why 2PC is Rarely Used in Modern Systems: While 2PC provides strong consistency, it has several major drawbacks that make it unpopular for modern, high-availability web applications:

It's a blocking protocol: During the prepare phase, each service has to lock the resources it needs to modify. It must hold these locks until it receives the final commit or abort message from the Coordinator. This can take a long time, especially in a high-latency network, and it dramatically reduces the availability of the system.
The Coordinator is a single point of failure: If the Coordinator crashes after the prepare phase but before the commit phase, the participating services are left in a blocked state, holding their locks and waiting for a decision that will never come.
It doesn't scale well: The need for all participants to agree and hold locks makes it very slow and difficult to scale.

The Saga Pattern

Because of the problems with 2PC, most modern microservices architectures embrace eventual consistency and use a pattern called Saga to manage long-running transactions.

A Saga is a sequence of local transactions. Each local transaction updates the database in a single service and then publishes an event or message that triggers the next local transaction in the saga.

If a local transaction fails, the saga executes a series of compensating transactions that undo the changes made by the preceding local transactions.

Example: The Order Placement Saga

Let's look at our e-commerce example again.

The "Happy Path" (Successful transaction):

The client sends a Create Order request to the Order Service.
The Order Service starts a local transaction, creates the order with a PENDING status, and saves it. It then publishes an OrderCreated event.
The Payment Service consumes the OrderCreated event, starts a local transaction, and processes the payment. On success, it publishes a PaymentSucceeded event.
The Inventory Service consumes the PaymentSucceeded event, starts a local transaction, and decrements the stock. On success, it publishes an InventoryUpdated event.
The Order Service consumes the InventoryUpdated event and updates the order's status from PENDING to CONFIRMED. The saga is complete.

The "Failure Path" (with compensating transactions):

Let's say the Inventory Service fails.

Steps 1-3 are the same. The payment succeeds.
The Inventory Service consumes the PaymentSucceeded event but finds that the item is out of stock. It publishes an InventoryUpdateFailed event.
Now, the compensating transactions run in reverse order:
- The Payment Service consumes the InventoryUpdateFailed event and executes a compensating transaction: it refunds the payment to the user. It then publishes a PaymentRefunded event.
- The Order Service consumes the InventoryUpdateFailed event (or the PaymentRefunded event) and executes a compensating transaction: it updates the order's status from PENDING to FAILED.

The end result is that the system is back in a consistent state. The order is marked as failed, and the user has not been charged.

Pros of the Saga Pattern:

High Availability and Scalability: There is no direct, synchronous calling between services and no long-held locks. This makes the system highly available and scalable.
Loose Coupling: Services are decoupled and only communicate via events.

Cons of the Saga Pattern:

Complex Programming Model: The logic is more complex to design and debug than a simple ACID transaction. You have to explicitly design and implement the compensating transactions for every step of the saga.
Eventual Consistency: The data is only consistent at the end of the saga. During the saga, the system is in a temporarily inconsistent state (e.g., the payment has been made, but the inventory has not yet been updated). This requires the application to be able to handle this temporary inconsistency.

In a system design interview, if you are faced with a problem that requires transactional consistency across multiple microservices, you should propose the Saga pattern. Explaining how it works with events and compensating transactions shows a deep understanding of modern distributed systems design and its trade-offs.