Skip to main content

Command Palette

Search for a command to run...

Why 'Unit of Work' Fails in a Distributed System

Updated
6 min read
Why 'Unit of Work' Fails in a Distributed System

The Unit of Work (UOW) pattern feels almost invisible when it’s working well. You load a set of entities, make changes, and commit once. If something goes wrong, everything rolls back. For a single application talking to a single database, this model is intuitive, reliable, and hard to argue with.

The trouble starts when that same mental model is carried into a distributed system.

As systems grow, you split responsibilities into services, introduce messaging, and isolate data stores. At that point, the guarantees that made UOW safe begin to dissolve. Rather than abandoning the pattern, many people try to stretch it across service boundaries, coordinating commits or wrapping multiple operations in a higher-level “distributed” transaction.

This is where the abstraction breaks down.

Martin Fowler puts it bluntly when discussing microservices and transactions:

“You can’t maintain ACID transactions across microservices.”

That single sentence quietly invalidates the idea of a distributed UOW. UOW is an ACID-coordinating pattern. Its entire purpose is to provide atomic commit and rollback across a set of changes. If ACID cannot span services, then neither can UOW. This is not a tooling limitation or a framework gap. It is a consequence of running code across processes, machines, and networks that fail independently. Once a UOW crosses a distributed boundary, the assumptions it relies on no longer hold, even if the code still looks correct.

From that point on, the question is no longer “How do we implement distributed Unit of Work safely?”
The real question becomes “What patterns replace it when atomicity is no longer possible?”

The Illusion of Distributed Atomicity

Distributed UOW usually shows up disguised as coordination. One service acts as a conductor, asking each participant whether it is ready to commit. This is the foundation of two-phase commit, and it looks convincing on paper. In practice, it creates a system that is tightly coupled, slow under load, and fragile under failure. The coordinator can crash after some participants have committed and others have not. A participant can commit successfully but fail before acknowledging. A retry can arrive after a partial commit and cause duplicate work. None of these scenarios are edge cases. They are normal behaviour in a real distributed environment.

The deeper issue is that the network itself is unreliable. Messages can be delayed, reordered, duplicated, or lost. UOW assumes a level of synchrony and trust that the network cannot provide.

When services are forced to coordinate a single commit, they must all be alive and responsive at the same time. A slow or unhealthy service does not just fail its own work, it blocks everyone else. Locks are held longer. Throughput drops. Timeouts increase. Recovery becomes manual. In theory, the system is “more correct”. In reality, it is down more often.

This is why many distributed systems that start with coordinated transactions eventually remove them under production pressure. The cost shows up in incidents, not in code reviews.

The Shift in Thinking

The failure of distributed UOW forces a change in mindset. Instead of asking, “How do I make everything commit together?”, the question becomes, “How do I make each step reliable on its own?” This is the pivot from atomicity to durability. Modern distributed systems accept that work happens in stages. Each stage commits locally. Communication between stages is durable. Failures are expected, retried, and compensated for rather than magically rolled back. Once you accept this, the replacement patterns start to make sense.

The first rule is simple: transactions stop at the service boundary.

Inside a service, you still use UOW. Entity Framework’s DbContext remains a perfectly valid abstraction. You load data, apply changes, and commit once. That part does not change. What changes is the expectation that this transaction somehow covers the rest of the system. It doesnt, and it never will.

Anything that crosses the boundary is treated as asynchronous and unreliable by default.

Outbox as the New Commit Boundary

The transactional outbox pattern exists precisely to bridge the gap between local atomicity and distributed communication.

Instead of updating the database and publishing a message as two separate actions, both are recorded inside the same local transaction. The database change represents what happened. The outbox record represents what needs to be communicated. Once the transaction commits, the system is in a consistent state. Even if the process crashes immediately afterwards, the intent to publish the message is safely stored.

Later, a background process reads the outbox and delivers messages until they succeed.

This approach does not pretend that messaging is atomic with database writes. It acknowledges that messaging is eventually reliable and builds around that reality.

Once you introduce retries, you introduce duplicates. In a distributed system, messages can be delivered more than once. Any design that assumes otherwise will eventually corrupt data. The replacement for “exactly once” delivery is idempotent handling. Every message is treated as something that may already have been processed. The handler checks, records, and moves on. This makes retries safe. It allows consumers to crash and restart. It allows operators to replay messages from a dead-letter queue without fear. Most importantly, it removes the psychological need for a distributed UOW. You no longer rely on perfect coordination to avoid duplicates, because duplicates are harmless by design.

Sagas Replace Distributed Rollback

UOW relies on rollback as its safety net. If something fails, everything is undone. In distributed systems, rollback is replaced by compensation. A saga is a sequence of local transactions, each with a defined compensating action. Instead of pretending the work never happened, the system acknowledges that it did happen and applies an explicit reversal. Charging a card can be compensated with a refund. Reserving inventory can be compensated by releasing it. Creating a shipment can be compensated by cancelling it.

This is more work than rollback, but it is also more honest. The system reflects reality rather than hiding it behind abstractions.

In simple cases, services can react to events without central coordination. In more complex workflows, orchestration becomes valuable. A process manager or workflow engine tracks which step has completed, which is pending, and which compensations are required. State transitions are persisted. Timeouts are handled deliberately. Retries are visible. This structure replaces the false simplicity of distributed UOW with explicit control. You can see where a process is stuck. You can reason about partial completion. You can resume or compensate without guesswork.

That visibility is impossible when everything is hidden behind a single commit call.

Why the Pattern Is So Tempting

Distributed UOW is tempting because it feels familiar. It allows developers to pretend they are still working in a monolith, just stretched across the network. But the complexity does not disappear. It accumulates in places that are harder to debug: locks, timeouts, partial commits, and recovery scripts run at three in the morning!

The patterns that replace it feel more complex at first because they surface reality instead of hiding it. Over time, they reduce surprises, incidents, and data corruption.

None of this means UOW is obsolete.

It remains a great pattern inside a service boundary. It remains the right abstraction for coordinating changes within a single database. The mistake is letting it leak beyond that boundary.

Once the boundary is crossed, the rules change.

The failure of UOW in distributed systems is not a tooling issue or a framework limitation. It is a mismatch between assumptions and reality. Distributed systems do not offer global atomicity. They offer unreliable communication, partial failure, and eventual delivery. Designs that embrace those properties succeed. Designs that fight them collapse under load.

If you stop trying to make a distributed system behave like a local one, the architecture becomes clearer, the failure modes become manageable, and the system becomes something you can actually operate in production.

You also dont get as many call outs at 3:00 am when all the over night processes collapse!