Skip to main content

Command Palette

Search for a command to run...

Patterns for Resilience and Integration at Scale

Published
26 min read
Patterns for Resilience and Integration at Scale

Modern distributed systems rarely fail because the core business logic is too hard. They fail because the edges are messy. One service is slow, another is flaky, a third is legacy, a fourth is owned by another team, and a fifth needs a human to click an approval link before anything can continue. The logic inside your own codebase might be clean and deterministic, but the moment a workflow starts crossing boundaries, certainty disappears. That is where resilience stops being a nice architectural word and starts becoming the difference between a system that degrades gracefully and one that creates operational chaos.

This is the part many Developers underestimate. They can design a clean domain model, expose a tidy API, and even get the happy path flowing nicely in development. Then integration begins. Payments have retry semantics you do not control. Fraud services throttle under burst load. ERP platforms respond eventually, but only after translating your request into formats that feel like they were invented in another decade. Humans approve things late, suppliers call back twice, webhooks arrive out of order, and support teams need answers while the workflow is still in flight. None of those problems are unusual. They are the normal operating environment of serious enterprise systems.

That reality creates what is best described as an integration tax. Every dependency adds latency, risk, state mismatch, and behavioural quirks. Every new handoff expands the number of ways a process can stall or become inconsistent. This tax cannot be avoided. If your system has to interact with payment providers, CRM tools, ERP platforms, shipping carriers, external risk engines, old databases, partner APIs, or human approvers, then complexity is already part of the deal. The real question is whether that complexity is handled intentionally or left to leak through the architecture.

The good news is that the same failure shapes show up again and again. Systems struggle with overload, duplicate work, half-completed transactions, tight coupling, invisible state, and awkward coexistence between new platforms and old ones. Once you see those patterns clearly, the architecture becomes much easier to reason about. Durable orchestration platforms such as Azure Durable Functions are especially useful here because they provide a strong set of building blocks for stateful workflows, retries, timers, external events, and long-running coordination. But the bigger lesson is not tied to one platform. The patterns in this article apply whether you are orchestrating with Durable Functions, Temporal, Step Functions, Camunda, MassTransit sagas, or even a carefully designed internal workflow engine.

This article takes those recurring resilience and integration problems and turns them into a practical operating model. We will look at circuit breakers, idempotency, compensation, event-driven handoffs, workflow status, hybrid architecture, and resilience-first design. The goal is not to repeat a chapter from a book. The goal is to turn those ideas into a standalone guide for engineers and architects building systems that need to survive the real world.

Why Integration Gets Harder at Scale

A single integration in a low-volume system is often manageable with little more than an HTTP client, a timeout, and a retry policy. That is why many systems look fine during the first release. The real trouble appears later, once transaction volume grows, external dependencies multiply, and the business starts relying on workflows that stretch across multiple bounded contexts.

At that point, latency stops being an isolated technical concern and starts shaping business outcomes. A fraud service that takes three seconds instead of two might not sound catastrophic, but if that call sits in the middle of a checkout flow, the extra second now becomes customer friction. Multiply that by retries, duplicate callbacks, rate limiting, and a few downstream dependencies, and what looked like a simple workflow becomes a slow-motion queueing problem. Enterprise systems rarely collapse in one dramatic moment. More often, they drown gradually in coordination overhead.

Another issue is failure diversity. Internal services often fail in relatively predictable ways because the same teams own the deployment model, monitoring stack, and operational practices. External systems are different. One dependency might fail fast with clear error codes. Another might hang without responding. Another might accept the request but finish it later. Another might partially succeed and provide no clean rollback. Legacy platforms are especially problematic because they often expose interfaces that were never designed for modern reliability expectations, yet still sit on the critical path of important business processes.

Human interaction adds another layer of uncertainty. Approvals, escalations, document review, manual intervention, and exception handling all introduce variable time windows that cannot be compressed by throwing more CPU at the problem. A workflow might be technically healthy but still paused for six hours waiting on someone in a different department. If the system does not model that state explicitly, operators end up guessing whether it is broken or simply waiting.

This is why mature integration architecture is less about making every dependency perfect and more about building a workflow that can absorb imperfect behaviour. You are not designing for a world where all systems are reliable. You are designing for a world where some systems are slow, some are inconsistent, some are overloaded, and some are still useful enough that the business cannot function without them.

The Core Idea: Resilience Is a Workflow Concern

Developers think about resilience at the level of individual service calls. They add retries to HTTP clients, configure exponential backoff, maybe wrap a few dependencies in a circuit breaker, and consider the job largely done. That helps, but it is not enough. In distributed systems, resilience is rarely just a call-level concern. It is a workflow concern.

A payment retry is not just a payment retry. It is part of a broader transaction that may also reserve inventory, create an order record, notify a customer, update a loyalty profile, and send data into finance systems. A supplier callback is not just an inbound event. It affects which timer should be cancelled, what status should be shown to support, and whether the workflow can proceed to the next stage. A human approval is not just a pause in processing. It changes how you monitor state, set expectations, and decide when intervention is needed.

This is where orchestration platforms earn their keep. They provide a durable memory of the workflow so that retries, waiting, state transitions, and external signals are modelled as first-class behaviour instead of being spread across controller methods, background jobs, and database flags. That durable state is not just useful for implementation. It also creates a place where resilience patterns can be applied consistently.

The rest of this article focuses on those patterns.

Pattern 1: Circuit Breakers Prevent a Bad Dependency from Taking the Workflow Down with It

One of the most common mistakes in integration-heavy systems is treating every failure as a reason to retry harder. That instinct is understandable. Retries solve a lot of transient faults, especially network blips, brief throttling, and short-lived platform issues. The problem is that retries are not free. When a downstream service is genuinely unhealthy, repeated retries can amplify the damage by increasing traffic against a struggling dependency and consuming resources in your own system while little useful work gets done.

That is why circuit breakers matter. A circuit breaker watches failure behaviour over time. If failures cross a threshold, the breaker opens and temporarily blocks new requests to the dependency. Rather than continuing to hammer a service that is already in trouble, the workflow fails fast or routes into a fallback path. After a cooldown period, the breaker can move into a half-open state and allow limited traffic to test whether the downstream system has recovered.

In a long-running workflow, this pattern is especially valuable because it prevents an unhealthy external service from dragging large volumes of orchestration instances into pointless retry loops. Imagine an order pipeline that calls an external fraud scoring API before taking payment. If that provider is returning 500 errors for ten minutes, the wrong response is to let every new order attempt the call repeatedly until the orchestration backlog expands and customer-facing latency spikes. A better response is to trip the breaker, fail new attempts quickly with a clear status, and alert operators that the fraud dependency is down.

A simplified view looks like this:

In Durable Functions, one practical implementation is to use a Durable Entity to hold breaker state for a dependency. The entity can track consecutive failures, the time the breaker opened, and whether a call is allowed. Each orchestration or activity checks the entity before making the dependency call. That gives you a central, durable place to enforce the breaker rather than leaving each workflow instance to make its own isolated decision.

A stripped-back example might look like this in C#:

public record CircuitBreakerState(
    int ConsecutiveFailures,
    DateTime? OpenedAtUtc,
    bool IsOpen);

public class FraudServiceBreakerEntity
{
        public CircuitBreakerState State { get; set; } = new(0, null, false);
        public bool CanExecute(DateTime nowUtc)
        {
            if (!State.IsOpen)
                return true; 

            var cooldown = TimeSpan.FromMinutes(2);

            return State.OpenedAtUtc is { } openedAt && nowUtc - openedAt >= cooldown;

         }

        public void RecordSuccess()
        {
            State = new CircuitBreakerState(0, null, false);
        }

        public void RecordFailure(DateTime nowUtc)
        {
            var failures = State.ConsecutiveFailures + 1;
            if (failures >= 5)
            {
                State = new CircuitBreakerState(failures, nowUtc, true);
                return;
            }

        State = new CircuitBreakerState(failures, State.OpenedAtUtc, false);

    }

}

The important part is not the code. It is the operational behaviour. Once the breaker opens, you stop turning a bad dependency into a system-wide slowdown. You make the failure explicit, measurable, and bounded.

That said, circuit breakers are not magic. They must be tuned carefully. Thresholds that are too aggressive can block useful traffic. Cooldowns that are too long can delay recovery. Breakers also need observability. If the team cannot see when they open, why they opened, and how often they are being exercised, they become another hidden state machine nobody trusts during an incident.

Pattern 2: Idempotency Turns Retries from a Risk into a Safety Net

If you work on distributed systems long enough, you stop asking whether duplicate requests will happen and start asking where they will happen first. Retries from clients, retries from orchestrators, webhook replays, queue redelivery, supplier callbacks, double clicks from users, and timeouts that hide already-completed work all create duplicate execution paths. If your system is not designed for that, it will eventually perform the same side effect twice.

That is where idempotency becomes non-negotiable. An idempotent operation can be executed multiple times with the same logical input and still produce the same final outcome. This does not mean every call is naturally idempotent. It means the system is built so that repeated attempts are recognised and handled safely.

Payment flows are the classic example. If a payment service receives the same charge request twice because the first response timed out, the customer must not be charged twice. The standard approach is to send an idempotency key, often the order ID or payment request ID, with the outbound call. The payment provider stores the first result for that key and returns the same outcome for later retries instead of executing a second charge.

But idempotency belongs far beyond payments. ERP submission endpoints should reject duplicate order registration for the same business reference. Customer reward updates should not apply points twice. Shipping requests should not create duplicate consignments. Inventory allocation should not reserve the same units repeatedly because a callback was delivered more than once.

Here is the shape of the idea:

In your own services, the idempotency mechanism often comes down to a durable write model. You persist a unique business operation key before or alongside the side effect, and later requests with the same key return the stored outcome. Sometimes that means a dedicated idempotency table. Sometimes it means a natural domain guard such as a unique constraint on an external reference. Sometimes it means tracking processed event IDs in an entity or aggregate.

A simple service-side pattern in C# could look like this:

public sealed class ProcessedRequest
{
    public string RequestId { get; init; } = default!;
    public string ResultJson { get; init; } = default!;
    public DateTime ProcessedAtUtc { get; init; }
}

public async Task<PaymentResult> ChargeAsync(
    string requestId, decimal amount, cancellationToken stopToken)
{
    var existing = await db.ProcessedRequests
 .SingleOrDefaultAsync(x => x.RequestId == requestId, stopToken);

    if (existing is not null)
        return JsonSerializer.Deserialize<PaymentResult>(existing.ResultJson)!;

    var result = await gateway.ChargeAsync(amount, stopToken);

    db.ProcessedRequests.Add(new ProcessedRequest
    {
        RequestId = requestId,
        ResultJson = JsonSerializer.Serialize(result),
        ProcessedAtUtc = DateTime.UtcNow
    });

    await db.SaveChangesAsync(stopToken);
    return result;
}

The hard part is deciding the correct scope of the idempotency key. If it is too broad, distinct operations can accidentally collapse into one. If it is too narrow, duplicates slip through. Good idempotency design requires a clear understanding of the business operation, not just the transport request.

It is also worth being blunt about this: retries without idempotency are reckless. They create the appearance of resilience while quietly shifting the cost onto customers, finance teams, and support operations. Once you understand that, idempotency stops feeling like a technical detail and starts feeling like table stakes.

Pattern 3: Compensation Is How You Survive Without Distributed Transactions

Enterprise workflows almost always cross boundaries where a single atomic transaction is impossible. You might charge a card in one system, reserve inventory in another, create a shipment in a third, and register the order in an ERP platform that still thinks SOAP is modern. No transaction coordinator is going to make all of that commit or roll back as one neat unit. Even if it could, you probably would not want the coupling and latency that came with it.

So what happens when part of the workflow succeeds and a later step fails? That is where compensation comes in. Compensation is the deliberate reversal of already-completed actions so that the broader workflow returns to a consistent business state.

Suppose a checkout flow successfully charges the customer, then later fails to allocate stock. Without compensation, the system has taken money for an order it cannot fulfil. That is not a mere technical defect. It is a business failure. The workflow needs a compensating action, such as issuing a refund, releasing provisional customer benefits, and notifying operations if manual review is required.

The same applies in other domains. If a claims workflow opens a financial reserve and later discovers a validation failure, the reserve may need reversing. If an onboarding workflow provisions downstream access and then fails a compliance check, those accounts may need disabling. If a shipping request is accepted and the ERP later rejects the order, logistics and customer communication may both need corrective action.

A compensation flow often looks like this:

Compensation is frequently misunderstood as just calling an undo API. Sometimes that is possible, but often it is not. Real compensations can be asynchronous, partial, or manual. A refund might take time. A shipment might be cancellable only before handoff to the carrier. A legacy platform might support reversal only through an overnight batch. That means compensation needs its own design, status tracking, and operational visibility.

In orchestrated systems, a common pattern is to record which forward steps have completed, then execute compensations in reverse order if the workflow later fails. Durable Functions makes this practical because orchestration state can keep track of what has happened so far.

A simplified orchestration sketch might look like this:

[Function(nameof(ProcessOrderOrchestrator))]

public static async Task Run(
 [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var order = context.GetInput<OrderRequest>();
    var completedSteps = new List<string>();

    try
    {
        await context.CallActivityAsync(nameof(ReserveInventoryActivity), order);

        completedSteps.Add("inventory");
        await context.CallActivityAsync(nameof(ChargePaymentActivity), order);

        completedSteps.Add("payment");
        await context.CallActivityAsync(nameof(RegisterOrderInErpActivity), order);

        completedSteps.Add("erp");
    }
    catch (Exception)
    {
        if (completedSteps.Contains("payment"))
        {
            await context.CallActivityAsync(nameof(RefundPaymentActivity), order);
        }

        if (completedSteps.Contains("inventory"))
        {
            await context.CallActivityAsync(nameof(ReleaseInventoryActivity), order);
        }

        await context.CallActivityAsync(nameof(RaiseOpsAlertActivity), order.OrderId);

        throw;
     }
}

This is deliberately simple, but it makes the main point. Compensation is not an optional extra you add later. It is part of the workflow contract. If a business process can leave the world half-changed, then it also needs a defined path to recover from that condition.

There is another important truth here. Compensation is rarely perfect. You should not promise exact rollback semantics where the domain does not support them. Some workflows are better described as eventually corrected rather than fully undone. That is fine, provided the state transitions are explicit and visible. False certainty is more dangerous than honest eventual consistency.

Pattern 4: Event-Driven Integration Reduces Coupling and Preserves Flow

One of the easiest ways to make orchestration brittle is to let the central workflow call every downstream system directly. It feels simple at first because all the logic is in one place. The orchestrator confirms the order, then calls the ERP, then calls analytics, then calls CRM, then calls some downstream fulfilment component, then maybe calls a notification service. The problem is that each of those direct calls adds latency and dependency pressure to the core flow.

A better option in many cases is to separate the business milestone from the downstream reactions. Once the workflow reaches a meaningful state, such as order confirmed, claim submitted, policy approved, or onboarding completed, it can publish an event. Other systems subscribe independently and handle their own processing. That removes non-essential side effects from the critical path and reduces direct coupling between the orchestrator and every consumer.

Here is the contrast:

This shift matters for several reasons. First, it shortens the synchronous path of the core workflow. Second, it allows new consumers to be added later without modifying the orchestrator. Third, it isolates failure. If analytics is down, that should not usually block order confirmation. If CRM processing is delayed, the business milestone may still be valid.

That does not mean direct calls disappear entirely. Some steps remain essential to the transaction outcome and must stay in the workflow. Payment authorisation is usually not optional. Inventory reservation is often not optional. But secondary reactions are usually better handled as event subscribers.

In Azure, this might mean an orchestration step publishes an OrderConfirmed event into Event Grid or a queue topic after core invariants are satisfied. Separate Functions then react to that event and perform ERP synchronisation, customer communications, and reporting updates. In other stacks, the same pattern could use Kafka, RabbitMQ, SNS/SQS, NATS, or any eventing platform with durable delivery.

A typical event contract should be boring and explicit. That is a good thing. It might include a business ID, event type, timestamp, correlation ID, schema version, and only the data consumers genuinely need. Resist the urge to publish an anemic dump of internal objects. Events are integration contracts, not convenient serialisation shortcuts.

A simple event model could look like this:

public sealed record OrderConfirmedEvent(
    string OrderId,
    string CustomerId,
    decimal Total,
    DateTime ConfirmedAtUtc,
    string CorrelationId,
    int SchemaVersion);

There is a trade-off, of course. Event-driven systems push you toward eventual consistency. Consumers may process at different times. Delivery may be at least once, not exactly once. That takes us right back to idempotency and observability. Event-driven integration works well when paired with those patterns, not when treated as a shortcut that somehow removes the need for them.

Pattern 5: Custom Status and Observability Keep Workflows from Becoming Black Boxes

Many operational incidents are not caused by the workflow being broken. They are caused by nobody being able to tell what the workflow is doing. A long-running integration process can be perfectly healthy while waiting on a supplier response, a human approval, or an overnight ERP batch. Without good status signals, support teams often interpret waiting as failure and failure as waiting. That confusion creates noise, escalations, and manual work that should never have existed.

The fix is simple in principle and often neglected in practice. Long-running workflows need explicit, queryable status. Not vague technical state, but business-meaningful status. A fraud check should not just be running. It should be FraudCheckPending or FraudCheckFailed. An ERP handoff should be ErpSubmissionPending, ErpRegistered, or ErpRejected. A supplier callback stage should be AwaitingSupplierApproval. A manual review should be PendingHumanDecision.

Durable Functions supports custom orchestration status, which is a powerful way to surface this information directly from the workflow runtime. But the same idea applies on any platform. You need a state model that answers the basic operational question: where is this process now, and why is it there?

A practical lifecycle might look like this:

In code, that might be as straightforward as setting custom status at each meaningful stage:

context.SetCustomStatus(new
    {
        orderId = order.OrderId,
        stage = "FraudCheckPending",
        updatedAtUtc = context.CurrentUtcDateTime
    });

That single line is more valuable than many teams realise. Once status is queryable, you can power dashboards, operator portals, support tooling, and incident triage without reverse engineering workflow behaviour from logs.

Observability also needs more than status labels. Correlation IDs must flow through the entire chain, from inbound request to orchestration instance to activity calls to outbound dependency calls and published events. Logs need consistent structured fields. Metrics should cover latency, retries, breaker state, queue depth, timeout counts, compensation frequency, and downstream failure rates. Tracing should allow engineers to follow a transaction through multiple services without playing archaeology across disconnected log stores.

Here is the ugly truth. If your workflow depends on several systems and you do not have proper correlation and state visibility, you do not have an operable architecture. You have a hope-based architecture.

Pattern 6: Hybrid Integration Accepts Reality Instead of Demanding a Rewrite

A lot of technical content on serverless and orchestration quietly assumes the organisation has the freedom to build a clean greenfield system. That is rarely how enterprise work actually looks. Most teams are not replacing everything. They are inserting modern capability into an environment where a mixture of old and new already exists.

That is why hybrid integration matters. Serverless does not have to replace the ERP. It can orchestrate around it. Durable workflows do not have to own every business rule. They can coordinate specialised services that already exist. Modern data stores can support fast projections and reporting while a different platform remains the canonical system of record. New cloud-native capabilities can coexist with legacy systems provided the architectural boundaries are clear.

A realistic enterprise shape often looks something like this:

This hybrid model is often the most pragmatic route to value. The orchestration layer becomes the coordinator of the business process. Compliance-sensitive payment logic can remain in a dedicated service. A legacy ERP can continue as the source of truth for certain financial or operational records. Cloud-native projections can power responsive read models and dashboards without forcing the organisation to migrate everything at once.

That also means architects need discipline around ownership. The orchestration engine should coordinate process state, not become the dumping ground for every piece of business logic in the company. The ERP should retain the responsibilities it is still good at, not be called for every trivial lookup. Projection stores should serve read performance and user experience, not quietly evolve into shadow systems with ambiguous truth boundaries.

The big win in hybrid architecture is incremental progress. You do not need a grand rewrite to improve resilience, observability, and flow control. You can wrap brittle integrations with better orchestration. You can isolate long-running handoffs. You can publish cleaner events. You can add compensations and workflow visibility around systems that were never built with those ideas in mind.

That is usually how real transformation succeeds, not through replacement fantasies but through carefully chosen seams.

Pattern 7: Resilience by Design Means Assuming the System Will Be Incomplete, Slow, and Wrong Sometimes

The strongest systems are not the ones that assume everything will go right. They are the ones that assume at least some parts will go wrong and still define how the workflow should behave. That mindset is what resilience by design really means.

It means assuming partial failure is normal. A dependency might succeed after a retry, fail permanently, or accept work and complete later. A callback might arrive twice. A timer might expire before a human responds. An event consumer might process late. An external system might hold the canonical answer even though your local projection says otherwise. These are not edge cases. They are part of the design space.

Resilience by design also means being honest about consistency. Many distributed workflows are eventually consistent, and pretending otherwise helps nobody. The real architectural task is to define where temporary inconsistency is acceptable, how it is reconciled, and what the user or operator sees while it exists. Good systems make the transition states explicit instead of hiding them behind vague processing messages.

It also means measuring the behaviour that matters. You should know which dependencies are slowest, which steps retry most often, which compensations are frequent, how long workflows remain in waiting states, and which manual interventions are recurring. Teams that do not measure this tend to rediscover the same operational pain every quarter and act surprised each time.

Finally, resilience by design means accepting that supportability is part of architecture. A workflow is not finished when it compiles and passes tests. It is finished when operators can understand it, support teams can explain it, incidents can be triaged quickly, and business stakeholders can trust that failures are bounded and recoverable.

A Concrete End-to-End Example

Let us pull these patterns together in a single scenario. Imagine a large B2B order workflow. An order enters the system through an API. The orchestration starts and immediately assigns a correlation ID that follows the transaction everywhere. The workflow sets its status to Received. It then checks whether the fraud provider breaker is open. If it is, the workflow fails fast with a visible dependency-unavailable status rather than quietly piling into retries.

If the breaker allows execution, the workflow sends a fraud request with a request ID that can be used for deduplication if the provider supports it. Once fraud is approved, payment is attempted with an idempotency key derived from the order ID. That ensures retries cannot double-charge the customer. After payment succeeds, the workflow publishes an OrderConfirmed event so downstream analytics and CRM updates can proceed independently instead of extending the critical path.

Next, the workflow submits the order to a legacy ERP. The ERP is slow and sometimes responds asynchronously, so the orchestration switches status to ErpSubmissionPending and waits for either an external callback or a timeout. If the callback arrives with success, the workflow completes. If the ERP rejects the order, the orchestration enters CompensationInProgress, triggers a refund, releases any provisional inventory state, raises an operational alert, and finally moves the order into a failed terminal state with a reason that support can actually understand.

That end-to-end shape looks like this:

Nothing in that flow is exotic. That is exactly the point. Most resilient architectures are not built from obscure theory. They are built from boring patterns applied consistently and early enough that the system does not rot under growth.

What Developers Usually Get Wrong

The first common mistake is over-centralising the orchestration. Developers discover a workflow engine and start putting every rule, integration, and transformation into the orchestrator itself. That turns the orchestrator into a giant god-process that becomes hard to change and impossible to reason about. The workflow should coordinate. It should not absorb every responsibility.

The second mistake is believing retries are a resilience strategy on their own. They are not. Retries without idempotency, compensation, status visibility, and bounded dependency behaviour are just a way of failing repeatedly.

The third mistake is underestimating operational visibility. Teams often spend far more time designing the happy path than designing the support path. Then the first real incident happens and nobody can answer the obvious questions. Which stage is this order at. Did payment happen already. Has ERP seen it. Is this waiting for a callback or stuck in a retry loop. Those questions should not require an engineer to grep logs across five systems.

The fourth mistake is assuming greenfield purity is required before improvement is possible. It is not. Some of the best resilience gains come from putting orchestration, status modelling, idempotency, and compensations around existing systems rather than replacing them.

The fifth mistake is treating eventual consistency as a flaw to be hidden instead of a reality to be designed for. Users and operators can cope with transition states if those states are honest and understandable. What they cannot cope with is silent ambiguity.

How to Apply These Patterns in Practice

If you are building or modernising an integration-heavy workflow, start by identifying the true business milestones rather than the raw API calls. Ask where side effects happen, which ones must be synchronous, which ones can be event-driven, and which ones need compensation if a later step fails. That alone will usually reveal whether your current workflow is too tightly coupled.

Then look at duplicate execution risk. Anywhere you have retries, redelivery, callbacks, or human re-submission, you need a defined idempotency strategy. Be precise about the operation key and where the result is recorded. Vague assurances that the provider should handle duplicates are not enough. Next, inspect dependency behaviour. Which integrations deserve a circuit breaker. Which ones should fail fast. Which ones should shift into async wait mode with timers and external events. Which ones are important enough to stay on the critical path and which ones should react to events later.

After that, design your status model. Not your log messages, your status model. What are the meaningful states of the workflow from an operator and business perspective. How are those states exposed. Where do correlation IDs flow. What metrics would tell you this process is degrading before customers notice.

Finally, decide how the new workflow lives alongside existing systems. Be explicit about what remains the source of truth, what becomes a projection, and what the orchestrator does and does not own. Hybrid architecture becomes dangerous only when ownership is vague.

Resilience at scale is not about making distributed systems behave like a single local transaction. That fantasy does not survive contact with real dependencies, real organisations, or real time. The job is to build workflows that remain understandable and recoverable when the surrounding systems behave imperfectly.

That is why these patterns are useful. Circuit breakers keep one bad dependency from turning into systemic slowdown. Idempotency makes retries safe. Compensation gives workflows a path back from partial success. Event-driven integration reduces unnecessary coupling. Custom status and observability make the process operable. Hybrid architecture accepts the systems you actually have. Resilience by design ties all of it together into a mindset rather than a patchwork of technical tricks.

Once you start thinking this way, integration architecture changes. You stop asking how to make the happy path pass one more test and start asking how the workflow behaves when the world around it is late, duplicated, unavailable, or inconsistent. That is the right question. It is also the one that separates systems that merely work from systems that keep working.