Skip to main content

Command Palette

Search for a command to run...

The Most Dangerous Line in a .NET Background Worker

Updated
12 min read
The Most Dangerous Line in a .NET Background Worker
P
Senior Software Engineer specialising in cloud architecture, distributed systems, and modern .NET development, with over two decades of experience designing and delivering enterprise platforms in financial, insurance, and high-scale commercial environments. My focus is on building systems that are reliable, scalable, and maintainable over the long term. I’ve led modernisation initiatives moving legacy platforms to cloud-native Azure architectures, designed high-throughput streaming solutions to eliminate performance bottlenecks, and implemented secure microservices environments using container-based deployment models and event-driven integration patterns. From an architecture perspective, I have strong practical experience applying approaches such as Vertical Slice Architecture, Domain-Driven Design, Clean Architecture, and Hexagonal Architecture. I’m particularly interested in modular system design that balances delivery speed with long-term sustainability, and I enjoy solving complex problems involving distributed workflows, performance optimisation, and system reliability. I enjoy mentoring engineers, contributing to architectural decisions, and helping teams simplify complex systems into clear, maintainable designs. I’m always open to connecting with other engineers, architects, and technology leaders working on modern cloud and distributed system challenges.

Every production system has a few lines of code that look too simple to question.In a .NET background worker, one of them is usually this:

while (true)
{
    await DoWorkAsync();
}

It looks ok. It looks like the simplest possible worker. Keep running. Keep processing. Keep doing the job. That line is also how you build a worker that ignores shutdown, hammers failed dependencies, hides exceptions, duplicates work, burns CPU, blocks deployments, and turns a small downstream outage into a production incident. The dangerous part is not the loop itself. Long running workers need loops. The problem is what that loop says about the design. It says the worker owns time, retries, failure, cancellation, pacing, and recovery, but none of those things have been made explicit. Thats where the incident starts.

Background workers are production code, not side code

A lot of Developers treat background workers differently from APIs. API endpoints get validation, cancellation tokens, logging, metrics, timeouts, idempotency checks, and careful error handling. Workers often get a while (true) loop, a scoped service, and a vague hope that the hosted service will just keep running. Thats backwards!

A background worker is usually much closer to the dangerous part of the system than an API endpoint. Its often the code that charges a card, sends an email, processes a file, or moves a message to the next stage of a workflow. It may also be the thing calling external services in the background, retrying failed jobs, or changing state without a user watching the screen. That makes the worker easy to underestimate. When it fails, it doesnt always fail loudly. It can keep running quietly while doing the wrong thing again and again. When that code misbehaves, there may be no user sitting in front of the screen to notice. It can keep doing the wrong thing quietly.

Thats why a poor worker loop is so expensive. It doesnt fail once. It fails repeatedly.

The naive worker

This is the kind of code that shows up in plenty of real applications:

public sealed class PaymentWorker : BackgroundService
{
    private readonly IServiceProvider _serviceProvider;
    public PaymentWorker(IServiceProvider serviceProvider)
    {
        _serviceProvider = serviceProvider;
    }

    protected override async Task ExecuteAsync(CancellationToken stopToken)
    {
        while (true)
        {
            using var scope = _serviceProvider.CreateScope();
            var processor = scope.ServiceProvider.GetRequiredService<PaymentProcessor>();

            await processor.ProcessPendingPaymentsAsync();
        }
    }
}

At first glance, this seems fine. The worker creates a scope, resolves the processor, and processes pending payments. But theres several problems hiding inside it. It ignores stopToken. It has no delay when there is no work. It has no pacing when there is too much work. It has no exception boundary. It has no timeout per operation. It has no clear retry behaviour. Theres no way to stop cleanly during deployment. It gives you no signal about whether the worker is healthy, stuck, or spinning. That one loop has quietly accepted responsibility for production behaviour it does not actually control.

Ignoring cancellation breaks shutdown

The first mistake is simple. The loop never observes cancellation. When the host is shutting down, .NET passes a cancellation token into ExecuteAsync. That token is the worker's signal to finish what it is doing and stop. If the worker ignores it, shutdown becomes a guess. That can cause slow deployments. It can leave work half processed. It can make Kubernetes, Azure App Service, containers, or Windows services terminate the process more aggressively because the app did not stop in time.

The fix starts with the loop condition:

protected override async Task ExecuteAsync(CancellationToken stopToken)
{
    while (!stopToken.IsCancellationRequested)
    {
        await DoWorkAsync(stopToken);
    }
}

Thats better, but it is still not enough. Passing the token into the work is the important part.

private static async Task DoWorkAsync(CancellationToken stopToken)
{
    await Task.Delay(TimeSpan.FromSeconds(1), stopToken);
}

If your worker calls a database, queue, HTTP API, blob store, or another service, the token should flow into those calls as well.

await dbContext.SaveChangesAsync(stopToken);
await httpClient.SendAsync(request, stopToken);
await queueClient.ReceiveMessagesAsync(cancellationToken: stopToken);

Cancellation is not decoration. It is how your worker cooperates with the host.

The missing delay becomes a CPU bug

The next mistake is the tight loop. If ProcessPendingPaymentsAsync finds no work and returns quickly, the worker immediately calls it again. Then again. Then again. That can turn an empty database table into constant polling. It can turn a quiet queue into unnecessary network traffic. It can turn a broken dependency into a retry storm.

A simple delay helps, but it needs to be cancellable:

protected override async Task ExecuteAsync(CancellationToken stopToken)
{
    while (!stopToken.IsCancellationRequested)
    {
        await ProcessNextBatchAsync(stopToken);
        await Task.Delay(TimeSpan.FromSeconds(5), stopToken);
    }
}

This is still basic, but it is already safer. The worker does not spin when there is no work, and the delay does not block shutdown. For more serious systems, the delay should usually depend on the outcome. If work was found, continue quickly. If no work was found, back off. If a dependency failed, back off more aggressively. The point is not to add a magic sleep. The point is to make pacing deliberate.

Exceptions should not decide your architecture

A background worker needs a clear exception boundary. Without one, an unhandled exception can stop the worker. Depending on your host options, that may stop the whole application or leave you with a dead background process while the web app still responds to health checks.

This version is too fragile:

protected override async Task ExecuteAsync(CancellationToken stopToken)
{
    while (!stopToken.IsCancellationRequested)
    {
        await ProcessNextBatchAsync(stopToken);
        await Task.Delay(TimeSpan.FromSeconds(5), stopToken);
    }
}

If ProcessNextBatchAsync throws once, your worker may be gone.A better worker makes failure explicit:

protected override async Task ExecuteAsync(CancellationToken stopToken)
{
    while (!stopToken.IsCancellationRequested)
    {
        try
        {
            await ProcessNextBatchAsync(stopToken);
            await Task.Delay(TimeSpan.FromSeconds(5), stopToken);
        }
        catch (OperationCanceledException) when (stopToken.IsCancellationRequested)
        {
            break;
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Payment worker failed while processing a batch.");

            await Task.Delay(TimeSpan.FromSeconds(30), stopToken);
        }
    }
}

The OperationCanceledException case matters. Cancellation is not the same thing as failure. You do not want noisy error logs every time the application shuts down cleanly. The general exception case also matters. You do not want one bad record, one timeout, or one transient network issue to permanently kill the worker. But this is not permission to swallow everything and move on. The worker should log the failure, emit metrics, back off, and make it visible. Silent recovery is how systems rot.

Retrying the loop is not the same as retrying the operation

A common worker bug is accidental retry behaviour. The code fails during processing. The loop catches the exception. The next iteration runs the same query again. The same item is picked up again. The same external call happens again. Sometimes thats fine. Often its not.

Imagine this flow:

await paymentGateway.ChargeAsync(payment, stopToken);
payment.MarkAsCharged();
await dbContext.SaveChangesAsync(stopToken);

If the gateway charge succeeds but SaveChangesAsync fails, the database still says the payment is pending. The next loop picks it up again. Now you may charge the customer twice. That is not a background worker problem in isolation. It is a workflow design problem. The worker only exposes it. The fix depends on the domain, but the principles are stable. Use idempotency keys when calling external providers. Store external operation IDs. Make state transitions explicit. Avoid selecting the same work item concurrently from multiple workers. Do not assume "retry the method" is safe just because the code is inside a loop. For a payment worker, the provider call should include an idempotency key based on a stable business operation:

var request = new ChargePaymentRequest
{
    PaymentId = payment.Id,
    Amount = payment.Amount,
    Currency = payment.Currency,
    IdempotencyKey = $"payment-charge-{payment.Id}"
};

var result = await paymentGateway.ChargeAsync(request, stopToken);

Then the local state should record what happened:

payment.MarkChargeSubmitted(result.ProviderReference);
await dbContext.SaveChangesAsync(stopToken);

The exact design will vary. The important part is accepting that the worker loop will retry. Your business operation has to survive that.

Multiple workers make the bug worse

A loop that works locally can fail badly when scaled out. On your machine, theres one worker. In production, there may be three app instances. During deployment, there may briefly be old and new instances running at the same time. If each instance runs the same worker, they may all select the same pending rows.

This code is suspicious:

var payments = await dbContext.Payments
    .Where(x => x.Status == PaymentStatus.Pending)
    .OrderBy(x => x.CreatedAt)
    .Take(50)
    .ToListAsync(stopToken);

It reads pending work, but it doesnt claim it. Two workers can read the same rows before either one saves a status change. Both think they own the work. A safer design needs an ownership step. In SQL Server, that often means moving work from Pending to Processing in a way that is atomic enough for your concurrency model. You may use row versioning, locking hints, a stored procedure, an outbox table, a queue, or a dedicated work-claim pattern. The details are less important than the rule, reading work is not the same as owning work. If your worker can run on more than one process, it needs a real claim strategy.

Scoped services need scoped lifetimes

Another worker smell is injecting a scoped service directly into a singleton hosted service. BackgroundService is registered as a singleton. A DbContext is scoped. Those lifetimes do not line up.

This is wrong:

public sealed class PaymentWorker(AppDbContext dbContext) : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken stopToken)
    {
        while (!stopToken.IsCancellationRequested)
        {
            await dbContext.SaveChangesAsync(stopToken);
        }
    }
}

Use a scope per iteration or per batch:

public sealed class PaymentWorker(
    IServiceScopeFactory scopeFactory,
    ILogger<PaymentWorker> logger)
    : BackgroundService
{
    protected override async Task ExecuteAsync(CancellationToken stopToken)
    {
        while (!stopToken.IsCancellationRequested)
        {
            try
            {
                await using var scope = scopeFactory.CreateAsyncScope();

                var processor = scope.ServiceProvider
                    .GetRequiredService<PaymentProcessor>();

                await processor.ProcessNextBatchAsync(stopToken);
                await Task.Delay(TimeSpan.FromSeconds(5), stopToken);
            }
            catch (OperationCanceledException) when (stopToken.IsCancellationRequested)
            {
                break;
            }
            catch (Exception ex)
            {
                logger.LogError(ex, "Payment worker failed.");
                await Task.Delay(TimeSpan.FromSeconds(30), stopToken);
            }
        }
    }
}

The scope gives each batch a clean set of scoped dependencies. Thats important for DbContext, unit of work boundaries, and services that hold request-level state.

A safer worker shape

A production worker does not need to be complicated. It needs to be honest about the behaviours it owns.

Here is a more sensible shape:

public sealed class PaymentWorker(
    IServiceScopeFactory scopeFactory,
    ILogger<PaymentWorker> logger)
    : BackgroundService
{
    private static readonly TimeSpan IdleDelay = TimeSpan.FromSeconds(5);
    private static readonly TimeSpan FailureDelay = TimeSpan.FromSeconds(30);

    protected override async Task ExecuteAsync(CancellationToken stopToken)
    {
        logger.LogInformation("Payment worker started.");

        while (!stopToken.IsCancellationRequested)
        {
            try
            {
                var processed = await ProcessBatchAsync(stopToken);

                if (processed == 0)
                {
                    await Task.Delay(IdleDelay, stopToken);
                }
            }
            catch (OperationCanceledException) when (stopToken.IsCancellationRequested)
            {
                break;
            }
            catch (Exception ex)
            {
                logger.LogError(ex, "Payment worker batch failed.");

                await Task.Delay(FailureDelay, stopToken);
            }
        }

        logger.LogInformation("Payment worker stopped.");
    }

    private async Task<int> ProcessBatchAsync(CancellationToken stopToken)
    {
        await using var scope = scopeFactory.CreateAsyncScope();

        var processor = scope.ServiceProvider
            .GetRequiredService<PaymentProcessor>();

        return await processor.ProcessNextBatchAsync(stopToken);
    }
}

This version is still small, but the behaviour is much easier to reason about. Cancellation now flows through the worker properly, idle periods don’t cause a tight loop, and failures are handled separately from shutdown. The worker also creates scoped dependencies in the right place and gives the processor a simple way to tell the loop whether any work was actually done. It still needs domain level safety around idempotency, ownership, retries, and state transitions. The loop cannot solve those alone. But it no longer makes everything worse by default.

The worker should be observable

A worker that only logs errors is hard to operate. You want to know whether it is alive, whether it is doing useful work, how long batches take, how many items it processes, how often it fails, and how old the oldest pending item is. That last one is especially useful. Queue length can lie. A queue with 100 items may be fine if they are fresh. A queue with 3 items may be a serious problem if the oldest one is 12 hours old. At a minimum, record batch duration, processed count, failure count, retry count, and work age. If the worker handles business-critical tasks, expose those numbers in dashboards and alerts.

A background worker should not be trusted just because the process is running. The process can be healthy while the worker is stuck. The worker can be running while the work is failing. Health checks need to reflect actual progress, not just application uptime.

The line is dangerous because it hides decisions

while (true) is not evil.

The problem is that the loop often appears before the team has decided how the worker should behave under pressure. An idle queue, a failed database call, a half-successful external API request, or a poison message all need different handling. So does deployment shutdown. So does running more than one worker at the same time. Those decisions are easy to ignore when the code is just a loop. They’re much harder to ignore at 2am when the same job keeps running, failing, retrying, and touching production data. A good worker answers those questions in code. A bad worker answers them during the incident.

The most dangerous line in a .NET background worker is not dangerous because it is clever.

Its dangerous because it is ordinary.

while (true)

It slips through code review because everyone understands it. It works locally because there is no real load, no deployment pressure, no duplicate worker, no flaky dependency, and no awkward half-success from an external provider. Then production adds all of those things at once.

A background worker is not just a loop. It is a production workflow running without a user watching it. Treat it with the same seriousness you give your APIs, because when it fails, it may fail quietly, repeatedly, and expensively.