Running LLMs Locally in .NET with Microsoft.Extensions.AI

The rise of large language models (LLMs) has created a curious dilemma for developers, we have access to cloud scale giants like GPT-5 that deliver state of the art reasoning, but at the same time a wave of open models, Llama 3.1, Mistral, Qwen2, Phi-3, are available to download and run directly on our own machines. On the surface, this looks like an odd competition, why would anyone settle for a smaller local model when ChatGPT already delivers better answers with minimal setup? The answer lies in trade offs around privacy, cost, latency, and compliance.
From the perspective of a .NET engineer. What does running an LLM locally really mean, why do some teams choose this route despite the quality gap, and how does Microsoft.Extensions.AI provide a clean abstraction that allows you to combine both worlds. By the end of this post, you should be able to build a .NET service that uses GPT-5 when it can, but falls back to a local Ollama hosted model when privacy rules or network conditions demand it.
Why Local Models are Still Important
It’s important to be honest, if you are free to call GPT-5 in the cloud, and you care solely about the best possible quality of answers, then there is no reason to use a local LLM. GPT-5’s performance remains ahead of anything you can run on a laptop or single workstation GPU. So why bother? The answer is that many developers are not operating in a vacuum. In regulated industries, insurance, finance, health, defence the barrier to entry is not “is it good enough?” but “can we even legally send this data to a third party?” Local models provide a way to use LLMs while keeping data inside your network boundary.
Even when regulations aren’t the driver, local models bring predictable cost curves, offline capability, and faster round trip times for certain workloads. None of these advantages override the quality difference on their own, but together they justify experimentation. It could even be that 1 department like Engineering needs access to Chat GPT-5 for the latest, cutting edge answers to a technical question but another more regulated department would be able to make use of the local LLMs and still see a rise in productivity.
Microsoft.Extensions.AI
If you have built ASP.NET Core apps, you already know Microsoft.Extensions.* dependency injection, configuration, options, and logging. The Microsoft.Extensions.AI libraries apply the same approach to language models.
The key principle is provider abstraction. Instead of coding directly against Azure OpenAI, OpenAI, or Ollama, you target interfaces like IChatClient. Then you decide in configuration whether that interface maps to a cloud provider or a local engine. This makes hybrid strategies feasible, you can design your services once and swap between GPT-5 in the cloud and Llama 3.1 running locally without changing business logic.
Setting up a Local Model with Ollama
First, let’s set up a local model. Ollama is a lightweight runtime that lets you download and serve LLMs on your machine using a single command.
Install Ollama and pull a model:
ollama pull llama3.1:8b
ollama run llama3.1:8b
Ollama exposes a simple HTTP API on port 11434. This is perfect for local experimentation and integrates cleanly with Microsoft.Extensions.AI.
Configuring GPT-5 via Azure OpenAI
Assume you also have an Azure subscription with a GPT-5 deployment. Normally you would register the Azure OpenAI client directly. With Microsoft.Extensions.AI, you can wire both providers side by side
using Microsoft.Extensions.AI;
using Microsoft.Extensions.DependencyInjection;
var builder = Host.CreateApplicationBuilder(args);
// Local Ollama provider
builder.Services.AddOllamaChatClient("http://localhost:11434", model: "llama3.1:8b");
// Azure OpenAI provider
builder.Services.AddAzureOpenAIChatClient(options =>
{
options.Endpoint = new Uri(builder.Configuration["AOAI:Endpoint"]);
options.ApiKey = builder.Configuration["AOAI:ApiKey"];
options.Deployment = builder.Configuration["AOAI:Deployment"];
});
// App services
builder.Services.AddTransient<HybridChatService>();
var app = builder.Build();
await app.RunAsync();
Now both providers are available in the DI container.
A Hybrid Service
Let’s define a service that attempts to use GPT-5 first, but falls back to Ollama if a specific privacy flag is set
public sealed class HybridChatService
{
private readonly IChatClient _cloud;
private readonly IChatClient _local;
public HybridChatService(
[FromKeyedServices("AzureOpenAI")] IChatClient cloud,
[FromKeyedServices("Ollama")] IChatClient local)
{
_cloud = cloud;
_local = local;
}
public async Task<string> AskAsync(string prompt, bool sensitive, CancellationToken stopToken)
{
var client = sensitive ? _local : _cloud;
var response = await client.CompleteAsync(prompt, stopToken);
return response?.Message?.Text ?? string.Empty;
}
}
With this design, your application logic never changes. It simply asks HybridChatService for a completion, and the decision about local versus cloud is handled by policy.
Streaming Responses
One benefit of Microsoft.Extensions.AI is built in support for token streaming. This is important if you want your UI to behave like ChatGPT’s interface, displaying output as it arrives
await foreach (var update in _cloud.CompleteStreamingAsync("Write a poem about Dublin", stopToken))
{
Console.Write(update.Message.Text);
}
This works identically regardless of whether you are talking to GPT-5 in the cloud or a local Ollama process.
Error Handling and Resilience
Another advantage of adopting Microsoft.Extensions.* is consistent resilience. Just as HttpClientFactory supports retries, backoff, and circuit breakers, you can compose AI clients with the same patterns:
builder.Services.AddAzureOpenAIChatClient()
.AddResilience(resilience =>
{
resilience.AddRetry(new RetryStrategyOptions
{
MaxRetryAttempts = 3,
Delay = TimeSpan.FromSeconds(2)
});
});
This ensures transient failures, whether network hiccups to Azure or an overloaded local runtime, don’t immediately break your application.
When Local is Pointless
It’s worth being blunt, if you are an independent developer, and you have no compliance requirements, then GPT-5 is the better choice almost every time. Running an 8B or 70B parameter model locally on consumer hardware will not give you the same level of reasoning.
But the calculation changes when:
You need answers within 50ms in an edge device with no internet.
You cannot legally send text off-premises.
You want predictable cost curves for millions of daily completions.
For these cases, quality is not the only value.
The gap between open and closed models is narrowing. Llama 3.1 405B already rivals GPT-4-Turbo in benchmarks, and open models are improving at a pace that makes local deployment viable for more workloads each year. In a year or two, the difference between local and cloud may be more about scale and fine tuning than raw capability. The point of Microsoft.Extensions.AI is to ensure your .NET applications are insulated from these shifts. Whether GPT-6 or Llama-4 is the winner, your code should not need to change. Running a local LLM is not about outperforming GPT-5, it’s about meeting constraints that the cloud cannot solve. Privacy, cost, offline capability, and compliance drive the decision. The real power emerges when you design a hybrid architecture, local when you must, GPT-5 when you can.
By adopting Microsoft.Extensions.AI, you gain the ability to swap providers without rewriting your application, use the same streaming and resilience primitives across both, and prepare for a future where the line between local and cloud models is increasingly blurred.
If you are experimenting today, try setting up an Ollama instance alongside your Azure OpenAI subscription and wire them both into your services. The best way to understand the trade offs is to experience them in code.





