The cleanest architectural decisions aren't the ones where you evaluate all the options and pick the best one. They're the ones where a constraint eliminates the bad options and forces you to build something more durable than unconstrained choice would have produced. Compliance requirements did that here — and the result was a better design than I would have chosen on my own.
3–5 min read
Key Tradeoffs
Building a RAG system on a large-scale compliance platform meant one requirement was non-negotiable from the start: confidential information and PII could not leave the network. That single constraint made the obvious architectural choice unavailable.
OpenAI's embedding API was off the table entirely. Every document, code file, and configuration item passed through the embedding pipeline would have crossed a network boundary into a third-party service. For a platform handling sensitive compliance data across multiple jurisdictions, that wasn't a risk to evaluate — it was a line not to cross.
Azure OpenAI was a more viable path. Data stays within your Azure tenant, which satisfies many compliance requirements. But it still represented a cloud dependency — infrastructure outside direct control, subject to service availability, pricing changes, and the evolving compliance guidance that shapes what's acceptable in regulated environments.
Ollama running locally eliminated the cloud dependency entirely. All embedding happened on-premises, on hardware we controlled, with no data leaving the network. It was the option that most directly satisfied the requirement.
It also came with three costs that deserved honest accounting.
Embedding quality. Local models aren't equivalent to OpenAI's best offerings. The gap is measurable — it shows up in retrieval precision, in the relevance of what surfaces when an engineer asks a complex question about system behavior. For a system whose entire value is the quality of what it retrieves, that gap matters. It's not a dealbreaker, but it's a real cost that shapes what the system can deliver.
Inference latency. Local inference is slower than a well-provisioned API call, particularly under indexing load. When the system is processing a large batch — a full Confluence export, a significant codebase update — the throughput is lower than it would be with a hosted provider. This affects how quickly the index stays current after a major change.
Operational complexity. Running Ollama in production adds infrastructure overhead that a hosted API doesn't. Model management, hardware provisioning, the failure modes that come with running ML inference locally — none of these exist when you're making an HTTP call to someone else's service. They're manageable, but they're real, and they represent ongoing maintenance cost that a hosted solution would have absorbed.
These weren't theoretical tradeoffs evaluated on a whiteboard. They're the costs that showed up in production and had to be managed. Honest accounting upfront is what made the decision defensible — and what drove the design that followed.
Rather than accepting lock-in to either option, I built an abstraction layer that made the embedding provider a configuration decision rather than an architectural one.
The pipeline routes through TrackingEmbeddingService — which wraps usage metrics around any provider — and optionally through RoundRobinEmbeddingService for multi-endpoint load balancing. The concrete implementations, OllamaEmbeddingService and OpenAiEmbeddingService, sit behind the same interface. Swapping providers doesn't touch the indexing pipeline, the query path, or the storage layer. It changes a config value.

The result is a system that can run fully on-premises with Ollama, fully on Azure OpenAI, or split across both — with different providers for different workloads if the compliance posture or performance requirements call for it. As models improve, as compliance guidance evolves, as Azure OpenAI achieves certification for additional regulatory frameworks, the system adapts without an architectural change.
Compliance drove the requirement. The abstraction made it livable — and made the system more adaptable than it would have been if the decision had been made under no pressure at all.
The instinct when a compliance requirement narrows your options is to treat it as a limitation. That framing is wrong.
The abstraction layer I built because I had to is the same abstraction layer I would have eventually gotten around to if I'd had unlimited time and no external pressure. The difference is that the compliance requirement made it non-negotiable from the start. The pressure to satisfy the constraint produced a design that was more deliberate, more portable, and more durable than the path of least resistance would have created.
This is a pattern worth recognizing. The best architectural decisions often look obvious in retrospect — and they're frequently obvious because a constraint forced the design that should have existed from the beginning. A security requirement that forces proper isolation. A compliance rule that prevents the shortcut that would have become technical debt. A regulatory boundary that demands the abstraction that makes the system maintainable.
Constraints eliminate options. The options they tend to eliminate are the ones you'd have regretted anyway.
Constraints don't limit good architecture. Sometimes they're the reason it gets built.