Insights & Thinking

Practical thinking on enterprise AI

Written from production experience, not theory. Every piece is grounded in real deployments in regulated environments.

AI Cost OptimisationMarch 2026 · 7 min read

How We Cut AI Infrastructure Cost by 93% in Production

Azure AI Foundry pricing at scale will surprise most organisations. Here is the exact architecture decision that changed everything.

When you process a few hundred documents a month through a managed LLM API, the cost is negligible. When you scale to 5,000+ documents per hour, the economics break entirely.

That was the situation at a major UK life insurer. Their AI infrastructure bill was tracking toward £2M annually and growing linearly with volume. The CFO flagged it. The CTO needed a solution that didn't compromise on performance or data governance.

The problem with the obvious answers

The simple responses — switch to a cheaper model, reduce usage, negotiate a volume discount — didn't hold up:

  • ·Cheaper managed models introduced accuracy risk on underwriting decisions carrying significant liability
  • ·Reducing usage meant the business value case collapsed
  • ·Volume discounts don't change the fundamental economics at scale

The real answer was structural, not transactional.

The hybrid routing architecture

The insight was that not all LLM requests are equal. Some require low latency and high accuracy — they must use the best available model. Others are high-volume, tolerant of slightly higher latency, and don't justify managed API pricing.

We built an intelligent routing layer that classifies each request at inference time and directs it to the optimal endpoint:

- Latency-sensitive, low-volume queries → Azure AI Foundry (GPT-4o) - High-volume batch processing → Self-hosted GPU cluster (Llama 3.x, Qwen 2.5 on AKS with KAITO operator)

The self-hosted cluster runs on NC48ads A100 v4 nodes within the organisation's own Azure estate. No data leaves the perimeter. Full FCA compliance maintained.

The numbers

The cost reduction was 93%. Not 20%, not 40% — 93%. Annualised saving: over £1.8M. Throughput achieved: 47,637 requests per hour.

The routing architecture is now patent-pending. It has been adopted as the organisation's standard for all LLM deployments.

What this means for your organisation

If you are running AI workloads at scale on managed APIs, you almost certainly have a cost problem you haven't fully quantified yet. The break-even point for self-hosted infrastructure is lower than most engineering teams assume — typically around 50,000 tokens per day.

The calculation isn't complicated. The architecture isn't exotic. What's missing, usually, is someone who has done it in a regulated production environment and can de-risk the transition.

AI Governance6 min

The 5 Questions Your Board Will Ask About AI — and How to Answer Them

Your legal team isn't blocking AI because they don't understand it. They're blocking it because your technical team can't answer five specific questions.

AI Governance5 min

Shadow AI in Regulated Organisations: The Risk You're Not Measuring

Employees at regulated firms are using consumer AI tools with production data. Most compliance teams know. Almost none have quantified the exposure.

AI Architecture8 min

Principal-Led, Agent-Augmented: Why the Boutique Advisory Model Is Changing

The answer to 'you're just one person' is no longer what it used to be. Here is what AI-augmented delivery actually means in a regulated advisory context.

More on LinkedIn

Dr. Sam Arora publishes regularly on AI governance, production LLM architecture, and cost optimisation for regulated industries.

Follow on LinkedIn