How a lightweight “traffic‑cop” model keeps your enterprise chatbots from hallucinating — while finally giving mini‑models a job they’re great at.

TL;DR Retrieval‑Augmented Generation (RAG) supercharges LLMs with private data, but it can still inject irrelevant or noisy snippets that confuse downstream tools and users. AI Routing adds a thin, fast model in front of your RAG pipeline to decide which collections, databases, or micro‑capabilities (“MCPs”) a query truly needs. The result: leaner context windows, cleaner tool calls, and rock‑solid enterprise answers — delivered in <100 ms with a few smart optimizations.

Razroo’s AI Router currently at 640 MS latency

The Problem: RAG Is Powerful — but Indiscriminate

RAG’s promise is simple: combine the reasoning strength of an LLM with the factual grounding of your knowledge base. In practice, though, a vanilla RAG pipeline often:

  • Over‑retrieves: pulls 50 passages when the answer lives in two.
  • Mixes domains: blends HR policy with product docs because both mention “vacation.”
  • Clogs tool calls: bloated context prompts push function‑calling JSON over 16 k tokens.

In an enterprise setting — think compliance, analytics, or critical decision support — those mistakes translate to hallucinations, wrong dashboards, or mis‑routed tasks. Stakeholders lose trust fast.

Enter AI Routing

AI Routing inserts a micro‑model — small enough to run on CPU — in front of your retriever. Its single job:

  1. Classify the incoming query (intent, domain, sensitivity).
  2. Select the minimal set of data collections / DB tables / MCPs required.
  3. Inject only those URIs or indices into the RAG prompt.

If no collection passes a confidence threshold, the router can even return a “no‑answer” token, preventing garbage retrieval.

Why Mini Models Shine Here

  • Low latency (<50 ms even on commodity instances).
  • Narrow focus reduces the need for gigantic parameter counts.
  • Easier fine‑tuning on small, labeled routing datasets.

In my own deployments, a distilled‑MiniLM variant scored >92 % F1 on routing accuracy and brought overall answer relevance up by 17 pp.

Architecture at a Glance

flowchart LR
    A[User Query] --> B[AI Router (Mini Model)]
    B -->|collections[]| C[Vector Retriever]
    C --> D[LLM + Tool Calling]
    D --> E[Response]
  • Router (32–128 M params, CPU)
  • Retriever (FAISS / Elastic KNN / PGVector / Qdrant)
  • LLM (GPT‑4o, Claude‑3, etc.)
  • MCP layer (Function‑calling micro‑services)

Performance Tuning: From 640 ms to Sub‑100 ms

My baseline implementation clocked ≈640 ms end‑to‑end. Here’s the shaving plan:

  1. Vectorization cache for repeat queries (−120 ms).
  2. Indexed routing outputs (think intent→collection map; −90 ms).
  3. Asynchronous prefetch of top‑k docs while the router streams logits (−70 ms).
  4. Quantization + batch‑size 4 on the mini‑model (−110 ms).

Target: 80–100 ms median latency — fast enough for chat UIs and background tool calls alike.

Implementation Blueprint

  1. Label 1–2 k queries with correct collections/MCPs.
  2. Fine‑tune a MiniLM/BERT‑base using multi‑label classification.
  3. Wrap the router in an API that emits JSON:
{
  "collections": ["ProdDocs", "PricingDB"],
  "tools": ["QuoteGenerator"]
}

4. Gate the retriever + LLM behind the router’s output.

5. Log + retrain monthly; routing drift is slow but real.

Real‑World Example

Query: “What’s the SLA for our EU‑West region and how do I file an escalation?”

Pipeline Stage Outcome Router Picks SLA_DB, Escalation_Playbook collections; ignores Marketing, HR, etc. Retriever Returns 3 SLA docs + 1 escalation SOP (total 620 tokens). LLM generates a concise answer plus auto‑calls the CreateTicket MCP with correct priority.

Without routing, the retriever previously stuffed 18 docs (3 k tokens) and the LLM called the wrong ticketing function. ☠️

Why This Matters for Enterprise AI

  • Compliance — Fewer off‑topic docs means lower exposure of restricted data.
  • Cost — Slimmer prompts cut token spend by 30‑50 %.
  • Trust — Business users see crisp, reliable answers — not hallucinated mash‑ups.

Looking Ahead

  • Graph‑based routing to capture inter‑collection dependencies.
  • Self‑optimizing indices that learn which vectors matter most per intent.
  • Router distillation into on‑device models (<10 M params) for edge apps.

Final Thought

AI Routing won’t replace RAG; it enables RAG to hit enterprise‑grade SLAs. By giving humble mini‑models a focused, high‑leverage role, we keep our large models honest — and our users happy.

Latency is just engineering; relevance is strategy.

Have you tried AI Routing in production? Share your wins (and war stories) in the comments.