🎬 New — watch the 2-minute guide videos →

← Blog

Cost, latency or throughput: there is no "best model"

"Which model is best?" is the wrong question. We swept every model×provider route on BharatRouter through the production gateway (2026-07-02) and the answer splits three ways: the cheapest model, the snappiest model and the fastest-streaming model are not the same model. The right question is "best for what?" — and that has a measurable answer.

Three lenses, 2 winners

LensWinner right nowMeasured
Lowest ₹ / taskllama-3.1-8b-instruct₹3/Mtok blended
First token, fastestllama-3.1-8b-instruct428ms via groq
Sustained stream, fastestgpt-oss-20b606 tok/s

Why the split? ₹/task is a rate-card fact: small open models on commodity hosts win. Time to first token is an infrastructure fact: it rewards providers running custom inference silicon or aggressive batching, whatever the model. Sustained tok/s rewards a third thing again — decode-optimized serving of big models. No single model sits on all three podiums, and the podium reshuffles when providers re-price or re-platform, which is monthly business now.

So pick by workload, not by leaderboard

What your application feels is a weighted mix of the three lenses, and the weights are a property of the workload:

We turned this into a picker: bharatrouter.com/compare ranks the whole catalog per use case — 7 presets from coding agents to Indic-language work — using measured medians, and shows every model under all three lenses so you can see what a choice costs you on the axes you didn't optimize.

Agents get the same chooser

The page is a view over a public endpoint — your agent can ask the same question programmatically and re-ask it monthly as the podium moves:

GET /v1/compare/models      # no auth
{
  "use_cases": [ { "id": "coding-agent", "ranked": [ ... ] }, ... ],
  "data":      [ { "id": "glm-5.2", "routes": [ { "provider": "...",
                   "pricing_inr_per_mtok": {...}, "perf": {...} } ] } ]
}

There is also a choose_model tool on the BharatRouter MCP server, so an agent picking a model for a sub-task can do it with one tool call — and then route the actual request with optimize: "price" | "latency" | "throughput" and let the gateway pick the winning host per request, with failover when a host blips.

In the open

Method: multi-run streamed medians per (model × provider) route via the production gateway on 2026-07-02, rounds interleaved across hosts, identical ~300-token prompt; TTFT counts the first content or reasoning token; tok/s is the post-first-token decode rate. Routes we couldn't measure (missing org key, provider billing pause) are listed in the API response, never silently dropped. Numbers refresh with every sweep; this page pulls them from the same data file as /compare, so the post and the picker can't disagree.

Open the model picker Browse the catalog