Cost, latency or throughput: there is no "best model"
"Which model is best?" is the wrong question. We swept every model×provider route on BharatRouter through the production gateway (2026-07-02) and the answer splits three ways: the cheapest model, the snappiest model and the fastest-streaming model are not the same model. The right question is "best for what?" — and that has a measurable answer.
Three lenses, 2 winners
| Lens | Winner right now | Measured |
|---|---|---|
| Lowest ₹ / task | llama-3.1-8b-instruct | ₹3/Mtok blended |
| First token, fastest | llama-3.1-8b-instruct | 428ms via groq |
| Sustained stream, fastest | gpt-oss-20b | 606 tok/s |
Why the split? ₹/task is a rate-card fact: small open models on commodity hosts win. Time to first token is an infrastructure fact: it rewards providers running custom inference silicon or aggressive batching, whatever the model. Sustained tok/s rewards a third thing again — decode-optimized serving of big models. No single model sits on all three podiums, and the podium reshuffles when providers re-price or re-platform, which is monthly business now.
So pick by workload, not by leaderboard
What your application feels is a weighted mix of the three lenses, and the weights are a property of the workload:
- Coding agent. Agent loops make dozens of short, tool-calling turns per task — time to first token dominates how fast the agent feels, throughput matters for big diffs, and cost adds up across the loop. Reasoning support is required. Top picks from the current sweep: gpt-oss-120b, qwen3-32b, gpt-oss-20b.
- Chat assistant. A user is watching the screen: first token wins or loses the experience. Answers are short-to-medium, so sustained throughput matters less than snap. Top picks from the current sweep: llama-3.1-8b-instruct, gpt-oss-120b, qwen3-32b.
- Bulk extraction / classification. Nobody is watching a batch job — ₹ per task is nearly everything, throughput sets how long the batch takes, and TTFT is irrelevant. Top picks from the current sweep: llama-3.1-8b-instruct, gpt-oss-20b, glm-4.7-flash.
We turned this into a picker: bharatrouter.com/compare ranks the whole catalog per use case — 7 presets from coding agents to Indic-language work — using measured medians, and shows every model under all three lenses so you can see what a choice costs you on the axes you didn't optimize.
Agents get the same chooser
The page is a view over a public endpoint — your agent can ask the same question programmatically and re-ask it monthly as the podium moves:
GET /v1/compare/models # no auth
{
"use_cases": [ { "id": "coding-agent", "ranked": [ ... ] }, ... ],
"data": [ { "id": "glm-5.2", "routes": [ { "provider": "...",
"pricing_inr_per_mtok": {...}, "perf": {...} } ] } ]
}
There is also a choose_model tool on the BharatRouter MCP server, so an
agent picking a model for a sub-task can do it with one tool call — and then route the
actual request with optimize: "price" | "latency" | "throughput" and let
the gateway pick the winning host per request, with failover when a host blips.
In the open
Method: multi-run streamed medians per (model × provider) route via the production gateway on 2026-07-02, rounds interleaved across hosts, identical ~300-token prompt; TTFT counts the first content or reasoning token; tok/s is the post-first-token decode rate. Routes we couldn't measure (missing org key, provider billing pause) are listed in the API response, never silently dropped. Numbers refresh with every sweep; this page pulls them from the same data file as /compare, so the post and the picker can't disagree.