How we benchmark — and why you can trust the numbers
A benchmark is only as good as its method. Before you act on our GLM-vs-frontier numbers, here's exactly how we produce them — every choice anchored to the eval literature, so you can audit it, re-run it, and disagree with it on the merits. These are interim results on a 14-task set; a broader, contamination-controlled benchmark is coming.
1. Correctness is execution-checked, not judged
We don't ask a model to grade another model. Every task ships assert-based unit tests, and a run passes only if its code passes every assert in a fresh subprocess. This is the functional-correctness tradition of MBPP (Austin et al., Google, 2021 — arXiv:2108.07732): judge generated code by whether it actually runs and passes tests. For the harder, repo-level end of the spectrum, the gold standard is SWE-bench (Jimenez, Yang, Press, Narasimhan et al., Princeton, ICLR 2024 — arXiv:2310.06770), which grades real GitHub fixes against the repository's own test suite. Our 14 tasks are the small, fast end of that idea — real tests, no opinions.
2. Accuracy, latency and cost — reported together
A model that's 1% more accurate but 10× slower and pricier is not "better" — it depends on your workload. So we report accuracy, latency and ₹/$ cost as first-class axes on the same table, the way HELM argues evaluation should work (Liang, Bommasani, Lee et al., Stanford CRFM, 2022 / TMLR 2023 — arXiv:2211.09110): many desiderata, measured simultaneously and transparently. Cost is computed from the real measured token counts against a dated, sourced price sheet (FX ₹96/$), never list-price guesses.
3. Confidence intervals, not bare point estimates
"99.4%" without a range is a vibe, not a measurement. We run 100 repetitions per task (1,400 results per system) and put a Wilson 95% confidence interval on every pass-rate. Wilson (Wilson, E.B., 1927, JASA — DOI:10.2307/2276774) is the right interval near 100%: it stays inside [0,1] and keeps a non-zero width even at a perfect score, where the textbook Wald interval collapses to zero and lies to you (Brown, Cai & DasGupta, 2001, Statistical Science — DOI:10.1214/ss/1009213286).
# Wilson score 95% CI — correct for pass-rates near 100% (Wilson, 1927)
def wilson(passes, n, z=1.96):
p = passes / n
denom = 1 + z*z/n
center = (p + z*z/(2*n)) / denom
half = z * ((p*(1-p)/n + z*z/(4*n*n)) ** 0.5) / denom
return (center - half, center + half) # stays in [0,1], non-zero width even at p=1 This mirrors the argument in "Adding Error Bars to Evals" (Miller, Anthropic, 2024 — arXiv:2411.00640): treat an eval as an experiment — report confidence intervals, compare models with overlap in mind, and plan sample sizes. For aggregate metrics like mean latency we use bootstrap intervals (Efron, 1979, Annals of Statistics — DOI:10.1214/aos/1176344552).
4. Infrastructure errors are not model errors
When a provider returns HTTP 429 (rate limit) or a timeout, that's an infrastructure
event, not a wrong answer — so we exclude it from the accuracy denominator (uniformly across
systems; most legs have zero) and report the error count separately. Conflating the two would
unfairly punish whichever model we happened to hammer hardest on a throttled host.
5. Serving metrics, defined
For host comparisons (Baseten vs OpenRouter vs Zhipu) we separate time-to-first-token (prefill) from per-token throughput (decode), the split formalized by vLLM / PagedAttention (Kwon et al., Berkeley, SOSP 2023 — arXiv:2309.06180) and standardized as TTFT/TPOT in the MLPerf Inference LLM rounds (MLCommons, v4.0+, 2024). Latency-bounded throughput under realistic arrivals is the MLPerf "Server" scenario (Reddi et al., ISCA 2020 — arXiv:1911.02549).
What's interim, and what's coming
We're publishing interim numbers on a deliberately small, transparent task set so you can see the method end-to-end today. Next is a broader benchmark: more tasks, repo-level problems in the SWE-bench tradition, and contamination control — scoring on problems released after a model's training cutoff, the discipline introduced by LiveCodeBench (Jain, Han, Gu et al., Berkeley/MIT, 2024 — arXiv:2403.07974), so high scores reflect reasoning, not memorized training data. We'll publish it the same way: every script, prompt, price and raw result on GitHub.
Reproduce everything: scripts, dated prices and raw results → github.com/bharatrouter/cookbook → · See it applied: GLM vs frontier · The host showdown
Go live in a few clicks. Install a governed GLM coding agent in one command — Codex- or Claude-Code-style, routed & metered through your own key. BharatRouter Code →