How we benchmark — and why you can trust the numbers

A benchmark is only as good as its method. Before you act on our GLM-vs-frontier numbers, here's exactly how we produce them — every choice anchored to the eval literature, so you can audit it, re-run it, and disagree with it on the merits. These are interim results on a 14-task set; a broader, contamination-controlled benchmark is coming.

Methodology at a glance: execution-checked correctness, accuracy + latency + cost reported together, and Wilson 95% confidence intervals on every pass-rate.

1. Correctness is execution-checked, not judged

We don't ask a model to grade another model. Every task ships assert-based unit tests, and a run passes only if its code passes every assert in a fresh subprocess. This is the functional-correctness tradition of MBPP (Austin et al., Google, 2021 — arXiv:2108.07732): judge generated code by whether it actually runs and passes tests. For the harder, repo-level end of the spectrum, the gold standard is SWE-bench (Jimenez, Yang, Press, Narasimhan et al., Princeton, ICLR 2024 — arXiv:2310.06770), which grades real GitHub fixes against the repository's own test suite. Our 14 tasks are the small, fast end of that idea — real tests, no opinions.

2. Accuracy, latency and cost — reported together

A model that's 1% more accurate but 10× slower and pricier is not "better" — it depends on your workload. So we report accuracy, latency and ₹/$ cost as first-class axes on the same table, the way HELM argues evaluation should work (Liang, Bommasani, Lee et al., Stanford CRFM, 2022 / TMLR 2023 — arXiv:2211.09110): many desiderata, measured simultaneously and transparently. Cost is computed from the real measured token counts against a dated, sourced price sheet (FX ₹96/$), never list-price guesses.

3. Confidence intervals, not bare point estimates

"99.4%" without a range is a vibe, not a measurement. We run 100 repetitions per task (1,400 results per system) and put a Wilson 95% confidence interval on every pass-rate. Wilson (Wilson, E.B., 1927, JASA — DOI:10.2307/2276774) is the right interval near 100%: it stays inside [0,1] and keeps a non-zero width even at a perfect score, where the textbook Wald interval collapses to zero and lies to you (Brown, Cai & DasGupta, 2001, Statistical Science — DOI:10.1214/ss/1009213286).

# Wilson score 95% CI — correct for pass-rates near 100% (Wilson, 1927)
def wilson(passes, n, z=1.96):
    p = passes / n
    denom = 1 + z*z/n
    center = (p + z*z/(2*n)) / denom
    half = z * ((p*(1-p)/n + z*z/(4*n*n)) ** 0.5) / denom
    return (center - half, center + half)   # stays in [0,1], non-zero width even at p=1

This mirrors the argument in "Adding Error Bars to Evals" (Miller, Anthropic, 2024 — arXiv:2411.00640): treat an eval as an experiment — report confidence intervals, compare models with overlap in mind, and plan sample sizes. For aggregate metrics like mean latency we use bootstrap intervals (Efron, 1979, Annals of Statistics — DOI:10.1214/aos/1176344552).

4. Infrastructure errors are not model errors

When a provider returns HTTP 429 (rate limit) or a timeout, that's an infrastructure event, not a wrong answer — so we exclude it from the accuracy denominator (uniformly across systems; most legs have zero) and report the error count separately. Conflating the two would unfairly punish whichever model we happened to hammer hardest on a throttled host.

5. Serving metrics, defined

For host comparisons (Baseten vs OpenRouter vs Zhipu) we separate time-to-first-token (prefill) from per-token throughput (decode), the split formalized by vLLM / PagedAttention (Kwon et al., Berkeley, SOSP 2023 — arXiv:2309.06180) and standardized as TTFT/TPOT in the MLPerf Inference LLM rounds (MLCommons, v4.0+, 2024). Latency-bounded throughput under realistic arrivals is the MLPerf "Server" scenario (Reddi et al., ISCA 2020 — arXiv:1911.02549).

What's interim, and what's coming

We're publishing interim numbers on a deliberately small, transparent task set so you can see the method end-to-end today. Next is a broader benchmark: more tasks, repo-level problems in the SWE-bench tradition, and contamination control — scoring on problems released after a model's training cutoff, the discipline introduced by LiveCodeBench (Jain, Han, Gu et al., Berkeley/MIT, 2024 — arXiv:2403.07974), so high scores reflect reasoning, not memorized training data. We'll publish it the same way: every script, prompt, price and raw result on GitHub.

Reproduce everything: scripts, dated prices and raw results → github.com/bharatrouter/cookbook → · See it applied: GLM vs frontier · The host showdown

Go live in a few clicks. Install a governed GLM coding agent in one command — Codex- or Claude-Code-style, routed & metered through your own key. BharatRouter Code →

Was this helpful?