Methodology
How we benchmark frontier AI models — our test design, scoring system, and quality controls.
Overview
Individual tests per run
Benchmark categories
Daily benchmark run (3am ET)
Models Tested
All models are called via their official APIs with temperature set to 0.0 for maximum reproducibility. When a new frontier model is released, it is added to the benchmark suite and begins tracking immediately.
Benchmark Categories
Token Efficiency
A synthetic benchmark that rewards lower token burn per successful task and penalizes wasteful completions.
Weight: 1x · 3 tests per runLong Reasoning
Multi-step logic puzzles, legal reasoning chains, and mathematical proofs that test sustained analytical thinking.
Weight: 1x · 3 tests per runCoding Tasks
Implement data structures, REST APIs, and algorithmic challenges from specifications.
Weight: 1.2x · 3 tests per runBug Fixes
Identify and fix subtle bugs including off-by-one errors, race conditions, and memory leaks.
Weight: 1.2x · 3 tests per runFeature Implementation
Add search/filter, rate limiting, pagination, and other features to existing codebases.
Weight: 1x · 3 tests per runCode Thoroughness
Error handling completeness, input validation depth, and test generation quality.
Weight: 0.9x · 3 tests per runBug Introduction Rate
Modify existing code, refactor safely, and add features without breaking existing functionality.
Weight: 1.1x · 3 tests per runSecurity Awareness
SQL injection prevention, vulnerability identification, and secure file handling practices.
Weight: 1.1x · 3 tests per runInstruction Following
Format constraints, multi-constraint outputs, and adherence to negative instructions.
Weight: 0.8x · 3 tests per runCode Quality
Clean code standards, readable complex logic, and well-designed API interfaces.
Weight: 0.9x · 3 tests per runPerformance & Efficiency
Algorithm optimization, memory-efficient processing, and database query optimization.
Weight: 1x · 3 tests per runScoring System
Each test produces a score from 0 to 100. Scores are computed using one or more evaluation methods:
Model-generated code runs in a Docker sandbox. Score based on test case pass rate.
A separate model evaluates the response against a rubric. Judge model differs from test subjects.
Regex and structural checks verify format compliance, constraints, and output correctness.
Combines multiple evaluators with weighted scoring for nuanced assessment.
Composite scores are weighted averages across all 11 categories. This includes a token-efficiency benchmark that penalizes higher average token burn per successful task. Weights reflect the relative importance of each capability (coding and security are weighted higher than instruction following).
Regression Detection
After each benchmark run, we compare scores against rolling historical averages:
A regression is considered “resolved” when the score returns to within 2% of the pre-regression average over 3 consecutive runs.
Limitations
- LLM-as-judge evaluation introduces variability. We mitigate this with temperature 0.0 and structured rubrics.
- API performance can vary due to server load, rate limiting, and geographic location.
- 30 tests cannot cover all capabilities. Results indicate trends, not absolute quality.
- Model versioning is opaque — providers may update models silently (which is exactly what we track).
- Code execution benchmarks are limited to Python and JavaScript in sandboxed environments.