Methodology

How we benchmark frontier AI models — our test design, scoring system, and quality controls.

Overview

30

Individual tests per run

10

Benchmark categories

1x

Daily benchmark run (3am ET)

Models Tested

Claude Opus 4.8anthropic
Claude Sonnet 4.6anthropic
GPT-5.5openai
Grokxai

All models are called via their official APIs with temperature set to 0.0 for maximum reproducibility. When a new frontier model is released, it is added to the benchmark suite and begins tracking immediately.

Benchmark Categories

Token Efficiency

A synthetic benchmark that rewards lower token burn per successful task and penalizes wasteful completions.

Weight: 1x · 3 tests per run

Long Reasoning

Multi-step logic puzzles, legal reasoning chains, and mathematical proofs that test sustained analytical thinking.

Weight: 1x · 3 tests per run

Coding Tasks

Implement data structures, REST APIs, and algorithmic challenges from specifications.

Weight: 1.2x · 3 tests per run

Bug Fixes

Identify and fix subtle bugs including off-by-one errors, race conditions, and memory leaks.

Weight: 1.2x · 3 tests per run

Feature Implementation

Add search/filter, rate limiting, pagination, and other features to existing codebases.

Weight: 1x · 3 tests per run

Code Thoroughness

Error handling completeness, input validation depth, and test generation quality.

Weight: 0.9x · 3 tests per run

Bug Introduction Rate

Modify existing code, refactor safely, and add features without breaking existing functionality.

Weight: 1.1x · 3 tests per run

Security Awareness

SQL injection prevention, vulnerability identification, and secure file handling practices.

Weight: 1.1x · 3 tests per run

Instruction Following

Format constraints, multi-constraint outputs, and adherence to negative instructions.

Weight: 0.8x · 3 tests per run

Code Quality

Clean code standards, readable complex logic, and well-designed API interfaces.

Weight: 0.9x · 3 tests per run

Performance & Efficiency

Algorithm optimization, memory-efficient processing, and database query optimization.

Weight: 1x · 3 tests per run

Scoring System

Each test produces a score from 0 to 100. Scores are computed using one or more evaluation methods:

Code Execution

Model-generated code runs in a Docker sandbox. Score based on test case pass rate.

LLM Judge

A separate model evaluates the response against a rubric. Judge model differs from test subjects.

Pattern Matching

Regex and structural checks verify format compliance, constraints, and output correctness.

Composite

Combines multiple evaluators with weighted scoring for nuanced assessment.

Composite scores are weighted averages across all 11 categories. This includes a token-efficiency benchmark that penalizes higher average token burn per successful task. Weights reflect the relative importance of each capability (coding and security are weighted higher than instruction following).

Regression Detection

After each benchmark run, we compare scores against rolling historical averages:

Minor: 3-5% drop
Moderate: 5-15% drop
Major: >15% drop

A regression is considered “resolved” when the score returns to within 2% of the pre-regression average over 3 consecutive runs.

Limitations

  • LLM-as-judge evaluation introduces variability. We mitigate this with temperature 0.0 and structured rubrics.
  • API performance can vary due to server load, rate limiting, and geographic location.
  • 30 tests cannot cover all capabilities. Results indicate trends, not absolute quality.
  • Model versioning is opaque — providers may update models silently (which is exactly what we track).
  • Code execution benchmarks are limited to Python and JavaScript in sandboxed environments.