Back to Dashboard
CategoryWeight: 1.0x
Token Efficiency
Measures how efficiently a model solves tasks by penalizing higher token consumption. Lower usage earns a higher score.
Best Score
0.0Avg Score
0.0Measured Models
4Performance Over Time — All Models
Model Rankings
Benchmark Construction
How It Scores
We total prompt and completion tokens across all successful benchmark tasks, compute an average per successful task, then assign 100 to the lowest-burn model in that run. Everyone else is scaled down proportionally, so higher usage means a lower benchmark score.
What To Read In This View
The ranking table above is the benchmark itself. Use Avg/Test to compare per-task burn and Total to spot larger absolute usage across the whole run.
Historical scores show whether a model is becoming more or less token-efficient over time, independent of raw quality improvements in the other ten categories.