Back to Dashboard
CategoryWeight: 1.0x

Token Efficiency

Measures how efficiently a model solves tasks by penalizing higher token consumption. Lower usage earns a higher score.

Best Score

0.0

Avg Score

0.0

Measured Models

4

Performance Over Time — All Models

Model Rankings

1
Claude Sonnet 4.6

Token usage benchmark

View
100.0BEST
Avg/Test4.8k/test
Total121.0k
2
Claude Opus 4.8

Token usage benchmark

View
68.3-31.7 pts
Avg/Test7.1k/test
Total134.7k
3
GPT-5.5

Token usage benchmark

View
36.8-63.2 pts
Avg/Test13.2k/test
Total368.2k
4
Grok

Token usage benchmark

View
14.1-85.9 pts
Avg/Test34.2k/test
Total991.9k

Benchmark Construction

How It Scores

We total prompt and completion tokens across all successful benchmark tasks, compute an average per successful task, then assign 100 to the lowest-burn model in that run. Everyone else is scaled down proportionally, so higher usage means a lower benchmark score.

What To Read In This View

The ranking table above is the benchmark itself. Use Avg/Test to compare per-task burn and Total to spot larger absolute usage across the whole run.

Historical scores show whether a model is becoming more or less token-efficient over time, independent of raw quality improvements in the other ten categories.