CategoryWeight: 1.0x

Token Efficiency

Measures how efficiently a model solves tasks by penalizing higher token consumption. Lower usage earns a higher score.

Best Score

0.0

Avg Score

0.0

Measured Models

Performance Over Time — All Models

Model Rankings

Claude Opus 4.8

Token usage benchmark

View

100.0BEST

Avg/Test7.7k/test

Total231.9k

Claude Sonnet 4.6

Token usage benchmark

View

70.3-29.7 pts

Avg/Test11.0k/test

Total318.7k

GPT-5.5

Token usage benchmark

View

51.7-48.3 pts

Avg/Test15.0k/test

Total448.9k

Grok 4.5

Token usage benchmark

View

22.6-77.4 pts

Avg/Test34.2k/test

Total1.0M

Rank	Model	Score	Avg/Test	vs. Best	Details
1	Claude Opus 4.8	100.0	7.7k/test	BEST	View
2	Claude Sonnet 4.6	70.3	11.0k/test	-29.7 pts	View
3	GPT-5.5	51.7	15.0k/test	-48.3 pts	View
4	Grok 4.5	22.6	34.2k/test	-77.4 pts	View

Benchmark Construction

How It Scores

We total prompt and completion tokens across all successful benchmark tasks, compute an average per successful task, then assign 100 to the lowest-burn model in that run. Everyone else is scaled down proportionally, so higher usage means a lower benchmark score.

What To Read In This View

The ranking table above is the benchmark itself. Use Avg/Test to compare per-task burn and Total to spot larger absolute usage across the whole run.

Historical scores show whether a model is becoming more or less token-efficient over time, independent of raw quality improvements in the other ten categories.