Back to Dashboard
CategoryWeight: 1.0x

Long Reasoning

Multi-step logic puzzles, extended chain-of-thought, and complex analytical reasoning tasks requiring sustained coherence over many steps.

Best Score

0.0

Avg Score

0.0

Tests

3

Performance Over Time — All Models

Model Rankings

1
Claude Opus 4.8

Category score

View
77.8BEST
Tokens10.7k
Total10.7k
2
Grok

Category score

View
70.1-7.7 pts
Tokens92.2k
Total92.2k
3
Claude Sonnet 4.6

Category score

View
68.9-8.9 pts
Tokens20.0k
Total20.0k
4
GPT-5.5

Category score

View
66.6-11.2 pts
Tokens38.1k
Total38.1k

Test Breakdown

Multi-step Logic Puzzle

Complex optimization with 8+ constraints across multiple variables

Claude Opus 4.8
77.8
Grok
70.1
Claude Sonnet 4.6
68.9
GPT-5.5
66.6

Legal Reasoning Chain

Contract dispute analysis requiring multi-party obligation tracking

Claude Opus 4.8
77.8
Grok
70.1
Claude Sonnet 4.6
68.9
GPT-5.5
66.6

Mathematical Proof

Prove divisibility properties using induction and modular arithmetic

Claude Opus 4.8
77.8
Grok
70.1
Claude Sonnet 4.6
68.9
GPT-5.5
66.6