CategoryWeight: 1.0x

Long Reasoning

Multi-step logic puzzles, extended chain-of-thought, and complex analytical reasoning tasks requiring sustained coherence over many steps.

Best Score

0.0

Avg Score

0.0

Tests

Performance Over Time — All Models

Model Rankings

GPT-5.5

Category score

View

69.6BEST

Tokens39.2k

Total39.2k

Claude Sonnet 4.6

Category score

View

68.6-1.0 pts

Tokens34.4k

Total34.4k

Grok 4.5

Category score

View

67.7-1.9 pts

Tokens106.3k

Total106.3k

Claude Opus 4.8

Category score

View

66.3-3.3 pts

Tokens22.6k

Total22.6k

Rank	Model	Score	Tokens	vs. Best	Details
1	GPT-5.5	69.6	39.2k	BEST	View
2	Claude Sonnet 4.6	68.6	34.4k	-1.0 pts	View
3	Grok 4.5	67.7	106.3k	-1.9 pts	View
4	Claude Opus 4.8	66.3	22.6k	-3.3 pts	View

Test Breakdown

Multi-step Logic Puzzle

Complex optimization with 8+ constraints across multiple variables

GPT-5.5

69.6

Claude Sonnet 4.6

68.6

Grok 4.5

67.7

Claude Opus 4.8

66.3

Legal Reasoning Chain

Contract dispute analysis requiring multi-party obligation tracking

GPT-5.5

69.6

Claude Sonnet 4.6

68.6

Grok 4.5

67.7

Claude Opus 4.8

66.3

Mathematical Proof

Prove divisibility properties using induction and modular arithmetic

GPT-5.5

69.6

Claude Sonnet 4.6

68.6

Grok 4.5

67.7

Claude Opus 4.8

66.3