Compare Models
Side-by-side performance comparison across all benchmark categories.
Select Models to Compare
Category Radar Comparison
The radar visualization is shown on wider screens. On mobile, use the detailed comparison cards below for exact per-category, composite, and token-efficiency values without clipped labels.
Performance Over Time
Detailed Score Comparison
Composite
Claude Opus 4.8
89.8
Claude Sonnet 4.6
93.4
GPT-5.5
86.2
Grok
85.5
Token Benchmark
Claude Opus 4.8
68.3
Claude Sonnet 4.6
100.0
GPT-5.5
36.8
Grok
14.1
Avg Tokens/Test
Claude Opus 4.8
7.1k
Claude Sonnet 4.6
4.8k
GPT-5.5
13.2k
Grok
34.2k
Total Tokens
Claude Opus 4.8
134.7k
Claude Sonnet 4.6
121.0k
GPT-5.5
368.2k
Grok
991.9k
Claude Opus 4.8
68.3
Claude Sonnet 4.6
100.0BEST
GPT-5.5
36.8
Grok
14.1
Claude Opus 4.8
77.8BEST
Claude Sonnet 4.6
68.9
GPT-5.5
66.6
Grok
70.1
Claude Opus 4.8
99.0BEST
Claude Sonnet 4.6
97.3
GPT-5.5
96.7
Grok
98.7
Claude Opus 4.8
90.0
Claude Sonnet 4.6
96.7BEST
GPT-5.5
93.3
Grok
96.3
Claude Opus 4.8
96.7
Claude Sonnet 4.6
97.3
GPT-5.5
97.7
Grok
99.0BEST
Claude Opus 4.8
74.0
Claude Sonnet 4.6
84.0
GPT-5.5
85.0
Grok
92.0BEST
Claude Opus 4.8
100.0BEST
Claude Sonnet 4.6
97.3
GPT-5.5
96.3
Grok
97.0
Claude Opus 4.8
94.5
Claude Sonnet 4.6
95.7BEST
GPT-5.5
90.9
Grok
84.3
Claude Opus 4.8
100.0BEST
Claude Sonnet 4.6
96.7
GPT-5.5
100.0BEST
Grok
100.0BEST
Claude Opus 4.8
93.5
Claude Sonnet 4.6
99.0BEST
GPT-5.5
93.0
Grok
94.3
Claude Opus 4.8
94.0
Claude Sonnet 4.6
94.5
GPT-5.5
92.0
Grok
94.7BEST
| Category | Claude Opus 4.8 | Claude Sonnet 4.6 | GPT-5.5 | Grok |
|---|---|---|---|---|
| Token Efficiency | 68.3 | 100.0 | 36.8 | 14.1 |
| Long Reasoning | 77.8 | 68.9 | 66.6 | 70.1 |
| Coding Tasks | 99.0 | 97.3 | 96.7 | 98.7 |
| Bug Fixes | 90.0 | 96.7 | 93.3 | 96.3 |
| Feature Implementation | 96.7 | 97.3 | 97.7 | 99.0 |
| Code Thoroughness | 74.0 | 84.0 | 85.0 | 92.0 |
| Bug Introduction Rate | 100.0 | 97.3 | 96.3 | 97.0 |
| Security Awareness | 94.5 | 95.7 | 90.9 | 84.3 |
| Instruction Following | 100.0 | 96.7 | 100.0 | 100.0 |
| Code Quality | 93.5 | 99.0 | 93.0 | 94.3 |
| Performance & Efficiency | 94.0 | 94.5 | 92.0 | 94.7 |
| Composite | 89.8 | 93.4 | 86.2 | 85.5 |
| Total Tokens | 134.7k | 121.0k | 368.2k | 991.9k |
| Avg Tokens / Test | 7.1k | 4.8k | 13.2k | 34.2k |
| Token Benchmark | 68.3 | 100.0 | 36.8 | 14.1 |