Top Performing Model
Based on composite benchmark scores
anthropic
Claude Sonnet 4.6
Leading today's benchmarks
0.0/100
+6.7%vs prev day
Performance Timeline
Active Regressions
4Grokmajor
Long Reasoning dropped -19.1% from 86.7 to 70.1
Detected Jun 5, 2026 · 7-day window
Grokminor
Bug Fixes dropped -3.7% from 100.0 to 96.3
Detected Jun 5, 2026 · 7-day window
Grokmoderate
Overall dropped -12.0% from 91.8 to 80.8
Detected Jun 5, 2026 · 7-day window
Grokmajor
Token Efficiency dropped -62.7% from 73.0 to 27.2
Detected Jun 5, 2026 · 7-day window
Category Performance Heatmap
Latest Benchmark Run
Jun 6, 8:00 AMdaily
Claude Opus 4.8
Composite benchmark summary
Composite
89.8
Token Benchmark
68.3
Total tokens
134.7k
~7.1k/test
Best category100.0 Instruction Following
Worst category68.3 Token Efficiency
Claude Sonnet 4.6
Composite benchmark summary
Composite
93.4
Token Benchmark
100.0
Total tokens
121.0k
~4.8k/test
Best category100.0 Token Efficiency
Worst category68.9 Long Reasoning
GPT-5.5
Composite benchmark summary
Composite
86.2
Token Benchmark
36.8
Total tokens
368.2k
~13.2k/test
Best category100.0 Instruction Following
Worst category36.8 Token Efficiency
Grok
Composite benchmark summary
Composite
85.5
Token Benchmark
14.1
Total tokens
991.9k
~34.2k/test
Best category100.0 Instruction Following
Worst category14.1 Token Efficiency
| Model | Composite | Rank | Best Category | Worst Category | Tokens | Details |
|---|---|---|---|---|---|---|
Claude Opus 4.8 | 89.8 | #2 | 100.0Instruction Following | 68.3Token Efficiency | 134.7k | View |
Claude Sonnet 4.6 | 93.4 | #1 | 100.0Token Efficiency | 68.9Long Reasoning | 121.0k | View |
GPT-5.5 | 86.2 | #3 | 100.0Instruction Following | 36.8Token Efficiency | 368.2k | View |
Grok | 85.5 | #4 | 100.0Instruction Following | 14.1Token Efficiency | 991.9k | View |