Back to Dashboard
Anthropic
Claude Opus 4.8
Comprehensive benchmark performance across 11 evaluation categories
Composite Score
0.0/100Rank
#2
Token Benchmark
68.3Lower burn, higher score
Total Tokens
134.7k
~7.1k/test
Category Radar
The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.
Historical Composite Score
Category Breakdown
| # | Category | Score | Tests | 7-Day Trend | Weight |
|---|---|---|---|---|---|
| 1 | Bug Introduction Rate | 100.0 | 1 | 1.1x | |
| 2 | Instruction Following | 100.0 | 1 | 0.8x | |
| 3 | Coding Tasks | 99.0 | 2 | 1.2x | |
| 4 | Feature Implementation | 96.7 | 3 | 1.0x | |
| 5 | Security Awareness | 94.5 | 2 | 1.1x | |
| 6 | Performance & Efficiency | 94.0 | 3 | 1.0x | |
| 7 | Code Quality | 93.5 | 2 | 0.9x | |
| 8 | Bug Fixes | 90.0 | 1 | 1.2x | |
| 9 | Long Reasoning | 77.8 | 2 | 1.0x | |
| 10 | Code Thoroughness | 74.0 | 2 | 0.9x | |
| 11 | Token Efficiency | 68.3 | 19 | 1.0x |
Individual Test Results
Token Efficiency68.3
Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.
Avg Tokens/Test
7.1k
Total Tokens
134.7k
Long Reasoning77.8
Mathematical Proof
4.5k tok25.3s78.7
Legal Reasoning Chain
6.2k tok52.1s77.0
Coding Tasks99.0
Graph Algorithm Implementation
3.9k tok17.4s98.0
REST API Design
4.7k tok26.0s100.0
Bug Fixes90.0
Off-by-One Boundary Fix
4.3k tok26.0s90.0
OAuth2 Integration
4.6k tok26.0s90.0
Webhook System
4.2k tok25.3s100.0
Search Autocomplete
7.2k tok50.2s100.0
Test Suite Completeness
6.1k tok42.3s86.0
Edge Case Coverage
16.0k tok2m 24s62.0
Refactor Without Regression
4.5k tok28.9s100.0
XSS Mitigation
6.7k tok54.4s100.0
Secret Management
5.4k tok2m 18s89.0
Structured Output Compliance
2.8k tok10.3s100.0
Code Quality93.5
Idiomatic Python
10.9k tok1m 20s97.0
Clean Architecture Patterns
16.4k tok2m 40s90.0
Algorithm Complexity
4.6k tok29.6s100.0
Query Optimization
5.9k tok48.8s85.0
Memory-efficient Processing
15.6k tok2m 13s97.0
Outage History
1errorongoing
Started Jun 5, 10:00 PM· checks affected