Back to Dashboard
Anthropic

Claude Opus 4.8

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

#2

Token Benchmark

68.3

Lower burn, higher score

Total Tokens

134.7k

~7.1k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

1 tests1.1x weight
1 tests0.8x weight
2 tests1.2x weight
3 tests1.0x weight
2 tests1.1x weight
3 tests1.0x weight
2 tests0.9x weight
90.0
1 tests1.2x weight
2 tests1.0x weight
2 tests0.9x weight
19 tests1.0x weight

Individual Test Results

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

7.1k

Total Tokens

134.7k

Mathematical Proof
4.5k tok25.3s78.7
Legal Reasoning Chain
6.2k tok52.1s77.0
Graph Algorithm Implementation
3.9k tok17.4s98.0
REST API Design
4.7k tok26.0s100.0
Off-by-One Boundary Fix
4.3k tok26.0s90.0
OAuth2 Integration
4.6k tok26.0s90.0
Webhook System
4.2k tok25.3s100.0
Search Autocomplete
7.2k tok50.2s100.0
Test Suite Completeness
6.1k tok42.3s86.0
Edge Case Coverage
16.0k tok2m 24s62.0
Refactor Without Regression
4.5k tok28.9s100.0
XSS Mitigation
6.7k tok54.4s100.0
Secret Management
5.4k tok2m 18s89.0
Structured Output Compliance
2.8k tok10.3s100.0
Idiomatic Python
10.9k tok1m 20s97.0
Clean Architecture Patterns
16.4k tok2m 40s90.0
Algorithm Complexity
4.6k tok29.6s100.0
Query Optimization
5.9k tok48.8s85.0
Memory-efficient Processing
15.6k tok2m 13s97.0

Outage History

1
errorongoing

Started Jun 5, 10:00 PM· checks affected