Back to Dashboard
Anthropic

Claude Sonnet 4.6

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

#1

Token Benchmark

100.0

Lower burn, higher score

Total Tokens

121.0k

~4.8k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

25 tests1.0x weight
1 tests0.9x weight
3 tests1.2x weight
3 tests1.0x weight
3 tests1.1x weight
96.7
3 tests1.2x weight
3 tests0.8x weight
3 tests1.1x weight
2 tests1.0x weight
1 tests0.9x weight
3 tests1.0x weight

Individual Test Results

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

4.8k

Total Tokens

121.0k

Mathematical Proof
3.2k tok44.6s83.6
Legal Reasoning Chain
4.8k tok1m 30s74.0
Multi-step Logic Puzzle
12.0k tok3m 12s49.0
Graph Algorithm Implementation
2.0k tok29.9s92.0
REST API Design
1.3k tok16.9s100.0
Concurrent Data Pipeline
5.8k tok1m 20s100.0
Race Condition Detection
3.0k tok52.3s100.0
Off-by-One Boundary Fix
3.2k tok50.7s90.0
Memory Leak Fix
2.9k tok51.7s100.0
OAuth2 Integration
4.2k tok1m 2s93.0
Webhook System
2.7k tok41.3s99.0
Search Autocomplete
7.3k tok1m 60s100.0
Test Suite Completeness
9.0k tok1m 51s84.0
Refactor Without Regression
3.9k tok1m 3s99.0
Merge Conflict Resolution
5.6k tok1m 13s96.0
Dependency Upgrade Safety
9.0k tok2m 2s97.0
SQL Injection Prevention
3.0k tok48.8s100.0
Secret Management
3.6k tok1m 4s87.2
XSS Mitigation
6.5k tok1m 47s100.0
Structured Output Compliance
910 tok16.3s100.0
Constraint Adherence
535 tok9.0s100.0
Multi-step Instruction Chain
1.0k tok17.5s90.0
Idiomatic Python
10.9k tok2m 14s99.0
Algorithm Complexity
6.0k tok1m 27s100.0
Query Optimization
8.5k tok2m 24s89.0

Outage History

1
errorongoing

Started Jun 5, 10:00 PM· checks affected