Back to Dashboard
Anthropic
Claude Sonnet 4.6
Comprehensive benchmark performance across 11 evaluation categories
Composite Score
0.0/100Rank
#1
Token Benchmark
100.0Lower burn, higher score
Total Tokens
121.0k
~4.8k/test
Category Radar
The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.
Historical Composite Score
Category Breakdown
| # | Category | Score | Tests | 7-Day Trend | Weight |
|---|---|---|---|---|---|
| 1 | Token Efficiency | 100.0 | 25 | 1.0x | |
| 2 | Code Quality | 99.0 | 1 | 0.9x | |
| 3 | Coding Tasks | 97.3 | 3 | 1.2x | |
| 4 | Feature Implementation | 97.3 | 3 | 1.0x | |
| 5 | Bug Introduction Rate | 97.3 | 3 | 1.1x | |
| 6 | Bug Fixes | 96.7 | 3 | 1.2x | |
| 7 | Instruction Following | 96.7 | 3 | 0.8x | |
| 8 | Security Awareness | 95.7 | 3 | 1.1x | |
| 9 | Performance & Efficiency | 94.5 | 2 | 1.0x | |
| 10 | Code Thoroughness | 84.0 | 1 | 0.9x | |
| 11 | Long Reasoning | 68.9 | 3 | 1.0x |
Individual Test Results
Token Efficiency100.0
Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.
Avg Tokens/Test
4.8k
Total Tokens
121.0k
Long Reasoning68.9
Mathematical Proof
3.2k tok44.6s83.6
Legal Reasoning Chain
4.8k tok1m 30s74.0
Multi-step Logic Puzzle
12.0k tok3m 12s49.0
Coding Tasks97.3
Graph Algorithm Implementation
2.0k tok29.9s92.0
REST API Design
1.3k tok16.9s100.0
Concurrent Data Pipeline
5.8k tok1m 20s100.0
Bug Fixes96.7
Race Condition Detection
3.0k tok52.3s100.0
Off-by-One Boundary Fix
3.2k tok50.7s90.0
Memory Leak Fix
2.9k tok51.7s100.0
OAuth2 Integration
4.2k tok1m 2s93.0
Webhook System
2.7k tok41.3s99.0
Search Autocomplete
7.3k tok1m 60s100.0
Test Suite Completeness
9.0k tok1m 51s84.0
Refactor Without Regression
3.9k tok1m 3s99.0
Merge Conflict Resolution
5.6k tok1m 13s96.0
Dependency Upgrade Safety
9.0k tok2m 2s97.0
SQL Injection Prevention
3.0k tok48.8s100.0
Secret Management
3.6k tok1m 4s87.2
XSS Mitigation
6.5k tok1m 47s100.0
Structured Output Compliance
910 tok16.3s100.0
Constraint Adherence
535 tok9.0s100.0
Multi-step Instruction Chain
1.0k tok17.5s90.0
Code Quality99.0
Idiomatic Python
10.9k tok2m 14s99.0
Algorithm Complexity
6.0k tok1m 27s100.0
Query Optimization
8.5k tok2m 24s89.0
Outage History
1errorongoing
Started Jun 5, 10:00 PM· checks affected