Anthropic

Claude Sonnet 4.6

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

Token Benchmark

70.3

Lower burn, higher score

Total Tokens

318.7k

~11.0k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

Coding Tasks

100.0

3 tests1.0x weight

Instruction Following

100.0

3 tests1.0x weight

Feature Implementation

97.7

3 tests1.0x weight

Bug Introduction Rate

97.0

3 tests1.0x weight

Code Quality

96.7

3 tests1.0x weight

Bug Fixes

96.5

3 tests1.0x weight

Security Awareness

96.3

3 tests1.0x weight

Code Thoroughness

92.3

3 tests1.0x weight

Token Efficiency

70.3

29 tests1.0x weight

#10

Long Reasoning

68.6

3 tests1.0x weight

#11

Performance & Efficiency

33.3

3 tests1.0x weight

#	Category	Score	Tests	Weight
1	Coding Tasks	100.0	3	1.0x
2	Instruction Following	100.0	3	1.0x
3	Feature Implementation	97.7	3	1.0x
4	Bug Introduction Rate	97.0	3	1.0x
5	Code Quality	96.7	3	1.0x
6	Bug Fixes	96.5	3	1.0x
7	Security Awareness	96.3	3	1.0x
8	Code Thoroughness	92.3	3	1.0x
9	Token Efficiency	70.3	29	1.0x
10	Long Reasoning	68.6	3	1.0x
11	Performance & Efficiency	33.3	3	1.0x

Individual Test Results

Token Efficiency70.3

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

11.0k

Total Tokens

318.7k

Long Reasoning68.6

Legal Reasoning Chain

4.5k tok1m 24s68.0

Multi-step Logic Puzzle

26.9k tok7m 2s51.8

Mathematical Proof

3.0k tok7m 31s86.0

Coding Tasks100.0

Graph Algorithm Implementation

5.0k tok4m 45s100.0

REST API Design

1.2k tok3m 16s100.0

Concurrent Data Pipeline

5.8k tok3m 4s100.0

Bug Fixes96.5

Off-by-One Boundary Fix

3.3k tok6m 7s89.4

Race Condition Detection

2.7k tok3m 32s100.0

Memory Leak Fix

2.9k tok2m 57s100.0

Feature Implementation97.7

Search Autocomplete

5.3k tok6m 11s100.0

Webhook System

2.0k tok6m 52s100.0

OAuth2 Integration

5.6k tok16m 53s93.0

Code Thoroughness92.3

Error Path Completeness

52.9k tok19m 47s98.8

Test Suite Completeness

13.3k tok7m 3s83.0

Edge Case Coverage

43.8k tok16m 23s95.0

Bug Introduction Rate97.0

Refactor Without Regression

5.5k tok11m 16s98.0

Merge Conflict Resolution

3.2k tok3m 19s96.0

Dependency Upgrade Safety

11.9k tok3m 39s97.0

Security Awareness96.3

SQL Injection Prevention

2.6k tok2m 33s100.0

XSS Mitigation

6.7k tok3m 5s100.0

Secret Management

4.5k tok3m 23s89.0

Instruction Following100.0

Structured Output Compliance

1.3k tok3m 33s100.0

Constraint Adherence

1.1k tok2m 1s100.0

Multi-step Instruction Chain

990 tok46.8s100.0

Code Quality96.7

Idiomatic Python

12.2k tok4m 15s97.0

TypeScript Best Practices

37.8k tok10m 4s96.0

Clean Architecture Patterns

39.1k tok11m 35s97.0

Performance & Efficiency33.3

Algorithm Complexity

5.3k tok4m 24s100.0

Memory-efficient Processing

860ms0.0

Query Optimization

8.4k tok2m 13s0.0

Regression History

Security Awarenessmoderateresolved

Score dropped -7.9% from 95.6 to 88.1

Detected Jun 27, 2026·Resolved Jun 27, 2026

Security Awarenessminorresolved

Score dropped -3.0% from 95.2 to 92.3

Detected Jun 16, 2026·Resolved Jun 16, 2026

Overall Compositemoderate

Score dropped -9.7% from 88.4 to 79.8

Detected Jun 14, 2026

Code Thoroughnessmajor

Score dropped -17.8% from 74.9 to 61.6

Detected Jun 14, 2026

Code Qualitymajor

Score dropped -55.7% from 88.7 to 39.3

Detected Jun 14, 2026

Performance Efficiencymajor

Score dropped -18.7% from 81.5 to 66.3

Detected Jun 12, 2026

Long Reasoningmoderateresolved

Score dropped -14.8% from 67.9 to 57.9

Detected Jun 12, 2026·Resolved Jun 28, 2026

Token Efficiencymajor

Score dropped -24.8% from 98.8 to 74.3

Detected Jun 8, 2026

Token Efficiencyminorresolved

Score dropped -4.7% from 100.0 to 95.3

Detected Jun 7, 2026·Resolved Jun 7, 2026

Outage History

errorongoing

Started Jun 30, 5:00 PM· checks affected

error

Started Jun 28, 6:00 AM·Ended Jun 28, 7:30 AM· checks affected

error

Started Jun 27, 5:30 AM·Ended Jun 27, 7:00 AM· checks affected

timeout

Started Jun 23, 2:31 PM·Ended Jun 23, 3:01 PM· checks affected

error

Started Jun 23, 5:00 AM·Ended Jun 23, 8:00 AM· checks affected

error

Started Jun 22, 5:00 AM·Ended Jun 22, 7:00 AM· checks affected

error

Started Jun 21, 5:00 AM·Ended Jun 21, 6:00 AM· checks affected

error

Started Jun 18, 5:00 AM·Ended Jun 18, 6:30 AM· checks affected

error

Started Jun 14, 5:30 AM·Ended Jun 14, 7:30 AM· checks affected

error

Started Jun 11, 4:30 PM·Ended Jun 11, 7:00 PM· checks affected

error

Started Jun 9, 5:00 AM·Ended Jun 9, 6:00 AM· checks affected