Anthropic

Claude Opus 4.8

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

Token Benchmark

100.0

Lower burn, higher score

Total Tokens

231.9k

~7.7k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

Token Efficiency

100.0

30 tests1.0x weight

Instruction Following

100.0

3 tests1.0x weight

Coding Tasks

99.3

3 tests1.0x weight

Feature Implementation

99.3

3 tests1.0x weight

Bug Introduction Rate

97.0

3 tests1.0x weight

Bug Fixes

96.3

3 tests1.0x weight

Performance & Efficiency

94.0

3 tests1.0x weight

Security Awareness

90.1

3 tests1.0x weight

Code Thoroughness

89.9

3 tests1.0x weight

#10

Code Quality

82.7

3 tests1.0x weight

#11

Long Reasoning

66.3

3 tests1.0x weight

#	Category	Score	Tests	Weight
1	Token Efficiency	100.0	30	1.0x
2	Instruction Following	100.0	3	1.0x
3	Coding Tasks	99.3	3	1.0x
4	Feature Implementation	99.3	3	1.0x
5	Bug Introduction Rate	97.0	3	1.0x
6	Bug Fixes	96.3	3	1.0x
7	Performance & Efficiency	94.0	3	1.0x
8	Security Awareness	90.1	3	1.0x
9	Code Thoroughness	89.9	3	1.0x
10	Code Quality	82.7	3	1.0x
11	Long Reasoning	66.3	3	1.0x

Individual Test Results

Token Efficiency100.0

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

7.7k

Total Tokens

231.9k

Long Reasoning66.3

Legal Reasoning Chain

6.3k tok3m 4s70.0

Mathematical Proof

5.2k tok6m 44s78.7

Multi-step Logic Puzzle

11.1k tok10m 10s50.2

Coding Tasks99.3

REST API Design

5.1k tok2m 25s100.0

Graph Algorithm Implementation

6.9k tok6m 7s98.0

Concurrent Data Pipeline

4.8k tok3m 36s100.0

Bug Fixes96.3

Race Condition Detection

4.6k tok2m 38s100.0

Off-by-One Boundary Fix

4.7k tok5m 30s88.8

Memory Leak Fix

5.4k tok3m 46s100.0

Feature Implementation99.3

OAuth2 Integration

5.0k tok4m 37s98.0

Search Autocomplete

8.5k tok5m 44s100.0

Webhook System

5.0k tok7m 57s100.0

Code Thoroughness89.9

Edge Case Coverage

17.2k tok8m 10s86.0

Error Path Completeness

21.9k tok6m 33s98.8

Test Suite Completeness

9.1k tok18m 46s85.0

Bug Introduction Rate97.0

Refactor Without Regression

5.3k tok17m 48s99.0

Merge Conflict Resolution

4.7k tok7m 54s98.0

Dependency Upgrade Safety

6.6k tok8m 35s94.0

Security Awareness90.1

SQL Injection Prevention

5.6k tok6m 17s85.0

XSS Mitigation

7.1k tok4m 49s100.0

Secret Management

7.4k tok2m 55s85.4

Instruction Following100.0

Structured Output Compliance

3.3k tok1m 45s100.0

Multi-step Instruction Chain

3.0k tok2m 23s100.0

Constraint Adherence

2.8k tok3m 33s100.0

Code Quality82.7

Idiomatic Python

10.2k tok1m 55s97.0

TypeScript Best Practices

10.8k tok3m 51s55.0

Clean Architecture Patterns

17.1k tok4m 34s96.0

Performance & Efficiency94.0

Algorithm Complexity

4.9k tok1m 38s99.0

Query Optimization

6.7k tok4m 30s85.0

Memory-efficient Processing

15.8k tok11m 46s98.0

Regression History

Long Reasoningmoderate

Score dropped -6.1% from 69.1 to 64.9

Detected Jun 27, 2026

Feature Implementationminorresolved

Score dropped -3.3% from 98.2 to 95.0

Detected Jun 25, 2026·Resolved Jun 29, 2026

Bug Fixesmoderateresolved

Score dropped -13.5% from 97.0 to 83.9

Detected Jun 23, 2026·Resolved Jun 23, 2026

Feature Implementationminorresolved

Score dropped -3.0% from 99.0 to 96.0

Detected Jun 22, 2026·Resolved Jun 22, 2026

Long Reasoningminorresolved

Score dropped -4.6% from 68.9 to 65.7

Detected Jun 21, 2026·Resolved Jun 21, 2026

Instruction Followingminorresolved

Score dropped -4.1% from 98.8 to 94.7

Detected Jun 20, 2026·Resolved Jun 26, 2026

Code Thoroughnessmoderateresolved

Score dropped -10.7% from 91.6 to 81.8

Detected Jun 20, 2026·Resolved Jun 25, 2026

Coding Tasksmajorresolved

Score dropped -27.3% from 99.0 to 72.0

Detected Jun 18, 2026·Resolved Jun 18, 2026

Security Awarenessminor

Score dropped -3.3% from 93.8 to 90.7

Detected Jun 15, 2026

Code Qualitymajor

Score dropped -25.6% from 91.8 to 68.3

Detected Jun 12, 2026

Long Reasoningminorresolved

Score dropped -3.9% from 66.5 to 63.9

Detected Jun 9, 2026·Resolved Jun 14, 2026

Outage History

errorongoing

Started Jun 30, 5:00 PM· checks affected

error

Started Jun 28, 6:30 AM·Ended Jun 28, 8:00 AM· checks affected

error

Started Jun 27, 5:30 AM·Ended Jun 27, 7:00 AM· checks affected

timeout

Started Jun 24, 2:01 PM·Ended Jun 24, 2:30 PM· checks affected

timeout

Started Jun 23, 2:30 PM·Ended Jun 23, 3:30 PM· checks affected

error

Started Jun 23, 5:00 AM·Ended Jun 23, 8:00 AM· checks affected

timeout

Started Jun 22, 8:31 AM·Ended Jun 22, 9:00 AM· checks affected

error

Started Jun 22, 5:00 AM·Ended Jun 22, 7:00 AM· checks affected

error

Started Jun 21, 5:00 AM·Ended Jun 21, 6:00 AM· checks affected

error

Started Jun 18, 5:00 AM·Ended Jun 18, 6:30 AM· checks affected

timeout

Started Jun 15, 7:01 AM·Ended Jun 15, 7:30 AM· checks affected

error

Started Jun 14, 5:30 AM·Ended Jun 14, 7:30 AM· checks affected

error

Started Jun 11, 4:30 PM·Ended Jun 11, 7:00 PM· checks affected

error

Started Jun 9, 5:00 AM·Ended Jun 9, 6:00 AM· checks affected

timeout

Started Jun 7, 4:01 AM·Ended Jun 7, 4:30 AM· checks affected