OpenAI

GPT-5.5

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

Token Benchmark

51.7

Lower burn, higher score

Total Tokens

448.9k

~15.0k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

Instruction Following

100.0

3 tests1.0x weight

Coding Tasks

98.3

3 tests1.0x weight

Feature Implementation

98.3

3 tests1.0x weight

Bug Introduction Rate

96.0

3 tests1.0x weight

Security Awareness

95.3

3 tests1.0x weight

Code Quality

94.3

3 tests1.0x weight

Bug Fixes

93.3

3 tests1.0x weight

Code Thoroughness

92.7

3 tests1.0x weight

Performance & Efficiency

92.3

3 tests1.0x weight

#10

Long Reasoning

69.6

3 tests1.0x weight

#11

Token Efficiency

51.7

30 tests1.0x weight

#	Category	Score	Tests	Weight
1	Instruction Following	100.0	3	1.0x
2	Coding Tasks	98.3	3	1.0x
3	Feature Implementation	98.3	3	1.0x
4	Bug Introduction Rate	96.0	3	1.0x
5	Security Awareness	95.3	3	1.0x
6	Code Quality	94.3	3	1.0x
7	Bug Fixes	93.3	3	1.0x
8	Code Thoroughness	92.7	3	1.0x
9	Performance & Efficiency	92.3	3	1.0x
10	Long Reasoning	69.6	3	1.0x
11	Token Efficiency	51.7	30	1.0x

Individual Test Results

Token Efficiency51.7

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

15.0k

Total Tokens

448.9k

Long Reasoning69.6

Multi-step Logic Puzzle

13.1k tok24.5s43.8

Mathematical Proof

13.0k tok28.9s78.1

Legal Reasoning Chain

13.1k tok21.9s87.0

Coding Tasks98.3

Graph Algorithm Implementation

13.0k tok23.7s95.0

Concurrent Data Pipeline

12.9k tok24.1s100.0

REST API Design

12.6k tok19.8s100.0

Bug Fixes93.3

Off-by-One Boundary Fix

13.1k tok23.5s80.0

Race Condition Detection

12.8k tok30.1s100.0

Memory Leak Fix

12.9k tok39.9s100.0

Feature Implementation98.3

Search Autocomplete

13.7k tok36.2s100.0

Webhook System

13.0k tok21.2s98.0

OAuth2 Integration

13.3k tok27.9s97.0

Code Thoroughness92.7

Test Suite Completeness

13.9k tok37.7s87.0

Edge Case Coverage

30.5k tok5m 39s93.0

Error Path Completeness

28.8k tok1m 29s98.2

Bug Introduction Rate96.0

Refactor Without Regression

13.3k tok23.7s93.0

Merge Conflict Resolution

13.7k tok26.5s97.0

Dependency Upgrade Safety

24.3k tok3m 34s98.0

Security Awareness95.3

SQL Injection Prevention

13.0k tok25.3s98.8

XSS Mitigation

14.2k tok40.7s100.0

Secret Management

13.5k tok32.1s87.2

Instruction Following100.0

Structured Output Compliance

12.4k tok17.1s100.0

Multi-step Instruction Chain

12.2k tok13.5s100.0

Constraint Adherence

12.2k tok11.7s100.0

Code Quality94.3

Idiomatic Python

13.8k tok39.7s96.0

TypeScript Best Practices

14.9k tok55.5s93.0

Clean Architecture Patterns

14.0k tok42.0s94.0

Performance & Efficiency92.3

Algorithm Complexity

13.3k tok27.4s97.0

Query Optimization

13.8k tok34.1s83.0

Memory-efficient Processing

20.5k tok2m 37s97.0

Regression History

Code Thoroughnessminor

Score dropped -3.1% from 90.9 to 88.1

Detected Jun 28, 2026

Security Awarenessminorresolved

Score dropped -4.2% from 92.6 to 88.7

Detected Jun 26, 2026·Resolved Jun 30, 2026

Security Awarenessmoderateresolved

Score dropped -5.1% from 92.4 to 87.7

Detected Jun 21, 2026·Resolved Jun 21, 2026

Bug Introduction Rateminorresolved

Score dropped -3.4% from 96.6 to 93.3

Detected Jun 21, 2026·Resolved Jun 21, 2026

Token Efficiencymoderate

Score dropped -9.0% from 53.8 to 49.0

Detected Jun 19, 2026

Security Awarenessminorresolved

Score dropped -4.3% from 92.3 to 88.3

Detected Jun 9, 2026·Resolved Jun 20, 2026

Long Reasoningmoderate

Score dropped -5.7% from 65.0 to 61.3

Detected Jun 7, 2026

Outage History

errorongoing

Started Jul 18, 10:00 PM· checks affected

error

Started Jul 17, 3:00 AM·Ended Jul 17, 3:00 AM· checks affected

error

Started Jul 12, 6:30 PM·Ended Jul 13, 11:30 PM· checks affected

error

Started Jul 12, 3:00 AM·Ended Jul 12, 3:00 AM· checks affected

error

Started Jul 11, 3:00 AM·Ended Jul 11, 3:00 AM· checks affected

error

Started Jul 10, 3:00 AM·Ended Jul 10, 3:00 AM· checks affected

error

Started Jul 9, 7:00 PM·Ended Jul 9, 7:30 PM· checks affected

error

Started Jul 9, 3:00 AM·Ended Jul 9, 3:00 AM· checks affected

error

Started Jul 8, 3:00 AM·Ended Jul 8, 3:00 AM· checks affected

error

Started Jul 7, 3:00 AM·Ended Jul 7, 3:00 AM· checks affected

error

Started Jul 3, 10:30 PM·Ended Jul 4, 12:00 AM· checks affected

error

Started Jul 2, 1:00 AM·Ended Jul 2, 2:00 AM· checks affected

error

Started Jun 24, 5:30 PM·Ended Jun 24, 6:00 PM· checks affected

error

Started Jun 18, 5:30 PM·Ended Jun 18, 6:30 PM· checks affected

error

Started Jun 16, 3:00 AM·Ended Jun 16, 3:00 AM· checks affected