Back to Dashboard
OpenAI
GPT-5.5
Comprehensive benchmark performance across 11 evaluation categories
Composite Score
0.0/100Rank
#3
Token Benchmark
36.8Lower burn, higher score
Total Tokens
368.2k
~13.2k/test
Category Radar
The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.
Historical Composite Score
Category Breakdown
| # | Category | Score | Tests | 7-Day Trend | Weight |
|---|---|---|---|---|---|
| 1 | Instruction Following | 100.0 | 3 | 0.8x | |
| 2 | Feature Implementation | 97.7 | 3 | 1.0x | |
| 3 | Coding Tasks | 96.7 | 3 | 1.2x | |
| 4 | Bug Introduction Rate | 96.3 | 3 | 1.1x | |
| 5 | Bug Fixes | 93.3 | 3 | 1.2x | |
| 6 | Code Quality | 93.0 | 3 | 0.9x | |
| 7 | Performance & Efficiency | 92.0 | 3 | 1.0x | |
| 8 | Security Awareness | 90.9 | 3 | 1.1x | |
| 9 | Code Thoroughness | 85.0 | 1 | 0.9x | |
| 10 | Long Reasoning | 66.6 | 3 | 1.0x | |
| 11 | Token Efficiency | 36.8 | 28 | 1.0x |
Individual Test Results
Token Efficiency36.8
Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.
Avg Tokens/Test
13.2k
Total Tokens
368.2k
Long Reasoning66.6
Mathematical Proof
12.7k tok33.6s79.9
Legal Reasoning Chain
12.6k tok20.1s67.0
Multi-step Logic Puzzle
12.8k tok27.4s53.0
Coding Tasks96.7
Graph Algorithm Implementation
12.7k tok28.2s93.0
REST API Design
12.2k tok19.0s100.0
Concurrent Data Pipeline
12.5k tok23.5s97.0
Bug Fixes93.3
Race Condition Detection
12.3k tok20.9s100.0
Off-by-One Boundary Fix
12.7k tok25.7s80.0
Memory Leak Fix
12.6k tok22.9s100.0
OAuth2 Integration
12.9k tok26.4s97.0
Webhook System
12.7k tok25.9s98.0
Search Autocomplete
12.8k tok28.5s98.0
Test Suite Completeness
13.8k tok43.9s85.0
Refactor Without Regression
15.1k tok1m 5s94.0
Merge Conflict Resolution
13.3k tok27.1s98.0
Dependency Upgrade Safety
14.5k tok52.9s97.0
SQL Injection Prevention
12.6k tok27.0s86.8
XSS Mitigation
13.6k tok39.1s100.0
Secret Management
12.9k tok29.5s86.0
Structured Output Compliance
12.0k tok18.3s100.0
Constraint Adherence
11.8k tok14.0s100.0
Multi-step Instruction Chain
11.8k tok15.3s100.0
Code Quality93.0
Idiomatic Python
13.4k tok41.2s96.0
Clean Architecture Patterns
13.7k tok44.4s95.0
TypeScript Best Practices
14.4k tok54.9s88.0
Algorithm Complexity
12.8k tok26.1s98.0
Memory-efficient Processing
14.2k tok51.5s94.0
Query Optimization
16.9k tok1m 41s84.0
Outage History
1errorongoing
Started Jun 5, 10:00 PM· checks affected