CategoryWeight: 1.0x

Bug Introduction Rate

Measures how often the model introduces new bugs while writing or modifying code. Lower is better (inverted for scoring).

Best Score

0.0

Avg Score

0.0

Tests

Performance Over Time — All Models

Model Rankings

Claude Opus 4.8

Category score

View

97.0BEST

Tokens16.6k

Total16.6k

Claude Sonnet 4.6

Category score

View

97.0BEST

Tokens20.7k

Total20.7k

GPT-5.5

Category score

View

96.0-1.0 pts

Tokens51.3k

Total51.3k

Grok 4.5

Category score

View

95.7-1.3 pts

Tokens118.3k

Total118.3k

Rank	Model	Score	Tokens	vs. Best	Details
1	Claude Opus 4.8	97.0	16.6k	BEST	View
2	Claude Sonnet 4.6	97.0	20.7k	BEST	View
3	GPT-5.5	96.0	51.3k	-1.0 pts	View
4	Grok 4.5	95.7	118.3k	-1.3 pts	View

Test Breakdown

Refactor Without Regression

Refactor a function without introducing new failures in existing tests

Claude Opus 4.8

97.0

Claude Sonnet 4.6

97.0

GPT-5.5

96.0

Grok 4.5

95.7

Merge Conflict Resolution

Resolve merge conflicts without introducing semantic errors

Claude Opus 4.8

97.0

Claude Sonnet 4.6

97.0

GPT-5.5

96.0

Grok 4.5

95.7

Dependency Upgrade Safety

Upgrade a dependency and adapt code without breaking changes

Claude Opus 4.8

97.0

Claude Sonnet 4.6

97.0

GPT-5.5

96.0

Grok 4.5

95.7