Back to Dashboard
CategoryWeight: 1.0x

Bug Introduction Rate

Measures how often the model introduces new bugs while writing or modifying code. Lower is better (inverted for scoring).

Best Score

0.0

Avg Score

0.0

Tests

3

Performance Over Time — All Models

Model Rankings

1
Claude Opus 4.8

Category score

View
100.0BEST
Tokens4.5k
Total4.5k
2
Claude Sonnet 4.6

Category score

View
97.3-2.7 pts
Tokens18.5k
Total18.5k
3
Grok

Category score

View
97.0-3.0 pts
Tokens99.9k
Total99.9k
4
GPT-5.5

Category score

View
96.3-3.7 pts
Tokens42.9k
Total42.9k

Test Breakdown

Refactor Without Regression

Refactor a function without introducing new failures in existing tests

Claude Opus 4.8
100.0
Claude Sonnet 4.6
97.3
Grok
97.0
GPT-5.5
96.3

Merge Conflict Resolution

Resolve merge conflicts without introducing semantic errors

Claude Opus 4.8
100.0
Claude Sonnet 4.6
97.3
Grok
97.0
GPT-5.5
96.3

Dependency Upgrade Safety

Upgrade a dependency and adapt code without breaking changes

Claude Opus 4.8
100.0
Claude Sonnet 4.6
97.3
Grok
97.0
GPT-5.5
96.3