CategoryWeight: 1.0x

Code Thoroughness

Evaluates completeness of generated code: edge case handling, input validation, error paths, and test coverage.

Best Score

0.0

Avg Score

0.0

Tests

Performance Over Time — All Models

Model Rankings

GPT-5.5

Category score

View

92.7BEST

Tokens73.2k

Total73.2k

Claude Sonnet 4.6

Category score

View

92.3-0.4 pts

Tokens110.0k

Total110.0k

Grok 4.5

Category score

View

91.8-0.9 pts

Tokens241.6k

Total241.6k

Claude Opus 4.8

Category score

View

89.9-2.8 pts

Tokens48.2k

Total48.2k

Rank	Model	Score	Tokens	vs. Best	Details
1	GPT-5.5	92.7	73.2k	BEST	View
2	Claude Sonnet 4.6	92.3	110.0k	-0.4 pts	View
3	Grok 4.5	91.8	241.6k	-0.9 pts	View
4	Claude Opus 4.8	89.9	48.2k	-2.8 pts	View

Test Breakdown

Edge Case Coverage

Generate code handling null, empty, unicode, and overflow inputs

GPT-5.5

92.7

Claude Sonnet 4.6

92.3

Grok 4.5

91.8

Claude Opus 4.8

89.9

Error Path Completeness

Ensure all failure modes have proper error handling and logging

GPT-5.5

92.7

Claude Sonnet 4.6

92.3

Grok 4.5

91.8

Claude Opus 4.8

89.9

Test Suite Completeness

Generate tests covering happy path, edge cases, and integration

GPT-5.5

92.7

Claude Sonnet 4.6

92.3

Grok 4.5

91.8

Claude Opus 4.8

89.9