Back to Dashboard
CategoryWeight: 1.0x
Code Thoroughness
Evaluates completeness of generated code: edge case handling, input validation, error paths, and test coverage.
Best Score
0.0Avg Score
0.0Tests
3Performance Over Time — All Models
Model Rankings
Test Breakdown
Edge Case Coverage
Generate code handling null, empty, unicode, and overflow inputs
Grok
92.0GPT-5.5
85.0Claude Sonnet 4.6
84.0Claude Opus 4.8
74.0Error Path Completeness
Ensure all failure modes have proper error handling and logging
Grok
92.0GPT-5.5
85.0Claude Sonnet 4.6
84.0Claude Opus 4.8
74.0Test Suite Completeness
Generate tests covering happy path, edge cases, and integration
Grok
92.0GPT-5.5
85.0Claude Sonnet 4.6
84.0Claude Opus 4.8
74.0