Back to Dashboard
CategoryWeight: 1.0x

Instruction Following

Measures how closely the model adheres to explicit constraints in the prompt, including formatting, language, and output structure.

Best Score

0.0

Avg Score

0.0

Tests

3

Performance Over Time — All Models

Model Rankings

1
Claude Opus 4.8

Category score

View
100.0BEST
Tokens2.8k
Total2.8k
2
GPT-5.5

Category score

View
100.0BEST
Tokens35.6k
Total35.6k
3
Grok

Category score

View
100.0BEST
Tokens59.8k
Total59.8k
4
Claude Sonnet 4.6

Category score

View
96.7-3.3 pts
Tokens2.5k
Total2.5k

Test Breakdown

Structured Output Compliance

Produce JSON matching an exact schema with no extra fields

Claude Opus 4.8
100.0
GPT-5.5
100.0
Grok
100.0
Claude Sonnet 4.6
96.7

Constraint Adherence

Follow explicit constraints like max line length and naming conventions

Claude Opus 4.8
100.0
GPT-5.5
100.0
Grok
100.0
Claude Sonnet 4.6
96.7

Multi-step Instruction Chain

Execute a 6-step instruction sequence without skipping or reordering

Claude Opus 4.8
100.0
GPT-5.5
100.0
Grok
100.0
Claude Sonnet 4.6
96.7