CategoryWeight: 1.0x

Instruction Following

Measures how closely the model adheres to explicit constraints in the prompt, including formatting, language, and output structure.

Best Score

0.0

Avg Score

0.0

Tests

Performance Over Time — All Models

Model Rankings

Claude Opus 4.8

Category score

View

100.0BEST

Tokens9.2k

Total9.2k

Claude Sonnet 4.6

Category score

View

100.0BEST

Tokens3.4k

Total3.4k

GPT-5.5

Category score

View

100.0BEST

Tokens36.9k

Total36.9k

Grok 4.5

Category score

View

100.0BEST

Tokens48.2k

Total48.2k

Rank	Model	Score	Tokens	vs. Best	Details
1	Claude Opus 4.8	100.0	9.2k	BEST	View
2	Claude Sonnet 4.6	100.0	3.4k	BEST	View
3	GPT-5.5	100.0	36.9k	BEST	View
4	Grok 4.5	100.0	48.2k	BEST	View

Test Breakdown

Structured Output Compliance

Produce JSON matching an exact schema with no extra fields

Claude Opus 4.8

100.0

Claude Sonnet 4.6

100.0

GPT-5.5

100.0

Grok 4.5

100.0

Constraint Adherence

Follow explicit constraints like max line length and naming conventions

Claude Opus 4.8

100.0

Claude Sonnet 4.6

100.0

GPT-5.5

100.0

Grok 4.5

100.0

Multi-step Instruction Chain

Execute a 6-step instruction sequence without skipping or reordering

Claude Opus 4.8

100.0

Claude Sonnet 4.6

100.0

GPT-5.5

100.0

Grok 4.5

100.0