Compare Models

Side-by-side performance comparison across all benchmark categories.

Select Models to Compare

Category Radar Comparison

The radar visualization is shown on wider screens. On mobile, use the detailed comparison cards below for exact per-category, composite, and token-efficiency values without clipped labels.

Performance Over Time

Detailed Score Comparison

Composite

Claude Opus 4.8

89.8

Claude Sonnet 4.6

93.4

GPT-5.5

86.2

Grok

85.5

Token Benchmark

Claude Opus 4.8

68.3

Claude Sonnet 4.6

100.0

GPT-5.5

36.8

Grok

14.1

Avg Tokens/Test

Claude Opus 4.8

7.1k

Claude Sonnet 4.6

4.8k

GPT-5.5

13.2k

Grok

34.2k

Total Tokens

Claude Opus 4.8

134.7k

Claude Sonnet 4.6

121.0k

GPT-5.5

368.2k

Grok

991.9k

Token EfficiencyCategory

Claude Opus 4.8

68.3

Claude Sonnet 4.6

100.0BEST

GPT-5.5

36.8

Grok

14.1

Long ReasoningCategory

Claude Opus 4.8

77.8BEST

Claude Sonnet 4.6

68.9

GPT-5.5

66.6

Grok

70.1

Coding TasksCategory

Claude Opus 4.8

99.0BEST

Claude Sonnet 4.6

97.3

GPT-5.5

96.7

Grok

98.7

Bug FixesCategory

Claude Opus 4.8

90.0

Claude Sonnet 4.6

96.7BEST

GPT-5.5

93.3

Grok

96.3

Feature ImplementationCategory

Claude Opus 4.8

96.7

Claude Sonnet 4.6

97.3

GPT-5.5

97.7

Grok

99.0BEST

Code ThoroughnessCategory

Claude Opus 4.8

74.0

Claude Sonnet 4.6

84.0

GPT-5.5

85.0

Grok

92.0BEST

Bug Introduction RateCategory

Claude Opus 4.8

100.0BEST

Claude Sonnet 4.6

97.3

GPT-5.5

96.3

Grok

97.0

Security AwarenessCategory

Claude Opus 4.8

94.5

Claude Sonnet 4.6

95.7BEST

GPT-5.5

90.9

Grok

84.3

Instruction FollowingCategory

Claude Opus 4.8

100.0BEST

Claude Sonnet 4.6

96.7

GPT-5.5

100.0BEST

Grok

100.0BEST

Code QualityCategory

Claude Opus 4.8

93.5

Claude Sonnet 4.6

99.0BEST

GPT-5.5

93.0

Grok

94.3

Performance & EfficiencyCategory

Claude Opus 4.8

94.0

Claude Sonnet 4.6

94.5

GPT-5.5

92.0

Grok

94.7BEST