Back to Dashboard
xAI

Grok

Comprehensive benchmark performance across 11 evaluation categories

Composite Score

0.0/100

Rank

#4

Token Benchmark

14.1

Lower burn, higher score

Total Tokens

991.9k

~34.2k/test

Category Radar

The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.

Historical Composite Score

Category Breakdown

3 tests0.8x weight
3 tests1.0x weight
3 tests1.2x weight
3 tests1.1x weight
96.3
3 tests1.2x weight
3 tests1.0x weight
3 tests0.9x weight
2 tests0.9x weight
3 tests1.1x weight
3 tests1.0x weight
29 tests1.0x weight

Individual Test Results

Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.

Avg Tokens/Test

34.2k

Total Tokens

991.9k

Mathematical Proof
13.3k tok16.8s78.1
Legal Reasoning Chain
41.0k tok1m 55s89.0
Multi-step Logic Puzzle
37.9k tok3m 37s43.2
Graph Algorithm Implementation
28.7k tok1m 23s97.0
REST API Design
21.6k tok1m 23s100.0
Concurrent Data Pipeline
17.5k tok32.1s99.0
Race Condition Detection
15.6k tok43.4s100.0
Off-by-One Boundary Fix
20.8k tok56.9s88.8
Memory Leak Fix
21.6k tok1m 20s100.0
Webhook System
14.4k tok25.1s100.0
OAuth2 Integration
28.5k tok1m 21s97.0
Search Autocomplete
33.5k tok2m 7s100.0
Test Suite Completeness
47.6k tok3m 17s85.0
Edge Case Coverage
67.5k tok4m 58s99.0
Refactor Without Regression
38.1k tok2m 18s100.0
Merge Conflict Resolution
25.6k tok1m 33s97.0
Dependency Upgrade Safety
36.1k tok1m 32s94.0
XSS Mitigation
34.4k tok1m 48s100.0
SQL Injection Prevention
55.9k tok2m 46s88.0
Secret Management
76.9k tok3m 47s64.8
Structured Output Compliance
31.9k tok45.0s100.0
Multi-step Instruction Chain
14.4k tok23.5s100.0
Constraint Adherence
13.5k tok20.4s100.0
Idiomatic Python
43.7k tok2m 17s98.0
TypeScript Best Practices
73.2k tok4m 46s87.0
Clean Architecture Patterns
51.7k tok2m 49s98.0
Algorithm Complexity
21.2k tok1m 13s98.0
Memory-efficient Processing
40.3k tok2m 25s98.0
Query Optimization
25.3k tok1m 38s88.0

Regression History

5
Long Reasoningmajor

Score dropped -19.1% from 86.7 to 70.1

Detected Jun 5, 2026
Bug Fixesminor

Score dropped -3.7% from 100.0 to 96.3

Detected Jun 5, 2026
Overall Compositemoderate

Score dropped -12.0% from 91.8 to 80.8

Detected Jun 5, 2026
Token Efficiencymajor

Score dropped -62.7% from 73.0 to 27.2

Detected Jun 5, 2026
Feature Implementationminorresolved

Score dropped -3.4% from 98.3 to 95.0

Detected Jun 5, 2026·Resolved Jun 5, 2026

Outage History

1
errorongoing

Started Jun 5, 10:00 PM· checks affected