Back to Dashboard
xAI
Grok
Comprehensive benchmark performance across 11 evaluation categories
Composite Score
0.0/100Rank
#4
Token Benchmark
14.1Lower burn, higher score
Total Tokens
991.9k
~34.2k/test
Category Radar
The full radar chart is shown on wider screens. On mobile, the category breakdown below provides the same values in a readable stacked layout.
Historical Composite Score
Category Breakdown
| # | Category | Score | Tests | 7-Day Trend | Weight |
|---|---|---|---|---|---|
| 1 | Instruction Following | 100.0 | 3 | 0.8x | |
| 2 | Feature Implementation | 99.0 | 3 | 1.0x | |
| 3 | Coding Tasks | 98.7 | 3 | 1.2x | |
| 4 | Bug Introduction Rate | 97.0 | 3 | 1.1x | |
| 5 | Bug Fixes | 96.3 | 3 | 1.2x | |
| 6 | Performance & Efficiency | 94.7 | 3 | 1.0x | |
| 7 | Code Quality | 94.3 | 3 | 0.9x | |
| 8 | Code Thoroughness | 92.0 | 2 | 0.9x | |
| 9 | Security Awareness | 84.3 | 3 | 1.1x | |
| 10 | Long Reasoning | 70.1 | 3 | 1.0x | |
| 11 | Token Efficiency | 14.1 | 29 | 1.0x |
Individual Test Results
Token Efficiency14.1
Token Efficiency is computed from every successful task in the run. The model with the lowest average token burn receives 100, and heavier token usage is penalized proportionally.
Avg Tokens/Test
34.2k
Total Tokens
991.9k
Long Reasoning70.1
Mathematical Proof
13.3k tok16.8s78.1
Legal Reasoning Chain
41.0k tok1m 55s89.0
Multi-step Logic Puzzle
37.9k tok3m 37s43.2
Coding Tasks98.7
Graph Algorithm Implementation
28.7k tok1m 23s97.0
REST API Design
21.6k tok1m 23s100.0
Concurrent Data Pipeline
17.5k tok32.1s99.0
Bug Fixes96.3
Race Condition Detection
15.6k tok43.4s100.0
Off-by-One Boundary Fix
20.8k tok56.9s88.8
Memory Leak Fix
21.6k tok1m 20s100.0
Webhook System
14.4k tok25.1s100.0
OAuth2 Integration
28.5k tok1m 21s97.0
Search Autocomplete
33.5k tok2m 7s100.0
Test Suite Completeness
47.6k tok3m 17s85.0
Edge Case Coverage
67.5k tok4m 58s99.0
Refactor Without Regression
38.1k tok2m 18s100.0
Merge Conflict Resolution
25.6k tok1m 33s97.0
Dependency Upgrade Safety
36.1k tok1m 32s94.0
XSS Mitigation
34.4k tok1m 48s100.0
SQL Injection Prevention
55.9k tok2m 46s88.0
Secret Management
76.9k tok3m 47s64.8
Structured Output Compliance
31.9k tok45.0s100.0
Multi-step Instruction Chain
14.4k tok23.5s100.0
Constraint Adherence
13.5k tok20.4s100.0
Code Quality94.3
Idiomatic Python
43.7k tok2m 17s98.0
TypeScript Best Practices
73.2k tok4m 46s87.0
Clean Architecture Patterns
51.7k tok2m 49s98.0
Algorithm Complexity
21.2k tok1m 13s98.0
Memory-efficient Processing
40.3k tok2m 25s98.0
Query Optimization
25.3k tok1m 38s88.0
Regression History
5Long Reasoningmajor
Score dropped -19.1% from 86.7 to 70.1
Detected Jun 5, 2026
Bug Fixesminor
Score dropped -3.7% from 100.0 to 96.3
Detected Jun 5, 2026
Overall Compositemoderate
Score dropped -12.0% from 91.8 to 80.8
Detected Jun 5, 2026
Token Efficiencymajor
Score dropped -62.7% from 73.0 to 27.2
Detected Jun 5, 2026
Feature Implementationminorresolved
Score dropped -3.4% from 98.3 to 95.0
Detected Jun 5, 2026·Resolved Jun 5, 2026
Outage History
1errorongoing
Started Jun 5, 10:00 PM· checks affected