Leaderboard
Explore how AI models perform across our five core evaluation categories. Rankings are based on real-world conversations and human evaluations, measuring what truly matters in an AI assistant.
Model | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
#1 | Gemini 3 ProNew gemini-3-pro-preview | 93.18 | 96.0 | 93.0 | 94.0 | 88.7 | 94.3 | 80.0 | $2.00 / $12.00 | 2M | 65.5K |
#2 | Claude Opus 4.5New claude-opus-4-5-202511... | 91.38 | 90.5 | 87.7 | 95.3 | 89.6 | 93.8 | 60.0 | $5.00 / $25.00 | 200K | 64K |
#3 | Claude Sonnet 4.5 claude-sonnet-4-5 | 91.22 | 91.5 | 88.8 | 94.0 | 89.5 | 92.3 | 68.5 | $3.00 / $15.00 | 200K | 64K |
#4 | GPT-5.1 gpt-5.1 | 90.92 | 92.5 | 90.5 | 95.0 | 82.1 | 94.5 | 80.0 | $2.50 / $10.00 | 272K | 8.2K |
#5 | Grok 4.1 ThinkingNew grok-4.1-thinking | 89.96 | 91.5 | 91.5 | 94.0 | 79.8 | 93.0 | 62.0 | $3.00 / $15.00 | 256K | 8.2K |
#6 | GPT-5 gpt-5 | 89.29 | 91.0 | 87.7 | 94.0 | 80.3 | 93.5 | 72.0 | $1.25 / $10.00 | 400K | 128K |
#7 | Grok 4.1New grok-4.1 | 88.85 | 90.5 | 91.0 | 92.5 | 78.3 | 92.0 | 78.0 | $3.00 / $15.00 | 256K | 8.2K |
#8 | o3 o3-2025-04-16 | 88.84 | 89.5 | 90.0 | 93.2 | 77.5 | 94.0 | 58.0 | $2.00 / $8.00 | 200K | 100K |
#9 | Claude Opus 4.1 claude-opus-4-1 | 88.30 | 90.0 | 83.5 | 92.0 | 85.5 | 90.5 | 54.0 | $15.00 / $75.00 | 200K | 8.2K |
#10 | ChatGPT-4o chatgpt-4o | 87.30 | 88.5 | 90.5 | 89.0 | 81.5 | 87.0 | 82.2 | $5.00 / $15.00 | 128K | 16.4K |
#11 | Claude Haiku 4.5 claude-haiku-4-5 | 86.80 | 87.5 | 82.0 | 90.0 | 86.5 | 88.0 | 93.0 | $1.00 / $5.00 | 200K | 64K |
#12 | Grok 4 Fast Reasoning grok-4-fast-reasoning | 86.70 | 89.0 | 86.5 | 91.5 | 76.0 | 90.5 | 88.0 | $0.20 / $0.50 | 2M | 8.2K |
#13 | Gemini 2.5 Pro gemini-2.5-pro | 86.47 | 88.3 | 85.3 | 91.3 | 78.3 | 89.2 | 60.4 | $2.00 / $12.00 | 1M | 65.5K |
#14 | o4-mini o4-mini | 86.30 | 89.5 | 84.5 | 94.5 | 72.0 | 91.0 | 95.0 | $1.10 / $4.40 | 200K | 100K |
#15 | DeepSeek V3.2 Thinking deepseek-reasoner-v3.2 | 86.03 | 89.0 | 74.0 | 93.5 | 79.7 | 94.0 | 45.0 | $0.14 / $0.28 | 160K | 32.8K |
#16 | Grok 4 Fast grok-4-fast-non-reason... | 85.20 | 87.0 | 86.0 | 89.0 | 75.5 | 88.5 | 93.0 | $0.20 / $0.50 | 2M | 8.2K |
#17 | o1 o1 | 84.80 | 87.5 | 81.0 | 88.5 | 78.0 | 89.0 | 65.0 | $15.00 / $60.00 | 200K | 100K |
#18 | DeepSeek V3.1 Thinking deepseek-reasoner | 84.21 | 87.0 | 71.0 | 90.5 | 82.5 | 90.0 | 40.0 | $0.07 / $1.68 | 128K | 32.8K |
#19 | o3-mini o3-mini | 83.60 | 88.0 | 75.0 | 94.0 | 72.0 | 89.0 | 92.0 | $1.10 / $4.40 | 200K | 100K |
#20 | DeepSeek V3.2 deepseek-v3.2-exp | 83.40 | 86.5 | 79.0 | 88.5 | 76.0 | 87.0 | 95.0 | $0.07 / $0.14 | 160K | 8.2K |
#21 | Grok 3 grok-3 | 82.40 | 83.5 | 82.0 | 86.0 | 76.0 | 84.5 | 68.0 | $3.00 / $15.00 | 131.1K | 8.2K |
#22 | Grok 4 grok-4-0709 | 82.21 | 86.5 | 82.0 | 89.0 | 65.5 | 88.0 | 70.0 | $3.00 / $15.00 | 256K | 8.2K |
#23 | DeepSeek R1 deepseek-reasoner | 82.10 | 88.0 | 65.0 | 92.0 | 72.0 | 93.5 | 25.0 | $0.55 / $2.19 | 128K | 32.8K |
#24 | Llama 4 Maverick llama-4-maverick-17b-1... | 80.60 | 80.5 | 79.5 | 86.5 | 74.5 | 82.0 | 95.0 | $0.20 / $0.80 | 1M | 8.2K |
#25 | Grok 3 Mini grok-3-mini | 79.60 | 80.0 | 80.0 | 83.0 | 73.0 | 82.0 | 82.0 | $0.30 / $0.50 | 131.1K | 8.2K |
Showing 25 of 31 models