We stopped asking “Is it smart?” and started asking “Is it useful?”

Current language models are incredibly powerful, but academic benchmarks often focus on rote memorization or Math Olympiad capabilities. That doesn't tell you much about how a model will perform as your daily driver.

We built Chatio to capture the nuance of human interaction. We don't care if a model can recite the digits of Pi. We care if it can de-escalate a stressful situation, write a creative email that doesn't sound robotic, and follow your formatting instructions exactly.

LMArena

#1
Gemini 3 Pro
gemini-3-pro
#2
Grok 4.1 Thinking
grok-4-1-thinking
#3
Grok 4.1
grok-4-1
#4
Gemini 2.5 Pro
gemini-2-5-pro
#5
Claude Sonnet 4.0-20240229-Thinking-32k
claude-sonnet-4-0-2024...

Chatio

#1
Gemini 2.5 Pro
gemini-2.5-pro
#2
Claude Opus 4.1
claude-3-opus-20240229
#3
GPT-5
gpt-5
#4
o3
o3
#5
ChatGPT 4o
gpt-4o

Our five evaluation metrics

What we test

From fixing a leaky faucet to planning a schedule. We look for advice that is actually actionable for a layperson.

The Judge

A mix of Fact-Checkers and Human Reviewers.

Helpfulness
Instruction Following
Comprehension
Empathy
Creative Writing