Benchmarks
How the catalog compares on the benchmarks that matter. The Smart Router uses these signals (plus latency and price) when picking a model for your request. Best score per column is highlighted.
| Model | MMLU-Pro | GPQA Diamond | SWE-bench Verified | HumanEval | AIME 2025 |
|---|---|---|---|---|---|
| Claude Fable 5 Anthropic | 92.4 | 87.1 | 82.3 | 98.8 | 96.2 |
| GPT-5.5 OpenAI | 91.8 | 85.9 | 78.6 | 98.1 | 94.7 |
| Claude Opus 4.8 Anthropic | 90.6 | 84.2 | 79.4 | 98.0 | 92.1 |
| Gemini 3 Pro Google | 90.1 | 84.8 | 74.2 | 96.9 | 93.5 |
| GPT-5.4 OpenAI | 89.3 | 82.7 | 74.9 | 97.2 | 91.8 |
| Claude Sonnet 5 Anthropic | 88.7 | 81.4 | 77.1 | 97.5 | 89.3 |
| DeepSeek V4 DeepSeek | 87.9 | 80.6 | 71.8 | 96.4 | 90.6 |
| Llama 4 405B Meta | 86.2 | 77.3 | 65.4 | 94.1 | 84.9 |
| GPT-5.4 Mini OpenAI | 82.5 | 72.8 | 58.7 | 93.3 | 79.4 |
Scores aggregated from public leaderboards and provider model cards (demo data; higher is better). Coding agents should weigh SWE-bench Verified most heavily; research and analysis workloads track GPQA Diamond.