Benchmarks

How the catalog compares on the benchmarks that matter. The Smart Router uses these signals (plus latency and price) when picking a model for your request. Best score per column is highlighted.

Model	MMLU-Pro	GPQA Diamond	SWE-bench Verified	HumanEval	AIME 2025
Claude Fable 5 Anthropic	92.4	87.1	82.3	98.8	96.2
GPT-5.5 OpenAI	91.8	85.9	78.6	98.1	94.7
Claude Opus 4.8 Anthropic	90.6	84.2	79.4	98.0	92.1
Gemini 3 Pro Google	90.1	84.8	74.2	96.9	93.5
GPT-5.4 OpenAI	89.3	82.7	74.9	97.2	91.8
Claude Sonnet 5 Anthropic	88.7	81.4	77.1	97.5	89.3
DeepSeek V4 DeepSeek	87.9	80.6	71.8	96.4	90.6
Llama 4 405B Meta	86.2	77.3	65.4	94.1	84.9
GPT-5.4 Mini OpenAI	82.5	72.8	58.7	93.3	79.4

Scores aggregated from public leaderboards and provider model cards (demo data; higher is better). Coding agents should weigh SWE-bench Verified most heavily; research and analysis workloads track GPQA Diamond.