Benchmarks

How the catalog compares on the benchmarks that matter. The Smart Router uses these signals (plus latency and price) when picking a model for your request. Best score per column is highlighted.

ModelMMLU-ProGPQA DiamondSWE-bench VerifiedHumanEvalAIME 2025
Claude Fable 5 Anthropic92.487.182.398.896.2
GPT-5.5 OpenAI91.885.978.698.194.7
Claude Opus 4.8 Anthropic90.684.279.498.092.1
Gemini 3 Pro Google90.184.874.296.993.5
GPT-5.4 OpenAI89.382.774.997.291.8
Claude Sonnet 5 Anthropic88.781.477.197.589.3
DeepSeek V4 DeepSeek87.980.671.896.490.6
Llama 4 405B Meta86.277.365.494.184.9
GPT-5.4 Mini OpenAI82.572.858.793.379.4

Scores aggregated from public leaderboards and provider model cards (demo data; higher is better). Coding agents should weigh SWE-bench Verified most heavily; research and analysis workloads track GPQA Diamond.