1,842 language models · 140 providers · 247 labs
Pick the right LLM for what you're shipping.
Every week brings new models, prices, and benchmarks. LLMReference tracks the field so you ship with the right model and provider, fast.
Default view: coding task, balanced budget, fresh research.
Editors' picks · featured
Where most teams start
Coding
Anthropic's new flagship: 80.3% SWE-bench Pro, 96% SWE-bench Verified on Vals.ai, and 85.0% OSWorld-Verified make it the best production coding pick for non-trivial engineering tasks.
Agents
Best generally-available τ-bench (87.5); stays on-task across long tool loops and self-corrects without prompting.
Writing
Tops Chatbot Arena (1503) and writes paragraphs you'd ship; understands tone notes and edits like a copy chief.
Research
GDPval-AA ELO 1932 and Anthropic-reported finance, trading, and analytics wins make it the strongest general knowledge-work pick; do not use Mythos-only HLE rows as Fable evidence.
Image
The current photoreal leader — brand-consistent, with the best text rendering and hands in the open ecosystem.
Video
Best overall video quality in the catalog: 30-second clips, native audio, and up to 4K through Vertex AI.
Pulse · this week
What changed in the model market
North Mini Code 1.0 · Claude Fable 5 · Claude Mythos 5
Verified provider price reductions
1200 scores tracked across major suites
top-lab output · $/1M
Cheat sheet