Compare the latest Large Language Models across multiple benchmarks and performance metrics
Latest models included
Comprehensive tests
Kimi K2.5 (HumanEval)
TPS (GPT-4o mini Realtime)
| # | Organization | Category | MMLU | GPQA | MMMU | HellaSwag | HumanEval | BBHard | GSM8K | MATH | Cost/1K | TPS | Context | Trend | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Gemini 3.1 Pro 2026-02 Top GPQABest ARC-AGI-2Best Value Frontier | multimodal | 92.2% | 93.1% | 94.3% | 85.2% | N/A | N/A | N/A | N/A | 96.1% | $0.002 | 60 | 2M | ||
2 | GPT-5.4 2026-03 Best Computer UseTop OSWorld | OpenAI | reasoning | 91.6% | 93% | 92.8% | 83.5% | N/A | N/A | N/A | N/A | 97% | $0.0025 | 70 | 1M | |
3 | Claude Opus 4.6 2026-02 Best SWE-BenchTop Coding128K Output | Anthropic | coding | 91.3% | 92.4% | 91.3% | 85.1% | N/A | N/A | N/A | N/A | 95.2% | $0.005 | 45 | 1M | |
4 | Grok 4 2026-01 Best HLEMulti-AgentReal-time X Data | xAI | reasoning | 90.2% | 92.7% | 84.6% | N/A | N/A | N/A | N/A | N/A | 93.3% | $0.002 | 75 | 128K | |
5 | Kimi K2.5 2026-01 Open SourceBest HumanEval OpenTop SWE-Bench Open | Moonshot AI | coding | 89.5% | 92% | 87.6% | N/A | N/A | 99% | N/A | N/A | 98% | $0.0015 | 85 | 262K | |
6 | Claude Sonnet 4.6 2026-02 Best GDPval-AANear-Opus Performance | Anthropic | reasoning | 88.8% | 91% | 88.5% | 82% | N/A | N/A | N/A | N/A | 93.5% | $0.003 | 80 | 1M | |
7 | o1 2024-09 Top MMLUPremium | OpenAI | reasoning | 88.4% | 92.3% | 78% | N/A | N/A | N/A | N/A | N/A | 94.8% | $0.06 | 15 | 128K | |
8 | GLM-5 2026-01 Open SourceMIT LicenseBest Chatbot Arena Open | Zhipu AI | coding | 88% | 92% | 86% | N/A | N/A | 96.5% | N/A | N/A | 94.5% | $0.001 | 80 | 200K | |
9 | Qwen 3.5 397B 2026-02 Open SourceApache 2.0Top Open GPQA | Alibaba | reasoning | 87.7% | 91.8% | 88.4% | N/A | N/A | 94.2% | N/A | N/A | 96.5% | $0.001 | 90 | 256K | |
10 | MiniMax M2.5 2026-01 Open SourceBest SWE-Bench Open | MiniMax | coding | 87.5% | 90.8% | 85.2% | N/A | N/A | 94.5% | N/A | N/A | 96% | $0.0012 | 75 | 128K | |
11 | DeepSeek R1 2025-01 Best MATHBest Reasoning | DeepSeek | reasoning | 86.5% | 90.8% | 71.5% | N/A | N/A | N/A | N/A | N/A | 97.3% | $0.008 | 65 | 64K | |
12 | o3-mini 2024-12 Fastest TPSBest HumanEval | OpenAI | coding | 86% | 86% | 75% | N/A | N/A | 97% | N/A | N/A | N/A | $0.02 | 189 | 128K | |
13 | Claude 3.7 Sonnet 2024-11 Best MATH | Anthropic | reasoning | 85.5% | 86.1% | 84.8% | 75% | N/A | N/A | N/A | N/A | 96.2% | $0.02 | 65 | 200K | |
14 | o4-mini 2024-12 Best MATH | OpenAI | reasoning | 85.2% | N/A | 81.4% | 81.6% | N/A | N/A | N/A | N/A | 92.7% | $0.02 | 85 | 128K | |
15 | Gemini 2.5 Pro 2024-12 LatestHigh GPQA | multimodal | 85.2% | 89.8% | 84% | 81.7% | N/A | N/A | N/A | N/A | N/A | $0.02 | 55 | 1M | ||
16 | o3 2024-12 Top MATH | OpenAI | reasoning | 85% | N/A | 83.3% | 82.9% | N/A | N/A | N/A | N/A | 88.9% | $0.06 | 18 | 128K | |
17 | o1-preview 2024-09 Preview | OpenAI | reasoning | 84.9% | 90.8% | 78.3% | N/A | N/A | N/A | N/A | N/A | 85.5% | $0.045 | 20 | 128K | |
18 | DeepSeek V3.2 2025-12 Open SourceMIT LicenseBest Budget | DeepSeek | coding | 84.2% | 90.5% | 72.1% | N/A | N/A | 91.5% | N/A | N/A | 95% | $0.00014 | 95 | 64K | |
19 | DeepSeek V3 (0324) 2025-03 Open SourceUpdatedMIT License | DeepSeek | coding | 84.2% | 89% | 68.4% | N/A | N/A | 87.5% | N/A | N/A | 92% | $0.004 | 95 | 64K | |
20 | DeepSeek R1 Zero 2025-01 Open SourceRL-Only TrainingMIT License | DeepSeek | reasoning | 83.8% | 88.4% | 67% | N/A | N/A | N/A | N/A | N/A | 95.9% | $0.005 | 60 | 64K | |
21 | Llama 4 Behemoth 2025-07 Open SourceLargest LlamaTop MMLU Open | Meta | reasoning | 83.4% | 91.5% | 74.2% | 78.3% | N/A | N/A | N/A | N/A | 89.5% | $0.01 | 25 | 256K | |
22 | Claude 3.5 Sonnet 2024-06 User's ChoiceBest GSM8K | Anthropic | reasoning | 82.3% | 88.7% | 59.4% | 68.3% | 89% | 92% | 93.1% | 96.4% | 71.1% | $0.015 | 170 | 200K | |
23 | GPT-4o 2024-05 Least LatencyMultimodal | OpenAI | multimodal | 82.2% | 88.7% | 53.6% | 69.1% | 94.2% | 90.2% | 91.3% | 89.8% | 76.6% | $0.015 | 85 | 128K | |
24 | o1-mini 2024-09 Good Coding | OpenAI | coding | 81.9% | 85.2% | 60% | N/A | N/A | 92.4% | N/A | N/A | 90% | $0.025 | 45 | 128K | |
25 | Gemini 2.0 Flash 2024-12 FastGood Performance | multimodal | 81.8% | 87% | 59% | N/A | N/A | 91% | N/A | N/A | 90% | $0.01 | 110 | 1M | ||
26 | Claude Opus 4 2024-12 LatestPremium | Anthropic | reasoning | 81% | 88.8% | 83.3% | 76.5% | N/A | N/A | N/A | N/A | 75.5% | $0.045 | 45 | 200K | |
27 | Gemini 2.0 Flash Thinking 2025-02 Extended ThinkingBest Budget Reasoning | reasoning | 80.7% | 85% | 70.3% | 73.8% | N/A | N/A | N/A | N/A | 93.5% | $0.0035 | 50 | 1M | ||
28 | Grok 3 Mini 2025-02 Extended ThinkingCost-Effective | xAI | reasoning | 80.7% | 83% | 69.7% | N/A | N/A | N/A | N/A | N/A | 89.5% | $0.003 | 100 | 128K | |
29 | DeepSeek V3 2024-12 Open SourceGood MATH | DeepSeek | coding | 80.1% | 88.5% | 59.1% | N/A | N/A | 82.6% | N/A | N/A | 90.2% | $0.004 | 95 | 64K | |
30 | Claude 3.5 Sonnet v2 2025-02 Computer UseTop SWE-BenchUpgraded | Anthropic | reasoning | 79.3% | 88.7% | 65% | 70.7% | N/A | 93.7% | N/A | N/A | 78.3% | $0.015 | 75 | 200K | |
31 | Llama 3.1 405B 2024-07 Open SourceLargest Open | Meta | reasoning | 78.9% | 88.6% | 51.1% | 64.5% | 87% | 89% | 81.3% | 96.8% | 73.8% | $0.015 | 35 | 128K | |
32 | Claude Sonnet 4 2024-12 LatestHigh GPQA | Anthropic | reasoning | 78.8% | 86.5% | 83.8% | 74.4% | N/A | N/A | N/A | N/A | 70.5% | $0.025 | 60 | 200K | |
33 | Qwen 2.5-Max 2025-02 MoE ArchitectureTop Chinese Open | Alibaba | reasoning | 78.1% | 87% | 52.5% | N/A | N/A | 88% | N/A | N/A | 85% | $0.0016 | 95 | 128K | |
34 | GPT-4 Turbo 2024-04 Highly PreferredBalanced | OpenAI | reasoning | 77.6% | 86.5% | 48% | 63.1% | 94.2% | 90.2% | 87.6% | 91% | 72.2% | $0.03 | 45 | 128K | |
35 | Llama 3.1 Nemotron 70B 2025-01 Open SourceRLHF TunedTop Arena | NVIDIA | reasoning | 77.5% | 85% | 55.8% | N/A | N/A | 90% | N/A | N/A | 79% | $0.0035 | 70 | 128K | |
36 | GPT-4o (2025) 2025-05 UpdatedBest VoiceImage Gen | OpenAI | multimodal | 77.2% | 89.5% | 55% | 70.2% | N/A | 91.5% | N/A | N/A | 80% | $0.0125 | 90 | 128K | |
37 | Claude 3 Opus 2024-03 PremiumComplete Benchmarks | Anthropic | reasoning | 77.2% | 86.8% | 50.4% | 59.4% | 95.4% | 84.9% | 86.8% | 95% | 60.1% | $0.045 | 45 | 200K | |
38 | GPT-4.1 2024-11 Latest GPT | OpenAI | reasoning | 77.1% | 90.2% | 66.3% | 74.8% | N/A | N/A | N/A | N/A | N/A | $0.04 | 50 | 128K | |
39 | Gemini 2.0 Pro Experimental 2024-12 ExperimentalGood MATH | multimodal | 77.1% | 79.1% | 64.7% | 72.7% | N/A | N/A | N/A | N/A | 91.8% | $0.015 | 60 | 1M | ||
40 | Qwen 2.5 72B 2025-01 Open SourceApache 2.0Best Open Coding | Alibaba | coding | 76.4% | 86.1% | 49% | N/A | N/A | 87.2% | N/A | N/A | 83.1% | $0.0009 | 88 | 128K | |
41 | Claude 3.7 Sonnet (Normal) 2024-11 Balanced | Anthropic | reasoning | 76.3% | 83.2% | 68% | 71.8% | N/A | N/A | N/A | N/A | 82.2% | $0.015 | 85 | 200K | |
42 | Phi-4 2025-01 Open SourceBest-in-Class 14BSTEM Strong | Microsoft | reasoning | 75.9% | 84.8% | 56.1% | N/A | N/A | 82.6% | N/A | N/A | 80.4% | $0.0007 | 120 | 16K | |
43 | Llama 4 Maverick 2024-12 Open Source | Meta | reasoning | 75.9% | 84.6% | 69.8% | 73.4% | N/A | N/A | N/A | N/A | N/A | $0.005 | 85 | 128K | |
44 | Llama 3.3 70B 2024-10 Open SourceGood Coding | Meta | coding | 75.5% | 86% | 50.5% | N/A | N/A | 88.4% | N/A | N/A | 77% | $0.006 | 90 | 128K | |
45 | Mistral Large 2 2025-01 Open WeightsMultilingualFunction Calling | Mistral AI | reasoning | 75.4% | 84% | 49.6% | N/A | N/A | 92% | N/A | N/A | 76% | $0.006 | 80 | 128K | |
46 | Mistral Medium 3 2025-05 EnterpriseMultilingualNew | Mistral AI | reasoning | 75.2% | 83.5% | 51% | N/A | N/A | 90% | N/A | N/A | 76.5% | $0.004 | 95 | 128K | |
47 | GPT-4.1 mini 2024-11 Cost-Effective | OpenAI | reasoning | 75.1% | 87.5% | 65% | 72.7% | N/A | N/A | N/A | N/A | N/A | $0.015 | 95 | 128K | |
48 | Grok-2 2024-08 Good Coding | xAI | coding | 74.8% | 87.5% | 56% | 66.1% | N/A | 88.4% | N/A | N/A | 76.1% | $0.01 | 75 | 128K | |
49 | Grok 3 2024-12 Latest | xAI | reasoning | 74.3% | N/A | 75.4% | 73.2% | N/A | N/A | N/A | N/A | N/A | $0.012 | 70 | 128K | |
50 | Gemini 1.5 Pro 2024-02 Largest ContextComplete Benchmarks | multimodal | 73.6% | 81.9% | 46.2% | 62.2% | 92.5% | 71.9% | 84% | 91.7% | 58.5% | $0.0125 | 38 | 2M | ||
51 | Gemini 2.5 Flash Lite 2024-12 Latest | multimodal | 71.8% | 84.5% | 66.7% | 72.9% | N/A | N/A | N/A | N/A | 63.1% | $0.01 | 75 | 1M | ||
52 | GPT-4 2023-03 Most ExpensiveClassic | OpenAI | reasoning | 71.4% | 86.4% | 35.7% | 56.8% | 95.3% | 67% | 83.1% | 92% | 52.9% | $0.18 | 25 | 8K | |
53 | Claude 3.5 Haiku (2025) 2025-04 UpdatedFastest ClaudeComputer Use | Anthropic | conversation | 70.3% | 73.5% | 43.2% | N/A | N/A | 90.5% | N/A | N/A | 74% | $0.004 | 150 | 200K | |
54 | Llama 3.2 90B 2024-09 Open Source | Meta | reasoning | 69.6% | 86% | 46.7% | 60.3% | N/A | N/A | N/A | 86.9% | 68% | $0.008 | 80 | 128K | |
55 | Command R+ (2025) 2025-03 RAG OptimizedTool UseEnterprise | Cohere | reasoning | 69.4% | 82.3% | 46% | N/A | N/A | 80.5% | N/A | N/A | 68.9% | $0.0025 | 85 | 128K | |
56 | Claude 3 Sonnet 2024-03 BalancedComplete Benchmarks | Anthropic | reasoning | 69.1% | 79% | 40.4% | 53.1% | 89% | 73% | 82.9% | 92.3% | 43.1% | $0.012 | 90 | 200K | |
57 | Gemini 1.5 Flash 2024-05 FastComplete Benchmarks | multimodal | 68.6% | 78.9% | 39.5% | 56.1% | 81.3% | 67.5% | 89.2% | 68.8% | 67.7% | $0.008 | 95 | 1M | ||
58 | Mistral Small 3 2025-03 Open WeightsUltra EfficientApache 2.0 | Mistral AI | conversation | 68.1% | 81.5% | 42% | N/A | N/A | 83% | N/A | N/A | 66% | $0.001 | 130 | 32K | |
59 | Gemma 3 27B 2025-03 Open SourceMultimodalApache 2.0 | multimodal | 67.9% | 78.4% | 42% | 68.5% | N/A | 79% | N/A | N/A | 71.5% | $0.0003 | 110 | 128K | ||
60 | GPT-4o mini 2024-07 Cost-Effective | OpenAI | conversation | 67.8% | 82% | 40.2% | 59.4% | N/A | 87.2% | N/A | N/A | 70.2% | $0.007 | 120 | 128K | |
61 | Llama 4 Scout 2024-12 Least ExpensiveOpen Source | Meta | conversation | 67% | 74.3% | 57.2% | 69.4% | N/A | N/A | N/A | N/A | N/A | $0.0003 | 120 | 128K | |
62 | Amazon Nova Pro 2025-01 AWS NativeMultimodalCost-Effective | Amazon | multimodal | 66.9% | 80% | 44% | 63.5% | N/A | 79% | N/A | N/A | 68% | $0.0008 | 100 | 300K | |
63 | Claude 3.5 Haiku 2024-11 FastCost-Effective | Anthropic | conversation | 66% | 65% | 41.6% | N/A | N/A | 88.1% | N/A | N/A | 69.2% | $0.005 | 140 | 200K | |
64 | Claude 3 Haiku 2024-03 FastComplete Benchmarks | Anthropic | conversation | 65.3% | 75.2% | 33.3% | 50.2% | 85.9% | 75.9% | 73.7% | 88.9% | 38.9% | $0.004 | 160 | 200K | |
65 | Phi-4 Mini 2025-04 Open SourceEdge Deployable3.8B Params | Microsoft | conversation | 63.4% | 75.6% | 37.3% | N/A | N/A | 73% | N/A | N/A | 67.5% | $0.0001 | 200 | 16K | |
66 | GPT-4.1 nano 2024-11 Ultra Fast | OpenAI | conversation | 61.9% | 80.1% | 50.3% | 55.4% | N/A | N/A | N/A | N/A | N/A | $0.005 | 150 | 32K | |
67 | Gemma 3 12B 2025-03 Open SourceLightweightOn-Device | conversation | 60.2% | 74.2% | 37% | 59.8% | N/A | 72% | N/A | N/A | 58% | $0.0001 | 160 | 128K | ||
68 | Amazon Nova Lite 2025-01 AWS NativeUltra FastCheapest Multimodal | Amazon | conversation | 56.6% | 73% | 33% | 55% | N/A | 68% | N/A | N/A | 54% | $0.00006 | 180 | 300K | |
69 | o3-pro 2025-01 Upcoming | OpenAI | reasoning | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | $0.08 | 20 | 128K | |
70 | GPT-4o Realtime 2024-10 RealtimeVoice | OpenAI | conversation | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | $0.02 | 200 | 128K | |
71 | GPT-4o mini Realtime 2024-10 RealtimeVoiceCost-Effective | OpenAI | conversation | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | $0.008 | 250 | 128K |
How to Use the LLM Leaderboard
Choosing the right AI model matters because cost, speed, and accuracy vary significantly across providers. This leaderboard compares 71+ large language models across eight industry-standard benchmarks so you can make evidence-based decisions.
Step 1: Browse the main leaderboard to review overall rankings by composite score.
Step 2: Open the Performance Charts tab to visualize strengths across benchmarks like MMLU, GPQA, and HumanEval.
Step 3: Use Model Comparison to evaluate 2-3 models side by side on metrics relevant to your use case.
Step 4: Review Benchmark Details for scoring methodology and context.
Whether you are building customer support automation, coding assistants, or content generation workflows, selecting the right model can save substantial API spend. Use cost and throughput columns to identify your performance-budget sweet spot. A model that scores 5% lower but costs 80% less may be the practical winner.
The leaderboard is updated regularly as new models release and benchmarks evolve. Recheck it before major infrastructure decisions so your stack reflects current capabilities and pricing realities.
For practical evaluation, shortlist models from the leaderboard and run your own task-specific test set before rollout. Benchmarks provide directional guidance, but domain prompts, latency expectations, and compliance constraints can change final selection. Combining public rankings with internal testing gives the most reliable model choice.
Related Articles
Try our other free tools!
Explore more powerful AI tools to enhance your productivity and creativity.