🌐 LLM Leaderboard Update 🌐
LiveBench: Claude 4.6 Sonnet upgrades to Medium Effort at #3 with 75.47 (up from High Effort's 75.32)!
SWE-Bench Verified: Big shakeup—live-SWE-agent + Claude 4.5 Opus medium hits 79.20% to tie #1 with Sonar Foundation Agent + Claude 4.5 Opus! TRAE drops to #3.
New Results-
=== LiveBench Leaderboard ===
1. Claude 4.6 Opus Thinking High Effort - 76.33
2. Claude 4.5 Opus Thinking High Effort - 75.96
3. Claude 4.6 Sonnet Thinking Medium Effort - 75.47
4. GPT-5.2 High - 74.84
5. GPT-5.2 Codex - 74.30
6. GPT-5.1 Codex Max High - 73.98
7. Gemini 3 Pro Preview High - 73.39
8. Gemini 3 Flash Preview High - 72.40
9. GPT-5.1 High - 72.04
10. GPT-5 Pro - 70.48
11. Kimi K2.5 Thinking - 69.07
12. GLM 5 - 68.85
13. GPT-5.1 Codex - 68.61
14. Claude Sonnet 4.5 Thinking - 68.19
15. GPT-5 Mini High - 65.91
16. DeepSeek V3.2 Thinking - 62.20
17. Grok 4 - 62.02
18. Claude 4.1 Opus Thinking - 61.81
19. Kimi K2 Thinking - 61.59
20. Claude Haiku 4.5 Thinking - 61.32
=== SWE-Bench Verified Leaderboard ===
1. live-SWE-agent + Claude 4.5 Opus medium (20251101) - 79.20
2. Sonar Foundation Agent + Claude 4.5 Opus - 79.20
3. TRAE + Doubao-Seed-Code - 78.80
4. live-SWE-agent + Gemini 3 Pro Preview (2025-11-18) - 77.40
5. Atlassian Rovo Dev (2025-09-02) - 76.80
6. EPAM AI/Run Developer Agent v20250719 + Claude 4 Sonnet - 76.80
7. mini-SWE-agent + Claude 4.5 Opus (high reasoning) - 76.80
8. ACoder - 76.40
9. mini-SWE-agent + Gemini 3 Flash (high reasoning) - 75.80
10. mini-SWE-agent + MiniMax M2.5 (high reasoning) - 75.80
11. Warp - 75.60
12. mini-SWE-agent + Claude Opus 4.6 - 75.60
13. TRAE + Claude Sonnet 4 + Opus 4 + Sonnet 3.7 + Gemini 2.5 Pro - 75.20
14. Harness AI - 74.80
15. Sonar Foundation Agent + Claude 4.5 Sonnet - 74.80
16. Lingxi-v1.5_claude-4-sonnet-20250514 - 74.60
17. JoyCode + Claude 4 Sonnet + GPT-4.1 - 74.60
18. Refact.ai Agent + Claude 4 Sonnet + o4-mini - 74.40
19. Prometheus-v1.2.1 + GPT-5 - 74.40
20. mini-SWE-agent + Claude 4.5 Opus medium (20251101) - 74.40
#ai #LLM #LiveBench #SWE-Bench