Claude Sonnet 4.6 Benchmarks: Beating Opus at Half the Price
On February 17, 2026, Anthropic released Claude Sonnet 4.6 — and quietly disrupted its own product line. Sonnet 4.6 beats Claude Opus 4.6 on the GDPval-AA benchmark for real-world office tasks, matches it within 0.2% on computer use, and delivers 97–99% of its coding capability — all at $3 per million input tokens versus Opus's $5. For most developers and businesses, the most expensive Claude model is no longer the best choice.
This is not marginal. Sonnet 4.6 jumped from 13.6% to 58.3% on ARC-AGI-2 (novel problem-solving). It leads all models on GDPval-AA. It scores 79.6% on SWE-bench Verified, within striking distance of Opus's 80.8%. And it is now the default model for free and Pro tier users. Here is the full breakdown.
Complete Benchmark Comparison
Every number below comes from Anthropic's published benchmarks, verified by independent outlets including ZDNet, CNET, Forbes, eWeek, and Tom's Guide. This table covers every major evaluation.
| Benchmark | Sonnet 4.6 | Opus 4.6 | GPT-5.2 | What It Tests |
|---|---|---|---|---|
| GDPval-AA (Elo) | 1633 | 1606 | 1462 | 220 professional tasks across 44 occupations |
| SWE-bench Verified | 79.6% | 80.8% | 80.0% | Real GitHub issue resolution |
| OSWorld-Verified | 72.5% | 72.7% | 38.2% | Operating real software autonomously |
| Finance Agent v1.1 | 63.3% | 60.1% | 59.0% | Agentic financial analysis |
| ARC-AGI-2 | 58.3% | 68.8% | — | Novel problem-solving, IQ-test style |
| Context Window | 1M (beta) | 1M (beta) | — | Maximum input capacity |
| Input Price | $3/M | $5/M | $1.75/M | Cost per million tokens |
| Output Price | $15/M | $25/M | $14/M | Cost per million tokens |
Three things stand out immediately:
GDPval-AA is the most important benchmark here. It evaluates 220 real professional tasks across 44 occupations — finance, legal, healthcare, engineering, and more. These are not synthetic puzzles. Sonnet 4.6's Elo 1633 makes it the #1 model on this leaderboard, beating both Opus and GPT-5.2 by substantial margins. For organizations deploying AI for knowledge work, this is the benchmark that matters most.
OSWorld near-parity is remarkable. Sonnet 4.6 scores 72.5% versus Opus's 72.7% on OSWorld-Verified, which tests AI's ability to navigate spreadsheets, fill multi-step web forms, and interact with real operating systems. Both models have reached human-baseline performance — the approximate 72% threshold where AI matches average human accuracy on routine digital tasks. For context, GPT-5.2 scored 38.2% on the same test, and Sonnet 4.5 scored 61.4%.
ARC-AGI-2 is where Opus still wins convincingly. Opus 4.6 scores 68.8% versus Sonnet 4.6's 58.3% on novel abstract reasoning. This benchmark tests genuine generalization — solving problems the model has never seen in any form. The 10.5-point gap indicates Opus still has a meaningful edge in deep, creative reasoning. But Sonnet 4.6's leap from Sonnet 4.5's 13.6% to 58.3% (a 328% improvement) signals rapid convergence.
The Pricing Equation
Claude Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens. Claude Opus 4.6 costs $5 per million input tokens and $25 per million output tokens.
The practical impact depends on workload:
For 1 million tokens of input + 100K tokens of output:
- Sonnet 4.6: $3.00 + $1.50 = $4.50
- Opus 4.6: $5.00 + $2.50 = $7.50
That is 40% cheaper per request. At enterprise scale — millions of API calls per month — the savings compound into six figures annually. And for most tasks, Sonnet 4.6's slightly lower capability is invisible in practice.
Sonnet 4.6 is also available for free. It is now the default model for free-tier and Claude Pro users on Anthropic's platform. API access runs through Claude API, Amazon Bedrock, Google Cloud's Vertex AI, and Microsoft Foundry.
Developer Preference Data
Anthropic released internal user preference data that reinforces the benchmarks:
- 70% of developers preferred Sonnet 4.6 over Sonnet 4.5 for coding tasks
- 59% of developers preferred Sonnet 4.6 over the previous Claude Opus 4.5
- Key reasons cited: reduced overengineering, less "laziness" (refusing to complete tasks), and better instruction following
The "reduced overengineering" point matters more than it sounds. Opus-class models sometimes produce unnecessarily complex solutions — adding abstraction layers, creating interfaces that are not needed, or refactoring code that was not asked to be refactored. Sonnet 4.6 follows instructions more literally, which most production use cases prefer.
Technical Architecture: What Changed
Sonnet 4.6 features a 1 million token context window in beta — the same capacity as Opus 4.6. This allows processing entire codebases, lengthy legal documents, or dozens of research papers in a single request. Previous Sonnet models had significantly smaller context windows.
Key architectural upgrades include improvements in:
- Coding: 79.6% SWE-bench Verified, near-parity with the best models
- Computer use: Human-baseline performance on real software interfaces
- Long-context reasoning: Processing vast inputs without degradation
- Agent planning: Multi-step task decomposition and execution
- Knowledge work: Real-world office task performance (GDPval-AA)
- Design: UI/UX generation and visual reasoning
The model's 58.3% on ARC-AGI-2 (up from 13.6% on Sonnet 4.5) indicates fundamental improvements in abstract reasoning, not just benchmark optimization. A 328% improvement on a genuine generalization test suggests architectural changes, not just scaling.
When to Use Sonnet vs Opus
The decision framework is straightforward:
Use Sonnet 4.6 when:
- Building production applications where cost efficiency matters
- Deploying agentic systems for office work, financial analysis, or coding
- Requiring consistent instruction following over creative reasoning
- Processing high volumes of API requests
- Needing computer use capabilities (nearly identical to Opus)
Use Opus 4.6 when:
- Solving novel, never-before-seen problems (ARC-AGI-2 gap)
- Performing agentic terminal coding (shell commands, system administration)
- Conducting multi-disciplinary research requiring deep creative reasoning
- Running agentic search tasks with complex query decomposition
- Budget is secondary to peak performance
For most organizations, the answer is Sonnet 4.6 for 80–90% of workloads and Opus 4.6 reserved for the hardest 10–20%.
Competitive Landscape
Sonnet 4.6 competes directly with GPT-5.2 (OpenAI), GPT-5.3 Codex (coding-focused), and Gemini 2.0 (Google).
Against GPT-5.2, Sonnet wins on GDPval-AA (1633 vs 1462 Elo), Finance Agent (63.3% vs 59.0%), and OSWorld (72.5% vs 38.2%). GPT-5.2 is cheaper at $1.75/$14 per million tokens but scores dramatically lower on computer use.
Against GPT-5.3 Codex, the comparison is apples-to-oranges. Codex targets coding-specific workflows with 77.3% on Terminal-Bench 2.0. Sonnet 4.6 is a general-purpose model that happens to code very well.
Gemini 2.0 pushes multimodal agents with strong vision capabilities but lags on text-based reasoning benchmarks. No direct Elo comparison exists yet on GDPval-AA.
xAI's Grok 4.2 focuses on multi-agent debate for reasoning accuracy. It excels in math (ranked #1 on LMArena) but is not a coding-focused model.
What This Means for the AI Industry
Sonnet 4.6 demonstrates a trend that will reshape AI pricing: mid-tier models are closing the gap to flagship models faster than flagship models can advance. When the second-best model matches 97–99% of the best model's capability at 40% lower cost, the economic argument for the flagship becomes difficult to sustain.
This has implications for:
Anthropic's own business: Opus 4.6 exists to compete at the absolute frontier, but most revenue will flow through Sonnet usage. Anthropic may need to differentiate Opus more aggressively — perhaps through exclusive features, not just raw capability.
Enterprise AI budgets: The era of defaulting to the largest model is ending. Teams that switch from Opus to Sonnet save 40% with negligible quality loss for most applications. Budget-conscious organizations can deploy more agents, process more data, and experiment more freely.
Competitors: OpenAI and Google face a pricing war. Anthropic demonstrated that you can offer near-flagship performance at mid-tier prices. Expect matching price cuts or capability improvements across the board.
Frequently Asked Questions
What is Claude Sonnet 4.6?
Claude Sonnet 4.6 is Anthropic's latest mid-tier model, released February 17, 2026. It beats Claude Opus 4.6 on GDPval-AA (Elo 1633 vs 1606) and offers 97–99% of Opus coding capability at $3 per million input tokens.
How does Sonnet 4.6 compare to Opus 4.6 on benchmarks?
Sonnet leads on GDPval-AA and Finance Agent. Opus leads on ARC-AGI-2 and agentic terminal coding. They are virtually tied on OSWorld (72.5% vs 72.7%) and SWE-bench Verified (79.6% vs 80.8%).
What does Claude Sonnet 4.6 cost?
$3 per million input tokens and $15 per million output tokens. Opus 4.6 costs $5/$25. Free-tier users get Sonnet 4.6 as the default model.
Is Claude Sonnet 4.6 free?
Yes, it is the default model for free and Pro plan users on Anthropic's platform. API access requires paid tokens.
What is the context window?
1 million tokens in beta — the same as Opus 4.6. This supports processing entire codebases or document collections in a single request.
Where is Sonnet 4.6 available?
Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, and directly through claude.ai for free and paid users.
Sonnet 4.6 is the most important Claude release of 2026 — not because it is the most capable, but because it proves that near-flagship performance at mid-tier pricing is achievable. Watch for Opus 4.7 to counter, cross-provider benchmark standardization, and whether the 40% price gap forces GPT and Gemini to respond.
