Sonnet 4.6 Explained: Anthropic’s New Mid-Tier Model Is Here
Claude Sonnet 4.6 dropped today, and the headline isn’t just “it’s better.” It’s that developers with early access preferred it over Anthropic’s own top-tier Opus model 59% of the time. That’s the cheaper model beating the expensive one.
First up, the tl;dr
If you only have two minutes, here’s what you need to know. Sonnet 4.6 is a full upgrade across coding, computer use, long-context reasoning, agent planning, and design. But here’s what actually matters for your day-to-day:
- It can use your computer like a person.
- Anthropic first introduced computer use in October 2024 and called it “experimental.”
- Sixteen months later, early users report human-level capability on tasks like navigating complex spreadsheets and filling out multi-step web forms across multiple browser tabs.
- The OSWorld benchmark (which tests real software tasks on a simulated computer) shows steady, significant gains with each Sonnet release.
- 1M token context window (in beta).
- That’s enough to hold an entire codebase, a stack of legal contracts, or dozens of research papers in a single request.
- And unlike some models that lose the plot halfway through a long document, Sonnet 4.6 actually reasons across all of it.
- Claude Code users love it.
- Testers preferred it over the previous Sonnet 70% of the time, reporting fewer hallucinations, less overengineering, and better follow-through on multi-step tasks.
- The thing developers hated most (the model confidently claiming it finished something it didn’t) happens way less.
- Excel gets MCP connectors. Claude in Excel now connects to S&P Global, PitchBook, Moody’s, FactSet, and others, so you can pull external data into your spreadsheet without leaving it. If you work in finance, this is a big deal.
One detail caught our eye: in a simulated business competition called Vending-Bench Arena, Sonnet 4.6 developed its own strategy. It spent heavily on capacity for 10 months, then pivoted sharply to profitability and crushed the competition. Nobody told it to do that.
The details: Pricing stays the same as Sonnet 4.5 ($ 3/$15 per million tokens), and it’s already the default model for free and Pro users on claude.ai. If you’ve been paying for Opus to get reliable results, it might be worth testing whether Sonnet 4.6 gets you 90% of the way there at a fraction of the cost.
Now, let’s dive into the deets more in-depth.
Anthropic’s Sonnet 4.6 is the AI model built for the ‘age of agents’
There’s a recurring pattern in AI: a company releases its best, most expensive model. Everyone agrees it’s incredible. Then, a few months later, the same company packages that same level of intelligence into something faster and cheaper, and that’s the one that actually changes how people work.
Anthropic just did exactly that. On Feb. 17, Claude Sonnet 4.6 arrived as the new default model across Claude’s free and Pro plans. On paper, it’s “just” a Sonnet (Anthropic’s mid-tier model class, sitting below the flagship Opus). In our livestream on Tuesday, we pretty much felt it was just another Sonnet. But in practice, when applied specifically to agentic tasks, Anthropic’s benchmarks show it matches or beats Opus 4.6 on the tasks that matter most to people using AI as a daily work tool: computer use, office tasks, financial analysis, browser automation, and long-horizon planning.
As discussed above, the pricing remains the same as Sonnet 4.5: $3 per million input tokens and $15 per million output tokens. That’s one-fifth the cost of Opus 4.6. And for anyone who’s been watching their API bills climb into the hundreds of dollars per day running agentic workflows, that’s not just a nice discount. It’s the difference between “cool experiment” and “viable business tool.”
As Will Brown put it on X: “Sonnet 4.6 is the first flagship LLM since BloombergGPT to be targeted primarily at the finance crowd.” He’s half-joking, but only half. This model was clearly trained with agents in mind, as the benchmarks show.
Let’s break down everything that makes Sonnet 4.6 significant, from the headline numbers to the weird stuff buried 90 pages deep in its 134-page system card.
The benchmark breakdown: Where it wins, where it doesn’t
Let’s be precise about what Sonnet 4.6 actually does well and where Opus still has an edge. The numbers matter here because “it’s basically the same” is true for some tasks and misleading for others.
Where Sonnet 4.6 matches or beats Opus 4.6
Computer use (OSWorld-Verified): 72.5% vs. 72.7%. Essentially tied.
Real-world office tasks (GDPval-AA): Sonnet 4.6 hit an ELO of 1633, actually slightly ahead of Opus 4.6’s 1606. This benchmark, run by Artificial Analysis, tests models on 220 professional tasks across 44 occupations (accountants, analysts, designers, editors) and 9 industries. Think tasks like “prepare a detailed amortization schedule in Excel for prepaid expenses” or “create a pitch deck analyzing market trends.” Sonnet 4.6 is now the #1 model on this leaderboard.
Financial analysis (Finance Agent by Vals AI): 63.3% with max thinking, beating Opus 4.6 (60.05%) and GPT-5.2 (58.53%). This measures research on SEC filings of public companies.
Web automation (WebArena-Verified): Sonnet 4.6 scored state-of-the-art on the full set, exceeding Opus 4.6 among single-agent systems.
Agentic search (BrowseComp): 74.72%, above Opus 4.5, and with a multi-agent setup reached 82.62%.
Deep research (DeepSearchQA): State-of-the-art results across all models tested.
Customer service (τ²-bench): 97.9% on Telecom, 91.7% on Retail. Near-perfect.
Long-context graph reasoning (GraphWalks): Sonnet 4.6 is actually Anthropic’s best model for this, beating even Opus 4.6.
Scientific chart understanding (CharXiv Reasoning): 77.4% with tools, matching Opus 4.6.
Medical calculations (MedCalc-Bench): 86.24%, slightly above Opus 4.6 (85.24%).
Cybersecurity (CyberGym): 65.2%, nearly matching Opus 4.6’s 66.6%.
Reasoning benchmarks (SimpleBench): Now on par with Opus 4, per independent testing by LM Council.
Context engineering (Letta Context-Bench): 70% improvement in token efficiency and 38% improvement in accuracy over Sonnet 4.5.
Where Opus 4.6 still leads
Pure coding (SWE-bench Verified): 79.6% vs. 80.8%. Close, but Opus retains a small edge on complex software engineering tasks.
Terminal tasks (Terminal-Bench 2.0): 59.1% vs. 65.4%. Opus has a clearer advantage here.
Deepest reasoning (GPQA Diamond): 89.9% vs. 91.3%. For graduate-level science questions, Opus still pulls ahead.
Root cause analysis (OpenRCA): 27.9% vs. 34.9%. Opus is significantly better at diagnosing complex software failures across enterprise systems.
ARC-AGI-2 fluid intelligence: 58.3% vs. 68.8%. For novel pattern reasoning, Opus keeps a healthy lead.
Codebase refactoring and multi-agent coordination: Anthropic specifically notes Opus 4.6 remains the stronger choice for tasks demanding “the deepest reasoning.”
For tasks that look like work (spreadsheets, presentations, data analysis, browser automation, tool use, financial research), Sonnet 4.6 is functionally interchangeable with Opus. For tasks that look like hard computer science (complex debugging, novel reasoning, large-scale code refactoring), Opus still has an edge.
As Alex Finn put it in his breakdown video: “Sonnet is not better than Opus at any specific thing, but it is just as good as Opus 4.6 when it comes to agentic tasks specifically. This is massive because it means it’s just as good as a brain for tools like OpenClaw and Claude Code… at a fifth of the price.”
What developers are actually doing with it
The response from the developer community was immediate and telling.
OpenClaw released a same-day update to support Sonnet 4.6, and users are reporting it as the new default model for their AI agent workflows. The logic is simple: if computer use and tool use performance is essentially the same as Opus, but the cost is one-fifth, you run Sonnet for everything except the hardest coding tasks.
Alex Finn’s breakdown laid out the practical decision framework: use Sonnet 4.6 as your main agent model, use Opus only for planning or one-shot implementations of complex components, and use Codex for pure coding tasks inside agent frameworks.
Meta Alchemist captured the consensus view: “Sonnet 4.6 feels like it was made for OpenClaw… with how much emphasis they put on running the apps on your computer, and tool usage. Almost the same levels there as Opus 4.6. If you are using Claude with OpenClaw, using Sonnet 4.6 will be faster and cheaper compared to Opus.”
Letta, the agent framework company, integrated Sonnet 4.6 and reported near-Opus-level performance on context engineering tasks with 70% better token efficiency. They did note one behavioral difference: Sonnet 4.6 is less likely to delegate work to sub-agents or explicitly trigger plan modes, so prompt tuning may be needed for complex multi-agent setups.
Cline 3.64.0 launched with Sonnet 4.6 support, highlighting clearer communication with the coding assistant, better framework integration, and improved codebase search.
The Firecrawl team identified what they called “the perfect web automation stack”: Sonnet 4.6 plus Agent Browser plus Firecrawl’s browser sandbox.
And Wes Winder offered the obligatory reality check in meme form: “Sonnet 4.6 just refactored my entire codebase in one call. 64 tool invocations. 1M+ new lines. 17 brand new files. It modularized everything. Broke up monoliths. Cleaned up spaghetti. None of it worked. But boy was it beautiful.”
The competitive landscape: How it stacks up against GPT-5.2 and Gemini 3 Pro
It’s worth putting Sonnet 4.6 in the broader competitive context, because this isn’t just an Anthropic-vs-Anthropic story.
On GDPval-AA (real-world knowledge work), Sonnet 4.6’s ELO of 1633 puts it ahead of GPT-5.2 (1462) and Gemini 3 Pro (1201) by meaningful margins. On the Finance Agent benchmark, it beats GPT-5.2 by nearly 5 percentage points. On DeepSearchQA (multi-step research tasks), it’s state-of-the-art across all models tested.
On traditional reasoning benchmarks, the picture is more competitive. GPT-5.2 leads on GPQA Diamond (93.2% vs. 89.9%) and MMMU-Pro with tools (80.4% vs. 75.6%). Gemini 3 Pro leads on MMMLU multilingual understanding (91.8% vs. 89.3%). But these are the kinds of academic benchmarks that, increasingly, don’t predict which model will be most useful in practice.
Where Sonnet 4.6 has a more unique advantage is in the infrastructure surrounding it. Programmatic tool calling, context compaction, adaptive thinking, and the computer use API are all capabilities that GPT-5.2 and Gemini 3 Pro either don’t offer or implement differently. For developers building agentic systems, these features often matter more than a few percentage points on a multiple-choice test.
One interesting data point from the system card: Anthropic measured how much “thinking” each model does on multilingual questions. Gemini 3 Pro used 1,078 tokens per question. Sonnet 4.5 used 437. Sonnet 4.6 used 246. Opus 4.6 used 191. GPT-5.2 Pro used 127. The models achieve comparable accuracy at wildly different levels of computational effort, which means efficiency (and therefore cost and speed) varies enormously even when benchmark scores look similar.
On the Petri open-source safety audit, which enables apples-to-apples comparison across different model providers, Sonnet 4.6 showed stronger safety properties than every API model from another provider that was tested, including GPT-5.2, Gemini 3 Pro, Grok 4.1 Fast, and Kimi K2.5.
How to think about this if you actually use AI for work
Here’s the practical takeaway, stripped of benchmark jargon.
- If you use Claude through the website or app: Sonnet 4.6 is now your default model. You don’t need to do anything. It’s faster and more capable than what you had yesterday, especially for file creation, data analysis, and any task that involves working with spreadsheets, documents, or web research.
- If you use Claude Code or agentic coding tools: Sonnet 4.6 should be your default for most tasks. Save Opus for complex architecture decisions, large refactors, or situations where you need the absolute best code quality on the first try. The 1M token context window means it can hold your entire codebase in memory.
- If you build applications on the Claude API: The combination of Sonnet 4.6 performance and programmatic tool calling is a meaningful cost reduction. Tasks that required Opus-class models (and Opus-class pricing) can now run on Sonnet. The 1M context window plus context compaction means you can build much longer-running agents without hitting limits.
- If you work in finance: This is genuinely the strongest AI model available for financial analysis tasks, including SEC filing research, financial modeling, and structured document generation. The benchmarks support this, and it’s rare for a Sonnet-class model to beat every Opus and GPT variant on a finance-specific evaluation.
- If you’re evaluating AI models for your organization: The GDPval-AA results are probably the most relevant benchmark to look at. It tests real professional tasks across real occupations, and Sonnet 4.6 is currently #1, slightly ahead of Opus 4.6. For most “knowledge work” use cases, this is the best value in AI right now.
Sonnet 4.6 is what happens when Opus-level intelligence meets Sonnet-level pricing, and it’s built from the ground up for the thing that will define the next year of AI: agents that actually do work on your behalf.
The age of AI as a “chatbot you type questions into” is rapidly giving way to the age of AI as “a coworker that uses your computer, reads your documents, and gets things done while you sleep.” Sonnet 4.6 is the model that makes that transition economically viable for everyone, not just the companies willing to burn thousands per day on API costs.
For most people, this should just be the model you use. No asterisks, no caveats, no “but wait for Opus.” Just use it.
Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.
The post Sonnet 4.6 Explained: Anthropic’s New Mid-Tier Model Is Here appeared first on eWEEK.
