Summary
Anthropic has released Claude Sonnet 4.6 – a beta version with significant performance improvements in coding, computer use, reasoning, and agent-based tasks. The model offers a context window of one million tokens for the first time and becomes the standard model for free users and pro subscribers. Although Sonnet remains the mid-tier model of the Claude family, it sometimes outperforms Opus 4.5 in benchmarks – at significantly lower costs. New token-saving features such as context compression address cost control for extensive tasks.
People
- Eva-Maria Weiss (Author)
Topics
- AI model families and benchmarking
- Large Language Models (LLMs)
- Computer vision and automation
- Safety in AI applications
Clarus Lead
Claude Sonnet 4.6 sets new standards for cost efficiency. The mid-tier model of the Anthropic family achieves benchmark performance levels between Opus 4.5 and Opus 4.6, while remaining significantly cheaper. For decision-makers in development and data processing, this is relevant: Sonnet 4.6 becomes the default model for millions of users. The computer use function demonstrates a performance leap of over 11 percentage points compared to the previous version, with a 72.5% success rate in the OSWorld benchmark.
Detailed Summary
Claude Sonnet 4.6 offers comprehensive improvements across multiple dimensions. The performance increase spans coding capabilities, autonomous agent coordination, logical reasoning, and professional design tasks. The new one million token context window enables processing of significantly longer documents and conversation histories – a critical advantage for document-intensive scenarios.
The positioning within the product portfolio remains clear: Haiku is the fastest and most cost-effective model, Sonnet the balanced mid-range offering, Opus the performance pinnacle for highly complex problems. However, the benchmark results relativize this hierarchy: Sonnet 4.6 sometimes competes with Opus 4.5, particularly in standardized tests. Practical performance varies depending on specific tasks.
A focus lies on computer use – the ability to operate regular software like LibreOffice, Chrome, and VS Code similarly to a human, without explicit API integration. With 72.5% success rate in the OSWorld benchmark, Sonnet 4.6 demonstrates considerable progress. Simultaneously, Anthropic identifies a critical security gap: prompt injections – hidden instructions on websites – are attack vectors. The new version is intended to better detect and defend against these, but the fundamental problem remains unsolved.
Cost control is a central sales argument. New features like context compression compress older conversation histories to reduce token consumption. This is necessary: deep reasoning tasks or multi-agent scenarios can quickly become prohibitively expensive. Opus 4.6 remains the tool of choice for such edge cases.
Key Findings
- Claude Sonnet 4.6 becomes the standard model for free and pro users; for the first time with a 1-million-token context window
- Performance: Benchmark level between Opus 4.5 and 4.6, at 30–50% lower costs
- Computer use improves by 11 percentage points (61.4% → 72.5% OSWorld success rate)
- Security risks (prompt injections) are addressed but not completely solved
- Token-saving mechanisms (context compression) are necessary for cost management in large tasks
Critical Questions
Evidence/Data Quality: How representative are the benchmark metrics (OSWorld 72.5%) for real production scenarios? Are test tasks regularly recalibrated to prevent overfitting?
Conflicts of Interest: Anthropic publishes both the model and the benchmarks. Is there independent third-party validation of performance comparisons with OpenAI models or other competitors?
Causality/Alternatives: To what extent do performance gains result from architectural innovations versus better training? Could these improvements have been achieved with a larger Haiku variant?
Security/Implementation: The statement that prompt injections are "detected and avoided" – how is this defense specifically implemented, and has Anthropic conducted external penetration testing?
Feasibility: What concrete cost savings does the context compression function deliver in typical production scenarios (e.g., 1M-token window)?
Competitive Context: How does Sonnet 4.6 position itself against GPT-4 variants or other open models in computer-use scenarios?
Sources
Primary Source: Anthropic releases Claude Sonnet 4.6 – it can do everything better – heise.de, Eva-Maria Weiss
Verification Status: ✓ 2025
This text was created with the support of an AI model. Editorial responsibility: clarus.news