Claude Sonnet 4.6 – Anthropic's Latest AI Generation with Enhanced Capabilities

Summary

Anthropic has released Claude Sonnet 4.6 – a beta version with significant performance improvements in coding, computer use, reasoning, and agent-based tasks. The model offers a context window of one million tokens for the first time and becomes the standard model for free users and pro subscribers. Although Sonnet remains the mid-tier model of the Claude family, it sometimes outperforms Opus 4.5 in benchmarks – at significantly lower costs. New token-saving features such as context compression address cost control for extensive tasks.

People

Eva-Maria Weiss (Author)

Topics

AI model families and benchmarking
Large Language Models (LLMs)
Computer vision and automation
Safety in AI applications

Clarus Lead

Claude Sonnet 4.6 sets new standards for cost efficiency. The mid-tier model of the Anthropic family achieves benchmark performance levels between Opus 4.5 and Opus 4.6, while remaining significantly cheaper. For decision-makers in development and data processing, this is relevant: Sonnet 4.6 becomes the default model for millions of users. The computer use function demonstrates a performance leap of over 11 percentage points compared to the previous version, with a 72.5% success rate in the OSWorld benchmark.

Detailed Summary

Claude Sonnet 4.6 offers comprehensive improvements across multiple dimensions. The performance increase spans coding capabilities, autonomous agent coordination, logical reasoning, and professional design tasks. The new one million token context window enables processing of significantly longer documents and conversation histories – a critical advantage for document-intensive scenarios.

The positioning within the product portfolio remains clear: Haiku is the fastest and most cost-effective model, Sonnet the balanced mid-range offering, Opus the performance pinnacle for highly complex problems. However, the benchmark results relativize this hierarchy: Sonnet 4.6 sometimes competes with Opus 4.5, particularly in standardized tests. Practical performance varies depending on specific tasks.

A focus lies on computer use – the ability to operate regular software like LibreOffice, Chrome, and VS Code similarly to a human, without explicit API integration. With 72.5% success rate in the OSWorld benchmark, Sonnet 4.6 demonstrates considerable progress. Simultaneously, Anthropic identifies a critical security gap: prompt injections – hidden instructions on websites – are attack vectors. The new version is intended to better detect and defend against these, but the fundamental problem remains unsolved.

Cost control is a central sales argument. New features like context compression compress older conversation histories to reduce token consumption. This is necessary: deep reasoning tasks or multi-agent scenarios can quickly become prohibitively expensive. Opus 4.6 remains the tool of choice for such edge cases.

Key Findings

Claude Sonnet 4.6 becomes the standard model for free and pro users; for the first time with a 1-million-token context window
Performance: Benchmark level between Opus 4.5 and 4.6, at 30–50% lower costs
Computer use improves by 11 percentage points (61.4% → 72.5% OSWorld success rate)
Security risks (prompt injections) are addressed but not completely solved
Token-saving mechanisms (context compression) are necessary for cost management in large tasks

Critical Questions

Evidence/Data Quality: How representative are the benchmark metrics (OSWorld 72.5%) for real production scenarios? Are test tasks regularly recalibrated to prevent overfitting?
Conflicts of Interest: Anthropic publishes both the model and the benchmarks. Is there independent third-party validation of performance comparisons with OpenAI models or other competitors?
Causality/Alternatives: To what extent do performance gains result from architectural innovations versus better training? Could these improvements have been achieved with a larger Haiku variant?
Security/Implementation: The statement that prompt injections are "detected and avoided" – how is this defense specifically implemented, and has Anthropic conducted external penetration testing?
Feasibility: What concrete cost savings does the context compression function deliver in typical production scenarios (e.g., 1M-token window)?
Competitive Context: How does Sonnet 4.6 position itself against GPT-4 variants or other open models in computer-use scenarios?

Sources

Primary Source: Anthropic releases Claude Sonnet 4.6 – it can do everything better – heise.de, Eva-Maria Weiss

Verification Status: ✓ 2025

This text was created with the support of an AI model. Editorial responsibility: clarus.news