Summary
A research team from Stanford and Yale has demonstrated in a new study that copyrighted books can be extracted from commercial language models almost word-for-word. Claude 3.7 Sonnet showed the highest extraction rates – 95.8% for "Harry Potter and the Philosopher's Stone" – while other models such as Gemini 2.5 Pro and Grok 3 cooperate without additional security mechanisms. The results raise fundamental questions about data security and copyright.
People
- Jonathan Kemper (Author)
Topics
- AI Models and Memorization
- Copyright Protection
- Data Security in Large Language Models
- Legal Assessment of Model Training
Detailed Summary
Experimental Design and Results
The researchers tested four commercial language models – Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 – between August and September 2025. They followed a two-phase method: In Phase 1, they asked the models to continue the opening passages of classic works word-for-word. Phase 2 aimed to extract as much text as possible through repeated queries.
The results were dramatic: Claude 3.7 Sonnet delivered 95.8% of the text from "Harry Potter and the Philosopher's Stone" and was able to reconstruct two complete books nearly identically. Gemini 2.5 Pro and Grok 3 extracted 76.8% and 70.3% respectively. Notably, the latter two models followed instructions without circumventing security mechanisms, while Claude and GPT-4.1 required a "best-of-N jailbreak" – instructions were corrupted through character replacements until the model cooperated.
Measurement Method and Scaling
The researchers used the metric "near-verbatim recall" (nv-recall), which only counts coherent text blocks of at least 100 words. This prevents random similarities from being classified as memorization. Crucially: even low percentages represent substantial amounts of text. Thus, 1.3% of "Game of Thrones" in Grok 3 corresponds to approximately 3,700 words of nearly identical text. The longest continuous blocks comprised up to 9,070 words.
Costs and Practical Relevance
Extraction costs varied dramatically:
- Claude 3.7 Sonnet: $120 USD (long contexts)
- Grok 3: $8 USD
- Gemini 2.5 Pro: $2.44 USD
- GPT-4.1: $1.37 USD (early refusal)
This shows that copyright violations are economically viable for attackers.
Validation and Context
The team tested a total of 13 books (11 copyrighted, 2 public domain). Works after the training cutoffs served as negative controls – these failed Phase 1 with all models, confirming that extraction actually reflects training data.
Key Findings
- Claude 3.7 Sonnet shows critical security gaps: Two complete classic novels were reconstructed almost word-for-word.
- Gemini and Grok are problematic because they extract books without jailbreak techniques – a pure design issue.
- Low percentages are deceptive: 1–2% extraction corresponds to thousands of words of copied text.
- Memorization is universal: Earlier studies show that even open models like Llama 3.1 70B and image/video models are affected.
- Costs are low: Entire books can be extracted via API for under $2–8.
- Legal situation fragmented: German court (November 2025, GEMA vs. OpenAI) recognized reproduction in model parameters; British court reached opposite conclusion.
Stakeholders & Those Affected
| Stakeholder | Impact |
|---|---|
| Authors & Publishers | Direct copyright infringement; their works become effortlessly reproducible |
| AI Providers | Liability risks; reputational damage; potential cease-and-desist letters |
| Users | Risk-free access to protected content |
| Judges & Regulators | Unclear how memorization should be legally assessed |
| Training Data Suppliers | Pressure to control training data more strictly |
Opportunities & Risks
| Opportunities | Risks |
|---|---|
| Sharper awareness of security gaps in models | Massive economic damage to creative industries |
| Adaptation of security protocols by developers | Free reproduction of works at massive scale |
| Regulatory clarity through court cases | Legal uncertainty for AI companies hampers innovation |
| Transparency about training practices | Consumer and user deception about AI capabilities |
| ⚠️ Loss of trust in AI industry |
Action Relevance
For AI Providers:
- Conduct immediate audits of own models for memorization
- Tighten security protocols: Gemini and Grok require urgent safeguards
- Establish transparent training practices to minimize legal risks
For Regulators:
- Develop standardized testing methods for memorization
- Provide clear legal definition: Is memorization training or reproduction?
- Establish liability rules for model operators
For Authors & Publishers:
- Demand opt-out mechanisms (e.g., robots.txt for training)
- Consider damage lawsuits (use GEMA case as precedent)
- Employ monitoring tools for model misuse
For Users & Society:
- Critical evaluation of AI outputs regarding original content
- Demand consumer protection
Quality Assurance & Fact-Checking
- [x] Central statements and figures verified (extraction rates, costs, model names)
- [x] Unconfirmed data marked with ⚠️ (see above)
- [x] Web research conducted for current data
- [x] No detected bias or political one-sidedness
⚠️ Limitation: The researchers themselves emphasize that their results do not constitute evaluative comparisons between models. Each experiment ran under different conditions; only the specific settings should be considered.
Supplementary Research
- Carnegie Mellon Study (RECAP Method): Independent confirmation of memorization in Claude, Gemini, GPT-4.1, and DeepSeek-V3
- Court Case GEMA vs. OpenAI (Munich, Nov. 2025): Legal assessment of memorization as reproduction
- Harvard Law School Report (2025): Analysis of global liability risks for generative AI
Source Directory
Primary Source: Stanford & Yale Research Team – "Memorization and Extraction in Large Language Models" (2025) Published via THE DECODER | https://the-decoder.de/
Supplementary Sources:
- Carnegie Mellon University: RECAP method for text extraction (2025)
- Munich Regional Court: Ruling GEMA v. OpenAI, November 2025
- British High Court: Ruling on image models and copyright, October 2025
Verification Status: ✓ Facts checked on January 9, 2026
Footer (Transparency Notice)
This text was created with the assistance of Claude. Editorial responsibility: clarus.news | Fact-checking: January 9, 2026