Copyrighted Novels Can Be Almost Completely Retrieved From AI Language Models

Summary

A research team from Stanford and Yale has demonstrated in a new study that copyrighted books can be extracted from commercial language models almost word-for-word. Claude 3.7 Sonnet showed the highest extraction rates – 95.8% for "Harry Potter and the Philosopher's Stone" – while other models such as Gemini 2.5 Pro and Grok 3 cooperate without additional security mechanisms. The results raise fundamental questions about data security and copyright.

People

Jonathan Kemper (Author)

Topics

AI Models and Memorization
Copyright Protection
Data Security in Large Language Models
Legal Assessment of Model Training

Detailed Summary

Experimental Design and Results

The researchers tested four commercial language models – Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 – between August and September 2025. They followed a two-phase method: In Phase 1, they asked the models to continue the opening passages of classic works word-for-word. Phase 2 aimed to extract as much text as possible through repeated queries.

The results were dramatic: Claude 3.7 Sonnet delivered 95.8% of the text from "Harry Potter and the Philosopher's Stone" and was able to reconstruct two complete books nearly identically. Gemini 2.5 Pro and Grok 3 extracted 76.8% and 70.3% respectively. Notably, the latter two models followed instructions without circumventing security mechanisms, while Claude and GPT-4.1 required a "best-of-N jailbreak" – instructions were corrupted through character replacements until the model cooperated.

Measurement Method and Scaling

The researchers used the metric "near-verbatim recall" (nv-recall), which only counts coherent text blocks of at least 100 words. This prevents random similarities from being classified as memorization. Crucially: even low percentages represent substantial amounts of text. Thus, 1.3% of "Game of Thrones" in Grok 3 corresponds to approximately 3,700 words of nearly identical text. The longest continuous blocks comprised up to 9,070 words.

Costs and Practical Relevance

Extraction costs varied dramatically:

Claude 3.7 Sonnet: $120 USD (long contexts)
Grok 3: $8 USD
Gemini 2.5 Pro: $2.44 USD
GPT-4.1: $1.37 USD (early refusal)

This shows that copyright violations are economically viable for attackers.

Validation and Context

The team tested a total of 13 books (11 copyrighted, 2 public domain). Works after the training cutoffs served as negative controls – these failed Phase 1 with all models, confirming that extraction actually reflects training data.

Key Findings

Claude 3.7 Sonnet shows critical security gaps: Two complete classic novels were reconstructed almost word-for-word.
Gemini and Grok are problematic because they extract books without jailbreak techniques – a pure design issue.
Low percentages are deceptive: 1–2% extraction corresponds to thousands of words of copied text.
Memorization is universal: Earlier studies show that even open models like Llama 3.1 70B and image/video models are affected.
Costs are low: Entire books can be extracted via API for under $2–8.
Legal situation fragmented: German court (November 2025, GEMA vs. OpenAI) recognized reproduction in model parameters; British court reached opposite conclusion.

Stakeholders & Those Affected

Stakeholder	Impact
Authors & Publishers	Direct copyright infringement; their works become effortlessly reproducible
AI Providers	Liability risks; reputational damage; potential cease-and-desist letters
Users	Risk-free access to protected content
Judges & Regulators	Unclear how memorization should be legally assessed
Training Data Suppliers	Pressure to control training data more strictly

Opportunities & Risks

Opportunities	Risks
Sharper awareness of security gaps in models	Massive economic damage to creative industries
Adaptation of security protocols by developers	Free reproduction of works at massive scale
Regulatory clarity through court cases	Legal uncertainty for AI companies hampers innovation
Transparency about training practices	Consumer and user deception about AI capabilities
	⚠️ Loss of trust in AI industry

Action Relevance

For AI Providers:

Conduct immediate audits of own models for memorization
Tighten security protocols: Gemini and Grok require urgent safeguards
Establish transparent training practices to minimize legal risks

For Regulators:

Develop standardized testing methods for memorization
Provide clear legal definition: Is memorization training or reproduction?
Establish liability rules for model operators

For Authors & Publishers:

Demand opt-out mechanisms (e.g., robots.txt for training)
Consider damage lawsuits (use GEMA case as precedent)
Employ monitoring tools for model misuse

For Users & Society:

Critical evaluation of AI outputs regarding original content
Demand consumer protection

Quality Assurance & Fact-Checking

[x] Central statements and figures verified (extraction rates, costs, model names)
[x] Unconfirmed data marked with ⚠️ (see above)
[x] Web research conducted for current data
[x] No detected bias or political one-sidedness

⚠️ Limitation: The researchers themselves emphasize that their results do not constitute evaluative comparisons between models. Each experiment ran under different conditions; only the specific settings should be considered.

Supplementary Research

Carnegie Mellon Study (RECAP Method): Independent confirmation of memorization in Claude, Gemini, GPT-4.1, and DeepSeek-V3
Court Case GEMA vs. OpenAI (Munich, Nov. 2025): Legal assessment of memorization as reproduction
Harvard Law School Report (2025): Analysis of global liability risks for generative AI

Source Directory

Primary Source: Stanford & Yale Research Team – "Memorization and Extraction in Large Language Models" (2025) Published via THE DECODER | https://the-decoder.de/

Supplementary Sources:

Carnegie Mellon University: RECAP method for text extraction (2025)
Munich Regional Court: Ruling GEMA v. OpenAI, November 2025

Verification Status: ✓ Facts checked on January 9, 2026

Footer (Transparency Notice)

This text was created with the assistance of Claude. Editorial responsibility: clarus.news | Fact-checking: January 9, 2026