Summary

A research team from Stanford and Yale has demonstrated in a new study that copyrighted books can be extracted from commercial language models almost word-for-word. Claude 3.7 Sonnet showed the highest extraction rates – 95.8% for "Harry Potter and the Philosopher's Stone" – while other models such as Gemini 2.5 Pro and Grok 3 cooperate without additional security mechanisms. The results raise fundamental questions about data security and copyright.

People

  • Jonathan Kemper (Author)

Topics

  • AI Models and Memorization
  • Copyright Protection
  • Data Security in Large Language Models
  • Legal Assessment of Model Training

Detailed Summary

Experimental Design and Results

The researchers tested four commercial language models – Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3 – between August and September 2025. They followed a two-phase method: In Phase 1, they asked the models to continue the opening passages of classic works word-for-word. Phase 2 aimed to extract as much text as possible through repeated queries.

The results were dramatic: Claude 3.7 Sonnet delivered 95.8% of the text from "Harry Potter and the Philosopher's Stone" and was able to reconstruct two complete books nearly identically. Gemini 2.5 Pro and Grok 3 extracted 76.8% and 70.3% respectively. Notably, the latter two models followed instructions without circumventing security mechanisms, while Claude and GPT-4.1 required a "best-of-N jailbreak" – instructions were corrupted through character replacements until the model cooperated.

Measurement Method and Scaling

The researchers used the metric "near-verbatim recall" (nv-recall), which only counts coherent text blocks of at least 100 words. This prevents random similarities from being classified as memorization. Crucially: even low percentages represent substantial amounts of text. Thus, 1.3% of "Game of Thrones" in Grok 3 corresponds to approximately 3,700 words of nearly identical text. The longest continuous blocks comprised up to 9,070 words.

Costs and Practical Relevance

Extraction costs varied dramatically:

  • Claude 3.7 Sonnet: $120 USD (long contexts)
  • Grok 3: $8 USD
  • Gemini 2.5 Pro: $2.44 USD
  • GPT-4.1: $1.37 USD (early refusal)

This shows that copyright violations are economically viable for attackers.

Validation and Context

The team tested a total of 13 books (11 copyrighted, 2 public domain). Works after the training cutoffs served as negative controls – these failed Phase 1 with all models, confirming that extraction actually reflects training data.


Key Findings

  • Claude 3.7 Sonnet shows critical security gaps: Two complete classic novels were reconstructed almost word-for-word.
  • Gemini and Grok are problematic because they extract books without jailbreak techniques – a pure design issue.
  • Low percentages are deceptive: 1–2% extraction corresponds to thousands of words of copied text.
  • Memorization is universal: Earlier studies show that even open models like Llama 3.1 70B and image/video models are affected.
  • Costs are low: Entire books can be extracted via API for under $2–8.
  • Legal situation fragmented: German court (November 2025, GEMA vs. OpenAI) recognized reproduction in model parameters; British court reached opposite conclusion.

Stakeholders & Those Affected

StakeholderImpact
Authors & PublishersDirect copyright infringement; their works become effortlessly reproducible
AI ProvidersLiability risks; reputational damage; potential cease-and-desist letters
UsersRisk-free access to protected content
Judges & RegulatorsUnclear how memorization should be legally assessed
Training Data SuppliersPressure to control training data more strictly

Opportunities & Risks

OpportunitiesRisks
Sharper awareness of security gaps in modelsMassive economic damage to creative industries
Adaptation of security protocols by developersFree reproduction of works at massive scale
Regulatory clarity through court casesLegal uncertainty for AI companies hampers innovation
Transparency about training practicesConsumer and user deception about AI capabilities
⚠️ Loss of trust in AI industry

Action Relevance

For AI Providers:

  • Conduct immediate audits of own models for memorization
  • Tighten security protocols: Gemini and Grok require urgent safeguards
  • Establish transparent training practices to minimize legal risks

For Regulators:

  • Develop standardized testing methods for memorization
  • Provide clear legal definition: Is memorization training or reproduction?
  • Establish liability rules for model operators

For Authors & Publishers:

  • Demand opt-out mechanisms (e.g., robots.txt for training)
  • Consider damage lawsuits (use GEMA case as precedent)
  • Employ monitoring tools for model misuse

For Users & Society:

  • Critical evaluation of AI outputs regarding original content
  • Demand consumer protection

Quality Assurance & Fact-Checking

  • [x] Central statements and figures verified (extraction rates, costs, model names)
  • [x] Unconfirmed data marked with ⚠️ (see above)
  • [x] Web research conducted for current data
  • [x] No detected bias or political one-sidedness

⚠️ Limitation: The researchers themselves emphasize that their results do not constitute evaluative comparisons between models. Each experiment ran under different conditions; only the specific settings should be considered.


Supplementary Research

  1. Carnegie Mellon Study (RECAP Method): Independent confirmation of memorization in Claude, Gemini, GPT-4.1, and DeepSeek-V3
  2. Court Case GEMA vs. OpenAI (Munich, Nov. 2025): Legal assessment of memorization as reproduction
  3. Harvard Law School Report (2025): Analysis of global liability risks for generative AI

Source Directory

Primary Source: Stanford & Yale Research Team – "Memorization and Extraction in Large Language Models" (2025) Published via THE DECODER | https://the-decoder.de/

Supplementary Sources:

  1. Carnegie Mellon University: RECAP method for text extraction (2025)
  2. Munich Regional Court: Ruling GEMA v. OpenAI, November 2025
  3. British High Court: Ruling on image models and copyright, October 2025

Verification Status: ✓ Facts checked on January 9, 2026


Footer (Transparency Notice)


This text was created with the assistance of Claude. Editorial responsibility: clarus.news | Fact-checking: January 9, 2026