Court Papers Reveal Anthropic Bought, Scanned and Destroyed Millions of Books to Train Its AI — And Tried to Keep It Quiet

🇺🇸United States

TechnologyArtificial IntelligencePublishingMediaMusic

Sat, Jan 31, 2026

InsightsWire News2026

Internal filings made public in a recent court order depict a deliberate, large‑scale effort by Anthropic to feed its Claude models with high‑quality written material by purchasing used books, converting them into digital files and recycling the paper originals. Company planners favored physical books as a dependable source of polished prose that could elevate model outputs above noisier web‑scraped text, and the operation relied on industrial scanners and mechanized cutting equipment to process volumes efficiently and at low cost. Anthropic’s legal team anchored its defense in the resale owner’s prerogative to repurpose purchased goods, a justification that featured prominently in court exchanges as judges weighed transformative‑use arguments. The records also disclose earlier use of automated downloads from shadow libraries, a separate procurement channel that executives later characterized as distinct from the book‑scanning program. The book revelations culminated in a significant settlement with authors — reported in filings as roughly $1.5 billion — underscoring both commercial exposure and reputational harm. Parallel litigation made public during discovery has widened the controversy beyond books: music publishers have filed a complaint alleging tens of thousands of songs, lyrics and sheet music were incorporated into training corpora and are seeking more than $3 billion in damages. At the same time, several major publishers have begun blocking automated access to repositories such as the Internet Archive, arguing that organized, machine‑readable archives are being used as indirect routes for mass ingestion by AI firms. Those publisher measures trade reduced archival openness for a tactic aimed at cutting off convenient access to copyrighted reporting and other materials, but they also raise concerns about fragmenting public preservation resources. Taken together, the disclosures and the new complaints illustrate a pattern of aggressive, sometimes opaque data acquisition and prompt questions about when legal defensibility diverges from ethical or reputational risk. For creators, publishers and model builders alike, the episode intensifies pressure for clearer provenance, licensing frameworks, and technical or contractual controls that can prevent unconsented reuse. The affair is likely to push the industry toward auditable supply chains, negotiated licensing deals, and possibly narrower judicial rules on permissible training practices.

PREMIUM ANALYSIS

Read Our Expert Analysis

Create an account or login for free to unlock our expert analysis and key takeaways for this development.

By continuing, you agree to receive marketing communications and our weekly newsletter. You can opt-out at any time.

Free Access

No Payment Needed

Join Thousands of Readers

Recommended for you

Policy & Geopolitics

Anthropic Settlement and Landmark Rulings Force AI Labs to Rework Training Data

Anthropic agreed to a $1.5 billion settlement after courts scrutinized how large language models handle copyrighted material, and parallel lawsuits by music publishers and creators broaden the exposure—pushing AI firms to reassess training-data provenance, licensing and acquisition channels.

Markets & Economy

Major music publishers sue Anthropic, seek $3B+ over alleged mass copyright copying

A coalition led by Concord and Universal alleges Anthropic copied and used more than 20,000 copyrighted musical works to train its Claude models and is seeking in excess of $3 billion, relying in part on discovery from prior litigation to show patterns of bulk acquisition. The filing is part of a broader wave of creator and publisher suits testing how AI builders source training data and could force licensing, provenance controls, or injunctive limits on dataset procurement.

Policy & Geopolitics

Anthropic Accuses DeepSeek, MiniMax and Moonshot of Distillation Mining of Claude

Anthropic alleges three mainland-China labs used over 24,000 fake accounts to record roughly 16 million exchanges from its Claude model to perform large-scale distillation; OpenAI and other industry disclosures show similar extraction tactics but have not independently verified Anthropic’s full counts, deepening policy and legal debates over export controls, telemetry, and model-protection measures.

AI & Technology

OpenAI faces copyright and trademark suit from Encyclopaedia Britannica

Encyclopaedia Britannica sued OpenAI claiming the company ingested proprietary encyclopedia text during model training and that outputs sometimes repeat or misattribute that material; the complaint seeks injunctive relief and trademark remedies. The filing comes amid a broader wave of litigation—including multi‑billion‑dollar demands and a reported $1.5 billion authors’ settlement—that is forcing publishers, archivists and model builders to reassess data sourcing, provenance and licensing practices.

Markets & Economy

Publishers Restrict Internet Archive Access as AI Scraping Risks Rise

Several major news organizations are blocking the Internet Archive’s crawlers amid worries that AI companies could use the Archive as a conduit to collect paywalled journalism. The change intensifies legal and commercial conflicts over training data and raises short-term risks to public access and long-term questions about how journalistic content will be governed for AI use.

Policy & Geopolitics

OpenAI alleges Chinese rival DeepSeek covertly siphoned outputs to train R1

OpenAI told U.S. lawmakers that DeepSeek used sophisticated, evasive querying and model-distillation techniques to harvest outputs from leading U.S. AI models and accelerate its R1 chatbot development. The claim sits alongside similar industry reports — including Google warnings about mass-query cloning attempts — underscoring a wider pattern that challenges existing defenses and pushes policymakers to consider provenance, watermarking and access controls.

Cybersecurity

Amazon Reported More Than One Million AI-Related CSAM Alerts to NCMEC but Refuses to Disclose Sources

Amazon told U.S. authorities it flagged over one million instances of AI-linked child sexual abuse material in 2025, driven largely by content it says was found in external data sets used for model training. The company says it removed the material before training and intentionally over-reported to avoid missing cases, but offered no specifics on where the material originated, leaving many reports unusable for law enforcement.

Policy & Geopolitics

Anthropic Blacklisting Triggers AI Market Shock

A White House‑led supply‑chain designation and de‑facto U.S. blacklist of Anthropic accelerated a broad market repricing across tech and catalyzed a high‑stakes political fight over AI procurement rules. The episode has already prompted roughly $125M in investor‑led pro‑industry political funding, a separate $20M company payment tied to Anthropic, and imperils a roughly $200M defense program with a six‑month migration window.