RSS GitHub
The Ledger A sourced historical record of AI

US Copyright Office: AI Training Not Categorically Fair Use

A ledger entry in the policy archive, dated 2025-05-09.

Summary

The US Copyright Office released a pre-publication version of Part 3 of its AI and copyright series, concluding that the use of copyrighted works to train AI systems is not categorically exempt as fair use. The report found that commerciality and market harm weigh heavily against AI developers in a standard four-factor fair use analysis, and that training on pirated datasets substantially undermines any fair use defense. The Office declined to recommend a compulsory licensing regime, instead calling for scalable voluntary and collective licensing solutions.

What Happened

Part 3 completed the Office's core analytical work on AI and copyright, addressing the question that had been generating the most litigation: does ingesting copyrighted material to train an AI model constitute fair use? The Office applied the standard four-factor framework and found no categorical answer, but the weight of its analysis cut against the industry's broad fair use claims.

On the first factor — purpose and character of the use — the Office found that the commercial nature of frontier AI development weighed against fair use, rejecting the argument that training is inherently transformative in a legally meaningful sense. On the fourth factor — effect on the potential market for the original work — the Office found substantial market harm where AI outputs could substitute for licensed content, undermining existing and emerging licensing markets. The report identified the use of openly pirated datasets such as Books3 and LibGen as particularly damaging to fair use arguments, noting that knowingly training on infringing source material could not benefit from a defense designed to protect good-faith uses.

The report explicitly rejected a compulsory licensing model, which some AI developers had proposed as a compromise. Instead, it recommended voluntary licensing frameworks and encouraged the development of collective licensing mechanisms analogous to those used in the music industry, where blanket licenses enable large-scale lawful use at manageable transaction costs.

Why It Matters

Part 3 is the most consequential policy statement yet in the AI training data debate. While it carries no binding legal authority, the Copyright Office's analysis is highly persuasive to courts and Congress alike. Its conclusion that training is not categorically fair use — combined with its specific analysis of how piracy in training datasets undermines any defense — substantially narrows the safe harbor AI developers had been assuming. The voluntary licensing recommendation shifts the policy debate from whether AI companies must pay rights holders to how those payments should be structured and collected.

§ How to read the metadata
Landmark
Fundamentally alters the trajectory; 2–5 per year.
Major
Meaningfully shifts the landscape; 2–4 per month.
Notable
Worth documenting; significance can be upgraded later.
Confidence
High = primary sources corroborate. Medium = credible secondary only. Low = provisional. Disputed = credible sources disagree.
Contestation
Uncontested = no formal challenge. Contested = at least one challenge open. Superseded = replaced by a later entry. Unresolved = dispute still open.

References

  1. Copyright and Artificial Intelligence — US Copyright Office , US Copyright Office official
  2. US Copyright Office Releases Third Report on AI and Copyright: Addressing Training AI Models with Copyrighted Materials , Crowell & Moring (Mon May 12 2025 00:00:00 GMT+0000 (Coordinated Universal Time)) expert interpretation
  3. US Copyright Office AI Report Part 3: What Authors Should Know , Authors Guild (Sat May 10 2025 00:00:00 GMT+0000 (Coordinated Universal Time)) secondary reporting

See also