Reclaiming Fair Use: A Technical and Legal Refutation of the U.S. Copyright Office’s Generative AI Training Report

Community Article Published May 16, 2025

Introduction

In reviewing the U.S. Copyright Office’s Part 3 Report on Generative AI Training, I see a cautious stance that mostly favors copyright owners. The Report concludes that copying works to train AI models is usually prima facie infringement, and that only a narrow range of uses might qualify as fair use—especially given claims of market harm and lost licensing opportunities. In this response, I argue that the Report’s analysis rests on flawed technical assumptions and misapplications of copyright law.

First, I examine the Report’s technical descriptions of machine learning, generative models, data scraping, and training processes—identifying inaccuracies and misleading framings. I show how these misperceptions lead to incorrect legal conclusions. In particular, I contend the Report: (1) overstates the extent to which AI training “copies” and stores expressive content, conflating AI’s statistical learning with human-like acts of reading and learning; (2) undervalues the transformative nature of using copyrighted works as informational inputs to develop new AI capabilities; and (3) embraces an expansive “market harm” theory—including a novel “market dilution” concept—that finds infringement in competition and innovation, which copyright law is not meant to forbid.

I will first correct the technical record, explaining how generative AI training actually works and why it is not equivalent to traditional copying. I then critically analyze the Report’s legal reasoning, factor by factor, showing that, when the technology is properly understood, the use of copyrighted materials in AI training often fits within well-established fair use precedent. Drawing on case law (from Google Books to Sega v. Accolade) and scholarly commentary, I demonstrate that many of the Report’s conclusions—from treating intermediate AI training copies as infringing, to endorsing speculative harms like market dilution—are unsupported by law or precedent. Finally, I address the policy implications: the Report’s recommendation to rely on voluntary licensing (and even hints at collective licensing) is built on an unduly pessimistic view of fair use and an overestimation of licensing feasibility. For Congress and policymakers, I believe the better approach is to recognize that generative AI training can coexist with copyright’s goals—fostering innovation and new creativity without undermining the market for original works—just as prior technologies (search engines, data mining tools, etc.) have done under fair use.

I. Technical Background: Correcting Misconceptions in the Report

A. Machine Learning Is Analytical, Not Literal Copying

When I look at the Report’s description of machine learning, I notice that it sometimes blurs the line between “using” a work as informational input and “copying” that work in the ordinary sense. In machine learning, an algorithm ingests training examples (texts or images) and adjusts numeric model parameters to capture general patterns—grammar, style, factual associations, and more—from those examples. The end product is a mathematical model, not a repository of verbatim passages.

Indeed, even the Report notes that generative models do not store training data as literal text or images. They convert language into tokens (numerical representations) and statistically weight those tokens. For instance, an AI model trained on the phrase “It was the best of times, it was the worst of times” doesn’t keep a textual copy. Instead, it tunes its internal matrices so that, when prompted with “It was the ___ of times,” it will likely predict “best” or “worst.” This process is analogous to how I absorb knowledge as a reader—it’s learning, not duplication.

Technical clarification: Generative AI models like large language models (LLMs) generate outputs one token at a time, predicting the next token given context, rather than retrieving stored content. For example, given “Twinkle, twinkle, little ___,” an LLM predicts “star” because that word has the highest probability, not because it’s retrieving a stored rhyme file. The model has generalized from training, not copied a file.

By focusing on AI’s ingestion of “massive troves of data,” the Report correctly observes the scale but risks implying that the technology depends on wholesale copying of protected expression. In reality, machine learning treats those works as data, not as creative works to republish. While the Report sometimes acknowledges this in footnotes, I think it doesn’t consistently maintain this crucial distinction in its analysis, leading to shaky infringement reasoning.

B. Generative Models Abstract and Compress—They Do Not Retain Full Copies

The Report devotes a section to “Memorization” and highlights debate over how much AI models “remember” specific training examples. In my view, this is a critical technical point. Modern AI models are incredibly compressive: a large language model may ingest hundreds of gigabytes of text, but distill that into billions of numerical weights that encode general knowledge. Except for rare anomalies, the model cannot reconstruct any particular document from those weights; the information is transformed and entangled.

Research shows that, while memorization of exact phrases can occur, it is rare. The Report cites a study finding that a 6-billion-parameter model memorized only about 1% of its dataset. In practical terms, that’s thousands of exact sequences among 200 million documents—outliers, not the norm. Developers have strong incentives to minimize memorization, because a model that merely parrots training data is less useful and more legally risky.

While the Report lists factors influencing memorization and notes ongoing research on mitigation, its legal analysis often treats any retention or ability to output original text as equivalent to copying. I find this misleading. If a human author memorizes a favorite line and later uses it, context and extent matter. With AI, the “memory” is fragmentary and diffuse. Calling AI training an act of copying a work in full is like saying a student who can recite one line from a novel has “copied the novel.” The vast majority of AI output is new and never appears verbatim in the training set.

C. Data Acquisition and “Scraping”: Scale Versus Feasibility

The Report correctly notes that developers assemble training datasets by crawling and scraping large portions of the internet, implicating millions of works. I agree that downloading or scraping a webpage to include in a training corpus makes an unauthorized copy—but so does a search engine when it caches web pages, and courts have long found that kind of copying can be fair use.

The Report blurs the distinction between access and use, lumping “data collection and curation” with infringement before analyzing fair use. If data is acquired unlawfully (hacking, piracy), I agree that weighs against fair use. But the typical case involves scraping publicly available material, which I see as analogous to a library making copies for indexing—something courts have upheld as fair use.

Comprehensive data collection is essential for state-of-the-art AI. The Report’s own evidence shows that it’s not practically possible to license content at the scale needed for modern AI. Instead of treating this as an argument for broad fair use, the Report views large-scale use skeptically. I believe this overlooks the public interest in enabling transformative technologies, just as search engines were allowed to index the web for public benefit.

D. AI Outputs Are New Creations, Not Unlawful Derivatives

Crucially, for the overwhelming majority of inputs, there is no one-to-one correspondence in outputs. A generative AI model does not simply spit out “copies” of training works except in rare, anomalous cases. It creates new text, images, or music that may reflect the influence of many works but typically contains no substantial portion of any particular one.

If an output were substantially similar to a specific copyrighted work (e.g., ChatGPT reproduces half a novel verbatim), that output itself would infringe—but that is a misuse of the AI, not an inherent result of training. I view this as analogous to a word processor: it can be used to infringe, but its creation and training are not inherently infringing.

Style imitation is not copyright infringement. AI can write “in the style of Jane Austen” just as a human can, but producing new expressions. Copyright does not protect style or general creative “voice.” I believe the Report’s conflation of stylistic influence with infringement overreaches, ignoring both the law and technical reality.

Summary of Technical Point:
In sum, I see generative AI training as an analytic, transformative process much more akin to search indexing or student learning than to reprinting books. While there is intermediate copying at the input stage, the output is new creative content. The Report, by treating AI development as another form of content appropriation, applies copyright doctrines too restrictively. As I turn to the legal analysis, I’ll show how a more accurate technical understanding leads to a far more permissive outcome under copyright law.

II. Legal Analysis: Why the Report’s Conclusions Overreach

A. Prima Facie Infringement vs. Intermediate Copying

The Report concludes that several stages in generative AI development implicate owners’ exclusive rights and are thus prima facie infringement, absent a defense. Technically, reproducing a work (even into RAM or a model’s memory) can qualify as making a copy. But I believe the analysis shouldn’t stop there.

There’s a long line of cases holding that intermediate copying for a transformative purpose does not equal actionable infringement if justified by fair use. In Sega v. Accolade, the court held that verbatim copies for reverse engineering were permissible. In Authors Guild v. Google, Google’s scanning of millions of books was seen as prima facie copying, but fair use applied because the copying was for a transformative search tool.

The Copyright Office Report acknowledges these cases but then treats the entire AI pipeline—from data scraping to model deployment—as a single, exploitative act. I see this as a misapplication of copyright law. Copying thousands of books into an AI training set should be evaluated for what it is: an intermediate, non-consumptive use aimed at developing new technology, not a straightforward reproduction or adaptation.

Any prima facie copying must find a defense, but the transformative nature of AI training is clear: the purpose is to extract knowledge, not to enjoy or redistribute expressive content. The Report actually agrees that training on a large and diverse dataset is often transformative, but then hedges if the resulting model is used commercially or in a way that affects authors. I believe that is properly examined under market effect—not as a reason to negate the initial transformative character.

B. Factor One: Purpose and Character of the Use

Under Factor One, the question is whether the secondary use adds something new, with a further purpose or different character. AI training is, in my view, highly transformative. The use is to extract knowledge, not to market the original expression.

The Report downplays this by insisting that the end use of the AI model matters—if the model outputs expressive works that “compete,” the use is less transformative. I think this reasoning confuses purpose with market effect. Even if the output serves a similar purpose (e.g., entertainment), the use of the original in training was to develop new creative capabilities, not to republish the original. I’m also skeptical of the Report’s effort to distinguish AI learning from human learning for copyright purposes; the key legal issue is what’s output, not how perfect the memory is.

Commercial nature is relevant, but courts have repeatedly held that even 100% commercial uses can be fair if other factors are strong. In most AI training, the purpose and character are fundamentally transformative.

C. Factor Two: Nature of the Copyrighted Works Used

Factor Two typically asks whether the originals are factual or creative, and published or unpublished. Most AI training data is published and includes both creative and factual material. While using creative works might tilt this factor slightly against fair use, courts rarely treat Factor Two as decisive—especially where the use is transformative and involves published works. Even training on creative content, I believe, is more about extracting functional or informational value than appropriating the work’s core aesthetic.

D. Factor Three: Amount and Substantiality of the Portion Used

AI systems often ingest entire works. However, copyright law’s analysis here is context-sensitive: the question is whether the amount used was reasonable for the transformative purpose. In AI training, using entire works is often essential for effective model performance. As in Google Books and other fair use cases, full copying can be justified by the need to capture comprehensive patterns. AI training does not target the “heart” of works but treats all content as data.

E. Factor Four: Market Effects and the Misuse of “Market Dilution” Theory

Factor Four—the effect on the potential market—is, in the Report, treated as paramount. The Report claims that AI training and output threaten creators by causing lost sales, market dilution, and lost licensing opportunities.

I agree that direct substitution (e.g., outputting verbatim works) can harm markets and is not defensible as fair use. But such misuse is rare and addressable by output moderation and existing law.

Where I strongly disagree is with the Report’s “market dilution” theory: that any AI output—even without copying—can harm authors simply by increasing competition in the marketplace. No court has ever recognized this as actionable market harm under copyright law. Copyright protects against unauthorized copying of expression, not against competition or the rise of new genres.

Similarly, I’m not persuaded by the claim that emerging licensing markets for AI training mean that every unlicensed use is market harm. Courts have long held that hypothetical or new licensing markets for transformative uses should not count against fair use. Otherwise, any new use would always weigh against fair use, undermining the doctrine.

The public benefit of unlicensed training—enabling new tools, democratizing creation, advancing science—should weigh in favor of fair use, especially where no direct market for the original is harmed.

III. Policy Implications and Conclusion: Toward a Balanced Approach

The Copyright Office Report urges development of licensing solutions but stops short of new legislation. While I support innovation and artist livelihoods, overstating the legal risks of AI training or implying that it generally requires a license could chill innovation and benefit only large players able to strike major deals.

Fair use is a flexible tool that has historically enabled new technologies—photocopiers, VCRs, search engines, and machine learning. Generative AI is the next step in this evolution. While occasional verbatim outputs or extreme style mimicry can raise issues, I believe these are best addressed by targeted technical or legal solutions, not by broadly treating training as infringing.

I am skeptical that licensing markets can cover the full diversity of internet content, especially for smaller or individual creators. Fair use, by contrast, allows all sorts of works—big and small—to be used, as long as the use is transformative and not harmful to the market for those works.

Other countries, like Japan and those in the EU, have recognized the importance of enabling text and data mining, with exceptions that explicitly permit AI training. The U.S. has relied on fair use for similar results, supporting leadership in search and AI.

In conclusion: I urge policymakers and courts to recognize AI training as generally transformative and akin to indexing or analysis, especially where outputs are research, education, or creative augmentation. Copyright should only recognize harm where there is actual appropriation of protected expression or clear market substitution. Industry best practices can further minimize memorization and verbatim outputs. Licensing is valuable where feasible, but should not be a precondition for fair use in the absence of practical alternatives. Expanding copyright to cover style or market displacement would undermine the fundamental balance of copyright and stifle innovation.

The Report’s caution, while well-intended, rests on flawed technical and legal premises. A more nuanced understanding, grounded in both technical reality and legal precedent, supports my view that generative AI training is the kind of innovative, transformative activity fair use is meant to protect. By correcting the Report’s mischaracterizations, I urge a balanced perspective—one that safeguards creators from true misappropriation, but allows machines to learn from human culture and fuel the next generation of creativity and knowledge.

Sources Cited

U.S. Copyright Office, Copyright and AI, Part 3: Generative AI Training (Pre-Pub. May 2025).

Authors Guild v. Google, Inc. (Google Books), 804 F.3d 202 (2d Cir. 2015).

Authors Guild v. HathiTrust, 755 F.3d 87 (2d Cir. 2014).

Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994).

Andy Warhol Found. v. Goldsmith, 143 S. Ct. 1258 (2023).

Sega Enters. Ltd. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992).

Sony Computer Entm’t v. Connectix, 203 F.3d 596 (9th Cir. 2000).

Kelly v. Arriba Soft Corp., 336 F.3d 811 (9th Cir. 2003).

Perfect 10, Inc. v. Amazon.com, Inc. (Google Image Search), 508 F.3d 1146 (9th Cir. 2007).

A.V. v. iParadigms, LLC, 562 F.3d 630 (4th Cir. 2009).

Lemley & Casey, Fair Learning, 99 Tex. L. Rev. 743 (2021).

Cooper & Grimmelmann, The Files Are in the Computer: Copyright, Memorization and Generative AI (2024).

AI Technical Sources: Holtzman et al., The Curious Case of Neural Text Degeneration (ICLR 2020); Carlini et al., Quantifying Memorization Across Neural Language Models (arXiv 2023); Somepalli et al., Diffusion Art or Digital Forgery? (arXiv 2022); OpenAI, Models: Default Keywords & RAG (2023).

Policy Sources: 17 U.S.C. §102(b) (no protection for ideas/styles); EU DSM Directive 2019/790, arts. 3 & 4 (text and data mining exceptions); Regional Ct. of Hamburg (Sept. 2024) – LAION case (applying German TDM exception).

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote