Theft and Data Mining (TDM) in AI
Requiring proof of harm after an irreversible ingestion is a mechanism for preventing enforcement.
This is for the copyright owners and creators, and the third in a four/five part series on AI and Copyright. It’s a long one. I’ll start with some jokes:
First Joke: ChatGPT’s app developer guidelines actually state that for ChatGPT apps, developers must: “Only use intellectual property that you own or have permission to use.”
Second joke: Google is suing SerpAPI for “unlawful scraping”, for “circumventing security measures protecting others’ copyrighted content that appears in Google search results”.
What, their hypocrisy isn’t funny? I don’t think so either.
Scraping copyright content is a crime, except when AI does it?
Lets look at the lawsuit a little more closely. Google says that while it “follows industry-standard crawling protocols, and honors websites’ directives over crawling of their content”, “Stealthy scrapers like SerpApi override those directives and give sites no choice at all.”
Clearly, Big Tech AI clearly already knows how to define unlawful scraping. It simply chooses not to apply it to itself. Incumbents often push for regulation because they’ve already benefited from violating the rules that they want enforced on others. Reduces competition.
Before deciding to follow robots.txt exemptions, they had, like they accuse SerpApi of doing, “uses automated means to scrape these other services.” It’s ironic that Google says that “SerpApi likewise does not get permission from or compensate the copyright holders whose content it grabs and redistributes.”
Google and OpenAI had already scraped the open web before applying those “industry standard crawling protocols”, OpenAI on August 7, 2023, and Google July 2023.
Google and OpenAI aren’t compensating websites for scraping for AI either. Except in some cases. Reddit and Twitter both had to restrict their API’s, following which Reddit signed a $60 million deal with Google to allow it to scrape its content. What about the rest of the web?
From the lawsuit, it appears that according to Big Tech AI, scraping is theft when it harms them, and innovation when they benefit. Looks like hypocrisy is becoming policy.
Reasoned is an ongoing attempt to make sense of how AI is rewiring the Internet.
Each piece focuses on a specific issue, but the larger goal is to understand how these changes accumulate, and what they mean for people whose work and lives depend on the Internet.
What a TDM Exception Actually Permits
The phrase I’ve heard often at closed door round tables, and in conversations with Big Tech AI execs is that you don’t want to restrict innovation. Their lobbyists are arguing for a “Text and Data Mining exemption”, saying that AI models need copyrighted data for training, and that’s research and hence should be allowed.
India’s DPIIT Committee on AI and Copyright explains how Text and Data Mining works:
It is supposed to be “research technique to collect information from large amounts of digital data through automated software tools”
It involves identifying input materials to be analysed, copying substantial quantities of materials
The copying includes “pre-processing materials by turning them into a machine-readable format” so that “structured data can be extracted” and possibly “uploading the pre-processed materials on a platform”, “extracting the data” and recombining it to identify patterns into the final output.”
Why “Research” is a smokescreen
To call AI training “research “is a smokescreen, like the phrases “Large Language Models”, and “learning”.
The TDM exception applies when there’s a means to fresh knowledge, and outputs are papers and insights that do not replace the underlying works. Foundational models involve indiscriminate collection and ingestion of copyrighted material (and public domain information), permanent retention of this material for future training cycles, monetized access to outputs via API’s, subscriptions and enterprise deals, and as I’ve explained previously, outputs clearly substituting original creators and copyright owners.
Copyright as a law doesn’t just regulate how copying is done, rather than what copying does. Its core function is to prevent uncompensated market substitution, not merely unauthorised public distribution.
Framing it as research is a deliberately misleading tactic, to reduce scrutiny and seek exceptions from copyright enforcement.
It is also argued that:
AI training requires accessing and, in some form, copying and storing data, but without showing, distributing or transmitting it to people.
These are purely machine-only steps to help models learn patterns.
The training use is incidental, non-expressive and computational, and is distinct from communication to the public, and the copies are incidental.
As I’ve mentioned earlier, the machines aren’t just learning patterns of language, but patterns of knowledge, which is what they’re replacing. A copying process that results in a permanent, monetised system capable of substituting the original works cannot be treated as a neutral or incidental intermediate step, regardless of whether a human reads the copied material.
Copyright applies even when copying is non-expressive or internal: it depends on whether it enables uncompensated market substitution.
Copying at the moment of ingestion, which is non-consumptive, but permanently and irreversibly substitutive and monetizable is economically destructive and extractive of copyrighted work.
Research exceptions are attached to outputs and intent. Their extractive business models, and outputs, are not research:
Paid API’s are not research.
Subscription models are not research.
Enterprise contracts are not research.
Trillions of dollars in funding and valuation is not for research.
We have to take into account the commercial intent of Big Tech AI before we consider stripping copyright owners of their rights, and give unrestricted access to what is a commercial input for a commercial service.
The research framing is strategic misclassification intended to enable applying academic exceptions to commercial infrastructure creation.
Text and Data Mining is Theft and Data Mining here: the misclassification is designed to convert commercial extraction into protected research.
Reasoned now has an index page, where all my writing on AI is aggregated in a structured manner. Bear with me. I’m winging it.
They’re Large Knowledge Models, not Large Language Models
As we know, AI Large Language Models (LLMs) are next word prediction machines, and predicting the next word involves pattern recognition.
I think the nomenclature “LLM” it is misleading and directs attention away from the real problem: they’re actually being built as “Large Knowledge Models”.
They’re expected to contain accurate information, not mere language prediction. In that way, they are replicating proprietary information and knowledge that people otherwise own and charge for access to. The distinction between LLMs and LKMs matters, because it determines whether training is just about language, or it is actually about market substitution.
Piracy as an input strategy
A few days ago, bad actors uploaded Spotify’s entire library of music - all 300 TB of it - on Anna’s Archive, a piracy website.
Chaitanya Chokkareddy, the co-founder and CTO of Ozonetel, sent me a comment about this:
“In this age of AI, with respect to copyright, looks like AI has more freedoms than humans. Sites like Anna’s Archive are blocked for humans and if you download and read a book, its called piracy. But if the same book is fed to an AI for training, apparently its fine and dandy. So Artificial Intelligence has more freedoms than actual Intelligence.”
A Text and Data Mining exemption will actually encourage theft of content, so that AI companies can use it. Once it is leaked, it would be a free-for-all: AI companies would have immunity because, hey, they didn’t steal and publish Spotify’s library: they only copied and “learned” from the stolen content.
This makes piracy an input strategy for AI companies.
Go back to the lawsuit against Meta: it allegedly copied around 81.7TB through Anna’s Archive, at least 35.7 TB from Z-Library and LibGen. The lawsuit has redacted transcripts of alleged messages between executives that clearly illustrate concerns about using pirated material, torrenting from a corporate laptop using Meta IP addresses. If Text and Data Mining pirated material was legally kosher, surely they shouldn’t have been concerned.
The demand for a Text and Data Mining exception is a legal strategy to legitimise past and future usage of pirated material to build AI models.
Kiran Jonnalagadda, co-founder of HasGeek messaged to say that he got a lot of hate some 25 years ago when he “wrote an article referring to open source and piracy as two sides of the same coin. If you want some functionality and can’t justify the cost, either pirate it, or rebuild as open source.”
AI cannot “rebuild” the datasets so it must pirate. There’s clear incentive to pirate: clean datasets are expensive and subject to negotiation. Dirty datasets are not just free, but also risk free.
A Text and Data Mining exception for AI won’t reduce piracy. It will industrialise it. It will grant AI systems the right to steal at scale.
The Zero-Cost Creation Fallacy
The Fifth story in my opening piece on AI and Copyright referenced a computer scientist telling me that I “have no right to copyright when the cost of content creation tends towards zero. Copyright needs to end.”
This argument is fundamentally flawed. What has declined is the marginal cost of generating outputs, which is tending towards zero.
If the true cost of creation were genuinely approaching zero, AI companies would not need to scrape, copy, and retain other people’s work at all. They would train models entirely on synthetic or self-generated data. This is not the case.
AI models also do not generate text from nothing: they train on copyrighted material. The fact that OpenAI mentioned in its UK Parliament filing, that “Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens”, means that copyright hasn’t outlived its purpose.
The cost of creation of valuable data isn’t zero, and owners of this data should thus be compensated for the value that they’ve created.
The cost of creating those copyrighted inputs is not zero, and that’s what gives me, and every copyright owner, the right to payment for stolen work.
The only way the marginal cost of creation becomes zero is if the cost of the most expensive raw material for building these trillion dollar businesses is borne by copyright owners.
AI companies are merely externalising the cost of building their models.
Collateral Damage: The false binary of Innovation versus Copyright
An anonymous reader messaged to say: “So what is the alternative? A lot of artists and their allies think they have an answer: they say we should extend copyright to cover the activities associated with training a model. And I’m here to tell you they are wrong: wrong because this would inflict terrible collateral damage on socially beneficial activities, and it would represent a massive expansion of copyright over activities that are currently permitted – for good reason!”
First, this is a flawed argument. No one is expecting a radical expansion of copyright: only the enforcement of copyright where it actually exists, and preventing the expansion of the Text and Data Mining exception to large scape copying for commercial training.
Second, the collateral damage claim made in defence of a TDM exception is simple: if AI companies are required to pay for the content they use, only large companies would be able to afford it, innovation will slow, competition will suffer, and only a handful of players will survive, and it will favour larger companies.
This is flawed. A TDM exemption entrenches incumbents because it rewards scale over efficiency.
The largest firms benefit disproportionately from unrestricted access to copyrighted content across the entire internet, while smaller players are locked out by compute costs, infrastructure requirements, and distribution power.
The result is not more competition, but fewer, more dominant models whose advantages compound over time. It has already become an arms race. TDM actually raises the barrier to entry, and concentrates power with the powerful. The “collateral damage” argument merely reveals whose costs are being prioritised here: of Big Tech AI firms.
Requiring paid access does the opposite: it rewards precision, curation, and smaller domain-specific models. When inputs are not free, firms are forced to make choices: which data actually matters, which domains justify investment, which models can be smaller, narrower, or more specialised.
Enforcing copyright would not ban innovation. As I ranted at one session on AI and Copyright 4-5 months ago: innovation is not the exclusive preserve of technology and AI companies.
*
There’s also false binary in saying that you can’t make LLMs without violating copyright. Viswam.ai’s Akshara Telugu LLM project* has been created using public domain data and copyright material used with permission, but not with public domain data alone. Chaitanya Chokkareddy, who is also with the Viswam.ai project confirms:
The Telugu stories are in public domain as that is synthetic data. The news articles have a copyright, but we have taken permission. We are also in talks with them for releasing those under a copyleft license and most of them have agreed in principle to do so next year. As you know, we have been working with SFLC to come up with a new license and all the data new data used for training will be released under the new license. But we wanted to work on the model before the license was released.
The models, data everything will be open with no copyright. That’s our ultimate goal at Viswam. And we will make it happen, atleast to make your statement [Ed: about using only public domain data] true :)
For now, to achieve the goals faster we are working with content owners and taking their permissions. Nothing is being done without the content owners permissions. Consent is important.
I’m told that an argument was made in Delhi High Court that if models don’t have access to copyrighted information, they will produce misinformation.
Two things here: First, as I mentioned earlier, these are not large language models: they’re Large Knowledge Models, and that knowledge is derived from copyright sources. If it were only about language, they wouldn’t need this much data, or need to be factually accurate.
Second, this is a red herring. Fact-checking, verification, and trustworthy knowledge production are not automatic by-products of scraping at scale: they are labour-intensive activities sustained by incentives, and without adequate incentives, journalism will collapse. You won’t reduce misinformation; you will accelerate it.
When the sources you steal from run out of money, all you’ll be left with is conspiracy theories and “Community Notes”. To respond to another point that anonymous made above: this is the actual socially beneficial activity that gets impacted because of a TDM exception. That’s the real collateral damage.
Why is it assumed that the only way to preserve socially beneficial activities is to exempt trillion-dollar companies from paying for their inputs? I know Big Tech AI, their lobbyists in NASSCOM and BSA, and law firms and consultants on their payroll will try and work their way around this, but the fact is that Copyright is meant to protect against market substitution.
Writing in, Srijan Rai agrees: “Even if the approach suggested by the expert committee is not very convincing, what has stood out to me more were some of the arguments suggesting that everyone should simply give up their data to ensure the success and profitability of AI industry. Claims that copyright is rent-seeking, or that India must give up its data because others will anyway, were really puzzling, especially given that these models are likely to replace or significantly reshape digital economy.“
Also, to respond to a point made by Anonymous, removing a TDM exception won’t stop training of models: it will only stop copying copyrighted content.
No proven harm? Really?
What NASSCOM claims: NASSCOM, in its submission to the committee, states that there is no proven harm (of AI companies stealing content) so far, that there is no clear evidence that generative AI has materially undercut creators’ revenues at a market-wide level.
Why regulators will like this: It’s a smart tactic: regulators need to see harm and market failure, and as India keeps saying, it doesn’t want to stand in the way of “innovation”, and only intervene when there’s a problem.
Why this is a problem: This demand of proof of harm here is a means of preventing enforcement.
I’ll explain why: The harm from enabling copyright violation and a TDM exemption for training is irreversible. The copying and the ingestion of knowledge is irreversible.
Sabari Raju, Head of AI Research and CTO, Clairva.ai wrote in to say that “It’s like asking wheat to be traced to a farmer while making biscuits in a factory.”
Unlike most industries, AI leaves no audit trail, no untraining mechanism, and no meaningful way to isolate the copyrighted content, and hence there is no remedy available to a copyright owner. Once ingestion occurs, there’s no mechanism to reverse it: even if a copyright owner later opts out, withdrawal of consent cannot be implemented in practice. A TDM exception destroys consent management in copyright.
The fact is that the harm from the copyright violation that has already been done, will be delayed, but it’s not unpredictable: websites are reporting reduced traffic. AI generated music is impacting musicians on Spotify. I’ve given enough examples in my first piece to indicate substitution and signs of harm. Short term revenue assessment of the lack of impact on Japan is insufficient proof that there won’t be long term damage, especially given the size of India’s content industry.
When there is no possibility of undoing the damage, we need to be extra cautious, especially if the signs are already there.
Regulators should not wait for market-wide harm to be conclusively proven, and, like with pollutants in the ganga, they need to act upstream because no remedies are going to be effective downstream.
I repeat: Requiring proof of harm after an irreversible ingestion is a mechanism for preventing enforcement.
Finally, what a TDM Exception actually does
It’s important to realise that the debate over Text and Data Mining - or Theft and Data Mining (Text is data anyway) as I call it, is not about copyright, but it’s about power, scale and irreversibility. It is politics.
NASSCOM in its submission (and I’ll go into more detail at MediaNama about this, with a counter) pushes for a machine readable opt-out at the moment of availability, saying that given the automated and large-scale nature of collection, seeking case-by-case consent is not practical at scale, and that training for when content is publicly available online and no opt-out has been expressed, training should be allowed. It’s essentially saying that opt-out is a balanced approach.
However, not all the content is taken by them without permission. Quite a bit of it is being licensed: deals with the largest copyright owners, whether FT, Axel Springer, Disney and the like. If they’re already doing voluntary licensing deals, why restrict them only to those with power? Big Tech AI is the most powerful force in AI today, and laws need to protect those with less power, not enable the powerful.
A TDM exception, especially with an opt-out rather than an opt-in provision, irrevocably shifts the balance of power in favour of Big Tech AI firms with no remedies for copyright owners for copying without their consent.
Essentially, they can copy everything, use it forever, and will never be liable. It’s a transfer of cost and effort to rights holders, saying that if you’ve left your door unlocked and we’ve come and stolen your valuables, you’re at fault. One should lock the door, but that’s not where it should end: switching from an opt-in to an opt-out shifts the responsibility to the victim and absolves the thief of liability. It’s a deft liability shift.
Silence is not consent. Copyright is an opt-in property right by design. Reversing that default is not a technical fix but a legal inversion of an established principle.
Opt-out scales for AI companies precisely because it does not scale for rightsholders.
A TDM exception is a compulsory copying right that enables large scale extractive behavior, and the transfer of proprietary knowledge without compensation.
Training is irreversible, and training on copyrighted data wreaks irreversible damage on an entire industry that depends on copyright because it gets substituted. Together, they authorise permanent extraction of value from copyright owners without remedy.
Big Tech AI are trillion dollar companies, much larger than the median copyright owner, and the responsibility should be theirs to bear. This is not balance. Restoring balance would be to ensure that property right restrictions apply to AI training, with maximum restriction exactly at the point of ingestion of data.
A Text and Data Mining exception legalises irreversible extraction and calls it progress. This is Theft and Data Mining.
To enable this would be regulatory abdication.
*- Corrigendum: An earlier version of this article said that Akshara LLM uses only public domain data. The article has been edited to clarify that they used both public domain and copyrighted content (with permission).
Reasoned is where I track what AI changes for people who build and create online. Subscribe here.



