Copyright is your right to refuse AI training on your content

AI Models respect for copyright depends on the power that the copyright owner wields

Dec 20, 2025

The first part of this series triggered quite a response, and some of it was adversarial (but off the record, so I’ll keep it that way). That tells me that I’ve touched a few raw nerves. I understand what I’m saying is not going to be liked by AI firms, but like I said earlier, that’s the way I roll. Always have.

A few comments I received:

There is a piece of magic thinking in the VC / Tech approach to AI. That somehow cheap production automatically results in better outcomes. The magic thinking is the market place of ideas. Our rules, assumptions and safe guards for the market place depended on humans competing with other humans. Not humans competing with mass content production systems. I’m not anti-progress, far from it. I am, however pro-market, pro-fairness and anti-monopoly. - Nikhil Bhasin, someone who works in the trust and safety space

The development of AI the world over is predicated on “Responsible AI”. Is it still “Responsible AI” if LLMs can extract our data — or citizens or content creators so easily or even with a token fee as the committee recommended — and still call it “Responsible”?

The Prime Minister has always been firm that we have to be self reliant. If we lose this self reliance and control over our data to foreign players, we are putting ourselves at a great disadvantage and I may dare say, placing our future at peril. - Saikat Datta, co-founder of Deepstrat

One other way to think about this is that it is large scale wealth transfer from millions of individuals to a few companies. Copyright is a property right and imagine if tomorrow government said that it’ll just transfer all land to a few entities because they’re “better” at land manager than the average individual, and everyone has to pay rent to them. Shorn of all the jargon, demanding the end of copyright is simply this - Alok Prasanna Kumar (on LinkedIn)

The immigration backlash happened in the US because H1B workers were seen as replacements for natives, who were forced to train them as their replacements. Now if this happens at scale to everyone everywhere by AI immi-agents imported by AI companies? Social Unrest! - Anand Venkatanarayanan, co-founder of Deepstrat (on X)

Chaitanya Chokkareddy, the co-founder and CTO of Ozonetel, who was instrumental to Swetcha’s small language models, points me towards his post likening AI to communism. Some great sharp takes there. I particularly liked his take on incentives:

Lack of Incentive → The “Why Bother?” Paradox

In Communism: If the state takes 90% of your extra harvest, you stop trying to grow extra food.
In AI: If a creator knows their unique style will be instantly “cloned” and devalued by a thousand AI bots, the financial incentive to innovate disappears. If you can’t make a living being an original artist, you might choose a “safer” job, leading to a stagnation of human culture.

Do read:

Experiments in AI

AI has a communism problem

“From each according to his ability, to each according to his needs…

6 months ago · nutanc

Two public comments differing with mine, which I’ll address in this post:

So if AI can learn from all the doctors in the world and work as the best doctor in the world, then what’s the issue? Every new tech has some issues and some benefits. I think AI has infinite benefits in the future - Anonymous user (on Reddit)

On the positive side it has also level to democratisation of access to information and knowledge far beyond the internet’s silos. I think one way around this, for India, could be to develop our own LLM backed by the State and make it as part of DPI. But with this approach we run the risk of a state censored model like DeepSeek - Siddharth Rathore (LinkedIn)

Lets get back to the arguments that AI companies make. Thematically this is a follow up to

“AI and the quiet rewiring of the Internet’ about the future of the web
“It’s not substituting. It’s transformative use” about AI and copyright.

Second: “If it’s publicly available, I can use it and if you don’t it to be copied, why make it available online”

Also, “If you’ve made it available for Google Search, what’s wrong with Google using it for AI?”

First, publicly available is not public domain. Even when we post something on social media, it’s not public domain - it’s just publicly available.

Instagram’s Terms of Use: We do not claim ownership of your content, but you grant us a license to use it. Nothing is changing about your rights in your content.
Facebook Terms of Use: You retain ownership of the intellectual property rights (things such as copyright or trademarks) in any such content that you create and share on Facebook and other that you use.

This means that if OpenAI, Google or Meta scraped your content from your website, or any platform they don’t have the right to copy from, they violated your copyright.

As far as I see it, you can look at the flowerpot in my courtyard- but you can’t take it. Copyright is the lock on the gate. Just because it’s visible online doesn’t mean it’s free to take.

OpenAI has acknowledged the distinction between Publicly available and Public Domain in a filing with the UK House of Commons last year that:

“OpenAI’s large language models, including the models that power ChatGPT, are developed using three primary sources of training data: (1) information that is publicly available on the internet, (2) information that we license from third parties, and (3) information that our users or our human trainers provide. Because copyright today covers virtually every sort of human expression– including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

“Nevertheless, although we believe that legally copyright law does not forbid training, we also recognize that there is still work to be done to support and empower creators.”

It’s clear then that publicly available is not public domain. Secondly, if they believe that copyright law doesn’t forbid training, then why have they done deals with Financial Times, The Associated Press, Conde Nast, Axel Springer, Le Monde and Prisa Media, and more recently with Disney?

AI Models respect for copyright depends on the power that the copyright owner wields, and how litigious they can be. From the rest, they steal.

On Siddharth Rathore’s comment above, to call the Internet is a silo is odd. The Internet is diverse. Apps are silos. AI is more siloed than apps. By allowing unrestricted access to copyrighted content, you’re allowing it to replace a diverse and open web with a restricted silo.

Taking content without consent and compensation is extraction, not democratisation. If you read Chaitanya’s post likening AI to communism - we will end up with disincentivisation of creation. Like I said earlier, if there isn’t enough incentive for reporters to report, and for publishers to pay reporters, who will AI steal from? Democratisation that rests on unpaid human labour and uncompensated creativity is not sustainable.

Second: “That’s a bit harsh, to call it stolen” is a response I got today, when I use the phrases stealing and theft when it comes to AI companies

This is a strange situation where “make information free” activists find common ground with companies trying to replace the Internet with AI.

The same logic that criminalised Aaron Swartz for downloading and sharing JSTOR (academic research) for no personal gain, which eventually led to this suicide, is being ignored for trillion-dollar companies doing the same, at an industrial scale, and for profit. That is hypocrisy. He was punished. They got billions in funding. To me it just feels like Big Tech, VCs, and PE funds have made this such a high-stakes game that lawmakers feel bullied into compliance - because if it unravels, the economic fallout could be massive.

This is also a good point to remind ourselves that Meta (allegedly) knowingly pirated copyrighted books, “at least 81.7 terabytes of data across multiple shadow libraries through the site Anna’s Archive, including at least 35.7 terabytes of data from Z-Library and LibGen.” Here are some nuggets from court evidence. Two snippets:

Joelle Pineau responds to Eleonora Presani’s statement that “using pirated material should be beyond our ethical threshold.” Ms. Pineau then asks, “You think it’s problematic to use even for this phase?” followed by a redacted sentence. Presani then says “SciHub, ResearchGate, LibGen are basically like PirateBay or something like that, they are distributing content that is protected by copyright and they’re infringing it.”
In an internal message, Nikolay Bashlykov expresses concern about using Meta IP addresses “to load through torrents pirate content,” and says, “torrenting from a corporate laptop doesn’t feel right.
Jelmer Van Der Linde admits, between redacted messages, that Bashlykov said he torrented the scimag data and downloaded the rest of LibGen.

Third, indexing for search, which sends traffic to users, is different from indexing for AI Mode, AI Summaries, or Perplexity results.

Some platforms do respect robots.txt, which is a file you upload on your servers to direct web crawling robots, among other things, to crawl your website, or not. When MediaNama switched servers, we were innundated with hundreds of crawlers because they probably detected that this is “fresh content”. Our traffic went through the roof, our servers kept crashing. They didn’t respect the robots.txt exception. It’s an honour code, not a legally binding document.

So AI companies have violated that honour code and crawled the web. To use a Hindi phrase, aap chronology samajhiye:

OpenAI launched in November 2022. Google launched Bard in March 2023.
I raised the issue about the lack of adherence by AI Crawlers for robots.txt exceptions in July 2023.
OpenAI launched a robots.txt exception in August 2023

Google in September 2023.

I’m not claiming that they launched robots.txt exceptions because of me, but the point is that they had already scraped content from copyrighted websites well before launching.

It’s a classic Silicon Valley move: “ask for forgiveness, not permission”.

Applies to the Meta situation above as well.

Btw, there are still bots that do not respect robots.txt exceptions. What robots ignore, the law must enforce.

Before you read further, do consider supporting my work
by making a payment here (if you’re in India) and here (if you’re not in India).

Content owners need to build legal grounds for restrictions: In response to a Parliamentary question about scraping for AI training, India’s Minister of State for IT, Jitin Prasada, said that “Section 43 of IT Act provides penalty for unauthorised access to computer system and provides compensation for damages to affected parties.”

After this, I promptly added a whole series of AI specific items to MediaNama’s Terms of Use, including restrictions on AI model Training and Data Mining, unauthorised usage by AI systems for commercial or non commercial usage without explicit consent, and restriction on circumvention, including of robots.txt directives. I created an exception for scraping for search engine results, adding that

“any use of the Website’s content beyond indexing for standard search engine results (e.g., repurposing, republishing, training AI models, summarization) is strictly prohibited without explicit, written consent”.

I’m not sure of how this would play out if the Copyright Act gets amended to enable compulsory licensing (I’ll write about that later): can a publisher still claim unauthorised access in Court? For the lawyers reading this, do let me know how this would play out.

Most importantly, it’s about consent

In January 2025, I went for an AI-based health checkup. The consent form required me to allow my data to be used for AI training.

I’ve said this before: With AI, even if you’re paying, you’re still the product.

I declined to tick this box, and they refused to do the test. Then a friend told them not to mess with me, and they made an exception and allowed me to opt out of this clause, and I asked for data deletion post the report.

When consent is forced, it’s not consent.

To the anonymous user above, just because something can be done doesn’t mean it should be done without rules, permission, or accountability.

AI isn’t exempt from trade-offs just because it’s new. If we don’t ask whose work it’s built on, who gave permission, and who benefits, we’re not innovating - we’re extracting.

At a MediaNama discussion in Bangalore many years ago, Rahul Matthan, a Partner at Trilegal, made a strong case for doing away with consent because we make our personal data available for free online anyway. We need to regulate for harms, not regulate access to data, he said.

Kiran Jonnalagadda, the co-founder of Hasgeek, standing right at the back ended that debate by saying that “consent is your right to say no.”

Copyright is my right to say that you can’t take my work without asking me. That barrier allows me to monetize it, or choose who to give it away for free.

P.s.: I know these two posts on Copyright have been a bit dense. It’s intense for me to write this. I’ll mix it up for the coming week. I’m still trying to find my writing cadence.

Also, please feel free to send me your comments (especially points of disagreement)

To the new readers, please do consider

nomen nescio

Dec 22Edited

Enough studies have been conducted to show that scientific publishing is a scamwith authors paying the publisher money to publish and readers paying the publisher to read the article. These prices are extortionate.

Since science is a public good, at the very least LLMs should be able to crawl through scie tidic papers without any further payment.

svs

Dec 20

the bigger challenge is that a very small minority is debating or thinking ofg consequences while all major stake holders - AI companies, investors, and policy makers are on the babdwagon without any guardrail and the worst lot is policy makers where they are hoping that this will pan out on its own by the time they wake up it may be too late

Discussion about this post

Ready for more?