AI and the splitting of the open web
What AI cannot extract from the web
I sat down with my team trying to find a line of defense against AI, almost three years after I wrote a thesis statement about the impact of AI on news media. AI is addressing the cognitive load of having to read to parse information, and people want answers, not links. They’re not clicking.
When ChatGPT launched, it wasn’t an accurate source of facts because its training data was outdated. GPT 3.5, in November 2022, had a cutoff of September 2021, and the world moved faster than it could update its training data. That advantage for publishers, which we didn’t even recognise as an advantage then, went out the window the moment RAG began to get deployed.
“Retrieval-augmented generation (RAG) is a technique that enables large language models (LLMs) to retrieve and incorporate new information from external data sources”…
“Unlike LLMs that rely on static training data, RAG pulls relevant text from databases, uploaded documents, or web sources”
RAG reduces a fundamental problem of hallucinations for AI models and addresses a core user need for updated information. It’s also cheaper than re-training regularly with fresh data. For publishers, it makes AI more extractive, and makes AI models an aggregated source of truth for readers.
Therefore, there’s a need to recalibrate our approach, just as everyone needs to: do the things that AI cannot.
The archive belongs to AI, the present doesn’t
Over the past two decades, publishers have focused on building archives (including How-To’s and explainers) to attract search traffic, which ranged from 40-80% of all traffic. In 2010, Yahoo bought Associated Content, a content farm, merely for search traffic. Quality Deserves Freshness (QDF) is a Google algorithm that was launched in 2011 that actually benefited news publishers because they had updated information, but hurt the content farms by prioritising fresh information over archives. That didn’t mean that there wasn’t value in unique, archival content.
Now even that value has been, to use a mining phrase, been depleted. Continuous extraction from AI means that archives are dead, and there’s nothing left on a website that AI cannot steal or has stolen already.
Anything that can be indexed is already gone.
The gap in the market lies in the inversion of this: they can’t tokenise and commodify what doesn’t exist. We thus need to switch focus to the new, fresh and unique.
There are hard gaps that make this gap permanent:
On-ground reporting: An AI agent can persistently check the US President’s website for announcements, but what is happening on ground can never be replicated: AI cannot simulate or synthesise observations from reporters on ground, or cover all the ground.
Curation: If multiple entities are reporting on something, then you have to have something that others don’t. Culling out information and spotlighting and prioritising it in a world of abundance is what journalists do well.
Digging: There are corners of the web that AI doesn’t know exist or care about that journalists dig out information from.
Opinion: AI cannot generate authentically an interpretive editorial opinion by an identity that is trusted.
Community: the live content that brings a community together, like in case of sports or even elections, or a moment that brings people together in the same space for the same shared experience.
Much of this is more expensive to produce than just rewriting and contextualising press releases.
Real-time is the last remaining monopoly
The value of freshness is historical: Paul Julius Reuter, the man behind the Reuters news agency, cemented his reputation by delivering news using carrier pigeons and eventually the telegraph, faster than others, with the knowledge that speed has implications for stock markets. To quote:
..when in the following year he produced in London an hour after its delivery a report of the Emperor Napoleon’s threatening speech to the Austrian Ambassador which led to the Italian War, his reputation was at once established as by a coup de théâtre.
That logic has never gone away, but it has been somewhat dwarfed by the advertising funded publishing era that made information free, and made traffic the primary measurement of value.
While historically, during Reuters time, that delay might have been measured in days and weeks, it is now measured in microseconds. This is why two shifts have taken place:
The advent of proprietary access to information, whether it is market quotes, news (essentially Bloomberg), and analysis, and
The advent of quick access to information, especially with robotrading, which can mean that the cost of delay is in millions.
For Reuters and Bloomberg, being correct is essential, but being fast and correct is the real product. People pay for exclusive access to accurate information that others might not have, especially when it impacts significant buying decisions.
Live sports also offers us another proof of the real-time premium: On-demand viewing allowed for binge-watching, but it largely killed appointment viewing, and reduced the ability for an advertiser to own a “moment”.
Today appointment-based viewing survives largely in live sports, where being there at that point in time, witnessing something live, brings people together, and gives advertisers maximum diffusion for their dollars: we see this with the Superbowl, the Football World Cup Finals, and Cricket’s Indian Premier League and ICC trophy finals. There is clear value in real-time, simultaneous, content that brings a community together in a shared live moment.
AI needs new information in order to remain relevant and current. All platforms are built to keep users on their platform. The switching cost is low in AI, and if someone does something better, it starts becoming the default. The need for new information is not going to die. You just need to make it harder for AI to access it.
It’s clear that the solution lies in gating access to bots. Not in SEO or GEO.
Related posts:
Harder than it sounds
Gating access has largely been a technology play, but it has also been an arms race.
First, it’s tricky to enable access for humans but restrict access to bots. For businesses that enable this, like Cloudflare and Tollbit, the key challenge here is in distinguishing traffic from humans (which you want) from that of bots (which you don’t). Google made things tricky by using the same bot for search (which you want) and scraping (which you don’t).
Bots are frequently changing form, often masquerading as users, in order to bypass restrictions built to block only bots. Amazon sued Perplexity, because it “chose to disguise an automated ‘agentic’ browser as a human user, to evade Amazon’s technological barriers”, and to access private customer accounts without Amazon’s permission”. Cloudflare and Perplexity are also in the middle of a similar face-off. It’s going to be an arms race to prevent scraping of real-time information.
Another issue is that we still don’t know how to distinguish between a user initiated agent, and whether we want to allow this.
Second, fresh information is a time-limited monopoly, and lasts only as long as the information isn’t copied by someone else, or accessed by a scraping bot, and once trained into an LLM it becomes permanent. You’re selling a single extraction, not a subscription. Importantly, the value lies in the moment, not just the information. All this has to be factored into pricing, and the pricing that emerges will vary as per what value you bring to the table, and how your costs change.
Third, there’s a clear market constraint, in terms of how many people the content is useful to, and what are they willing to pay. While stock markets offer a historical precedent in gating access to data and opinion, the audience for political news, live regulatory updates, or a breaking court judgment, or funding news is probably much smaller, because they have a longer window in which to act on that information. We don’t know what the market equillibrium will look like, in terms of pricing.
***
Gating access means that web is going to split into two parts: that which is freely available to bots, and that which isn’t. When there’s gating of access, the openness of the web suffers: it is the opposite of interoperability, which I wrote about here.
What happens when most user access is for summarisation, via their own agents? I can’t help but notice that several websites block jina.ai when I tried using it with my picoclaw agent.
There’s a clear tension between AI automation and the human-centered web: between creation and extraction, because value is being eroded away from the creator to the extractor. The only solace is that there will always be value in the period prior to extraction.
The contest for the present will never end, because it is being generated every moment.



