Publishers Get New Tools to Block AI Bots, Leveling Playing Field with Search Engines

AI crawling activities have led to a lot of disruption for website owners who despite installing Robot Exclusion Protocols and Tags have failed to stop the action.

Now, new standards are on the rise of being implemented. This will ensure all AI crawlers are hindered from using public data online for training purposes.

The news comes after the drafting of the proposal where Microsoft took the lead for the initiative. We can confirm that the goal is to prevent AI training crawlers from getting unrestricted access to content online. It’s almost like a dream that will soon come true for publishers whose material continuously is ripped off without consent or any form of compensation. Despite their refusal, their content keeps getting used for AI training.

As the IEFT or Internet Engineering Task Force first arose in 1986 as the governing body for making internet standards, it’s about time such rules were implemented. They are behind the Robots Exclusion Protocol from 1994 and many others. Let’s take a look at the rules being put into place soon.

The current draft proposal highlights three ways to block these stubborn AI training bots. For starters, there’s a Robots.Txt for blocking AI Robots. This will add more rules for AI training robots and give publishers power over what can and cannot be crawled from their online pages.

The change now is that this rule will provide control on how data can be used for training of Generative AI foundation models. All app developers must honor the tags but these also do not mean they’re getting access authorization. The best thing about the new draft is that all AI training crawlers need to follow the protocols. As a result, bot blocking is simplified now.

Other parts of this rule include allowing and disallowing AI training for language models.

The second role is the HTML Element or Robots Meta Tag. This entails the proposed meta-robot regulations. Through:

<meta name=”robots” content=”DisallowAITraining”>
<meta name=”examplebot” content=”AllowAITraining”>

Lastly, the third rule discusses how there will be an Application Layer Response Header that’s sent over by the server in reply to the request from the browser for websites. This proposal encourages the addition of new rules to the app layer response headers seen for robots.

If and when AI training is disallowed, instructions explain when not to use the data for AI training purposes of language models. And in those cases when there it’s allowed, it provides instructions on what is to be used for AI training of language models. For so long, we’ve heard about AI tech giants facing unsuccessful legal action for scraping publicly available data online. They keep on justifying the act with claims that they’ve got the right to crawl websites open to the public.

Furthermore, they liken the behavior to the performance of search engines for years. Now, such protocols can well define their limits and boundaries, providing them full control when it comes to crawlers that want to attain training data. Now, the playing field for both search and AI crawlers is leveled with the proposals in place.

Image: DIW-Aigen

Read next: YouTube Adopts ‘Swipe Up’ Gesture in Android Trial, Mimicking TikTok’s Navigation Style

Publishers Get New Tools to Block AI Bots, Leveling Playing Field with Search Engines

Dr. Hura Anwar

You might like