AI Platforms Under Fire For Scraping Data From Websites For Training Purposes

One of the world’s first free AI search engines by tech giant Perplexity has been accused of stealing content from news websites.

The company has been under fire by various leading media and publishing outlets who accused the firm of stealing stories by scraping data online and then going about republishing it across different platforms.

As reported by media outlet Wired, it was mentioned how it continued to ignore protocols like Robots Meta Tags Exclusion, therefore, enabling it to have a free hand to steal data online so that its own systems could benefit.

Plenty of websites including Forbes, and even The Shortcut were a part of the long list of accusers complaining on the matter. Reuters shortly joined the list later on, when it mentioned how its content was being used online without taking permission so that it could better train its technology.

Reuters went as far as to reveal how it got access to a letter rolled out to publishers from a startup called TollBit where a warning was generated on this front about AI agents bypassing protocols to content data from its websites.

The file from robots entails instructions by which web crawlers are given more insights about which pages they cannot and can access. This type of protocol was used for a long time but not everyone is compliant to using it as can be seen in this particular scenario.

So far, no company has been outlined but as reports from Business Insider revealed, it was explained how OpenAI and Anthropic are amongst those who bypass the robots.txt signals. Both firms had mentioned in the past how they have immense respect for such instructions, yet they opted to ignore it.

Such investigations proved how Wired could determine that machines found on Amazon might be operated by Perplexity. It was bypassing the company’s instructions for robots.txt and to confirm this, it was scraping content.

Wired kept on providing so many tools from the company with headlines taken from its content or prompts arising from its certain stories seen online. The tool came with results that paraphrased articles without giving too little attribution on this front.

There were moments when it continued to roll out inaccurate summaries for such stories.

But Wired keeps on mentioning how chatbots are falsely rolling out claims against it.

Another interview featuring Fast Company shed light on how the head for Perplexity was not ready to admit the truth about ignoring the Robot Exclusions Protocol and hence opting to lie about it was a matter of serious concern for many.

This does not indicate that it was not attaining advantages over crawlers that ignored the protocol. As explained previously, the firm makes use of crawlers from third parties, other than its own. When more pressure was put on Perplexity to admit the truth, it says the matter is more complex and complicated for obvious reasons.

Perplexity says it’s ready to defend the firm’s practices and even went as far as to inform the publication about how it’s not legal to make use of this framework. The company’s head even admitted how a new kind of agreement must come into play between publishers and firms to ensure things work smoothly from here end on.

Wired was then accused by the firm of intentionally adding certain prompts to the chatbot to have it behave like it did so that regular people wouldn’t get similar results.

Image: DIW-Aigen

Read next: How Much Money Tech Companies Will Lose Without the Internet?

AI Platforms Under Fire For Scraping Data From Websites For Training Purposes

Dr. Hura Anwar

You might like