OpenAI’s training crawler recently found itself stuck on a unique website called “the world’s lamest content farm,” causing a significant spike in activity. The site, created by John Levine, the author of “The Internet for Dummies”, was designed as an experiment.
It features billions of single-page sites that are all linked together. Each page looks almost the same but changes slightly each time someone clicks a link. Levine used a simple program to create a system where each click generates a new page name from a set of first names stored in a database.
The design of Levine’s website is such that it easily traps web crawlers, which are programs designed to scan the internet. The OpenAI bot, for instance, was so caught in the loop that it accessed the site almost 150 times/second in a single day.
Levine found this amusing and shared the issue on a professional listserv for web developers and IT experts, seeking a contact at OpenAI to report the crawler's behavior.
This incident highlights a broader issue with how AI models are trained by indiscriminately gathering data from the internet, sometimes capturing nonsensical or irrelevant information. The problem was notable enough that Levine commented on the nature of the data potentially being used to train future versions of AI.
He humorously suggested that if anyone was curious about what data was training the next AI models, they now had an example.
The issue was solved when the bot stopped accessing the site after Levine posted about it. His website is a bit unusual. Instead of having billions of pages it is billions of tiny websites each with a single page.
This setup confuses many web crawlers, not just OpenAI's. In the past, similar issues have occurred with bots from Bing and Amazon.
Levine's website also serves a less serious purpose. It hosts ads for a couple of his books and a carton of fake eggs, which Levine described as just being cute.
Despite the commercial aspect, he noted that the sales for his books are not what they used to be, humorously adding that everyone knows how to use the internet now, unlike the early days of his popular book.
Image: DIW-AIgen
Read next: WhatsApp Changes Age Limit Amid Safety Concerns
It features billions of single-page sites that are all linked together. Each page looks almost the same but changes slightly each time someone clicks a link. Levine used a simple program to create a system where each click generates a new page name from a set of first names stored in a database.
The design of Levine’s website is such that it easily traps web crawlers, which are programs designed to scan the internet. The OpenAI bot, for instance, was so caught in the loop that it accessed the site almost 150 times/second in a single day.
Levine found this amusing and shared the issue on a professional listserv for web developers and IT experts, seeking a contact at OpenAI to report the crawler's behavior.
This incident highlights a broader issue with how AI models are trained by indiscriminately gathering data from the internet, sometimes capturing nonsensical or irrelevant information. The problem was notable enough that Levine commented on the nature of the data potentially being used to train future versions of AI.
He humorously suggested that if anyone was curious about what data was training the next AI models, they now had an example.
The issue was solved when the bot stopped accessing the site after Levine posted about it. His website is a bit unusual. Instead of having billions of pages it is billions of tiny websites each with a single page.
This setup confuses many web crawlers, not just OpenAI's. In the past, similar issues have occurred with bots from Bing and Amazon.
Levine's website also serves a less serious purpose. It hosts ads for a couple of his books and a carton of fake eggs, which Levine described as just being cute.
Despite the commercial aspect, he noted that the sales for his books are not what they used to be, humorously adding that everyone knows how to use the internet now, unlike the early days of his popular book.
Image: DIW-AIgen
Read next: WhatsApp Changes Age Limit Amid Safety Concerns