A New Study Shows that Many Websites are Restricting their Data Which is Used for AI Training

A study from Data Provence Initiative studied 14,00 web domains used in AI training and found that there is soon going to be a shortage of data to train AI. It is because of consent problems as many publishers and online platforms are restricting access to their data. Researchers of the study say that 25% of the data from best sources has been restricted while 5% of the sources have put complete restrictions on all of their data. 45% of the data from data sets called C4 has been restricted due to websites’ terms of services.

Websites are using Robots Exclusion Protocol to block any access. This decline in data is going to impact AI companies as well as academics and researchers. Many AI models like ChatGPT, Gemini and Claude use data from these sites to write, code and create videos and images. Many websites say that their data is used by AI training without their permission and any compensation. Some publishers have also gone to court because they say AI companies are using their data for free. Reddit and StackOverFlow are now charging AI companies if they want to use their data.

This data shortage can impact how AI models are trained. AI companies should try to pay websites and publishers for using their data. If not, then a lot of websites are going to soon block their data which will affect AI training.

Image: DIW-Aigen

Read next:

• Survey Finds Social Media Apps and Search Engine Companies Are Going to be the Most Impacted with AI

• WhatsApp's iOS Beta Adds Offline File Sharing, Web Version Tests Username Creation

A New Study Shows that Many Websites are Restricting their Data Which is Used for AI Training

Arooj Ahmed

You might like