A new research published in Nature shows that all the AI models that are trained on AI generated data often give out worse output. A computer scientist from the University of Oxford says that just like printing a picture over and over again produces bad results in the end, AI models also produce content that is incoherent and nonsensical and the term for it is “model collapse”.
The research used many AI models, including the big ones like ChatGPT-3 and found that this model was trained on Common Crawl, an online website with over 3 billion web pages. And as many AI models are using AI generated junk websites, the problem is likely to get worse. The effects of cluttering of data are going to be seen in poor and slow performances of AI models.
To find out how performance of these AI models can be affected, the researchers tuned a large language model on data from Wikipedia and then tuned other generations of that LLM on the output of the first model. The results showed that the LLMs that were tuned on the output of another LLM were more perplexed. The first input was coherent and had well structured sentences. But in the final generation, the LLM showed incoherent and nonsensical sentences.
The researchers say that there is a need to train AI models from the output of other AI models because data on the internet is limited. AI models will have to be trained on synthetic data under controlled environments.
Image: DIW-Aigen
Read next: Upwork Survey: AI Increases Workload for 77% of Workers, 71% Experience Burnout
The research used many AI models, including the big ones like ChatGPT-3 and found that this model was trained on Common Crawl, an online website with over 3 billion web pages. And as many AI models are using AI generated junk websites, the problem is likely to get worse. The effects of cluttering of data are going to be seen in poor and slow performances of AI models.
To find out how performance of these AI models can be affected, the researchers tuned a large language model on data from Wikipedia and then tuned other generations of that LLM on the output of the first model. The results showed that the LLMs that were tuned on the output of another LLM were more perplexed. The first input was coherent and had well structured sentences. But in the final generation, the LLM showed incoherent and nonsensical sentences.
The researchers say that there is a need to train AI models from the output of other AI models because data on the internet is limited. AI models will have to be trained on synthetic data under controlled environments.
Image: DIW-Aigen
Read next: Upwork Survey: AI Increases Workload for 77% of Workers, 71% Experience Burnout