A new study is speaking about OpenAI training some of its AI models by memorizing copyrighted material.
The tech giant’s details about how it’s currently entangled in different lawsuits put out by authors and programmers are going public. The majority of them are accusing OpenAI of using material without consent, including books and databases, to produce new models.
The company has been dealing with such allegations for a long time and still feels like they’ve done nothing wrong. In fact, the tech giant adds that it’s developed models by fair use, but the plaintiffs beg to differ on the matter. They’re arguing that there is no carve-out in America’s copyright law linked to training data.
The research was first co-authored by experts at the University of Washington, Stanford, and even University of Copenhagen. The latest method is used for highlighting training data that models memorize behind APIs.
So to make it easy, they are trained using so much information, and they’re learning all sorts of patterns. This way, they can assist in generating images, essays, and beyond. Most of them learn by this technique only. Plenty of image models today were seen regurgitating pictures from films that were used during their training process. Meanwhile, large language models copied news reports.
This research’s method is based on words that co-authors refer to as high-surprisal. It’s mostly terms that pop out as unusual about a bigger body of work. Radar can be linked to high surprisal as it’s less prominent than terms like radio or engine that pop up before humming.
The co-authors mentioned how they investigated different OpenAI models such as GPT-4 and 3.5. There were signs of memorization, like getting rid of common terms from clippings of fiction books and New York Times articles. They have models to guess which terms were masked. If such models continue to be managed the right way, it’s more or less like memorizing snippets throughout the training process, the co-authors mentioned.
As per the results from tests, GPT-4 displayed signs of seeing memorized parts of famous fiction books like books inside a dataset featuring samples of copyrighted ebooks like BookMIA. The replies also prove how the model might have literally memorized parts of articles published by the New York Times. Even if that was at a much lower rate.
Some of the authors of the study stated how these findings prove that contentious data might be used for training AI models. So to really gauge if these systems are reliable or not, we need models that can be investigated and checked through scientific means.
The work today provides a great tool for probing LLMs, but there’s a greater need to be more transparent today than before. For so long, we’ve seen OpenAI stand there and advocate for fewer restrictions on new models using data protected by copyrights. While the firm has a lot of content licensing deals today, it continues to lobby governments around various AI training mechanisms.
Image: DIW-Aigen
Read next: Canada Vows to Review and Cancel Government Starlink Accounts Amid US Trade War
The tech giant’s details about how it’s currently entangled in different lawsuits put out by authors and programmers are going public. The majority of them are accusing OpenAI of using material without consent, including books and databases, to produce new models.
The company has been dealing with such allegations for a long time and still feels like they’ve done nothing wrong. In fact, the tech giant adds that it’s developed models by fair use, but the plaintiffs beg to differ on the matter. They’re arguing that there is no carve-out in America’s copyright law linked to training data.
The research was first co-authored by experts at the University of Washington, Stanford, and even University of Copenhagen. The latest method is used for highlighting training data that models memorize behind APIs.
So to make it easy, they are trained using so much information, and they’re learning all sorts of patterns. This way, they can assist in generating images, essays, and beyond. Most of them learn by this technique only. Plenty of image models today were seen regurgitating pictures from films that were used during their training process. Meanwhile, large language models copied news reports.
This research’s method is based on words that co-authors refer to as high-surprisal. It’s mostly terms that pop out as unusual about a bigger body of work. Radar can be linked to high surprisal as it’s less prominent than terms like radio or engine that pop up before humming.
The co-authors mentioned how they investigated different OpenAI models such as GPT-4 and 3.5. There were signs of memorization, like getting rid of common terms from clippings of fiction books and New York Times articles. They have models to guess which terms were masked. If such models continue to be managed the right way, it’s more or less like memorizing snippets throughout the training process, the co-authors mentioned.
As per the results from tests, GPT-4 displayed signs of seeing memorized parts of famous fiction books like books inside a dataset featuring samples of copyrighted ebooks like BookMIA. The replies also prove how the model might have literally memorized parts of articles published by the New York Times. Even if that was at a much lower rate.
Some of the authors of the study stated how these findings prove that contentious data might be used for training AI models. So to really gauge if these systems are reliable or not, we need models that can be investigated and checked through scientific means.
The work today provides a great tool for probing LLMs, but there’s a greater need to be more transparent today than before. For so long, we’ve seen OpenAI stand there and advocate for fewer restrictions on new models using data protected by copyrights. While the firm has a lot of content licensing deals today, it continues to lobby governments around various AI training mechanisms.
Image: DIW-Aigen
Read next: Canada Vows to Review and Cancel Government Starlink Accounts Amid US Trade War