Google’s DeepMind has developed a new standard called Gecko to better test AI systems that turn text into images. These AI models, like DALL-E, Midjourney, and Stable Diffusion, are popular for creating unique images from text prompts.
However, current methods to evaluate how well these models work might not be giving us the full picture. Usually, tests involve small human evaluations or automatic checks, which can miss subtle details or even differ from what people think.
Gecko tackles this issue by introducing a challenging set of 2,000 prompts that cover a variety of skills and complexities. This helps identify specific areas where AI models struggle.
The Gecko system breaks down these prompts into detailed categories to show not just where models fail but also at which level of complexity they start to have problems.
The research team also collected over 100,000 human opinions on the images produced by these AI models using Gecko's prompts. This large amount of feedback helps distinguish whether issues with the images come from the prompts themselves, differences in evaluation methods, or the models’ actual performance.
Gecko also introduces a new way to measure performance using a question-answering format, which aligns more closely with human judgment. This method highlighted differences in model strengths and weaknesses that weren't seen before.
DeepMind's Muse model performed the best in tests using the Gecko standard. The researchers believe that using various benchmarks and evaluation methods is crucial for understanding the true abilities of AI image generators before they are used in real-world applications.
They plan to share the Gecko code and data openly to encourage more advancements in this area. This effort shows the importance of rigorous testing to identify the best AI models, moving beyond just impressive-looking results to truly reliable technology.
Image: DIW-Aigen
Read next: Google Claims There Is No Direct Threat To Human Jobs From AI Yet
However, current methods to evaluate how well these models work might not be giving us the full picture. Usually, tests involve small human evaluations or automatic checks, which can miss subtle details or even differ from what people think.
Gecko tackles this issue by introducing a challenging set of 2,000 prompts that cover a variety of skills and complexities. This helps identify specific areas where AI models struggle.
The Gecko system breaks down these prompts into detailed categories to show not just where models fail but also at which level of complexity they start to have problems.
The research team also collected over 100,000 human opinions on the images produced by these AI models using Gecko's prompts. This large amount of feedback helps distinguish whether issues with the images come from the prompts themselves, differences in evaluation methods, or the models’ actual performance.
Gecko also introduces a new way to measure performance using a question-answering format, which aligns more closely with human judgment. This method highlighted differences in model strengths and weaknesses that weren't seen before.
DeepMind's Muse model performed the best in tests using the Gecko standard. The researchers believe that using various benchmarks and evaluation methods is crucial for understanding the true abilities of AI image generators before they are used in real-world applications.
They plan to share the Gecko code and data openly to encourage more advancements in this area. This effort shows the importance of rigorous testing to identify the best AI models, moving beyond just impressive-looking results to truly reliable technology.
Image: DIW-Aigen
Read next: Google Claims There Is No Direct Threat To Human Jobs From AI Yet