Sometimes, when you ask AI a question that has been restricted by the system, AI answers that it cannot reply to that kind of question. But a new study shows that users can easily convince AI to answer harmful questions even if the LLMs have gone through safety training. Large language models can easily get manipulated to spread misinformation, produce toxic or harmful content or support harmful activities. The new research done by EPFL found that even if LLMs have gone through the most recent safety training, they can still generate deviant answers after some convincing prompts.
The researchers found that the most recent LLMs are also prone to jailbreaking attacks, with a 100% successful attack rate on many LLMs like Claude 3.5 Sonnet and GPT-4o. Attacks that challenge the defense model of LLMs can easily be made, and can convince or manipulate LLMs to give out information that they are not supposed to. The researchers used a dataset of 50 harmful requests received on LLMs and after doing experiments with different LLMs, they got the perfect successful jailbreaking score(100%). It was found that different LLMs are vulnerable to different prompts. There are also some vulnerabilities in the Application Programming Interface of some LLMs that need to be restricted in the settings.
The researchers said that it is important to test both adaptive and static techniques to find out how easily an LLM can be manipulated. They said that experimenting by applying existing attacks on LLMs may not give out the desirable and accurate results. The results of this study have been forwarded to companions of AI models. This thesis of the researcher Maksym Andriushchenko got him Patrick Denantes Award as this research of his is important for safety of the users as well as AI agents.
Read next: Generative AI Awareness and Usage Soar, But Premium Smartphone Adoption Faces Challenges
The researchers found that the most recent LLMs are also prone to jailbreaking attacks, with a 100% successful attack rate on many LLMs like Claude 3.5 Sonnet and GPT-4o. Attacks that challenge the defense model of LLMs can easily be made, and can convince or manipulate LLMs to give out information that they are not supposed to. The researchers used a dataset of 50 harmful requests received on LLMs and after doing experiments with different LLMs, they got the perfect successful jailbreaking score(100%). It was found that different LLMs are vulnerable to different prompts. There are also some vulnerabilities in the Application Programming Interface of some LLMs that need to be restricted in the settings.
The researchers said that it is important to test both adaptive and static techniques to find out how easily an LLM can be manipulated. They said that experimenting by applying existing attacks on LLMs may not give out the desirable and accurate results. The results of this study have been forwarded to companions of AI models. This thesis of the researcher Maksym Andriushchenko got him Patrick Denantes Award as this research of his is important for safety of the users as well as AI agents.
Read next: Generative AI Awareness and Usage Soar, But Premium Smartphone Adoption Faces Challenges