ANEMONE: Analysis and improvement of LLM robustness

Date 01/04/2024 - 31/12/2024
Type Privacy Protection & Cryptography, Machine Learning
Partner armasuisse
Partner contact Ljiljana Dolamic, Gerome Bovet
EPFL Laboratory Signal Processing Laboratory 4

Large Language Models (LLMs) have gained widespread adoption for their ability to generate coherent text, and perform complex tasks. However, concerns around their safety such as biases, misinformation, and user data privacy have emerged. Using LLMs to automatically perform red-teaming has become a growing area of research. These attacks often involve generating inputs designed to exploit weaknesses in target models, such as inducing biased or harmful outputs, or prompting the model to leak sensitive information. In this project, we aim to use techniques like prompt engineering or adversarial paraphrasing to force the victim LLM to generate drastically different, often undesirable responses.