Large Language Models (LLMs) have gained widespread adoption for their ability to generate coherent text, and perform complex tasks. However, concerns around their safety such as biases, misinformation, and user data privacy have emerged. Using LLMs to automatically perform red-teaming has become a growing area of research. These attacks often involve generating inputs designed to exploit weaknesses in target models, such as inducing biased or harmful outputs, or prompting the model to leak sensitive information. In this project, we aim to use techniques like prompt engineering or adversarial paraphrasing to force the victim LLM to generate drastically different, often undesirable responses.
ANEMONE: Analysis and improvement of LLM robustness
Date | 01/04/2024 - 31/12/2024 |
Type | Privacy Protection & Cryptography, Machine Learning |
Partner | armasuisse |
Partner contact | Ljiljana Dolamic, Gerome Bovet |
EPFL Laboratory | Signal Processing Laboratory 4 |