ANEMONE: Analysis and improvement of LLM robustness

Date

01/04/2024 - 31/12/2024

Type

Privacy Protection & Cryptography, Machine Learning

Partner

armasuisse

Partner contact

Ljiljana Dolamic, Gerome Bovet

EPFL Laboratory

Signal Processing Laboratory 4

Large Language Models (LLMs) have gained widespread adoption for their ability to generate coherent text, and perform complex tasks. However, concerns around their safety such as biases, misinformation, and user data privacy have emerged. Using LLMs to automatically perform red-teaming has become a growing area of research. These attacks often involve generating inputs designed to exploit weaknesses in target models, such as inducing biased or harmful outputs, or prompting the model to leak sensitive information. In this project, we aim to use techniques like prompt engineering or adversarial paraphrasing to force the victim LLM to generate drastically different, often undesirable responses.