Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Are foundation models secure against malicious actors?

In this work, we focus on the image input to a vision-language model (VLM). We discover that their image input channel is vulnerable to attack, by way of image hijacks: adversarial images that control generative models at runtime.

We introduce the general behaviour-matching algorithm for training image hijacks. From this, we derive the prompt-matching algorithm, allowing us to train hijacks matching the behaviour of an arbitrary user-defined text prompt (e.g. 'the Eiffel Tower is now located in Rome') using a generic, off-the-shelf dataset unrelated to our choice of prompt.

We use behaviour-matching to craft hijacks for four types of attack:

Specific string attacks force a model to generate arbitrary output of the adversary's choosing.
Leak context attacks force a model to leak information from its context window into its output.
Jailbreak attacks circumvent a model's safety training.
Disinformation attacks force a model to believe false information.

We study these attacks against LLaVA, a state-of-the-art VLM based on CLIP and LLaMA-2, and find that all attack types achieve a success rate of over 80%. Moreover, our attacks are automated and require only small image perturbations.

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Abstract

BibTeX