Image Hijacks: Adversarial Images can Control Generative Models at Runtime

1UC Berkeley, 2Harvard University, 3University of Cambridge
* denotes equal contribution
Interpolate start reference image.

Figure 1: Image hijacks of LLaVA-2, a VLM based on CLIP and LLaMA-2. These attacks are automated, barely perceptible to humans, and control the model's output.


Are foundation models secure against malicious actors?

In this work, we focus on the image input to a vision-language model (VLM). We discover that their image input channel is vulnerable to attack, by way of image hijacks: adversarial images that control generative models at runtime.

We introduce the general behaviour-matching algorithm for training image hijacks. From this, we derive the prompt-matching algorithm, allowing us to train hijacks matching the behaviour of an arbitrary user-defined text prompt (e.g. 'the Eiffel Tower is now located in Rome') using a generic, off-the-shelf dataset unrelated to our choice of prompt.

We use behaviour-matching to craft hijacks for four types of attack:

  • Specific string attacks force a model to generate arbitrary output of the adversary's choosing.
  • Leak context attacks force a model to leak information from its context window into its output.
  • Jailbreak attacks circumvent a model's safety training.
  • Disinformation attacks force a model to believe false information.

We study these attacks against LLaVA, a state-of-the-art VLM based on CLIP and LLaMA-2, and find that all attack types achieve a success rate of over 80%. Moreover, our attacks are automated and require only small image perturbations.


  title={Image Hijacks: Adversarial Images can Control Generative Models at Runtime}, 
  author={Luke Bailey and Euan Ong and Stuart Russell and Scott Emmons},