Image Hijacks: Adversarial Images can Control Generative Models at Runtime

1UC Berkeley, 2Harvard University, 3University of Cambridge
* denotes equal contribution
Interpolate start reference image.

Figure 1: Image hijacks of LLaVA-2, a VLM based on CLIP and LLaMA-2. These attacks are automated, barely perceptible to humans, and control the model's output.


Are foundation models secure from malicious actors?

In this work, we study the attack surface of vision-language models (VLMs). We discover that their image input channel is vulnerable to attack, by way of image hijacks: adversarial images that control generative models at runtime.

We introduce behaviour matching, a general method for crafting image hijacks, and use it to build three different types of attack:

  • Specific string attacks force a model to generate arbitrary output of the adversary's choosing.
  • Leak context attacks force a model to leak information from its context window into its output.
  • Jailbreak attacks circumvent a model's safety training.

We study these attacks against LLaVA-2, a state-of-the-art VLM based on CLIP and LLaMA-2, and find that all our attack types have above a 90% success rate. Moreover, our attacks are automated and require only small image perturbations.

These findings raise serious concerns about the security of foundation models: if image hijacks are as difficult to defend against as adversarial examples in CIFAR-10, then it might be many years before a solution is found — if one even exists.


While the experiments in our paper were performed on the latest version of LLaVA based on LLaMA-2, we also trained specific string and leak context attacks for the public LLaVA demo, which you can try below:

Interpolate start reference image.

Original image

Interpolate start reference image.

Leak-context hijack under \(\ell_\infty\) norm constraint (\(\varepsilon=8/255\))

Interpolation end reference image.

Specific-string hijack under \(\ell_\infty\) norm constraint (\(\varepsilon=8/255\))


  title={Image Hijacks: Adversarial Images can Control Generative Models at Runtime}, 
  author={Luke Bailey and Euan Ong and Stuart Russell and Scott Emmons},