Figure 1: Image hijacks of LLaVA-2, a VLM based on CLIP and LLaMA-2. These attacks are automated, barely perceptible to humans, and control the model's output.
Are foundation models secure against malicious actors?
In this work, we focus on the image input to a vision-language model (VLM). We discover that their image input channel is vulnerable to attack, by way of image hijacks: adversarial images that control generative models at runtime.
We introduce the general behaviour-matching algorithm for training image hijacks. From this, we derive the prompt-matching algorithm, allowing us to train hijacks matching the behaviour of an
We use behaviour-matching to craft hijacks for four types of attack:
We study these attacks against LLaVA, a state-of-the-art VLM based on CLIP and LLaMA-2, and find that all attack types achieve a success rate of over 80%. Moreover, our attacks are automated and require only small image perturbations.
@misc{bailey2023image,
title={Image Hijacks: Adversarial Images can Control Generative Models at Runtime},
author={Luke Bailey and Euan Ong and Stuart Russell and Scott Emmons},
year={2023},
eprint={2309.00236},
archivePrefix={arXiv},
primaryClass={cs.LG}
}