llavaguard / README.md
Ahren09's picture
Upload 227 files
5ca4e86 verified
|
raw
history blame
3.56 kB

LLaVAGuard

PyTorch implementation for the paper "LLaVAGuard: Safety Guardrails for Multimodal Large Language Models against Jailbreak Attacks"

LLaVAGuard is a novel framework that offers multimodal safety guardrails to any input prompt. The safety guardrails are specifically optimized to minimize the likelihood of generating harmful responses on LLaVA-v1.5 model. We have also demonstrated the transferability of these guardrails to other prominent MLLMs, including GPT-4V, MiniGPT-4, and InstructBLIP, thereby broadening the scope of our solution.

Project Structure

  • cal_metrics.py: Summarizing the perplexity metrics over all examples
  • get_metric.py: Script for calculating detoxify and Perspective API metrics.
  • eval_configs: Configuration files for model evaluations, including settings for llama and MiniGPT-4.
  • image_safety_patch.py, text_safety_patch.py: Scripts for generating safety patches from images and text.
  • instructblip_*.py: Scripts related to the InstructBLIP model, including defense strategies against constrained and unconstrained attacks, and question answering.
  • lavis: Submodule for the InstructBLIP model, which contains the dataset builders, models, processors, projects, runners, and tasks for various multimodal learning purposes.
  • metric: Implementations of metrics such as detoxify and Perspective API.
  • minigpt_*.py: Scripts related to the MiniGPT-4 model, including constrained and unconstrained inference, and question answering.
  • requirements.txt: Required Python packages for setting up the project.
  • scripts: Shell scripts for running all experiments.
  • utils.py: Utility functions supporting various operations across the project, such as image loading and preprocessing.
  • visual: Scripts for visualizing the overall toxicity results from InstructBLIP and MiniGPT-4 evaluations.
  • text_patch_heuristic: pre-defined text guardrails
  • text_patch_optimized: optimized text guardrails

Setup

To get started with llavaguard, follow these setup steps:

  1. Clone the Repository:

    git clone <repository-url> llavaguard
    cd llavaguard
    
  2. Install Dependencies: Make sure you have Python 3.10+ installed, then run:

    pip install -r requirements.txt
    
  3. Dataset Preparation: Download the two files from Google Drive and put them under the project directory. Run:

    tar -xzvf adversarial_qna_images.tar.gz
    tar -xzvf unconstrained_attack_images.tar.gz
    

Usage

The project includes several scripts and shell commands designed to perform specific tasks. Here are some examples:

  • Running constrained / unconstrained attack as well as the QNA task for the InstructBLIP model:

    bash scripts/run_instructblip_attack.sh
    

    This involves getting the results from the LLMs and calculating the metrics.

    Procedures to run MiniGPT-4 are similar.

  • Running experiments for the baseline defense methods:

    bash scripts/run_instructblip_baseline.sh
    
  • Running our LLaVAGuard defense methods:

    bash scripts/run_instructblip_safety_patch.sh
    

Contributing

Contributions to llavaguard are welcomed. Please submit pull requests to the repository with a clear description of the changes and the purpose behind them.

License

This project is released under the Apache 2.0 License.