|
--- |
|
base_model: |
|
- SherryXTChen/LatentDiffusionDINOv2 |
|
datasets: |
|
- timbrooks/instructpix2pix-clip-filtered |
|
- SherryXTChen/InstructCLIP-InstructPix2Pix-Data |
|
language: |
|
- en |
|
license: apache-2.0 |
|
pipeline_tag: image-to-image |
|
library_name: diffusers |
|
tags: |
|
- model_hub_mixin |
|
- pytorch_model_hub_mixin |
|
--- |
|
|
|
# InstructCLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning (CVPR 2025) |
|
|
|
This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration. |
|
The model is based on the paper [Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning](https://huggingface.co/papers/2503.18406). |
|
|
|
[Arxiv](http://arxiv.org/abs/2503.18406) | [Image Editing Model](https://huggingface.co/SherryXTChen/InstructCLIP-InstructPix2Pix) | [Data Refinement Model](https://huggingface.co/SherryXTChen/Instruct-CLIP) | [Data](https://huggingface.co/datasets/SherryXTChen/InstructCLIP-InstructPix2Pix-Data) |
|
|
|
|
|
## Capabilities |
|
|
|
<p align="center"> |
|
<img src="https://raw.githubusercontent.com/SherryXTChen/Instruct-CLIP/refs/heads/main/assets/teaser_2.png" alt="Figure 2" width="50%"> |
|
</p> |
|
|
|
## Installation |
|
``` |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Edit Instruction Refinement Inference |
|
|
|
```python |
|
from PIL import Image |
|
import torch |
|
from torchvision import transforms |
|
|
|
from model import InstructCLIP |
|
from utils import get_sd_components, normalize |
|
|
|
parser = argparse.ArgumentParser(description="Simple example of estimating edit instruction from image pair") |
|
parser.add_argument( |
|
"--pretrained_instructclip_name_or_path", |
|
type=str, |
|
default="SherryXTChen/Instruct-CLIP", |
|
help=( |
|
"instructclip pretrained checkpoints" |
|
), |
|
) |
|
parser.add_argument( |
|
"--pretrained_model_name_or_path", |
|
type=str, |
|
default="runwayml/stable-diffusion-v1-5", |
|
help=( |
|
"sd pretrained checkpoints" |
|
), |
|
) |
|
parser.add_argument( |
|
"--input_path", |
|
type=str, |
|
default="assets/1_input.jpg", |
|
help=( |
|
"Input image path" |
|
) |
|
) |
|
parser.add_argument( |
|
"--output_path", |
|
type=str, |
|
default="assets/1_output.jpg", |
|
help=( |
|
"Output image path" |
|
) |
|
) |
|
args = parser.parse_args() |
|
device = "cuda" |
|
|
|
# load model for edit instruction estimation |
|
model = InstructCLIP.from_pretrained("SherryXTChen/Instruct-CLIP") |
|
model = model.to(device).eval() |
|
|
|
# load model to preprocess/encode image to latent space |
|
tokenizer, _, vae, _, _ = get_sd_components(args, device, torch.float32) |
|
|
|
# prepare image input |
|
transform = transforms.Compose([ |
|
transforms.ToTensor(), |
|
transforms.Normalize(mean=[0.5], std=[0.5]), |
|
]) |
|
image_list = [args.input_path, args.output_path] |
|
image_list = [ |
|
transform(Image.open(f).resize((512, 512))).unsqueeze(0).to(device) |
|
for f in image_list |
|
] |
|
|
|
with torch.no_grad(): |
|
image_list = [vae.encode(x).latent_dist.sample() * vae.config.scaling_factor for x in image_list] |
|
|
|
# get image feature |
|
zero_timesteps = torch.zeros_like(torch.tensor([0])).to(device) |
|
img_feat = model.get_image_features( |
|
inp=image_list[0], out=image_list[1], inp_t=zero_timesteps, out_t=zero_timesteps) |
|
img_feat = normalize(img_feat) |
|
|
|
# get edit instruction |
|
pred_instruct_input_ids = model.text_decoder.infer(img_feat[:1])[0] |
|
pred_instruct = tokenizer.decode(pred_instruct_input_ids, skip_special_tokens=True) |
|
print(pred_instruct) # as a 3 d sculpture |
|
``` |
|
|
|
## Citation |
|
```bibtex |
|
@misc{chen2025instructclipimprovinginstructionguidedimage, |
|
title={Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning}, |
|
author={Sherry X. Chen and Misha Sra and Pradeep Sen}, |
|
year={2025}, |
|
eprint={2503.18406}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2503.18406}, |
|
} |
|
``` |