Spaces:

smarques
/

InstantDrag

Sleeping

App Files Files Community

smarques commited on Jan 9

Commit

fd0db3a

1 Parent(s): 6022933

checkout INstantDrag

Browse files

Files changed (18) hide show

InstDrag/.gitignore +163 -0
InstDrag/README.md +76 -0
InstDrag/demo/demo_utils.py +242 -0
InstDrag/demo/run_demo.py +226 -0
InstDrag/demo/samples/airplane.jpg +0 -0
InstDrag/demo/samples/anime.jpg +0 -0
InstDrag/demo/samples/caligraphy.jpg +0 -0
InstDrag/demo/samples/crocodile.jpg +0 -0
InstDrag/demo/samples/elephant.jpg +0 -0
InstDrag/demo/samples/meteor.jpg +0 -0
InstDrag/demo/samples/monalisa.jpg +0 -0
InstDrag/demo/samples/portrait.jpg +0 -0
InstDrag/demo/samples/sketch.jpg +0 -0
InstDrag/demo/samples/surreal.jpg +0 -0
InstDrag/flowdiffusion/pipeline.py +495 -0
InstDrag/flowgen/models.py +161 -0
InstDrag/utils/flow_utils.py +143 -0
InstDrag/utils/null_prompt.pt +3 -0

InstDrag/.gitignore ADDED Viewed

	@@ -0,0 +1,163 @@

+demo/results
+demo/checkpoints
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/

InstDrag/README.md ADDED Viewed

	@@ -0,0 +1,76 @@

+# InstantDrag
+<p align="center">
+  <img src="assets/demo.gif" alt="Demo video">
+</p>
+<br/>
+Official implementation of the paper **"InstantDrag: Improving Interactivity in Drag-based Image Editing"** (SIGGRAPH Asia 2024).
+<p align="center">
+  <a href="https://arxiv.org/abs/2409.08857"><img src="https://img.shields.io/badge/arxiv-2409.08857-b31b1b"></a>
+  <a href="https://joonghyuk.com/instantdrag-web/"><img src="https://img.shields.io/badge/Project%20Page-InstantDrag-blue"></a>
+  <a href="https://huggingface.co/alex4727/InstantDrag"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-forestgreen"></a>
+</p>
+---
+## Setup
+1. Create and activate a conda environment:
+   ```bash
+   conda create -n instantdrag python=3.10 -y
+   conda activate instantdrag
+   ```
+2. Install PyTorch:
+   ```bash
+   pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
+   ```
+3. Install other dependencies:
+   ```bash
+   pip install transformers==4.44.2 diffusers==0.30.1 accelerate==0.33.0 gradio==4.44.0 opencv-python
+   ```
+   **Note:** Exact version matching may not be necessary for all dependencies.
+## Demo
+To run the demo:
+```bash
+cd demo/
+CUDA_VISIBLE_DEVICES=0 python run_demo.py
+```
+### Disclaimer
+- Our **base** models are **solely** trained on real-world talking head (facial) videos, with a focus on achieving **fast fine-grained facial editing w/o metadata**. The preliminary signs of generalizability in other types of scenes, without fine-tuning, should be considered more of an experimental byproduct and may not perform well in many cases. Please check the Appendix A of our paper for more information.
+- This is a research project, **NOT** a commercial product. Use at your own risk.
+### Usage Instructions & Tips
+- Upload and preprocess image using Gradio's interface.
+- Click to define source and target point pairs on the image.
+- Adjust settings in the "Configs" tab.
+  - We provide two checkpoints for FlowGen: config-2 (default, used for most figures in the paper) and config-3 (used for benchmark table in the paper). Generally, we recommend config-2 for most cases including few keypoints-based draggings. For extremely fine-grained editing with many drags (i.e. 68 keypoint drags as used in the benchmark), config-3 could be better suited as it produces more local movements.
+  - If image moves too much or too little, try modifying the image or flow guidance scales (usually 1 ~ 2 are recommended, but flow guidance can be larger).
+  - If you observe loss of identity or noisy artifacts, increasing image guidance or sampling steps could be helpful ([1.75, 1.5] scale is also a good choice for facial images).
+- Click `Run` to perform the editing.
+  - We recommend first viewing the example videos (in project page or .gif) and paper figures to understand the model's capabilities. Then, begin with facial images using fine-grained keypoint drags before progressing to more complex motions.
+  - As noted in the paper, our model may struggle with large motions that exceed the capabilities of the optical flow estimation networks used for training data extraction.
+- Notes on FlowGen Output Scale
+  - In many cases, especially for unseen domains, FlowGen's output doesn't precisely span the -1 to 1 range expected by FlowDiffusion's fixed-size normalization process. For all figures and benchmarks in our paper, we applied a static multiplier of 2 based on observations to adjust FlowGen's output to match the expected range. However, we found that forcefully rescaling the output to -1 to 1 also works well, so we set this as the default behavior (when value is -1). While not recommended, you can manually modify this value to scale the output of FlowGen before feeding it to FlowDiffusion for larger or smaller motions.
+**Note:** The initial run may take longer as models are loaded to GPU.
+## BibTeX
+If you find this work useful, please cite them as below!
+```
+@inproceedings{shin2024instantdrag,
+      title     = {{InstantDrag: Improving Interactivity in Drag-based Image Editing}},
+      author    = {Shin, Joonghyuk and Choi, Daehyeon and Park, Jaesik},
+      booktitle = {ACM SIGGRAPH Asia 2024 Conference Proceedings},
+      year      = {2024},
+      pages     = {1--10},
+}
+```

InstDrag/demo/demo_utils.py ADDED Viewed

	@@ -0,0 +1,242 @@

+import sys
+sys.path.append("../")
+import os
+import re
+import time
+import datetime
+from copy import deepcopy
+import numpy as np
+import cv2
+import torch
+import torch.nn.functional as F
+import gradio as gr
+from PIL import Image
+from PIL.ImageOps import exif_transpose
+from safetensors.torch import load_file
+from utils.flow_utils import flow_to_image, resize_flow
+from flowgen.models import UnetGenerator
+from flowdiffusion.pipeline import FlowDiffusionPipeline
+LENGTH = 512
+FLOWGAN_RESOLUTION = [256, 256] # HxW
+FLOWDIFFUSION_RESOLUTION = [512, 512] # HxW
+def process_img(image):
+    if image["composite"] is not None and not np.all(image["composite"] == 0):
+        original_image = Image.fromarray(image["composite"]).resize((LENGTH, LENGTH), Image.BICUBIC)
+        original_image = np.array(exif_transpose(original_image))
+        return original_image, [], gr.Image(value=deepcopy(original_image), interactive=False)
+    else:
+        return (
+            gr.Image(value=None, interactive=False),
+            [],
+            gr.Image(value=None, interactive=False),
+        )
+def get_points(img, sel_pix, evt: gr.SelectData):
+    sel_pix.append(evt.index)
+    print(sel_pix)
+    points = []
+    for idx, point in enumerate(sel_pix):
+        if idx % 2 == 0:
+            cv2.circle(img, tuple(point), 4, (255, 0, 0), -1)
+        else:
+            cv2.circle(img, tuple(point), 4, (0, 0, 255), -1)
+        points.append(tuple(point))
+        if len(points) == 2:
+            cv2.arrowedLine(img, points[0], points[1], (255, 255, 255), 2, tipLength=0.5)
+            points = []
+    img = img if isinstance(img, np.ndarray) else np.array(img)
+    return img
+def display_points(img, predefined_points, save_results):
+    if predefined_points != "":
+        predefined_points = predefined_points.split()
+        predefined_points = [int(re.sub(r'[^0-9]', '', point)) for point in predefined_points]
+        processed_points = []
+        for i, point in enumerate(predefined_points):
+            if i % 2 == 0:
+                processed_points.append([point, predefined_points[i+1]])
+        selected_points = processed_points
+    print(selected_points)
+    points = []
+    for idx, point in enumerate(selected_points):
+        if idx % 2 == 0:
+            cv2.circle(img, tuple(point), 4, (255, 0, 0), -1)
+        else:
+            cv2.circle(img, tuple(point), 4, (0, 0, 255), -1)
+        points.append(tuple(point))
+        if len(points) == 2:
+            cv2.arrowedLine(img, points[0], points[1], (255, 255, 255), 2, tipLength=0.5)
+            points = []
+    img = img if isinstance(img, np.ndarray) else np.array(img)
+    if save_results:
+        if not os.path.isdir("results/drag_inst_viz"):
+            os.makedirs("results/drag_inst_viz")
+        save_prefix = datetime.datetime.now().strftime("%Y-%m-%d-%H%M-%S")
+        to_save_img = Image.fromarray(img)
+        to_save_img.save(f"results/drag_inst_viz/{save_prefix}.png")
+    return img
+def undo_points_image(original_image):
+    if original_image is not None:
+        return original_image, []
+    else:
+        return gr.Image(value=None, interactive=False), []
+def clear_all():
+    return (
+        gr.Image(value=None, interactive=True),
+        gr.Image(value=None, interactive=False),
+        gr.Image(value=None, interactive=False),
+        [],
+        None
+    )
+class InstantDragPipeline:
+    def __init__(self, seed=9999, device="cuda", dtype=torch.float16):
+        self.seed = seed
+        self.device = device
+        self.dtype = dtype
+        self.generator = torch.Generator(device=device).manual_seed(seed)
+        self.flowgen_ckpt, self.flowdiffusion_ckpt = None, None
+        self.model_config = dict()
+    def build_model(self):
+        print("Building model...")
+        if self.flowgen_ckpt != self.model_config["flowgen_ckpt"]:
+            self.flowgen = UnetGenerator(input_nc=5, output_nc=2)
+            self.flowgen.load_state_dict(
+                load_file(os.path.join("checkpoints/", self.model_config["flowgen_ckpt"]), device="cpu")
+            )
+            self.flowgen.to(self.device)
+            self.flowgen.eval()
+            self.flowgen_ckpt = self.model_config["flowgen_ckpt"]
+        if self.flowdiffusion_ckpt != self.model_config["flowdiffusion_ckpt"]:
+            self.flowdiffusion = FlowDiffusionPipeline.from_pretrained(
+                os.path.join("checkpoints/", self.model_config["flowdiffusion_ckpt"]),
+                torch_dtype=self.dtype,
+                safety_checker=None
+            )
+            self.flowdiffusion.to(self.device)
+            self.flowdiffusion_ckpt = self.model_config["flowdiffusion_ckpt"]
+    def drag(self, original_image, selected_points, save_results):
+        scale = self.model_config["flowgen_output_scale"]
+        original_image = torch.tensor(original_image).permute(2, 0, 1).unsqueeze(0).float()  # 1, 3, 512, 512
+        original_image = 2 * (original_image / 255.) - 1  # Normalize to [-1, 1]
+        original_image = original_image.to(self.device)
+        source_points = []
+        target_points = []
+        for idx, point in enumerate(selected_points):
+            cur_point = torch.tensor([point[0], point[1]])  # x, y
+            if idx % 2 == 0:
+                source_points.append(cur_point)
+            else:
+                target_points.append(cur_point)
+        torch.cuda.synchronize()
+        start_time = time.time()
+        # Generate sparse flow vectors
+        point_vector_map = torch.zeros((1, 2, LENGTH, LENGTH))
+        for source_point, target_point in zip(source_points, target_points):
+            cur_x, cur_y = source_point[0], source_point[1]
+            target_x, target_y = target_point[0], target_point[1]
+            vec_x = target_x - cur_x
+            vec_y = target_y - cur_y
+            point_vector_map[0, 0, int(cur_y), int(cur_x)] = vec_x
+            point_vector_map[0, 1, int(cur_y), int(cur_x)] = vec_y
+        point_vector_map = point_vector_map.to(self.device)
+        # Sample-wise normalize the flow vectors
+        factor_x = torch.amax(torch.abs(point_vector_map[:, 0, :, :]), dim=(1, 2)).view(-1, 1, 1).to(self.device)
+        factor_y = torch.amax(torch.abs(point_vector_map[:, 1, :, :]), dim=(1, 2)).view(-1, 1, 1).to(self.device)
+        if factor_x >= 1e-8: # Avoid division by zero
+            point_vector_map[:, 0, :, :] /= factor_x
+        if factor_y >= 1e-8: # Avoid division by zero
+            point_vector_map[:, 1, :, :] /= factor_y
+        with torch.inference_mode():
+            gan_input_image = F.interpolate(original_image, size=FLOWGAN_RESOLUTION, mode="bicubic") # 256 x 256
+            point_vector_map = F.interpolate(point_vector_map, size=FLOWGAN_RESOLUTION, mode="bicubic") # 256 x 256
+            gan_input = torch.cat([gan_input_image, point_vector_map], dim=1)
+            flow = self.flowgen(gan_input) # -1 ~ 1
+            if scale == -1.0:
+                flow[:, 0, :, :] *= 1.0 / torch.amax(torch.abs(flow[:, 0, :, :]), dim=(1, 2)).view(-1, 1, 1) # force the range to be [-1 ~ 1]
+                flow[:, 1, :, :] *= 1.0 / torch.amax(torch.abs(flow[:, 1, :, :]), dim=(1, 2)).view(-1, 1, 1) # force the range to be [-1 ~ 1]
+            else:
+                flow[:, 0, :, :] *= scale # manually adjust the scale
+                flow[:, 1, :, :] *= scale # manually adjust the scale
+            if factor_x >= 1e-8:
+                flow[:, 0, :, :] *= factor_x * (FLOWGAN_RESOLUTION[1] / original_image.shape[3]) # width
+            else:
+                flow[:, 0, :, :] *= 0
+            if factor_y >= 1e-8:
+                flow[:, 1, :, :] *= factor_y * (FLOWGAN_RESOLUTION[0] / original_image.shape[2]) # height
+            else:
+                flow[:, 1, :, :] *= 0
+            resized_flow = resize_flow(flow, (FLOWDIFFUSION_RESOLUTION[0]//8, FLOWDIFFUSION_RESOLUTION[1]//8), scale_type="normalize_fixed")
+            kwargs = {
+                "image": original_image.to(self.dtype),
+                "flow": resized_flow.to(self.dtype),
+                "num_inference_steps": self.model_config['n_inference_step'],
+                "image_guidance_scale": self.model_config['image_guidance'],
+                "flow_guidance_scale": self.model_config['flow_guidance'],
+                "generator": self.generator,
+            }
+            edited_image = self.flowdiffusion(**kwargs).images[0]
+        end_time = time.time()
+        inference_time = end_time - start_time
+        print(f"Inference Time: {inference_time} seconds")
+        if save_results:
+            save_prefix = datetime.datetime.now().strftime("%Y-%m-%d-%H%M-%S")
+            if not os.path.isdir("results/flows"):
+                os.makedirs("results/flows")
+            np.save(f"results/flows/{save_prefix}.npy", flow[0].detach().cpu().numpy())
+            if not os.path.isdir("results/flow_visualized"):
+                os.makedirs("results/flow_visualized")
+            flow_to_image(flow[0].detach()).save(f"results/flow_visualized/{save_prefix}.png")
+            if not os.path.isdir("results/edited_images"):
+                os.makedirs("results/edited_images")
+            edited_image.save(f"results/edited_images/{save_prefix}.png")
+            if not os.path.isdir("results/drag_instructions"):
+                os.makedirs("results/drag_instructions")
+            with open(f"results/drag_instructions/{save_prefix}.txt", "w") as f:
+                f.write(str(selected_points))
+        edited_image = np.array(edited_image)
+        return edited_image
+    def run(self, original_image, selected_points,
+            flowgen_ckpt, flowdiffusion_ckpt, image_guidance, flow_guidance, flowgen_output_scale,
+            num_steps, save_results):
+        self.model_config = {
+            "flowgen_ckpt": flowgen_ckpt,
+            "flowdiffusion_ckpt": flowdiffusion_ckpt,
+            "image_guidance": image_guidance,
+            "flow_guidance": flow_guidance,
+            "flowgen_output_scale": flowgen_output_scale,
+            "n_inference_step": num_steps
+        }
+        self.build_model()
+        edited_image = self.drag(original_image, selected_points, save_results)
+        return edited_image

InstDrag/demo/run_demo.py ADDED Viewed

	@@ -0,0 +1,226 @@

+import os
+import torch
+import gradio as gr
+from huggingface_hub import snapshot_download
+os.makedirs("checkpoints", exist_ok=True)
+snapshot_download("alex4727/InstantDrag", local_dir="./checkpoints")
+from demo_utils import (
+    process_img,
+    get_points,
+    undo_points_image,
+    clear_all,
+    InstantDragPipeline,
+)
+LENGTH = 480  # Length of the square area displaying/editing images
+with gr.Blocks() as demo:
+    pipeline = InstantDragPipeline(seed=42, device="cuda", dtype=torch.float16)
+    # Layout definition
+    with gr.Row():
+        gr.Markdown(
+            """
+            # InstantDrag: Improving Interactivity in Drag-based Image Editing
+            """
+        )
+    with gr.Tab(label="InstantDrag Demo"):
+        selected_points = gr.State([])         # Store points
+        original_image = gr.State(value=None)  # Store original input image
+        with gr.Row():
+            # Upload & Preprocess Image Column
+            with gr.Column():
+                gr.Markdown(
+                    """<p style="text-align: center; font-size: 20px">Upload & Preprocess Image</p>"""
+                )
+                canvas = gr.ImageEditor(
+                    height=LENGTH,
+                    width=LENGTH,
+                    type="numpy",
+                    image_mode="RGB",
+                    label="Preprocess Image",
+                    show_label=True,
+                    interactive=True,
+                )
+                with gr.Row():
+                    save_results = gr.Checkbox(
+                        value=False,
+                        label="Save Results",
+                        scale=1,
+                    )
+                    undo_button = gr.Button("Undo Clicked Points", scale=3)
+            # Click Points Column
+            with gr.Column():
+                gr.Markdown(
+                    """<p style="text-align: center; font-size: 20px">Click Points</p>"""
+                )
+                input_image = gr.Image(
+                    type="numpy",
+                    label="Click Points",
+                    show_label=True,
+                    height=LENGTH,
+                    width=LENGTH,
+                    interactive=False,
+                    show_fullscreen_button=False,
+                )
+                with gr.Row():
+                    run_button = gr.Button("Run")
+            # Editing Results Column
+            with gr.Column():
+                gr.Markdown(
+                    """<p style="text-align: center; font-size: 20px">Editing Results</p>"""
+                )
+                edited_image = gr.Image(
+                    type="numpy",
+                    label="Editing Results",
+                    show_label=True,
+                    height=LENGTH,
+                    width=LENGTH,
+                    interactive=False,
+                    show_fullscreen_button=False,
+                )
+                with gr.Row():
+                    clear_all_button = gr.Button("Clear All")
+        with gr.Tab("Configs - make sure to check README for details"):
+            with gr.Row():
+                with gr.Column():
+                    with gr.Row():
+                        flowgen_choices = sorted(
+                            [model for model in os.listdir("checkpoints/") if "flowgen" in model]
+                        )
+                        flowgen_ckpt = gr.Dropdown(
+                            value=flowgen_choices[0],
+                            label="Select FlowGen to use",
+                            choices=flowgen_choices,
+                            info="config2 for most cases, config3 for more fine-grained dragging",
+                            scale=2,
+                        )
+                        flowdiffusion_choices = sorted(
+                            [model for model in os.listdir("checkpoints/") if "flowdiffusion" in model]
+                        )
+                        flowdiffusion_ckpt = gr.Dropdown(
+                            value=flowdiffusion_choices[0],
+                            label="Select FlowDiffusion to use",
+                            choices=flowdiffusion_choices,
+                            info="single model for all cases",
+                            scale=1,
+                        )
+                        image_guidance = gr.Number(
+                            value=1.5,
+                            label="Image Guidance Scale",
+                            precision=2,
+                            step=0.1,
+                            scale=1,
+                            info="typically between 1.0-2.0.",
+                        )
+                        flow_guidance = gr.Number(
+                            value=1.5,
+                            label="Flow Guidance Scale",
+                            precision=2,
+                            step=0.1,
+                            scale=1,
+                            info="typically between 1.0-5.0",
+                        )
+                        num_steps = gr.Number(
+                            value=20,
+                            label="Inference Steps",
+                            precision=0,
+                            step=1,
+                            scale=1,
+                            info="typically between 20-50, 20 is usually enough",
+                        )
+                        flowgen_output_scale = gr.Number(
+                            value=-1.0,
+                            label="FlowGen Output Scale",
+                            precision=1,
+                            step=0.1,
+                            scale=2,
+                            info="-1.0, by default, forces flowgen's output to [-1, 1], could be adjusted to [0, ∞] for stronger/weaker effects",
+                        )
+        gr.Markdown(
+            """
+            <p style="text-align: center; font-size: 18px;">Examples</p>
+            """
+        )
+        with gr.Row():
+            gr.Examples(
+                examples=[
+                    "samples/airplane.jpg",
+                    "samples/anime.jpg",
+                    "samples/caligraphy.jpg",
+                    "samples/crocodile.jpg",
+                    "samples/elephant.jpg",
+                    "samples/meteor.jpg",
+                    "samples/monalisa.jpg",
+                    "samples/portrait.jpg",
+                    "samples/sketch.jpg",
+                    "samples/surreal.jpg",
+                ],
+                inputs=[canvas],
+                outputs=[original_image, selected_points, input_image],
+                fn=process_img,
+                cache_examples=False,
+                examples_per_page=10,
+            )
+        gr.Markdown(
+            """
+            <p style="text-align: center; font-size: 9">[Important] Our base models are solely trained on real-world talking head (facial) videos, with a focus on achieving fine-grained facial editing. <br>
+            Their application to other types of scenes, without fine-tuning, should be considered more of an experimental byproduct and may not perform well in many cases (we currently support only square images).</p>
+            """
+        )
+    # Event Handlers
+    canvas.change(
+        process_img,
+        [canvas],
+        [original_image, selected_points, input_image],
+    )
+    input_image.select(
+        get_points,
+        [input_image, selected_points],
+        [input_image],
+    )
+    undo_button.click(
+        undo_points_image,
+        [original_image],
+        [input_image, selected_points],
+    )
+    run_button.click(
+        pipeline.run,
+        [
+            original_image,
+            selected_points,
+            flowgen_ckpt,
+            flowdiffusion_ckpt,
+            image_guidance,
+            flow_guidance,
+            flowgen_output_scale,
+            num_steps,
+            save_results,
+        ],
+        [edited_image],
+    )
+    clear_all_button.click(
+        clear_all,
+        [],
+        [
+            canvas,
+            input_image,
+            edited_image,
+            selected_points,
+            original_image,
+        ],
+    )
+demo.queue().launch(share=False, debug=True)

InstDrag/demo/samples/airplane.jpg ADDED Viewed

InstDrag/demo/samples/anime.jpg ADDED Viewed

InstDrag/demo/samples/caligraphy.jpg ADDED Viewed

InstDrag/demo/samples/crocodile.jpg ADDED Viewed

InstDrag/demo/samples/elephant.jpg ADDED Viewed

InstDrag/demo/samples/meteor.jpg ADDED Viewed

InstDrag/demo/samples/monalisa.jpg ADDED Viewed

InstDrag/demo/samples/portrait.jpg ADDED Viewed

InstDrag/demo/samples/sketch.jpg ADDED Viewed

InstDrag/demo/samples/surreal.jpg ADDED Viewed

InstDrag/flowdiffusion/pipeline.py ADDED Viewed

	@@ -0,0 +1,495 @@

+# This file is partially based on the diffusers library, which licensed the code under the following license:
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+from typing import Any, Callable, Dict, List, Optional, Union
+import os
+from pathlib import Path
+import PIL.Image
+import torch
+from transformers import CLIPImageProcessor
+from diffusers.callbacks import MultiPipelineCallbacks, PipelineCallback
+from diffusers.image_processor import PipelineImageInput, VaeImageProcessor
+from diffusers.loaders import StableDiffusionLoraLoaderMixin
+from diffusers.models import AutoencoderKL, UNet2DConditionModel
+from diffusers.schedulers import KarrasDiffusionSchedulers
+from diffusers.utils import deprecate, logging
+from diffusers.utils.torch_utils import randn_tensor
+from diffusers.pipelines.pipeline_utils import DiffusionPipeline, StableDiffusionMixin
+from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput
+from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+# Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_img2img.retrieve_latents
+def retrieve_latents(
+    encoder_output: torch.Tensor, generator: Optional[torch.Generator] = None, sample_mode: str = "sample"
+):
+    if hasattr(encoder_output, "latent_dist") and sample_mode == "sample":
+        return encoder_output.latent_dist.sample(generator)
+    elif hasattr(encoder_output, "latent_dist") and sample_mode == "argmax":
+        return encoder_output.latent_dist.mode()
+    elif hasattr(encoder_output, "latents"):
+        return encoder_output.latents
+    else:
+        raise AttributeError("Could not access latents of provided encoder_output")
+class FlowDiffusionPipeline(
+    DiffusionPipeline,
+    StableDiffusionMixin,
+    StableDiffusionLoraLoaderMixin,
+):
+    r"""
+    Pipeline for pixel-level image editing given optical flow as condition.
+    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
+    implemented for all pipelines (downloading, saving, running on a particular device, etc.).
+    The pipeline also inherits the following loading methods:
+        - [`~loaders.StableDiffusionLoraLoaderMixin.load_lora_weights`] for loading LoRA weights
+        - [`~loaders.StableDiffusionLoraLoaderMixin.save_lora_weights`] for saving LoRA weights
+    Args:
+        vae ([`AutoencoderKL`]):
+            Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
+        unet ([`UNet2DConditionModel`]):
+            A `UNet2DConditionModel` to denoise the encoded image latents.
+        scheduler ([`SchedulerMixin`]):
+            A scheduler to be used in combination with `unet` to denoise the encoded image latents.
+        safety_checker ([`StableDiffusionSafetyChecker`]):
+            Classification module that estimates whether generated images could be considered offensive or harmful.
+            Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details
+            about a model's potential harms.
+        feature_extractor ([`~transformers.CLIPImageProcessor`]):
+            A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`.
+    """
+    model_cpu_offload_seq = "unet->vae"
+    _optional_components = ["safety_checker", "feature_extractor"]
+    _exclude_from_cpu_offload = ["safety_checker"]
+    _callback_tensor_inputs = ["latents", "image_latents"]
+    def __init__(
+        self,
+        vae: AutoencoderKL,
+        unet: UNet2DConditionModel,
+        scheduler: KarrasDiffusionSchedulers,
+        safety_checker: StableDiffusionSafetyChecker,
+        feature_extractor: CLIPImageProcessor,
+        requires_safety_checker: bool = False,
+        null_prompt: str = "../utils/null_prompt.pt"
+    ):
+        super().__init__()
+        if safety_checker is None and requires_safety_checker:
+            logger.warning(
+                f"You have disabled the safety checker for {self.__class__} by passing `safety_checker=None`. Ensure"
+                " that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered"
+                " results in services or applications open to the public. Both the diffusers team and Hugging Face"
+                " strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling"
+                " it only for use-cases that involve analyzing network behavior or auditing its results. For more"
+                " information, please have a look at https://github.com/huggingface/diffusers/pull/254 ."
+            )
+        if safety_checker is not None and feature_extractor is None:
+            raise ValueError(
+                "Make sure to define a feature extractor when loading {self.__class__} if you want to use the safety"
+                " checker. If you do not want to use the safety checker, you can pass `'safety_checker=None'` instead."
+            )
+        self.register_modules(
+            vae=vae,
+            unet=unet,
+            scheduler=scheduler,
+            safety_checker=safety_checker,
+            feature_extractor=feature_extractor,
+        )
+        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
+        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
+        self.register_to_config(requires_safety_checker=requires_safety_checker)
+        self.null_prompt_embeds = torch.load(os.path.join(Path(__file__).parent.absolute(), null_prompt), map_location="cpu")
+    @torch.no_grad()
+    def __call__(
+        self,
+        image: PipelineImageInput = None,
+        flow: torch.Tensor = None,
+        num_inference_steps: int = 20,
+        image_guidance_scale: float = 1.5,
+        flow_guidance_scale: float = 1.5,
+        eta: float = 0.0,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.Tensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        callback_on_step_end: Optional[
+            Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]
+        ] = None,
+        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
+        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
+        **kwargs,
+    ):
+        r"""
+        The call function to the pipeline for generation.
+        Args:
+            image (`torch.Tensor` `np.ndarray`, `PIL.Image.Image`, `List[torch.Tensor]`, `List[PIL.Image.Image]`, or `List[np.ndarray]`):
+                `Image` or tensor representing an image batch to be repainted according to `prompt`. Can also accept
+                image latents as `image`, but if passing latents directly it is not encoded again. We only support batch size of 1 for now.
+            flow: torch.Tensor = None,
+                Optical flow tensor to be used as a condition for the image generation. We only support batch size of 1 for now.
+            num_inference_steps (`int`, *optional*, defaults to 20):
+                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+                expense of slower inference.
+            image_guidance_scale (`float`, *optional*, defaults to 1.5):
+                Push the generated image towards the initial `image`. Image guidance scale is enabled by setting
+                `image_guidance_scale > 1`. Higher image guidance scale encourages generated images that are closely
+                linked to the source `image`, usually at the expense of lower image quality. This pipeline requires a
+                value of at least `1`.
+            flow_guidance_scale (`float`, *optional*, defaults to 1.5):
+                Apply the flow guidance to the image generation. Higher values of `flow_guidance_scale` encourage
+                the model to follow the flow stronger.
+            eta (`float`, *optional*, defaults to 0.0):
+                Corresponds to parameter eta (η) from the [DDIM](https://arxiv.org/abs/2010.02502) paper. Only applies
+                to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers.
+            generator (`torch.Generator`, *optional*):
+                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
+                generation deterministic.
+            latents (`torch.Tensor`, *optional*):
+                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image
+                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
+                tensor is generated by sampling using the supplied random `generator`.
+            output_type (`str`, *optional*, defaults to `"pil"`):
+                The output format of the generated image. Choose between `PIL.Image` or `np.array`.
+            return_dict (`bool`, *optional*, defaults to `True`):
+                Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
+                plain tuple.
+            callback_on_step_end (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, *optional*):
+                A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of
+                each denoising step during the inference. with the following arguments: `callback_on_step_end(self:
+                DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a
+                list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
+            callback_on_step_end_tensor_inputs (`List`, *optional*):
+                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
+                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
+                `._callback_tensor_inputs` attribute of your pipeline class.
+            cross_attention_kwargs (`dict`, *optional*):
+                A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in
+                [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+        """
+        callback = kwargs.pop("callback", None)
+        callback_steps = kwargs.pop("callback_steps", None)
+        if callback is not None:
+            deprecate(
+                "callback",
+                "1.0.0",
+                "Passing `callback` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
+            )
+        if callback_steps is not None:
+            deprecate(
+                "callback_steps",
+                "1.0.0",
+                "Passing `callback_steps` as an input argument to `__call__` is deprecated, consider use `callback_on_step_end`",
+            )
+        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
+            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs
+        # 0. Check inputs
+        self.check_inputs(
+            callback_steps,
+            callback_on_step_end_tensor_inputs,
+        )
+        self._image_guidance_scale = image_guidance_scale
+        self._flow_guidance_scale = flow_guidance_scale
+        device = self._execution_device
+        if image is None or flow is None:
+            raise ValueError("`image` or `flow` input cannot be undefined.")
+        # 1. Define call parameters
+        # 2. Encode input prompt
+        prompt_embeds = self._encode_prompt(
+            device,
+            self.do_classifier_free_guidance,
+        )
+        # 3. Preprocess image
+        image = self.image_processor.preprocess(image)
+        assert image.shape[0] == 1 and flow.shape[0] == 1, "Batch size must be 1 for now."
+        # 4. set timesteps
+        self.scheduler.set_timesteps(num_inference_steps, device=device)
+        timesteps = self.scheduler.timesteps
+        # 5. Prepare Image latents
+        image_latents = self.prepare_image_latents(
+            image,
+            flow,
+            prompt_embeds.dtype,
+            device,
+            self.do_classifier_free_guidance,
+        )
+        height, width = image_latents.shape[-2:]
+        height = height * self.vae_scale_factor
+        width = width * self.vae_scale_factor
+        # 6. Prepare latent variables
+        num_channels_latents = self.vae.config.latent_channels
+        latents = self.prepare_latents(
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+        )
+        # 7. Check that shapes of latents and image match the UNet channels
+        num_channels_image = image_latents.shape[1]
+        if num_channels_latents + num_channels_image != self.unet.config.in_channels:
+            raise ValueError(
+                f"Incorrect configuration settings! The config of `pipeline.unet`: {self.unet.config} expects"
+                f" {self.unet.config.in_channels} but received `num_channels_latents`: {num_channels_latents} +"
+                f" `num_channels_image`: {num_channels_image} "
+                f" = {num_channels_latents+num_channels_image}. Please verify the config of"
+                " `pipeline.unet` or your `image` input."
+            )
+        # 8. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
+        # 9. Denoising loop
+        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
+        self._num_timesteps = len(timesteps)
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                # Expand the latents if we are doing classifier free guidance.
+                # The latents are expanded 3 times because for image / flow guidance
+                latent_model_input = torch.cat([latents] * 3) if self.do_classifier_free_guidance else latents
+                # concat latents, image_latents in the channel dimension
+                scaled_latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+                scaled_latent_model_input = torch.cat([scaled_latent_model_input, image_latents], dim=1)
+                # predict the noise residual
+                noise_pred = self.unet(
+                    scaled_latent_model_input,
+                    t,
+                    encoder_hidden_states=prompt_embeds,
+                    added_cond_kwargs=None,
+                    cross_attention_kwargs=cross_attention_kwargs,
+                    return_dict=False,
+                )[0]
+                # perform guidance
+                if self.do_classifier_free_guidance:
+                    noise_pred_flow, noise_pred_image, noise_pred_uncond = noise_pred.chunk(3)
+                    noise_pred = (
+                        noise_pred_uncond
+                        + self._image_guidance_scale * (noise_pred_image - noise_pred_uncond)
+                        + self._flow_guidance_scale * (noise_pred_flow - noise_pred_image)
+                    )
+                # compute the previous noisy sample x_t -> x_t-1
+                latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
+                if callback_on_step_end is not None:
+                    callback_kwargs = {}
+                    for k in callback_on_step_end_tensor_inputs:
+                        callback_kwargs[k] = locals()[k]
+                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)
+                    latents = callback_outputs.pop("latents", latents)
+                    prompt_embeds = callback_outputs.pop("prompt_embeds", prompt_embeds)
+                    image_latents = callback_outputs.pop("image_latents", image_latents)
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                    if callback is not None and i % callback_steps == 0:
+                        step_idx = i // getattr(self.scheduler, "order", 1)
+                        callback(step_idx, t, latents)
+        if not output_type == "latent":
+            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
+            image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
+        else:
+            image = latents
+            has_nsfw_concept = None
+        if has_nsfw_concept is None:
+            do_denormalize = [True] * image.shape[0]
+        else:
+            do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
+        image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
+        # Offload all models
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return (image, has_nsfw_concept)
+        return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)
+    def _encode_prompt(
+        self,
+        device,
+        do_classifier_free_guidance,
+    ):
+        prompt_embeds = self.null_prompt_embeds.to(dtype=torch.float16, device=device) # 1 77 512
+        if do_classifier_free_guidance: # We are only doing cfg for image and flow
+            prompt_embeds = torch.cat([prompt_embeds, prompt_embeds, prompt_embeds]) # 3 77 512
+        return prompt_embeds
+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.run_safety_checker
+    def run_safety_checker(self, image, device, dtype):
+        if self.safety_checker is None:
+            has_nsfw_concept = None
+        else:
+            if torch.is_tensor(image):
+                feature_extractor_input = self.image_processor.postprocess(image, output_type="pil")
+            else:
+                feature_extractor_input = self.image_processor.numpy_to_pil(image)
+            safety_checker_input = self.feature_extractor(feature_extractor_input, return_tensors="pt").to(device)
+            image, has_nsfw_concept = self.safety_checker(
+                images=image, clip_input=safety_checker_input.pixel_values.to(dtype)
+            )
+        return image, has_nsfw_concept
+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs
+    def prepare_extra_step_kwargs(self, generator, eta):
+        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
+        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
+        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
+        # and should be between [0, 1]
+        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        extra_step_kwargs = {}
+        if accepts_eta:
+            extra_step_kwargs["eta"] = eta
+        # check if the scheduler accepts generator
+        accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        if accepts_generator:
+            extra_step_kwargs["generator"] = generator
+        return extra_step_kwargs
+    # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.decode_latents
+    def decode_latents(self, latents):
+        deprecation_message = "The decode_latents method is deprecated and will be removed in 1.0.0. Please use VaeImageProcessor.postprocess(...) instead"
+        deprecate("decode_latents", "1.0.0", deprecation_message, standard_warn=False)
+        latents = 1 / self.vae.config.scaling_factor * latents
+        image = self.vae.decode(latents, return_dict=False)[0]
+        image = (image / 2 + 0.5).clamp(0, 1)
+        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
+        image = image.cpu().permute(0, 2, 3, 1).float().numpy()
+        return image
+    def check_inputs(
+        self,
+        callback_steps,
+        callback_on_step_end_tensor_inputs=None,
+    ):
+        if callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0):
+            raise ValueError(
+                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
+                f" {type(callback_steps)}."
+            )
+        if callback_on_step_end_tensor_inputs is not None and not all(
+            k in self._callback_tensor_inputs for k in callback_on_step_end_tensor_inputs
+        ):
+            raise ValueError(
+                f"`callback_on_step_end_tensor_inputs` has to be in {self._callback_tensor_inputs}, but found {[k for k in callback_on_step_end_tensor_inputs if k not in self._callback_tensor_inputs]}"
+            )
+    def prepare_latents(self, num_channels_latents, height, width, dtype, device, generator, latents=None):
+        shape = (
+            1,
+            num_channels_latents,
+            int(height) // self.vae_scale_factor,
+            int(width) // self.vae_scale_factor,
+        )
+        if isinstance(generator, list) and len(generator) != 1:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but we only support a single batch for now."
+            )
+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        else:
+            latents = latents.to(device)
+        # scale the initial noise by the standard deviation required by the scheduler
+        latents = latents * self.scheduler.init_noise_sigma
+        return latents
+    def prepare_image_latents(
+        self, image, flow, dtype, device, do_classifier_free_guidance, generator=None
+    ):
+        if not isinstance(image, (torch.Tensor, PIL.Image.Image, list)):
+            raise ValueError(
+                f"`image` has to be of type `torch.Tensor`, `PIL.Image.Image` or list but is {type(image)}"
+            )
+        image = image.to(device=device, dtype=dtype)
+        if image.shape[1] == 4:
+            image_latents = image
+        else:
+            image_latents = retrieve_latents(self.vae.encode(image), sample_mode="argmax")
+        image_latents_flow_cond = torch.cat([image_latents, flow.to(device)], dim=1)
+        if do_classifier_free_guidance:
+            image_latents_flow_uncond = torch.cat([image_latents, torch.zeros_like(flow).to(device)], dim=1)
+            image_latents_uncond = torch.zeros_like(image_latents_flow_cond)
+            image_latents_final = torch.cat([image_latents_flow_cond, image_latents_flow_uncond, image_latents_uncond], dim=0)
+        else:
+            image_latents_final = image_latents_flow_cond
+        return image_latents_final
+    @property
+    def image_guidance_scale(self):
+        return self._image_guidance_scale
+    @property
+    def flow_guidance_scale(self):
+        return self._flow_guidance_scale
+    @property
+    def num_timesteps(self):
+        return self._num_timesteps
+    @property
+    def do_classifier_free_guidance(self):
+        return self._image_guidance_scale > 1 or self._flow_guidance_scale > 1

InstDrag/flowgen/models.py ADDED Viewed

	@@ -0,0 +1,161 @@

+# Modified from https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix/blob/master/models/networks.py
+import torch
+import torch.nn as nn
+import functools
+class UnetSkipConnectionBlock(nn.Module):
+    """Defines the Unet submodule with skip connection.
+        X -------------------identity----------------------
+        |-- downsampling -- |submodule| -- upsampling --|
+    """
+    def __init__(self, outer_nc, inner_nc, input_nc=None,
+                 submodule=None, outermost=False, innermost=False, norm_layer=nn.BatchNorm2d, use_dropout=False):
+        """Construct a Unet submodule with skip connections.
+        Parameters:
+            outer_nc (int) -- the number of filters in the outer conv layer
+            inner_nc (int) -- the number of filters in the inner conv layer
+            input_nc (int) -- the number of channels in input images/features
+            submodule (UnetSkipConnectionBlock) -- previously defined submodules
+            outermost (bool)    -- if this module is the outermost module
+            innermost (bool)    -- if this module is the innermost module
+            norm_layer          -- normalization layer
+            use_dropout (bool)  -- if use dropout layers.
+        """
+        super(UnetSkipConnectionBlock, self).__init__()
+        self.outermost = outermost
+        if type(norm_layer) == functools.partial:
+            use_bias = norm_layer.func != nn.BatchNorm2d
+        else:
+            use_bias = norm_layer != nn.BatchNorm2d
+        if input_nc is None:
+            input_nc = outer_nc
+        downconv = nn.Conv2d(input_nc, inner_nc, kernel_size=4,
+                             stride=2, padding=1, bias=use_bias)
+        downrelu = nn.LeakyReLU(0.2, True)
+        if norm_layer == nn.GroupNorm:
+            downnorm = norm_layer(32, inner_nc)
+        else: downnorm = norm_layer(inner_nc)
+        uprelu = nn.ReLU(True)
+        if norm_layer == nn.GroupNorm:
+            if outer_nc % 32 != 0:
+                upnorm = norm_layer(outer_nc, outer_nc) # Layer Norm
+            else:
+                upnorm = norm_layer(32, outer_nc)
+        else:
+            upnorm = norm_layer(outer_nc)
+        if outermost:
+            upconv = nn.ConvTranspose2d(inner_nc * 2, outer_nc,
+                                        kernel_size=4, stride=2,
+                                        padding=1)
+            down = [downconv]
+            up = [uprelu, upconv, nn.Tanh()]
+            model = down + [submodule] + up
+        elif innermost:
+            upconv = nn.ConvTranspose2d(inner_nc, outer_nc,
+                                        kernel_size=4, stride=2,
+                                        padding=1, bias=use_bias)
+            down = [downrelu, downconv]
+            up = [uprelu, upconv, upnorm]
+            model = down + up
+        else:
+            upconv = nn.ConvTranspose2d(inner_nc * 2, outer_nc,
+                                        kernel_size=4, stride=2,
+                                        padding=1, bias=use_bias)
+            down = [downrelu, downconv, downnorm]
+            up = [uprelu, upconv, upnorm]
+            if use_dropout:
+                model = down + [submodule] + up + [nn.Dropout(0.5)]
+            else:
+                model = down + [submodule] + up
+        self.model = nn.Sequential(*model)
+    def forward(self, x):
+        if self.outermost:
+            return self.model(x)
+        else:   # add skip connections
+            return torch.cat([x, self.model(x)], 1)
+class UnetGenerator(nn.Module):
+    """Create a Unet-based generator"""
+    def __init__(self, input_nc, output_nc=2, num_downs=8, ngf=64, norm_layer=nn.GroupNorm, use_dropout=True):
+        """Construct a Unet generator
+        Parameters:
+            input_nc (int)  -- the number of channels in input images
+            output_nc (int) -- the number of channels in output images
+            num_downs (int) -- the number of downsamplings in UNet. For example, # if |num_downs| == 7,
+                                image of size 128x128 will become of size 1x1 # at the bottleneck
+            ngf (int)       -- the number of filters in the last conv layer
+            norm_layer      -- normalization layer
+        We construct the U-Net from the innermost layer to the outermost layer.
+        It is a recursive process.
+        """
+        super(UnetGenerator, self).__init__()
+        # construct unet structure
+        unet_block = UnetSkipConnectionBlock(ngf * 8, ngf * 8, input_nc=None, submodule=None, norm_layer=norm_layer, innermost=True)  # add the innermost layer
+        for i in range(num_downs - 5):          # add intermediate layers with ngf * 8 filters
+            unet_block = UnetSkipConnectionBlock(ngf * 8, ngf * 8, input_nc=None, submodule=unet_block, norm_layer=norm_layer, use_dropout=use_dropout)
+        # gradually reduce the number of filters from ngf * 8 to ngf
+        unet_block = UnetSkipConnectionBlock(ngf * 4, ngf * 8, input_nc=None, submodule=unet_block, norm_layer=norm_layer)
+        unet_block = UnetSkipConnectionBlock(ngf * 2, ngf * 4, input_nc=None, submodule=unet_block, norm_layer=norm_layer)
+        unet_block = UnetSkipConnectionBlock(ngf, ngf * 2, input_nc=None, submodule=unet_block, norm_layer=norm_layer)
+        self.model = UnetSkipConnectionBlock(output_nc, ngf, input_nc=input_nc, submodule=unet_block, outermost=True, norm_layer=norm_layer)  # add the outermost layer
+    def forward(self, input):
+        """Standard forward"""
+        return self.model(input)
+class NLayerDiscriminator(nn.Module):
+    """Defines a PatchGAN discriminator"""
+    def __init__(self, input_nc, ndf=64, n_layers=6, norm_layer=nn.GroupNorm):
+        """Construct a PatchGAN discriminator
+        Parameters:
+            input_nc (int)  -- the number of channels in input images
+            ndf (int)       -- the number of filters in the last conv layer
+            n_layers (int)  -- the number of conv layers in the discriminator
+            norm_layer      -- normalization layer
+        """
+        super(NLayerDiscriminator, self).__init__()
+        if type(norm_layer) == functools.partial:  # no need to use bias as BatchNorm2d has affine parameters
+            use_bias = norm_layer.func != nn.BatchNorm2d
+        else:
+            use_bias = norm_layer != nn.BatchNorm2d
+        kw = 4
+        padw = 1
+        sequence = [nn.Conv2d(input_nc, ndf, kernel_size=kw, stride=2, padding=padw), nn.LeakyReLU(0.2, True)]
+        nf_mult = 1
+        nf_mult_prev = 1
+        for n in range(1, n_layers):  # gradually increase the number of filters
+            nf_mult_prev = nf_mult
+            nf_mult = min(2 ** n, 8)
+            sequence += [
+                nn.Conv2d(ndf * nf_mult_prev, ndf * nf_mult, kernel_size=kw, stride=2, padding=padw, bias=use_bias),
+                norm_layer(32, ndf * nf_mult) if norm_layer == nn.GroupNorm else norm_layer(ndf * nf_mult),
+                nn.LeakyReLU(0.2, True)
+            ]
+        nf_mult_prev = nf_mult
+        nf_mult = min(2 ** n_layers, 8)
+        sequence += [
+            nn.Conv2d(ndf * nf_mult_prev, ndf * nf_mult, kernel_size=kw, stride=1, padding=padw, bias=use_bias),
+            norm_layer(32, ndf * nf_mult) if norm_layer == nn.GroupNorm else norm_layer(ndf * nf_mult),
+            nn.LeakyReLU(0.2, True)
+        ]
+        sequence += [nn.Conv2d(ndf * nf_mult, 1, kernel_size=kw, stride=1, padding=padw)]  # output 1 channel prediction map
+        self.model = nn.Sequential(*sequence)
+    def forward(self, input):
+        """Standard forward."""
+        return self.model(input)

InstDrag/utils/flow_utils.py ADDED Viewed

	@@ -0,0 +1,143 @@

+import numpy as np
+from PIL import Image
+import torch
+import torch.nn.functional as F
+def make_colorwheel():
+    """
+    Generates a color wheel for optical flow visualization as presented in:
+        Baker et al. "A Database and Evaluation Methodology for Optical Flow" (ICCV, 2007)
+        URL: http://vision.middlebury.edu/flow/flowEval-iccv07.pdf
+    Code follows the original C++ source code of Daniel Scharstein.
+    Code follows the the Matlab source code of Deqing Sun.
+    Returns:
+        np.ndarray: Color wheel
+    """
+    RY = 15
+    YG = 6
+    GC = 4
+    CB = 11
+    BM = 13
+    MR = 6
+    ncols = RY + YG + GC + CB + BM + MR
+    colorwheel = np.zeros((ncols, 3))
+    col = 0
+    # RY
+    colorwheel[0:RY, 0] = 255
+    colorwheel[0:RY, 1] = np.floor(255*np.arange(0,RY)/RY)
+    col = col+RY
+    # YG
+    colorwheel[col:col+YG, 0] = 255 - np.floor(255*np.arange(0,YG)/YG)
+    colorwheel[col:col+YG, 1] = 255
+    col = col+YG
+    # GC
+    colorwheel[col:col+GC, 1] = 255
+    colorwheel[col:col+GC, 2] = np.floor(255*np.arange(0,GC)/GC)
+    col = col+GC
+    # CB
+    colorwheel[col:col+CB, 1] = 255 - np.floor(255*np.arange(CB)/CB)
+    colorwheel[col:col+CB, 2] = 255
+    col = col+CB
+    # BM
+    colorwheel[col:col+BM, 2] = 255
+    colorwheel[col:col+BM, 0] = np.floor(255*np.arange(0,BM)/BM)
+    col = col+BM
+    # MR
+    colorwheel[col:col+MR, 2] = 255 - np.floor(255*np.arange(MR)/MR)
+    colorwheel[col:col+MR, 0] = 255
+    return colorwheel
+def flow_uv_to_colors(u, v, convert_to_bgr=False):
+    """
+    Applies the flow color wheel to (possibly clipped) flow components u and v.
+    According to the C++ source code of Daniel Scharstein
+    According to the Matlab source code of Deqing Sun
+    Args:
+        u (np.ndarray): Input horizontal flow of shape [H,W]
+        v (np.ndarray): Input vertical flow of shape [H,W]
+        convert_to_bgr (bool, optional): Convert output image to BGR. Defaults to False.
+    Returns:
+        np.ndarray: Flow visualization image of shape [H,W,3]
+    """
+    flow_image = np.zeros((u.shape[0], u.shape[1], 3), np.uint8)
+    colorwheel = make_colorwheel()  # shape [55x3]
+    ncols = colorwheel.shape[0]
+    rad = np.sqrt(np.square(u) + np.square(v))
+    a = np.arctan2(-v, -u)/np.pi
+    fk = (a+1) / 2*(ncols-1)
+    k0 = np.floor(fk).astype(np.int32)
+    k1 = k0 + 1
+    k1[k1 == ncols] = 0
+    f = fk - k0
+    for i in range(colorwheel.shape[1]):
+        tmp = colorwheel[:,i]
+        col0 = tmp[k0] / 255.0
+        col1 = tmp[k1] / 255.0
+        col = (1-f)*col0 + f*col1
+        idx = (rad <= 1)
+        col[idx]  = 1 - rad[idx] * (1-col[idx])
+        col[~idx] = col[~idx] * 0.75   # out of range
+        # Note the 2-i => BGR instead of RGB
+        ch_idx = 2-i if convert_to_bgr else i
+        flow_image[:,:,ch_idx] = np.floor(255 * col)
+    return flow_image
+def flow_to_image(flow_uv, clip_flow=None, convert_to_bgr=False, max_flow=None):
+    """
+    Expects a two dimensional flow image of shape.
+    Args:
+        flow_uv (torch.Tensor): Flow UV image of shape [2,H,W]
+        clip_flow (float, optional): Clip maximum of flow values. Defaults to None.
+        convert_to_bgr (bool, optional): Convert output image to BGR. Defaults to False.
+    Returns:
+        PIL Image: Flow visualization image
+    """
+    flow_uv = flow_uv.permute(1, 2, 0).cpu().numpy() # change to [H,W,2] and convert to numpy
+    if clip_flow is not None:
+        flow_uv = np.clip(flow_uv, 0, clip_flow)
+    u = flow_uv[:,:,0]
+    v = flow_uv[:,:,1]
+    if max_flow is None:
+        rad = np.sqrt(np.square(u) + np.square(v))
+        rad_max = np.max(rad)
+    else:
+        rad_max = max_flow
+    epsilon = 1e-5
+    u = u / (rad_max + epsilon)
+    v = v / (rad_max + epsilon)
+    flow_image = flow_uv_to_colors(u, v, convert_to_bgr)
+    return Image.fromarray(flow_image)
+def resize_flow(flow, size, scale_type="none", mode="bicubic"):
+    """
+    Resize the flow tensor (Bx2xHxW) to the given size (HxW).
+    flow tensor is in range of [-ori_w, ori_w] and [-ori_h, ori_h]
+    Size should be a tuple (H, W).
+    """
+    ori_h, ori_w = flow.shape[2:]
+    flow = F.interpolate(flow, size=size, mode=mode, align_corners=False)
+    if scale_type == "scale" and (ori_h != size[0] or ori_w != size[1]):
+        flow[:,0,:,:] *= size[1] / ori_w
+        flow[:,1,:,:] *= size[0] / ori_h
+    elif scale_type == "normalize_fixed": # normalize to -1 ~ 1
+        flow[:,0,:,:] /= ori_w
+        flow[:,1,:,:] /= ori_h
+    elif scale_type == "normalize_max":
+        max_flow_x = torch.amax(torch.abs(flow[:, 0, :, :]), dim=(1, 2))
+        max_flow_y = torch.amax(torch.abs(flow[:, 1, :, :]), dim=(1, 2))
+        flow[:, 0, :, :] /= max_flow_x.view(-1, 1, 1)
+        flow[:, 1, :, :] /= max_flow_y.view(-1, 1, 1)
+    return flow

InstDrag/utils/null_prompt.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7eb3e5fc1308277b9288aa665562eb688e4aa36e6bcbc422083b707468e84d2a
+size 237655