Spaces:

malvika2003
/

openvino_notebooks

Runtime error

File size: 89,072 Bytes

db5855f

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "c3afa009-cc7a-436b-be99-8f1f403af84a",
   "metadata": {},
   "source": [
    "# Image to Video Generation with Stable Video Diffusion\n",
    "\n",
    "Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it. In this tutorial we consider how to convert and run Stable Video Diffusion using OpenVINO.\n",
    "We will use [stable-video-diffusion-img2video-xt](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) model as example. Additionally, to speedup video generation process we apply [AnimateLCM](https://arxiv.org/abs/2402.00769) LoRA weights and run optimization with [NNCF](https://github.com/openvinotoolkit/nncf/).\n",
    "\n",
    "## Table of contents:\n",
    "\n",
    "- [Prerequisites](#Prerequisites)\n",
    "- [Download PyTorch Model](#Download-PyTorch-Model)\n",
    "- [Convert Model to OpenVINO Intermediate Representation](#Convert-Model-to-OpenVINO-Intermediate-Representation)\n",
    "    - [Image Encoder](#Image-Encoder)\n",
    "    - [U-net](#U-net)\n",
    "    - [VAE Encoder and Decoder](#VAE-Encoder-and-Decoder)\n",
    "- [Prepare Inference Pipeline](#Prepare-Inference-Pipeline)\n",
    "- [Run Video Generation](#Run-Video-Generation)\n",
    "    - [Select Inference Device](#Select-Inference-Device)\n",
    "- [Quantization](#Quantization)\n",
    "    - [Prepare calibration dataset](#Prepare-calibration-dataset)\n",
    "    - [Run Hybrid Model Quantization](#Run-Hybrid-Model-Quantization)\n",
    "    - [Run Weight Compression](#Run-Weight-Compression)\n",
    "    - [Compare model file sizes](#Compare-model-file-sizes)\n",
    "    - [Compare inference time of the FP16 and INT8 pipelines](#Compare-inference-time-of-the-FP16-and-INT8-pipelines)\n",
    "- [Interactive Demo](#Interactive-Demo)\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "117deb8f-bae0-4623-98b6-a2409d6eb0cc",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "[back to top ⬆️](#Table-of-contents:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21709ff7-a138-4256-9d2c-ba789a897162",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Note: you may need to restart the kernel to use updated packages.\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q \"torch>=2.1\" \"diffusers>=0.25\" \"peft==0.6.2\" \"transformers\" \"openvino>=2024.1.0\" Pillow opencv-python tqdm  \"gradio>=4.19\" safetensors --extra-index-url https://download.pytorch.org/whl/cpu\n",
    "%pip install -q datasets \"nncf>=2.10.0\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "972423c0-b6a0-45cd-a652-71c367eeb010",
   "metadata": {},
   "source": [
    "## Download PyTorch Model\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "The code below load Stable Video Diffusion XT model using [Diffusers](https://huggingface.co/docs/diffusers/index) library and apply Consistency Distilled AnimateLCM weights. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "6e29229d-821d-4367-8f91-ad8375a38895",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from pathlib import Path\n",
    "from diffusers import StableVideoDiffusionPipeline\n",
    "from diffusers.utils import load_image, export_to_video\n",
    "from diffusers.models.attention_processor import AttnProcessor\n",
    "from safetensors import safe_open\n",
    "import gc\n",
    "import requests\n",
    "\n",
    "lcm_scheduler_url = \"https://huggingface.co/spaces/wangfuyun/AnimateLCM-SVD/raw/main/lcm_scheduler.py\"\n",
    "\n",
    "r = requests.get(lcm_scheduler_url)\n",
    "\n",
    "with open(\"lcm_scheduler.py\", \"w\") as f:\n",
    "    f.write(r.text)\n",
    "\n",
    "from lcm_scheduler import AnimateLCMSVDStochasticIterativeScheduler\n",
    "from huggingface_hub import hf_hub_download\n",
    "\n",
    "MODEL_DIR = Path(\"model\")\n",
    "\n",
    "IMAGE_ENCODER_PATH = MODEL_DIR / \"image_encoder.xml\"\n",
    "VAE_ENCODER_PATH = MODEL_DIR / \"vae_encoder.xml\"\n",
    "VAE_DECODER_PATH = MODEL_DIR / \"vae_decoder.xml\"\n",
    "UNET_PATH = MODEL_DIR / \"unet.xml\"\n",
    "\n",
    "\n",
    "load_pt_pipeline = not (VAE_ENCODER_PATH.exists() and VAE_DECODER_PATH.exists() and UNET_PATH.exists() and IMAGE_ENCODER_PATH.exists())\n",
    "\n",
    "unet, vae, image_encoder = None, None, None\n",
    "if load_pt_pipeline:\n",
    "    noise_scheduler = AnimateLCMSVDStochasticIterativeScheduler(\n",
    "        num_train_timesteps=40,\n",
    "        sigma_min=0.002,\n",
    "        sigma_max=700.0,\n",
    "        sigma_data=1.0,\n",
    "        s_noise=1.0,\n",
    "        rho=7,\n",
    "        clip_denoised=False,\n",
    "    )\n",
    "    pipe = StableVideoDiffusionPipeline.from_pretrained(\n",
    "        \"stabilityai/stable-video-diffusion-img2vid-xt\",\n",
    "        variant=\"fp16\",\n",
    "        scheduler=noise_scheduler,\n",
    "    )\n",
    "    pipe.unet.set_attn_processor(AttnProcessor())\n",
    "    hf_hub_download(\n",
    "        repo_id=\"wangfuyun/AnimateLCM-SVD-xt\",\n",
    "        filename=\"AnimateLCM-SVD-xt.safetensors\",\n",
    "        local_dir=\"./checkpoints\",\n",
    "    )\n",
    "    state_dict = {}\n",
    "    LCM_LORA_PATH = Path(\n",
    "        \"checkpoints/AnimateLCM-SVD-xt.safetensors\",\n",
    "    )\n",
    "    with safe_open(LCM_LORA_PATH, framework=\"pt\", device=\"cpu\") as f:\n",
    "        for key in f.keys():\n",
    "            state_dict[key] = f.get_tensor(key)\n",
    "    missing, unexpected = pipe.unet.load_state_dict(state_dict, strict=True)\n",
    "\n",
    "    pipe.scheduler.save_pretrained(MODEL_DIR / \"scheduler\")\n",
    "    pipe.feature_extractor.save_pretrained(MODEL_DIR / \"feature_extractor\")\n",
    "    unet = pipe.unet\n",
    "    unet.eval()\n",
    "    vae = pipe.vae\n",
    "    vae.eval()\n",
    "    image_encoder = pipe.image_encoder\n",
    "    image_encoder.eval()\n",
    "    del pipe\n",
    "    gc.collect()\n",
    "\n",
    "# Load the conditioning image\n",
    "image = load_image(\"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true\")\n",
    "image = image.resize((512, 256))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "92eb1bd8-d933-4e44-8167-6690ff11dfbc",
   "metadata": {},
   "source": [
    "## Convert Model to OpenVINO Intermediate Representation\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "OpenVINO supports PyTorch models via conversion into Intermediate Representation (IR) format. We need to provide a model object, input data for model tracing to `ov.convert_model` function to obtain OpenVINO `ov.Model` object instance. Model can be saved on disk for next deployment using `ov.save_model` function.\n",
    "\n",
    "Stable Video Diffusion consists of 3 parts:\n",
    "\n",
    "* **Image Encoder** for extraction embeddings from the input image.\n",
    "* **U-Net** for step-by-step denoising video clip.\n",
    "* **VAE** for encoding input image into latent space and decoding generated video.\n",
    "\n",
    "Let's convert each part.\n",
    "\n",
    "### Image Encoder\n",
    "[back to top ⬆️](#Table-of-contents:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "60e89c71-4cf6-4e87-b788-8fa5265bca71",
   "metadata": {},
   "outputs": [],
   "source": [
    "import openvino as ov\n",
    "\n",
    "\n",
    "def cleanup_torchscript_cache():\n",
    "    \"\"\"\n",
    "    Helper for removing cached model representation\n",
    "    \"\"\"\n",
    "    torch._C._jit_clear_class_registry()\n",
    "    torch.jit._recursive.concrete_type_store = torch.jit._recursive.ConcreteTypeStore()\n",
    "    torch.jit._state._clear_class_state()\n",
    "\n",
    "\n",
    "if not IMAGE_ENCODER_PATH.exists():\n",
    "    with torch.no_grad():\n",
    "        ov_model = ov.convert_model(\n",
    "            image_encoder,\n",
    "            example_input=torch.zeros((1, 3, 224, 224)),\n",
    "            input=[-1, 3, 224, 224],\n",
    "        )\n",
    "    ov.save_model(ov_model, IMAGE_ENCODER_PATH)\n",
    "    del ov_model\n",
    "    cleanup_torchscript_cache()\n",
    "    print(f\"Image Encoder successfully converted to IR and saved to {IMAGE_ENCODER_PATH}\")\n",
    "del image_encoder\n",
    "gc.collect();"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "a81d2949-343a-40ee-a279-92186f7eb624",
   "metadata": {},
   "source": [
    "### U-net\n",
    "[back to top ⬆️](#Table-of-contents:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "ffe83a70-125b-4ceb-9c14-184ff882b28e",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import openvino as ov\n",
    "\n",
    "if not UNET_PATH.exists():\n",
    "    unet_inputs = {\n",
    "        \"sample\": torch.ones([2, 2, 8, 32, 32]),\n",
    "        \"timestep\": torch.tensor(1.256),\n",
    "        \"encoder_hidden_states\": torch.zeros([2, 1, 1024]),\n",
    "        \"added_time_ids\": torch.ones([2, 3]),\n",
    "    }\n",
    "    with torch.no_grad():\n",
    "        ov_model = ov.convert_model(unet, example_input=unet_inputs)\n",
    "    ov.save_model(ov_model, UNET_PATH)\n",
    "    del ov_model\n",
    "    cleanup_torchscript_cache()\n",
    "    print(f\"UNet successfully converted to IR and saved to {UNET_PATH}\")\n",
    "\n",
    "del unet\n",
    "gc.collect();"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "568e5f42-1ccc-48d4-ba47-d4fcd1fad865",
   "metadata": {},
   "source": [
    "### VAE Encoder and Decoder\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "As discussed above VAE model used for encoding initial image and decoding generated video. Encoding and Decoding happen on different pipeline stages, so for convenient usage we separate VAE on 2 parts: Encoder and Decoder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "996339ea-c674-4581-baf3-cb6a0443c74c",
   "metadata": {},
   "outputs": [],
   "source": [
    "class VAEEncoderWrapper(torch.nn.Module):\n",
    "    def __init__(self, vae):\n",
    "        super().__init__()\n",
    "        self.vae = vae\n",
    "\n",
    "    def forward(self, image):\n",
    "        return self.vae.encode(x=image)[\"latent_dist\"].sample()\n",
    "\n",
    "\n",
    "class VAEDecoderWrapper(torch.nn.Module):\n",
    "    def __init__(self, vae):\n",
    "        super().__init__()\n",
    "        self.vae = vae\n",
    "\n",
    "    def forward(self, latents, num_frames: int):\n",
    "        return self.vae.decode(latents, num_frames=num_frames)\n",
    "\n",
    "\n",
    "if not VAE_ENCODER_PATH.exists():\n",
    "    vae_encoder = VAEEncoderWrapper(vae)\n",
    "    with torch.no_grad():\n",
    "        ov_model = ov.convert_model(vae_encoder, example_input=torch.zeros((1, 3, 576, 1024)))\n",
    "    ov.save_model(ov_model, VAE_ENCODER_PATH)\n",
    "    cleanup_torchscript_cache()\n",
    "    print(f\"VAE Encoder successfully converted to IR and saved to {VAE_ENCODER_PATH}\")\n",
    "    del vae_encoder\n",
    "    gc.collect()\n",
    "\n",
    "if not VAE_DECODER_PATH.exists():\n",
    "    vae_decoder = VAEDecoderWrapper(vae)\n",
    "    with torch.no_grad():\n",
    "        ov_model = ov.convert_model(vae_decoder, example_input=(torch.zeros((8, 4, 72, 128)), torch.tensor(8)))\n",
    "    ov.save_model(ov_model, VAE_DECODER_PATH)\n",
    "    cleanup_torchscript_cache()\n",
    "    print(f\"VAE Decoder successfully converted to IR and saved to {VAE_ENCODER_PATH}\")\n",
    "    del vae_decoder\n",
    "    gc.collect()\n",
    "\n",
    "del vae\n",
    "gc.collect();"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "48ecf8f3-ffc4-4b97-ac7d-d3ea9bb3fd75",
   "metadata": {},
   "source": [
    "## Prepare Inference Pipeline\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "The code bellow implements `OVStableVideoDiffusionPipeline` class for running video generation using OpenVINO. The pipeline accepts input image and returns the sequence of generated frames\n",
    "The diagram below represents a simplified pipeline workflow.\n",
    "\n",
    "![svd](https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/a5671c5b-415b-4ae0-be82-9bf36527d452)\n",
    "\n",
    "The pipeline is very similar to [Stable Diffusion Image to Image Generation pipeline](../stable-diffusion-text-to-image/stable-diffusion-text-to-image.ipynb) with the only difference that Image Encoder is used instead of Text Encoder. Model takes input image and random seed as initial prompt. Then image encoded into embeddings space using Image Encoder and into latent space using VAE Encoder and passed as input to U-Net model. Next, the U-Net iteratively *denoises* the random latent video representations while being conditioned on the image embeddings. The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation via a scheduler algorithm for next iteration in generation cycle. This process repeats the given number of times and, finally, VAE decoder converts denoised latents into sequence of video frames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "1b073909-aa5c-4252-bff4-0a7e34a6c983",
   "metadata": {},
   "outputs": [],
   "source": [
    "from diffusers.pipelines.pipeline_utils import DiffusionPipeline\n",
    "import PIL.Image\n",
    "from diffusers.image_processor import VaeImageProcessor\n",
    "from diffusers.utils.torch_utils import randn_tensor\n",
    "from typing import Callable, Dict, List, Optional, Union\n",
    "from diffusers.pipelines.stable_video_diffusion import (\n",
    "    StableVideoDiffusionPipelineOutput,\n",
    ")\n",
    "\n",
    "\n",
    "def _append_dims(x, target_dims):\n",
    "    \"\"\"Appends dimensions to the end of a tensor until it has target_dims dimensions.\"\"\"\n",
    "    dims_to_append = target_dims - x.ndim\n",
    "    if dims_to_append < 0:\n",
    "        raise ValueError(f\"input has {x.ndim} dims but target_dims is {target_dims}, which is less\")\n",
    "    return x[(...,) + (None,) * dims_to_append]\n",
    "\n",
    "\n",
    "def tensor2vid(video: torch.Tensor, processor, output_type=\"np\"):\n",
    "    # Based on:\n",
    "    # https://github.com/modelscope/modelscope/blob/1509fdb973e5871f37148a4b5e5964cafd43e64d/modelscope/pipelines/multi_modal/text_to_video_synthesis_pipeline.py#L78\n",
    "\n",
    "    batch_size, channels, num_frames, height, width = video.shape\n",
    "    outputs = []\n",
    "    for batch_idx in range(batch_size):\n",
    "        batch_vid = video[batch_idx].permute(1, 0, 2, 3)\n",
    "        batch_output = processor.postprocess(batch_vid, output_type)\n",
    "\n",
    "        outputs.append(batch_output)\n",
    "\n",
    "    return outputs\n",
    "\n",
    "\n",
    "class OVStableVideoDiffusionPipeline(DiffusionPipeline):\n",
    "    r\"\"\"\n",
    "    Pipeline to generate video from an input image using Stable Video Diffusion.\n",
    "\n",
    "    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods\n",
    "    implemented for all pipelines (downloading, saving, running on a particular device, etc.).\n",
    "\n",
    "    Args:\n",
    "        vae ([`AutoencoderKL`]):\n",
    "            Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.\n",
    "        image_encoder ([`~transformers.CLIPVisionModelWithProjection`]):\n",
    "            Frozen CLIP image-encoder ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K)).\n",
    "        unet ([`UNetSpatioTemporalConditionModel`]):\n",
    "            A `UNetSpatioTemporalConditionModel` to denoise the encoded image latents.\n",
    "        scheduler ([`EulerDiscreteScheduler`]):\n",
    "            A scheduler to be used in combination with `unet` to denoise the encoded image latents.\n",
    "        feature_extractor ([`~transformers.CLIPImageProcessor`]):\n",
    "            A `CLIPImageProcessor` to extract features from generated images.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        vae_encoder,\n",
    "        image_encoder,\n",
    "        unet,\n",
    "        vae_decoder,\n",
    "        scheduler,\n",
    "        feature_extractor,\n",
    "    ):\n",
    "        super().__init__()\n",
    "        self.vae_encoder = vae_encoder\n",
    "        self.vae_decoder = vae_decoder\n",
    "        self.image_encoder = image_encoder\n",
    "        self.register_to_config(unet=unet)\n",
    "        self.scheduler = scheduler\n",
    "        self.feature_extractor = feature_extractor\n",
    "        self.vae_scale_factor = 2 ** (4 - 1)\n",
    "        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)\n",
    "\n",
    "    def _encode_image(self, image, device, num_videos_per_prompt, do_classifier_free_guidance):\n",
    "        dtype = torch.float32\n",
    "\n",
    "        if not isinstance(image, torch.Tensor):\n",
    "            image = self.image_processor.pil_to_numpy(image)\n",
    "            image = self.image_processor.numpy_to_pt(image)\n",
    "\n",
    "            # We normalize the image before resizing to match with the original implementation.\n",
    "            # Then we unnormalize it after resizing.\n",
    "            image = image * 2.0 - 1.0\n",
    "            image = _resize_with_antialiasing(image, (224, 224))\n",
    "            image = (image + 1.0) / 2.0\n",
    "\n",
    "            # Normalize the image with for CLIP input\n",
    "            image = self.feature_extractor(\n",
    "                images=image,\n",
    "                do_normalize=True,\n",
    "                do_center_crop=False,\n",
    "                do_resize=False,\n",
    "                do_rescale=False,\n",
    "                return_tensors=\"pt\",\n",
    "            ).pixel_values\n",
    "\n",
    "        image = image.to(device=device, dtype=dtype)\n",
    "        image_embeddings = torch.from_numpy(self.image_encoder(image)[0])\n",
    "        image_embeddings = image_embeddings.unsqueeze(1)\n",
    "\n",
    "        # duplicate image embeddings for each generation per prompt, using mps friendly method\n",
    "        bs_embed, seq_len, _ = image_embeddings.shape\n",
    "        image_embeddings = image_embeddings.repeat(1, num_videos_per_prompt, 1)\n",
    "        image_embeddings = image_embeddings.view(bs_embed * num_videos_per_prompt, seq_len, -1)\n",
    "\n",
    "        if do_classifier_free_guidance:\n",
    "            negative_image_embeddings = torch.zeros_like(image_embeddings)\n",
    "\n",
    "            # For classifier free guidance, we need to do two forward passes.\n",
    "            # Here we concatenate the unconditional and text embeddings into a single batch\n",
    "            # to avoid doing two forward passes\n",
    "            image_embeddings = torch.cat([negative_image_embeddings, image_embeddings])\n",
    "        return image_embeddings\n",
    "\n",
    "    def _encode_vae_image(\n",
    "        self,\n",
    "        image: torch.Tensor,\n",
    "        device,\n",
    "        num_videos_per_prompt,\n",
    "        do_classifier_free_guidance,\n",
    "    ):\n",
    "        image_latents = torch.from_numpy(self.vae_encoder(image)[0])\n",
    "\n",
    "        if do_classifier_free_guidance:\n",
    "            negative_image_latents = torch.zeros_like(image_latents)\n",
    "\n",
    "            # For classifier free guidance, we need to do two forward passes.\n",
    "            # Here we concatenate the unconditional and text embeddings into a single batch\n",
    "            # to avoid doing two forward passes\n",
    "            image_latents = torch.cat([negative_image_latents, image_latents])\n",
    "\n",
    "        # duplicate image_latents for each generation per prompt, using mps friendly method\n",
    "        image_latents = image_latents.repeat(num_videos_per_prompt, 1, 1, 1)\n",
    "\n",
    "        return image_latents\n",
    "\n",
    "    def _get_add_time_ids(\n",
    "        self,\n",
    "        fps,\n",
    "        motion_bucket_id,\n",
    "        noise_aug_strength,\n",
    "        dtype,\n",
    "        batch_size,\n",
    "        num_videos_per_prompt,\n",
    "        do_classifier_free_guidance,\n",
    "    ):\n",
    "        add_time_ids = [fps, motion_bucket_id, noise_aug_strength]\n",
    "\n",
    "        passed_add_embed_dim = 256 * len(add_time_ids)\n",
    "        expected_add_embed_dim = 3 * 256\n",
    "\n",
    "        if expected_add_embed_dim != passed_add_embed_dim:\n",
    "            raise ValueError(\n",
    "                f\"Model expects an added time embedding vector of length {expected_add_embed_dim}, but a vector of {passed_add_embed_dim} was created. The model has an incorrect config. Please check `unet.config.time_embedding_type` and `text_encoder_2.config.projection_dim`.\"\n",
    "            )\n",
    "\n",
    "        add_time_ids = torch.tensor([add_time_ids], dtype=dtype)\n",
    "        add_time_ids = add_time_ids.repeat(batch_size * num_videos_per_prompt, 1)\n",
    "\n",
    "        if do_classifier_free_guidance:\n",
    "            add_time_ids = torch.cat([add_time_ids, add_time_ids])\n",
    "\n",
    "        return add_time_ids\n",
    "\n",
    "    def decode_latents(self, latents, num_frames, decode_chunk_size=14):\n",
    "        # [batch, frames, channels, height, width] -> [batch*frames, channels, height, width]\n",
    "        latents = latents.flatten(0, 1)\n",
    "\n",
    "        latents = 1 / 0.18215 * latents\n",
    "\n",
    "        # decode decode_chunk_size frames at a time to avoid OOM\n",
    "        frames = []\n",
    "        for i in range(0, latents.shape[0], decode_chunk_size):\n",
    "            frame = torch.from_numpy(self.vae_decoder([latents[i : i + decode_chunk_size], num_frames])[0])\n",
    "            frames.append(frame)\n",
    "        frames = torch.cat(frames, dim=0)\n",
    "\n",
    "        # [batch*frames, channels, height, width] -> [batch, channels, frames, height, width]\n",
    "        frames = frames.reshape(-1, num_frames, *frames.shape[1:]).permute(0, 2, 1, 3, 4)\n",
    "\n",
    "        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16\n",
    "        frames = frames.float()\n",
    "        return frames\n",
    "\n",
    "    def check_inputs(self, image, height, width):\n",
    "        if not isinstance(image, torch.Tensor) and not isinstance(image, PIL.Image.Image) and not isinstance(image, list):\n",
    "            raise ValueError(\"`image` has to be of type `torch.FloatTensor` or `PIL.Image.Image` or `List[PIL.Image.Image]` but is\" f\" {type(image)}\")\n",
    "\n",
    "        if height % 8 != 0 or width % 8 != 0:\n",
    "            raise ValueError(f\"`height` and `width` have to be divisible by 8 but are {height} and {width}.\")\n",
    "\n",
    "    def prepare_latents(\n",
    "        self,\n",
    "        batch_size,\n",
    "        num_frames,\n",
    "        num_channels_latents,\n",
    "        height,\n",
    "        width,\n",
    "        dtype,\n",
    "        device,\n",
    "        generator,\n",
    "        latents=None,\n",
    "    ):\n",
    "        shape = (\n",
    "            batch_size,\n",
    "            num_frames,\n",
    "            num_channels_latents // 2,\n",
    "            height // self.vae_scale_factor,\n",
    "            width // self.vae_scale_factor,\n",
    "        )\n",
    "        if isinstance(generator, list) and len(generator) != batch_size:\n",
    "            raise ValueError(\n",
    "                f\"You have passed a list of generators of length {len(generator)}, but requested an effective batch\"\n",
    "                f\" size of {batch_size}. Make sure the batch size matches the length of the generators.\"\n",
    "            )\n",
    "\n",
    "        if latents is None:\n",
    "            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)\n",
    "        else:\n",
    "            latents = latents.to(device)\n",
    "\n",
    "        # scale the initial noise by the standard deviation required by the scheduler\n",
    "        latents = latents * self.scheduler.init_noise_sigma\n",
    "        return latents\n",
    "\n",
    "    @torch.no_grad()\n",
    "    def __call__(\n",
    "        self,\n",
    "        image: Union[PIL.Image.Image, List[PIL.Image.Image], torch.FloatTensor],\n",
    "        height: int = 320,\n",
    "        width: int = 512,\n",
    "        num_frames: Optional[int] = 8,\n",
    "        num_inference_steps: int = 4,\n",
    "        min_guidance_scale: float = 1.0,\n",
    "        max_guidance_scale: float = 1.2,\n",
    "        fps: int = 7,\n",
    "        motion_bucket_id: int = 80,\n",
    "        noise_aug_strength: int = 0.01,\n",
    "        decode_chunk_size: Optional[int] = None,\n",
    "        num_videos_per_prompt: Optional[int] = 1,\n",
    "        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,\n",
    "        latents: Optional[torch.FloatTensor] = None,\n",
    "        output_type: Optional[str] = \"pil\",\n",
    "        callback_on_step_end: Optional[Callable[[int, int, Dict], None]] = None,\n",
    "        callback_on_step_end_tensor_inputs: List[str] = [\"latents\"],\n",
    "        return_dict: bool = True,\n",
    "    ):\n",
    "        r\"\"\"\n",
    "        The call function to the pipeline for generation.\n",
    "\n",
    "        Args:discussed\n",
    "            image (`PIL.Image.Image` or `List[PIL.Image.Image]` or `torch.FloatTensor`):\n",
    "                Image or images to guide image generation. If you provide a tensor, it needs to be compatible with\n",
    "                [`CLIPImageProcessor`](https://huggingface.co/lambdalabs/sd-image-variations-diffusers/blob/main/feature_extractor/preprocessor_config.json).\n",
    "            height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):\n",
    "                The height in pixels of the generated image.\n",
    "            width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`):\n",
    "                The width in pixels of the generated image.\n",
    "            num_frames (`int`, *optional*):\n",
    "                The number of video frames to generate. Defaults to 14 for `stable-video-diffusion-img2vid` and to 25 for `stable-video-diffusion-img2vid-xt`\n",
    "            num_inference_steps (`int`, *optional*, defaults to 25):\n",
    "\n",
    "\n",
    "                The number of denoising steps. More denoising steps usually lead to a higher quality image at the\n",
    "                expense of slower inference. This parameter is modulated by `strength`.\n",
    "            min_guidance_scale (`float`, *optional*, defaults to 1.0):\n",
    "                The minimum guidance scale. Used for the classifier free guidance with first frame.\n",
    "            max_guidance_scale (`float`, *optional*, defaults to 3.0):\n",
    "                The maximum guidance scale. Used for the classifier free guidance with last frame.\n",
    "            fps (`int`, *optional*, defaults to 7):\n",
    "                Frames per second. The rate at which the generated images shall be exported to a video after generation.\n",
    "                Note that Stable Diffusion Video's UNet was micro-conditioned on fps-1 during training.\n",
    "            motion_bucket_id (`int`, *optional*, defaults to 127):\n",
    "                The motion bucket ID. Used as conditioning for the generation. The higher the number the more motion will be in the video.\n",
    "            noise_aug_strength (`int`, *optional*, defaults to 0.02):\n",
    "                The amount of noise added to the init image, the higher it is the less the video will look like the init image. Increase it for more motion.\n",
    "            decode_chunk_size (`int`, *optional*):\n",
    "                The number of frames to decode at a time. The higher the chunk size, the higher the temporal consistency\n",
    "                between frames, but also the higher the memory consumption. By default, the decoder will decode all frames at once\n",
    "                for maximal quality. Reduce `decode_chunk_size` to reduce memory usage.\n",
    "            num_videos_per_prompt (`int`, *optional*, defaults to 1):\n",
    "                The number of images to generate per prompt.\n",
    "            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):\n",
    "                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make\n",
    "                generation deterministic.\n",
    "            latents (`torch.FloatTensor`, *optional*):\n",
    "                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image\n",
    "                generation. Can be used to tweak the same generation with different prompts. If not provided, a latents\n",
    "                tensor is generated by sampling using the supplied random `generator`.\n",
    "            output_type (`str`, *optional*, defaults to `\"pil\"`):\n",
    "                The output format of the generated image. Choose between `PIL.Image` or `np.array`.\n",
    "            callback_on_step_end (`Callable`, *optional*):\n",
    "                A function that calls at the end of each denoising steps during the inference. The function is called\n",
    "                with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int,\n",
    "                callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by\n",
    "                `callback_on_step_end_tensor_inputs`.\n",
    "            callback_on_step_end_tensor_inputs (`List`, *optional*):\n",
    "                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list\n",
    "                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the\n",
    "                `._callback_tensor_inputs` attribute of your pipeline class.\n",
    "            return_dict (`bool`, *optional*, defaults to `True`):\n",
    "                Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a\n",
    "                plain tuple.\n",
    "\n",
    "        Returns:\n",
    "            [`~pipelines.stable_diffusion.StableVideoDiffusionPipelineOutput`] or `tuple`:\n",
    "                If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableVideoDiffusionPipelineOutput`] is returned,\n",
    "                otherwise a `tuple` is returned where the first element is a list of list with the generated frames.\n",
    "\n",
    "        Examples:\n",
    "\n",
    "        ```py\n",
    "        from diffusers import StableVideoDiffusionPipeline\n",
    "        from diffusers.utils import load_image, export_to_video\n",
    "\n",
    "        pipe = StableVideoDiffusionPipeline.from_pretrained(\"stabilityai/stable-video-diffusion-img2vid-xt\", torch_dtype=torch.float16, variant=\"fp16\")\n",
    "        pipe.to(\"cuda\")\n",
    "\n",
    "        image = load_image(\"https://lh3.googleusercontent.com/y-iFOHfLTwkuQSUegpwDdgKmOjRSTvPxat63dQLB25xkTs4lhIbRUFeNBWZzYf370g=s1200\")\n",
    "        image = image.resize((1024, 576))\n",
    "\n",
    "        frames = pipe(image, num_frames=25, decode_chunk_size=8).frames[0]\n",
    "        export_to_video(frames, \"generated.mp4\", fps=7)\n",
    "        ```\n",
    "        \"\"\"\n",
    "        # 0. Default height and width to unet\n",
    "        height = height or 96 * self.vae_scale_factor\n",
    "        width = width or 96 * self.vae_scale_factor\n",
    "\n",
    "        num_frames = num_frames if num_frames is not None else 25\n",
    "        decode_chunk_size = decode_chunk_size if decode_chunk_size is not None else num_frames\n",
    "\n",
    "        # 1. Check inputs. Raise error if not correct\n",
    "        self.check_inputs(image, height, width)\n",
    "\n",
    "        # 2. Define call parameters\n",
    "        if isinstance(image, PIL.Image.Image):\n",
    "            batch_size = 1\n",
    "        elif isinstance(image, list):\n",
    "            batch_size = len(image)\n",
    "        else:\n",
    "            batch_size = image.shape[0]\n",
    "        device = torch.device(\"cpu\")\n",
    "\n",
    "        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)\n",
    "        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`\n",
    "        # corresponds to doing no classifier free guidance.\n",
    "        do_classifier_free_guidance = max_guidance_scale > 1.0\n",
    "\n",
    "        # 3. Encode input image\n",
    "        image_embeddings = self._encode_image(image, device, num_videos_per_prompt, do_classifier_free_guidance)\n",
    "\n",
    "        # NOTE: Stable Diffusion Video was conditioned on fps - 1, which\n",
    "        # is why it is reduced here.\n",
    "        # See: https://github.com/Stability-AI/generative-models/blob/ed0997173f98eaf8f4edf7ba5fe8f15c6b877fd3/scripts/sampling/simple_video_sample.py#L188\n",
    "        fps = fps - 1\n",
    "\n",
    "        # 4. Encode input image using VAE\n",
    "        image = self.image_processor.preprocess(image, height=height, width=width)\n",
    "        noise = randn_tensor(image.shape, generator=generator, device=image.device, dtype=image.dtype)\n",
    "        image = image + noise_aug_strength * noise\n",
    "\n",
    "        image_latents = self._encode_vae_image(image, device, num_videos_per_prompt, do_classifier_free_guidance)\n",
    "        image_latents = image_latents.to(image_embeddings.dtype)\n",
    "\n",
    "        # Repeat the image latents for each frame so we can concatenate them with the noise\n",
    "        # image_latents [batch, channels, height, width] ->[batch, num_frames, channels, height, width]\n",
    "        image_latents = image_latents.unsqueeze(1).repeat(1, num_frames, 1, 1, 1)\n",
    "\n",
    "        # 5. Get Added Time IDs\n",
    "        added_time_ids = self._get_add_time_ids(\n",
    "            fps,\n",
    "            motion_bucket_id,\n",
    "            noise_aug_strength,\n",
    "            image_embeddings.dtype,\n",
    "            batch_size,\n",
    "            num_videos_per_prompt,\n",
    "            do_classifier_free_guidance,\n",
    "        )\n",
    "        added_time_ids = added_time_ids\n",
    "\n",
    "        # 4. Prepare timesteps\n",
    "        self.scheduler.set_timesteps(num_inference_steps, device=device)\n",
    "        timesteps = self.scheduler.timesteps\n",
    "        # 5. Prepare latent variables\n",
    "        num_channels_latents = 8\n",
    "        latents = self.prepare_latents(\n",
    "            batch_size * num_videos_per_prompt,\n",
    "            num_frames,\n",
    "            num_channels_latents,\n",
    "            height,\n",
    "            width,\n",
    "            image_embeddings.dtype,\n",
    "            device,\n",
    "            generator,\n",
    "            latents,\n",
    "        )\n",
    "\n",
    "        # 7. Prepare guidance scale\n",
    "        guidance_scale = torch.linspace(min_guidance_scale, max_guidance_scale, num_frames).unsqueeze(0)\n",
    "        guidance_scale = guidance_scale.to(device, latents.dtype)\n",
    "        guidance_scale = guidance_scale.repeat(batch_size * num_videos_per_prompt, 1)\n",
    "        guidance_scale = _append_dims(guidance_scale, latents.ndim)\n",
    "\n",
    "        # 8. Denoising loop\n",
    "        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order\n",
    "        num_timesteps = len(timesteps)\n",
    "        with self.progress_bar(total=num_inference_steps) as progress_bar:\n",
    "            for i, t in enumerate(timesteps):\n",
    "                # expand the latents if we are doing classifier free guidance\n",
    "                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents\n",
    "                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)\n",
    "\n",
    "                # Concatenate image_latents over channels dimention\n",
    "                latent_model_input = torch.cat([latent_model_input, image_latents], dim=2)\n",
    "                # predict the noise residual\n",
    "                noise_pred = torch.from_numpy(\n",
    "                    self.unet(\n",
    "                        [\n",
    "                            latent_model_input,\n",
    "                            t,\n",
    "                            image_embeddings,\n",
    "                            added_time_ids,\n",
    "                        ]\n",
    "                    )[0]\n",
    "                )\n",
    "                # perform guidance\n",
    "                if do_classifier_free_guidance:\n",
    "                    noise_pred_uncond, noise_pred_cond = noise_pred.chunk(2)\n",
    "                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond)\n",
    "\n",
    "                # compute the previous noisy sample x_t -> x_t-1\n",
    "                latents = self.scheduler.step(noise_pred, t, latents).prev_sample\n",
    "\n",
    "                if callback_on_step_end is not None:\n",
    "                    callback_kwargs = {}\n",
    "                    for k in callback_on_step_end_tensor_inputs:\n",
    "                        callback_kwargs[k] = locals()[k]\n",
    "                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)\n",
    "\n",
    "                    latents = callback_outputs.pop(\"latents\", latents)\n",
    "\n",
    "                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):\n",
    "                    progress_bar.update()\n",
    "\n",
    "        if not output_type == \"latent\":\n",
    "            frames = self.decode_latents(latents, num_frames, decode_chunk_size)\n",
    "            frames = tensor2vid(frames, self.image_processor, output_type=output_type)\n",
    "        else:\n",
    "            frames = latents\n",
    "\n",
    "        if not return_dict:\n",
    "            return frames\n",
    "\n",
    "        return StableVideoDiffusionPipelineOutput(frames=frames)\n",
    "\n",
    "\n",
    "# resizing utils\n",
    "def _resize_with_antialiasing(input, size, interpolation=\"bicubic\", align_corners=True):\n",
    "    h, w = input.shape[-2:]\n",
    "    factors = (h / size[0], w / size[1])\n",
    "\n",
    "    # First, we have to determine sigma\n",
    "    # Taken from skimage: https://github.com/scikit-image/scikit-image/blob/v0.19.2/skimage/transform/_warps.py#L171\n",
    "    sigmas = (\n",
    "        max((factors[0] - 1.0) / 2.0, 0.001),\n",
    "        max((factors[1] - 1.0) / 2.0, 0.001),\n",
    "    )\n",
    "    # Now kernel size. Good results are for 3 sigma, but that is kind of slow. Pillow uses 1 sigma\n",
    "    # https://github.com/python-pillow/Pillow/blob/master/src/libImaging/Resample.c#L206\n",
    "    # But they do it in the 2 passes, which gives better results. Let's try 2 sigmas for now\n",
    "    ks = int(max(2.0 * 2 * sigmas[0], 3)), int(max(2.0 * 2 * sigmas[1], 3))\n",
    "\n",
    "    # Make sure it is odd\n",
    "    if (ks[0] % 2) == 0:\n",
    "        ks = ks[0] + 1, ks[1]\n",
    "\n",
    "    if (ks[1] % 2) == 0:\n",
    "\n",
    "        ks = ks[0], ks[1] + 1\n",
    "\n",
    "    input = _gaussian_blur2d(input, ks, sigmas)\n",
    "\n",
    "    output = torch.nn.functional.interpolate(input, size=size, mode=interpolation, align_corners=align_corners)\n",
    "    return output\n",
    "\n",
    "\n",
    "def _compute_padding(kernel_size):\n",
    "    \"\"\"Compute padding tuple.\"\"\"\n",
    "    # 4 or 6 ints:  (padding_left, padding_right,padding_top,padding_bottom)\n",
    "    # https://pytorch.org/docs/stable/nn.html#torch.nn.functional.pad\n",
    "    if len(kernel_size) < 2:\n",
    "        raise AssertionError(kernel_size)\n",
    "    computed = [k - 1 for k in kernel_size]\n",
    "\n",
    "    # for even kernels we need to do asymmetric padding :(\n",
    "    out_padding = 2 * len(kernel_size) * [0]\n",
    "\n",
    "    for i in range(len(kernel_size)):\n",
    "        computed_tmp = computed[-(i + 1)]\n",
    "\n",
    "        pad_front = computed_tmp // 2\n",
    "        pad_rear = computed_tmp - pad_front\n",
    "\n",
    "        out_padding[2 * i + 0] = pad_front\n",
    "        out_padding[2 * i + 1] = pad_rear\n",
    "\n",
    "    return out_padding\n",
    "\n",
    "\n",
    "def _filter2d(input, kernel):\n",
    "    # prepare kernel\n",
    "    b, c, h, w = input.shape\n",
    "    tmp_kernel = kernel[:, None, ...].to(device=input.device, dtype=input.dtype)\n",
    "\n",
    "    tmp_kernel = tmp_kernel.expand(-1, c, -1, -1)\n",
    "\n",
    "    height, width = tmp_kernel.shape[-2:]\n",
    "\n",
    "    padding_shape: list[int] = _compute_padding([height, width])\n",
    "    input = torch.nn.functional.pad(input, padding_shape, mode=\"reflect\")\n",
    "\n",
    "    # kernel and input tensor reshape to align element-wise or batch-wise params\n",
    "    tmp_kernel = tmp_kernel.reshape(-1, 1, height, width)\n",
    "    input = input.view(-1, tmp_kernel.size(0), input.size(-2), input.size(-1))\n",
    "\n",
    "    # convolve the tensor with the kernel.\n",
    "    output = torch.nn.functional.conv2d(input, tmp_kernel, groups=tmp_kernel.size(0), padding=0, stride=1)\n",
    "\n",
    "    out = output.view(b, c, h, w)\n",
    "    return out\n",
    "\n",
    "\n",
    "def _gaussian(window_size: int, sigma):\n",
    "    if isinstance(sigma, float):\n",
    "        sigma = torch.tensor([[sigma]])\n",
    "\n",
    "    batch_size = sigma.shape[0]\n",
    "\n",
    "    x = (torch.arange(window_size, device=sigma.device, dtype=sigma.dtype) - window_size // 2).expand(batch_size, -1)\n",
    "\n",
    "    if window_size % 2 == 0:\n",
    "\n",
    "        x = x + 0.5\n",
    "\n",
    "    gauss = torch.exp(-x.pow(2.0) / (2 * sigma.pow(2.0)))\n",
    "\n",
    "    return gauss / gauss.sum(-1, keepdim=True)\n",
    "\n",
    "\n",
    "def _gaussian_blur2d(input, kernel_size, sigma):\n",
    "    if isinstance(sigma, tuple):\n",
    "        sigma = torch.tensor([sigma], dtype=input.dtype)\n",
    "    else:\n",
    "        sigma = sigma.to(dtype=input.dtype)\n",
    "\n",
    "    ky, kx = int(kernel_size[0]), int(kernel_size[1])\n",
    "    bs = sigma.shape[0]\n",
    "    kernel_x = _gaussian(kx, sigma[:, 1].view(bs, 1))\n",
    "    kernel_y = _gaussian(ky, sigma[:, 0].view(bs, 1))\n",
    "    out_x = _filter2d(input, kernel_x[..., None, :])\n",
    "    out = _filter2d(out_x, kernel_y[..., None])\n",
    "\n",
    "    return out"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "72845107-4b68-4f36-9ffb-9a9c31cca63c",
   "metadata": {},
   "source": [
    "## Run Video Generation\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "### Select Inference Device\n",
    "[back to top ⬆️](#Table-of-contents:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8a9c9ba7-8234-44ac-bbb3-708e3bea5640",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "b3a026ee8cad4ce7bdbf9c155dc4d9e7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Dropdown(description='Device:', index=4, options=('CPU', 'GPU.0', 'GPU.1', 'GPU.2', 'AUTO'), value='AUTO')"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import ipywidgets as widgets\n",
    "\n",
    "core = ov.Core()\n",
    "\n",
    "device = widgets.Dropdown(\n",
    "    options=core.available_devices + [\"AUTO\"],\n",
    "    value=\"AUTO\",\n",
    "    description=\"Device:\",\n",
    "    disabled=False,\n",
    ")\n",
    "\n",
    "device"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "bbaf9418-bbff-4275-b43b-b1c14cc92aca",
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import CLIPImageProcessor\n",
    "\n",
    "\n",
    "vae_encoder = core.compile_model(VAE_ENCODER_PATH, device.value)\n",
    "image_encoder = core.compile_model(IMAGE_ENCODER_PATH, device.value)\n",
    "unet = core.compile_model(UNET_PATH, device.value)\n",
    "vae_decoder = core.compile_model(VAE_DECODER_PATH, device.value)\n",
    "scheduler = AnimateLCMSVDStochasticIterativeScheduler.from_pretrained(MODEL_DIR / \"scheduler\")\n",
    "feature_extractor = CLIPImageProcessor.from_pretrained(MODEL_DIR / \"feature_extractor\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3ba0d612-2ba4-4297-94cb-9ca54f5c14a3",
   "metadata": {},
   "source": [
    "Now, let's see model in action.\n",
    "> Please, note, video generation is memory and time consuming process. For reducing memory consumption, we decreased input video resolution to 576x320 and number of generated frames that may affect quality of generated video. You can change these settings manually providing `height`, `width` and `num_frames` parameters into pipeline. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "0c722800-7800-4a81-8a39-369dd182237e",
   "metadata": {},
   "outputs": [],
   "source": [
    "ov_pipe = OVStableVideoDiffusionPipeline(vae_encoder, image_encoder, unet, vae_decoder, scheduler, feature_extractor)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "02b62761-35d4-46be-a7eb-bdc8774de7cd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "96d6580418db44f6a36967f7b8c50b9a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "denoise currently\n",
      "tensor(128.5637)\n",
      "denoise currently\n",
      "tensor(13.6784)\n",
      "denoise currently\n",
      "tensor(0.4969)\n",
      "denoise currently\n",
      "tensor(0.)\n"
     ]
    }
   ],
   "source": [
    "frames = ov_pipe(\n",
    "    image,\n",
    "    num_inference_steps=4,\n",
    "    motion_bucket_id=60,\n",
    "    num_frames=8,\n",
    "    height=320,\n",
    "    width=512,\n",
    "    generator=torch.manual_seed(12342),\n",
    ").frames[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "5e55dee5-fbb9-4616-a4d1-14f411093bb2",
   "metadata": {},
   "outputs": [],
   "source": [
    "out_path = Path(\"generated.mp4\")\n",
    "\n",
    "export_to_video(frames, str(out_path), fps=7)\n",
    "frames[0].save(\n",
    "    \"generated.gif\",\n",
    "    save_all=True,\n",
    "    append_images=frames[1:],\n",
    "    optimize=False,\n",
    "    duration=120,\n",
    "    loop=0,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "abf5294e-d76a-496d-a5d1-0b3f7e5eafc3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"generated.gif\">"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import HTML\n",
    "\n",
    "HTML('<img src=\"generated.gif\">')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c042cf53",
   "metadata": {},
   "source": [
    "## Quantization\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "[NNCF](https://github.com/openvinotoolkit/nncf/) enables post-training quantization by adding quantization layers into model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers. Quantized operations are executed in `INT8` instead of `FP32`/`FP16` making model inference faster.\n",
    "\n",
    "According to `OVStableVideoDiffusionPipeline` structure, the diffusion model takes up significant portion of the overall pipeline execution time. Now we will show you how to optimize the UNet part using [NNCF](https://github.com/openvinotoolkit/nncf/) to reduce computation cost and speed up the pipeline. Quantizing the rest of the pipeline does not significantly improve inference performance but can lead to a substantial degradation of accuracy. That's why we use only weight compression for the `vae encoder` and `vae decoder` to reduce the memory footprint.\n",
    "\n",
    "For the UNet model we apply quantization in hybrid mode which means that we quantize: (1) weights of MatMul and Embedding layers and (2) activations of other layers. The steps are the following:\n",
    "\n",
    "1. Create a calibration dataset for quantization.\n",
    "2. Collect operations with weights.\n",
    "3. Run `nncf.compress_model()` to compress only the model weights.\n",
    "4. Run `nncf.quantize()` on the compressed model with weighted operations ignored by providing `ignored_scope` parameter.\n",
    "5. Save the `INT8` model using `openvino.save_model()` function.\n",
    "\n",
    "\n",
    "Please select below whether you would like to run quantization to improve model inference speed.\n",
    "\n",
    "> **NOTE**: Quantization is time and memory consuming operation. Running quantization code below may take some time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "cb033895",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "81a05f76b1eb4e7f8f199f39b4bdb9f7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Checkbox(value=True, description='Quantization')"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "to_quantize = widgets.Checkbox(\n",
    "    value=True,\n",
    "    description=\"Quantization\",\n",
    "    disabled=False,\n",
    ")\n",
    "\n",
    "to_quantize"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a44c3174",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fetch `skip_kernel_extension` module\n",
    "import requests\n",
    "\n",
    "r = requests.get(\n",
    "    url=\"https://raw.githubusercontent.com/openvinotoolkit/openvino_notebooks/latest/utils/skip_kernel_extension.py\",\n",
    ")\n",
    "open(\"skip_kernel_extension.py\", \"w\").write(r.text)\n",
    "\n",
    "ov_int8_pipeline = None\n",
    "OV_INT8_UNET_PATH = MODEL_DIR / \"unet_int8.xml\"\n",
    "OV_INT8_VAE_ENCODER_PATH = MODEL_DIR / \"vae_encoder_int8.xml\"\n",
    "OV_INT8_VAE_DECODER_PATH = MODEL_DIR / \"vae_decoder_int8.xml\"\n",
    "\n",
    "%load_ext skip_kernel_extension"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a0bfb34e",
   "metadata": {},
   "source": [
    "### Prepare calibration dataset\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "We use a portion of [`fusing/instructpix2pix-1000-samples`](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples) dataset from Hugging Face as calibration data.\n",
    "To collect intermediate model inputs for UNet optimization we should customize `CompiledModel`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "3f6093ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%skip not $to_quantize.value\n",
    "\n",
    "import datasets\n",
    "import numpy as np\n",
    "from tqdm.notebook import tqdm\n",
    "from IPython.utils import io\n",
    "\n",
    "\n",
    "class CompiledModelDecorator(ov.CompiledModel):\n",
    "    def __init__(self, compiled_model: ov.CompiledModel, data_cache: List[Any] = None, keep_prob: float = 0.5):\n",
    "        super().__init__(compiled_model)\n",
    "        self.data_cache = data_cache if data_cache is not None else []\n",
    "        self.keep_prob = keep_prob\n",
    "\n",
    "    def __call__(self, *args, **kwargs):\n",
    "        if np.random.rand() <= self.keep_prob:\n",
    "            self.data_cache.append(*args)\n",
    "        return super().__call__(*args, **kwargs)\n",
    "\n",
    "\n",
    "def collect_calibration_data(ov_pipe, calibration_dataset_size: int, num_inference_steps: int = 50) -> List[Dict]:\n",
    "    original_unet = ov_pipe.unet\n",
    "    calibration_data = []\n",
    "    ov_pipe.unet = CompiledModelDecorator(original_unet, calibration_data, keep_prob=1)\n",
    "\n",
    "    dataset = datasets.load_dataset(\"fusing/instructpix2pix-1000-samples\", split=\"train\", streaming=False).shuffle(seed=42)\n",
    "    # Run inference for data collection\n",
    "    pbar = tqdm(total=calibration_dataset_size)\n",
    "    for batch in dataset:\n",
    "        image = batch[\"input_image\"]\n",
    "\n",
    "        with io.capture_output() as captured:\n",
    "            ov_pipe(\n",
    "                image,\n",
    "                num_inference_steps=4,\n",
    "                motion_bucket_id=60,\n",
    "                num_frames=8,\n",
    "                height=256,\n",
    "                width=256,\n",
    "                generator=torch.manual_seed(12342),\n",
    "            )\n",
    "        pbar.update(len(calibration_data) - pbar.n)\n",
    "        if len(calibration_data) >= calibration_dataset_size:\n",
    "            break\n",
    "\n",
    "    ov_pipe.unet = original_unet\n",
    "    return calibration_data[:calibration_dataset_size]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bfdee6ad",
   "metadata": {
    "test_replace": {
     "subset_size = 200": "subset_size = 4"
    }
   },
   "outputs": [],
   "source": [
    "%%skip not $to_quantize.value\n",
    "\n",
    "if not OV_INT8_UNET_PATH.exists():\n",
    "    subset_size = 200\n",
    "    calibration_data = collect_calibration_data(ov_pipe, calibration_dataset_size=subset_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a054e0fa",
   "metadata": {},
   "source": [
    "### Run Hybrid Model Quantization\n",
    "[back to top ⬆️](#Table-of-contents:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "2a7434b0",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%skip not $to_quantize.value\n",
    "\n",
    "from collections import deque\n",
    "\n",
    "def get_operation_const_op(operation, const_port_id: int):\n",
    "    node = operation.input_value(const_port_id).get_node()\n",
    "    queue = deque([node])\n",
    "    constant_node = None\n",
    "    allowed_propagation_types_list = [\"Convert\", \"FakeQuantize\", \"Reshape\"]\n",
    "\n",
    "    while len(queue) != 0:\n",
    "        curr_node = queue.popleft()\n",
    "        if curr_node.get_type_name() == \"Constant\":\n",
    "            constant_node = curr_node\n",
    "            break\n",
    "        if len(curr_node.inputs()) == 0:\n",
    "            break\n",
    "        if curr_node.get_type_name() in allowed_propagation_types_list:\n",
    "            queue.append(curr_node.input_value(0).get_node())\n",
    "\n",
    "    return constant_node\n",
    "\n",
    "\n",
    "def is_embedding(node) -> bool:\n",
    "    allowed_types_list = [\"f16\", \"f32\", \"f64\"]\n",
    "    const_port_id = 0\n",
    "    input_tensor = node.input_value(const_port_id)\n",
    "    if input_tensor.get_element_type().get_type_name() in allowed_types_list:\n",
    "        const_node = get_operation_const_op(node, const_port_id)\n",
    "        if const_node is not None:\n",
    "            return True\n",
    "\n",
    "    return False\n",
    "\n",
    "\n",
    "def collect_ops_with_weights(model):\n",
    "    ops_with_weights = []\n",
    "    for op in model.get_ops():\n",
    "        if op.get_type_name() == \"MatMul\":\n",
    "            constant_node_0 = get_operation_const_op(op, const_port_id=0)\n",
    "            constant_node_1 = get_operation_const_op(op, const_port_id=1)\n",
    "            if constant_node_0 or constant_node_1:\n",
    "                ops_with_weights.append(op.get_friendly_name())\n",
    "        if op.get_type_name() == \"Gather\" and is_embedding(op):\n",
    "            ops_with_weights.append(op.get_friendly_name())\n",
    "\n",
    "    return ops_with_weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1ef9c787",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%skip not $to_quantize.value\n",
    "\n",
    "import nncf\n",
    "import logging\n",
    "from nncf.quantization.advanced_parameters import AdvancedSmoothQuantParameters\n",
    "\n",
    "nncf.set_log_level(logging.ERROR)\n",
    "\n",
    "if not OV_INT8_UNET_PATH.exists():\n",
    "    diffusion_model = core.read_model(UNET_PATH)\n",
    "    unet_ignored_scope = collect_ops_with_weights(diffusion_model)\n",
    "    compressed_diffusion_model = nncf.compress_weights(diffusion_model, ignored_scope=nncf.IgnoredScope(types=['Convolution']))\n",
    "    quantized_diffusion_model = nncf.quantize(\n",
    "        model=diffusion_model,\n",
    "        calibration_dataset=nncf.Dataset(calibration_data),\n",
    "        subset_size=subset_size,\n",
    "        model_type=nncf.ModelType.TRANSFORMER,\n",
    "        # We additionally ignore the first convolution to improve the quality of generations\n",
    "        ignored_scope=nncf.IgnoredScope(names=unet_ignored_scope + [\"__module.conv_in/aten::_convolution/Convolution\"]),\n",
    "        advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alphas=AdvancedSmoothQuantParameters(matmul=-1))\n",
    "    )\n",
    "    ov.save_model(quantized_diffusion_model, OV_INT8_UNET_PATH)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17705d82",
   "metadata": {},
   "source": [
    "### Run Weight Compression\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "Quantizing of the `vae encoder` and `vae decoder` does not significantly improve inference performance but can lead to a substantial degradation of accuracy. Only weight compression will be applied for footprint reduction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "id": "f9f4a468",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:nncf:Statistics of the bitwidth distribution:\n",
      "┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑\n",
      "│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │\n",
      "┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥\n",
      "│              8 │ 98% (29 / 32)               │ 0% (0 / 3)                             │\n",
      "├────────────────┼─────────────────────────────┼────────────────────────────────────────┤\n",
      "│              4 │ 2% (3 / 32)                 │ 100% (3 / 3)                           │\n",
      "┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "25122a77cddd48acbf3d12eee8a59cf1",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Output()"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INFO:nncf:Statistics of the bitwidth distribution:\n",
      "┍━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┯━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┑\n",
      "│   Num bits (N) │ % all parameters (layers)   │ % ratio-defining parameters (layers)   │\n",
      "┝━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥\n",
      "│              8 │ 99% (65 / 68)               │ 0% (0 / 3)                             │\n",
      "├────────────────┼─────────────────────────────┼────────────────────────────────────────┤\n",
      "│              4 │ 1% (3 / 68)                 │ 100% (3 / 3)                           │\n",
      "┕━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┷━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "00594c7aaf7c4ffaaaac4b09e32899fc",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Output()"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%%skip not $to_quantize.value\n",
    "\n",
    "nncf.set_log_level(logging.INFO)\n",
    "\n",
    "if not OV_INT8_VAE_ENCODER_PATH.exists():\n",
    "    text_encoder_model = core.read_model(VAE_ENCODER_PATH)\n",
    "    compressed_text_encoder_model = nncf.compress_weights(text_encoder_model, mode=nncf.CompressWeightsMode.INT4_SYM, group_size=64)\n",
    "    ov.save_model(compressed_text_encoder_model, OV_INT8_VAE_ENCODER_PATH)\n",
    "\n",
    "if not OV_INT8_VAE_DECODER_PATH.exists():\n",
    "    decoder_model = core.read_model(VAE_DECODER_PATH)\n",
    "    compressed_decoder_model = nncf.compress_weights(decoder_model, mode=nncf.CompressWeightsMode.INT4_SYM, group_size=64)\n",
    "    ov.save_model(compressed_decoder_model, OV_INT8_VAE_DECODER_PATH)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9026878f",
   "metadata": {},
   "source": [
    "Let's compare the video generated by the original and optimized pipelines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "id": "b3156d0b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "df6f355996444314a0d8619df1edfb3b",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "  0%|          | 0/4 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ltalamanova/env_ci/lib/python3.8/site-packages/diffusers/configuration_utils.py:139: FutureWarning: Accessing config attribute `unet` directly via 'OVStableVideoDiffusionPipeline' object attribute is deprecated. Please access 'unet' over 'OVStableVideoDiffusionPipeline's config object instead, e.g. 'scheduler.config.unet'.\n",
      "  deprecate(\"direct config name access\", \"1.0.0\", deprecation_message, standard_warn=False)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "denoise currently\n",
      "tensor(128.5637)\n",
      "denoise currently\n",
      "tensor(13.6784)\n",
      "denoise currently\n",
      "tensor(0.4969)\n",
      "denoise currently\n",
      "tensor(0.)\n"
     ]
    }
   ],
   "source": [
    "%%skip not $to_quantize.value\n",
    "\n",
    "ov_int8_vae_encoder = core.compile_model(OV_INT8_VAE_ENCODER_PATH, device.value)\n",
    "ov_int8_unet = core.compile_model(OV_INT8_UNET_PATH, device.value)\n",
    "ov_int8_decoder = core.compile_model(OV_INT8_VAE_DECODER_PATH, device.value)\n",
    "\n",
    "ov_int8_pipeline = OVStableVideoDiffusionPipeline(\n",
    "    ov_int8_vae_encoder, image_encoder, ov_int8_unet, ov_int8_decoder, scheduler, feature_extractor\n",
    ")\n",
    "\n",
    "int8_frames = ov_int8_pipeline(\n",
    "    image,\n",
    "    num_inference_steps=4,\n",
    "    motion_bucket_id=60,\n",
    "    num_frames=8,\n",
    "    height=320,\n",
    "    width=512,\n",
    "    generator=torch.manual_seed(12342),\n",
    ").frames[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "id": "902036a4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<img src=\"generated_int8.gif\">"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "int8_out_path = Path(\"generated_int8.mp4\")\n",
    "\n",
    "export_to_video(frames, str(out_path), fps=7)\n",
    "int8_frames[0].save(\n",
    "    \"generated_int8.gif\",\n",
    "    save_all=True,\n",
    "    append_images=int8_frames[1:],\n",
    "    optimize=False,\n",
    "    duration=120,\n",
    "    loop=0,\n",
    ")\n",
    "HTML('<img src=\"generated_int8.gif\">')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b223a0a7",
   "metadata": {},
   "source": [
    "### Compare model file sizes\n",
    "\n",
    "[back to top ⬆️](#Table-of-contents:)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "id": "7099c21b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "vae_encoder compression rate: 2.018\n",
      "unet compression rate: 1.996\n",
      "vae_decoder compression rate: 2.007\n"
     ]
    }
   ],
   "source": [
    "%%skip not $to_quantize.value\n",
    "\n",
    "fp16_model_paths = [VAE_ENCODER_PATH, UNET_PATH, VAE_DECODER_PATH]\n",
    "int8_model_paths = [OV_INT8_VAE_ENCODER_PATH, OV_INT8_UNET_PATH, OV_INT8_VAE_DECODER_PATH]\n",
    "\n",
    "for fp16_path, int8_path in zip(fp16_model_paths, int8_model_paths):\n",
    "    fp16_ir_model_size = fp16_path.with_suffix(\".bin\").stat().st_size\n",
    "    int8_model_size = int8_path.with_suffix(\".bin\").stat().st_size\n",
    "    print(f\"{fp16_path.stem} compression rate: {fp16_ir_model_size / int8_model_size:.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65cec1b7",
   "metadata": {},
   "source": [
    "### Compare inference time of the FP16 and INT8 pipelines\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "To measure the inference performance of the `FP16` and `INT8` pipelines, we use median inference time on calibration subset.\n",
    "\n",
    "> **NOTE**: For the most accurate performance estimation, it is recommended to run `benchmark_app` in a terminal/command prompt after closing other applications."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "id": "80d1b146",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%skip not $to_quantize.value\n",
    "\n",
    "import time\n",
    "\n",
    "def calculate_inference_time(pipeline, validation_data):\n",
    "    inference_time = []\n",
    "    for prompt in validation_data:\n",
    "        start = time.perf_counter()\n",
    "        with io.capture_output() as captured:\n",
    "            _ = pipeline(\n",
    "                image,\n",
    "                num_inference_steps=4,\n",
    "                motion_bucket_id=60,\n",
    "                num_frames=8,\n",
    "                height=320,\n",
    "                width=512,\n",
    "                generator=torch.manual_seed(12342),\n",
    "            )\n",
    "        end = time.perf_counter()\n",
    "        delta = end - start\n",
    "        inference_time.append(delta)\n",
    "    return np.median(inference_time)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "id": "438d896c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Performance speed-up: 1.243\n"
     ]
    }
   ],
   "source": [
    "%%skip not $to_quantize.value\n",
    "\n",
    "validation_size = 3\n",
    "validation_dataset = datasets.load_dataset(\"fusing/instructpix2pix-1000-samples\", split=\"train\", streaming=True).shuffle(seed=42).take(validation_size)\n",
    "validation_data = [data[\"input_image\"] for data in validation_dataset]\n",
    "\n",
    "fp_latency = calculate_inference_time(ov_pipe, validation_data)\n",
    "int8_latency = calculate_inference_time(ov_int8_pipeline, validation_data)\n",
    "print(f\"Performance speed-up: {fp_latency / int8_latency:.3f}\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "f85c9cf6-8b88-462f-86bf-d5df450d82c2",
   "metadata": {},
   "source": [
    "## Interactive Demo\n",
    "[back to top ⬆️](#Table-of-contents:)\n",
    "\n",
    "Please select below whether you would like to use the quantized model to launch the interactive demo."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "id": "840decf8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6db1d459b7a04e98b473f23175a3bb2c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Checkbox(value=True, description='Use quantized model')"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quantized_model_present = ov_int8_pipeline is not None\n",
    "\n",
    "use_quantized_model = widgets.Checkbox(\n",
    "    value=quantized_model_present,\n",
    "    description=\"Use quantized model\",\n",
    "    disabled=not quantized_model_present,\n",
    ")\n",
    "\n",
    "use_quantized_model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1fe35f3-4f07-4ebd-9a1e-ae0431450c07",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import gradio as gr\n",
    "import random\n",
    "\n",
    "max_64_bit_int = 2**63 - 1\n",
    "pipeline = ov_int8_pipeline if use_quantized_model.value else ov_pipe\n",
    "\n",
    "example_images_urls = [\n",
    "    \"https://huggingface.co/spaces/wangfuyun/AnimateLCM-SVD/resolve/main/test_imgs/ship-7833921_1280.jpg?download=true\",\n",
    "    \"https://huggingface.co/spaces/wangfuyun/AnimateLCM-SVD/resolve/main/test_imgs/ai-generated-8476858_1280.png?download=true\",\n",
    "    \"https://huggingface.co/spaces/wangfuyun/AnimateLCM-SVD/resolve/main/test_imgs/ai-generated-8481641_1280.jpg?download=true\",\n",
    "    \"https://huggingface.co/spaces/wangfuyun/AnimateLCM-SVD/resolve/main/test_imgs/dog-7396912_1280.jpg?download=true\",\n",
    "    \"https://huggingface.co/spaces/wangfuyun/AnimateLCM-SVD/resolve/main/test_imgs/cupcakes-380178_1280.jpg?download=true\",\n",
    "]\n",
    "\n",
    "example_images_dir = Path(\"example_images\")\n",
    "example_images_dir.mkdir(exist_ok=True)\n",
    "example_imgs = []\n",
    "\n",
    "for image_id, url in enumerate(example_images_urls):\n",
    "    img = load_image(url)\n",
    "    image_path = example_images_dir / f\"{image_id}.png\"\n",
    "    img.save(image_path)\n",
    "    example_imgs.append([image_path])\n",
    "\n",
    "\n",
    "def sample(\n",
    "    image: PIL.Image,\n",
    "    seed: Optional[int] = 42,\n",
    "    randomize_seed: bool = True,\n",
    "    motion_bucket_id: int = 127,\n",
    "    fps_id: int = 6,\n",
    "    num_inference_steps: int = 15,\n",
    "    num_frames: int = 4,\n",
    "    max_guidance_scale=1.0,\n",
    "    min_guidance_scale=1.0,\n",
    "    decoding_t: int = 8,  # Number of frames decoded at a time! This eats most VRAM. Reduce if necessary.\n",
    "    output_folder: str = \"outputs\",\n",
    "    progress=gr.Progress(track_tqdm=True),\n",
    "):\n",
    "    if image.mode == \"RGBA\":\n",
    "        image = image.convert(\"RGB\")\n",
    "\n",
    "    if randomize_seed:\n",
    "        seed = random.randint(0, max_64_bit_int)\n",
    "    generator = torch.manual_seed(seed)\n",
    "\n",
    "    output_folder = Path(output_folder)\n",
    "    output_folder.mkdir(exist_ok=True)\n",
    "    base_count = len(list(output_folder.glob(\"*.mp4\")))\n",
    "    video_path = output_folder / f\"{base_count:06d}.mp4\"\n",
    "\n",
    "    frames = pipeline(\n",
    "        image,\n",
    "        decode_chunk_size=decoding_t,\n",
    "        generator=generator,\n",
    "        motion_bucket_id=motion_bucket_id,\n",
    "        noise_aug_strength=0.1,\n",
    "        num_frames=num_frames,\n",
    "        num_inference_steps=num_inference_steps,\n",
    "        max_guidance_scale=max_guidance_scale,\n",
    "        min_guidance_scale=min_guidance_scale,\n",
    "    ).frames[0]\n",
    "    export_to_video(frames, str(video_path), fps=fps_id)\n",
    "\n",
    "    return video_path, seed\n",
    "\n",
    "\n",
    "def resize_image(image, output_size=(512, 320)):\n",
    "    # Calculate aspect ratios\n",
    "    target_aspect = output_size[0] / output_size[1]  # Aspect ratio of the desired size\n",
    "    image_aspect = image.width / image.height  # Aspect ratio of the original image\n",
    "\n",
    "    # Resize then crop if the original image is larger\n",
    "    if image_aspect > target_aspect:\n",
    "        # Resize the image to match the target height, maintaining aspect ratio\n",
    "        new_height = output_size[1]\n",
    "        new_width = int(new_height * image_aspect)\n",
    "        resized_image = image.resize((new_width, new_height), PIL.Image.LANCZOS)\n",
    "        # Calculate coordinates for cropping\n",
    "        left = (new_width - output_size[0]) / 2\n",
    "        top = 0\n",
    "        right = (new_width + output_size[0]) / 2\n",
    "        bottom = output_size[1]\n",
    "    else:\n",
    "        # Resize the image to match the target width, maintaining aspect ratio\n",
    "        new_width = output_size[0]\n",
    "        new_height = int(new_width / image_aspect)\n",
    "        resized_image = image.resize((new_width, new_height), PIL.Image.LANCZOS)\n",
    "        # Calculate coordinates for cropping\n",
    "        left = 0\n",
    "        top = (new_height - output_size[1]) / 2\n",
    "        right = output_size[0]\n",
    "        bottom = (new_height + output_size[1]) / 2\n",
    "\n",
    "    # Crop the image\n",
    "    cropped_image = resized_image.crop((left, top, right, bottom))\n",
    "    return cropped_image\n",
    "\n",
    "\n",
    "with gr.Blocks() as demo:\n",
    "    gr.Markdown(\n",
    "        \"\"\"# Stable Video Diffusion: Image to Video Generation with OpenVINO.\n",
    "  \"\"\"\n",
    "    )\n",
    "    with gr.Row():\n",
    "        with gr.Column():\n",
    "            image_in = gr.Image(label=\"Upload your image\", type=\"pil\")\n",
    "            generate_btn = gr.Button(\"Generate\")\n",
    "        video = gr.Video()\n",
    "    with gr.Accordion(\"Advanced options\", open=False):\n",
    "        seed = gr.Slider(\n",
    "            label=\"Seed\",\n",
    "            value=42,\n",
    "            randomize=True,\n",
    "            minimum=0,\n",
    "            maximum=max_64_bit_int,\n",
    "            step=1,\n",
    "        )\n",
    "        randomize_seed = gr.Checkbox(label=\"Randomize seed\", value=True)\n",
    "        motion_bucket_id = gr.Slider(\n",
    "            label=\"Motion bucket id\",\n",
    "            info=\"Controls how much motion to add/remove from the image\",\n",
    "            value=127,\n",
    "            minimum=1,\n",
    "            maximum=255,\n",
    "        )\n",
    "        fps_id = gr.Slider(\n",
    "            label=\"Frames per second\",\n",
    "            info=\"The length of your video in seconds will be num_frames / fps\",\n",
    "            value=6,\n",
    "            minimum=5,\n",
    "            maximum=30,\n",
    "            step=1,\n",
    "        )\n",
    "        num_frames = gr.Slider(label=\"Number of Frames\", value=8, minimum=2, maximum=25, step=1)\n",
    "        num_steps = gr.Slider(label=\"Number of generation steps\", value=4, minimum=1, maximum=8, step=1)\n",
    "        max_guidance_scale = gr.Slider(\n",
    "            label=\"Max guidance scale\",\n",
    "            info=\"classifier-free guidance strength\",\n",
    "            value=1.2,\n",
    "            minimum=1,\n",
    "            maximum=2,\n",
    "        )\n",
    "        min_guidance_scale = gr.Slider(\n",
    "            label=\"Min guidance scale\",\n",
    "            info=\"classifier-free guidance strength\",\n",
    "            value=1,\n",
    "            minimum=1,\n",
    "            maximum=1.5,\n",
    "        )\n",
    "    examples = gr.Examples(\n",
    "        examples=example_imgs,\n",
    "        inputs=[image_in],\n",
    "        outputs=[video, seed],\n",
    "    )\n",
    "\n",
    "    image_in.upload(fn=resize_image, inputs=image_in, outputs=image_in)\n",
    "    generate_btn.click(\n",
    "        fn=sample,\n",
    "        inputs=[\n",
    "            image_in,\n",
    "            seed,\n",
    "            randomize_seed,\n",
    "            motion_bucket_id,\n",
    "            fps_id,\n",
    "            num_steps,\n",
    "            num_frames,\n",
    "            max_guidance_scale,\n",
    "            min_guidance_scale,\n",
    "        ],\n",
    "        outputs=[video, seed],\n",
    "        api_name=\"video\",\n",
    "    )\n",
    "\n",
    "\n",
    "try:\n",
    "    demo.queue().launch(debug=True)\n",
    "except Exception:\n",
    "    demo.queue().launch(debug=True, share=True)\n",
    "# if you are launching remotely, specify server_name and server_port\n",
    "# demo.launch(server_name='your server name', server_port='server port in int')\n",
    "# Read more in the docs: https://gradio.app/docs/"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "openvino_notebooks": {
   "imageUrl": "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/ae8a77b2-b5c9-45c5-a103-6e46c686739f",
   "tags": {
    "categories": [
     "Model Demos",
     "AI Trends"
    ],
    "libraries": [],
    "other": [],
    "tasks": [
     "Image-to-Video"
    ]
   }
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {
     "0bdd9345da2247c3a441414c86e1382a": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "layout": "IPY_MODEL_498134252036464aaf83057db4c54f85",
       "style": "IPY_MODEL_1e204e15e5324ed0acac6bf86b64e1a6",
       "value": "100%"
      }
     },
     "1e204e15e5324ed0acac6bf86b64e1a6": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "240f52aed13246c089b238d45043f41c": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "24fb38d8b6484895bfc517616d53c602": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "32199ccd03d44e3f95206d89a4f83076": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "498134252036464aaf83057db4c54f85": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {}
     },
     "531dd5ed15d041889f98bbb5bd4ca9c9": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "ProgressStyleModel",
      "state": {
       "description_width": ""
      }
     },
     "5c156b007fc5442a8ab4d3e5819420f2": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "layout": "IPY_MODEL_24fb38d8b6484895bfc517616d53c602",
       "style": "IPY_MODEL_74a59d97a5ab428fbd9935694d489967",
       "value": " 4/4 [00:47&lt;00:00, 11.53s/it]"
      }
     },
     "74a59d97a5ab428fbd9935694d489967": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "9a7f2447f844430d898e0aa2f41a5482": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "DescriptionStyleModel",
      "state": {
       "description_width": ""
      }
     },
     "e0d3b977bd3a4847b253ab82add501af": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HBoxModel",
      "state": {
       "children": [
        "IPY_MODEL_0bdd9345da2247c3a441414c86e1382a",
        "IPY_MODEL_f393046ecfd14d47bdbc7276622c2e61",
        "IPY_MODEL_5c156b007fc5442a8ab4d3e5819420f2"
       ],
       "layout": "IPY_MODEL_32199ccd03d44e3f95206d89a4f83076"
      }
     },
     "e2bdcfa8e15248289fa439cd3e97ebf1": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "DropdownModel",
      "state": {
       "_options_labels": [
        "CPU",
        "GPU.0",
        "GPU.1",
        "AUTO"
       ],
       "description": "Device:",
       "index": 3,
       "layout": "IPY_MODEL_f8519f3046d143879336ae6a70dc184e",
       "style": "IPY_MODEL_9a7f2447f844430d898e0aa2f41a5482"
      }
     },
     "f393046ecfd14d47bdbc7276622c2e61": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "FloatProgressModel",
      "state": {
       "bar_style": "success",
       "layout": "IPY_MODEL_240f52aed13246c089b238d45043f41c",
       "max": 4,
       "style": "IPY_MODEL_531dd5ed15d041889f98bbb5bd4ca9c9",
       "value": 4
      }
     },
     "f8519f3046d143879336ae6a70dc184e": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {}
     }
    },
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}