{ "cells": [ { "cell_type": "markdown", "id": "2990976f-be06-4068-bf49-56d39a3f93a8", "metadata": {}, "source": [ "# Controllable Music Generation with MusicGen and OpenVINO\n", "\n", "MusicGen is a single-stage auto-regressive Transformer model capable of generating high-quality music samples conditioned on text descriptions or audio prompts. The text prompt is passed to a text encoder model (T5) to obtain a sequence of hidden-state representations. These hidden states are fed to MusicGen, which predicts discrete audio tokens (audio codes). Finally, audio tokens are then decoded using an audio compression model (EnCodec) to recover the audio waveform.\n", "\n", "![pipeline](https://user-images.githubusercontent.com/76463150/260439306-81c81c8d-1f9c-41d0-b881-9491766def8e.png)\n", "\n", "[The MusicGen model](https://arxiv.org/abs/2306.05284) does not require a self-supervised semantic representation of the text/audio prompts; it operates over several streams of compressed discrete music representation with efficient token interleaving patterns, thus eliminating the need to cascade multiple models to predict a set of codebooks (e.g. hierarchically or upsampling). Unlike prior models addressing music generation, it is able to generate all the codebooks in a single forward pass.\n", "\n", "In this tutorial, we consider how to run the MusicGen model using OpenVINO.\n", "\n", "We will use a model implementation from the [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library. " ] }, { "cell_type": "markdown", "id": "c3c318e0-3828-4b23-aefc-9ccd9b093be2", "metadata": {}, "source": [ "\n", "#### Table of contents:\n", "\n", "- [Prerequisites](#Prerequisites)\n", " - [Install requirements](#Install-requirements)\n", " - [Imports](#Imports)\n", "- [MusicGen in HF Transformers](#MusicGen-in-HF-Transformers)\n", " - [Original Pipeline Inference](#Original-Pipeline-Inference)\n", "- [Convert models to OpenVINO Intermediate representation (IR) format](#Convert-models-to-OpenVINO-Intermediate-representation-(IR)-format)\n", " - [0. Set Up Variables](#0.-Set-Up-Variables)\n", " - [1. Convert Text Encoder](#1.-Convert-Text-Encoder)\n", " - [2. Convert MusicGen Language Model](#2.-Convert-MusicGen-Language-Model)\n", " - [3. Convert Audio Decoder](#3.-Convert-Audio-Decoder)\n", "- [Embedding the converted models into the original pipeline](#Embedding-the-converted-models-into-the-original-pipeline)\n", " - [Select inference device](#Select-inference-device)\n", " - [Adapt OpenVINO models to the original pipeline](#Adapt-OpenVINO-models-to-the-original-pipeline)\n", "- [Try out the converted pipeline](#Try-out-the-converted-pipeline)\n", "\n" ] }, { "cell_type": "markdown", "id": "501e64a4-62a8-4de6-b1c1-a8f41199ac07", "metadata": {}, "source": [ "## Prerequisites\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "markdown", "id": "af1463e7-68f4-40e2-a151-4d6d7678f1e0", "metadata": {}, "source": [ "### Install requirements\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "40d29d27-919b-407b-846d-f025b5e99878", "metadata": {}, "outputs": [], "source": [ "%pip install -q \"openvino>=2023.3.0\"\n", "%pip install -q \"torch>=2.1\" \"gradio>=4.19\" \"transformers\" packaging --extra-index-url https://download.pytorch.org/whl/cpu" ] }, { "cell_type": "markdown", "id": "dfe2e282-19f1-47c4-bbac-03177db00243", "metadata": {}, "source": [ "### Imports\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "d8494fc4-78f6-4f43-a4d3-f2ec18210355", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-05-17 15:00:26.783507: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.\n", "2024-05-17 15:00:26.785229: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n", "2024-05-17 15:00:26.820125: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.\n", "2024-05-17 15:00:26.821207: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.\n", "To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.\n", "2024-05-17 15:00:27.544961: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n" ] } ], "source": [ "from collections import namedtuple\n", "from functools import partial\n", "import gc\n", "from pathlib import Path\n", "from typing import Optional, Tuple\n", "import warnings\n", "\n", "from IPython.display import Audio\n", "import openvino as ov\n", "import numpy as np\n", "import torch\n", "from torch.jit import TracerWarning\n", "from transformers import AutoProcessor, MusicgenForConditionalGeneration\n", "from transformers.modeling_outputs import (\n", " BaseModelOutputWithPastAndCrossAttentions,\n", " CausalLMOutputWithCrossAttentions,\n", ")\n", "\n", "# Ignore tracing warnings\n", "warnings.filterwarnings(\"ignore\", category=TracerWarning)" ] }, { "cell_type": "markdown", "id": "523545c2-3e07-466e-bce8-85152ed2f1f4", "metadata": {}, "source": [ "## MusicGen in HF Transformers\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "To work with [MusicGen](https://huggingface.co/facebook/musicgen-small) by Meta AI, we will use [Hugging Face Transformers package](https://github.com/huggingface/transformers). Transformers package exposes the `MusicgenForConditionalGeneration` class, simplifying the model instantiation and weights loading. The code below demonstrates how to create a `MusicgenForConditionalGeneration` and generate a text-conditioned music sample." ] }, { "cell_type": "code", "execution_count": 3, "id": "c3d1128f-d892-49dc-8525-f757c34cc33a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.\n", " warnings.warn(\"torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.\")\n", "/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/models/encodec/modeling_encodec.py:123: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).\n", " self.register_buffer(\"padding_total\", torch.tensor(kernel_size - stride, dtype=torch.int64), persistent=False)\n" ] } ], "source": [ "import sys\n", "from packaging.version import parse\n", "\n", "\n", "if sys.version_info < (3, 8):\n", " import importlib_metadata\n", "else:\n", " import importlib.metadata as importlib_metadata\n", "loading_kwargs = {}\n", "\n", "if parse(importlib_metadata.version(\"transformers\")) >= parse(\"4.40.0\"):\n", " loading_kwargs[\"attn_implementation\"] = \"eager\"\n", "\n", "\n", "# Load the pipeline\n", "model = MusicgenForConditionalGeneration.from_pretrained(\"facebook/musicgen-small\", torchscript=True, return_dict=False, **loading_kwargs)" ] }, { "cell_type": "markdown", "id": "18a36ade-6ab4-4540-8766-54b29dfb2dc6", "metadata": {}, "source": [ "In the cell below user is free to change the desired music sample length." ] }, { "cell_type": "code", "execution_count": 4, "id": "ae8f6270-e745-4adb-b65d-c1d8dc44d7fc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sampling rate is 32000 Hz\n" ] } ], "source": [ "sample_length = 8 # seconds\n", "\n", "n_tokens = sample_length * model.config.audio_encoder.frame_rate + 3\n", "sampling_rate = model.config.audio_encoder.sampling_rate\n", "print(\"Sampling rate is\", sampling_rate, \"Hz\")\n", "\n", "model.to(\"cpu\")\n", "model.eval();" ] }, { "cell_type": "markdown", "id": "83117816-3ec2-43a0-8fcd-1eb09c6e21e9", "metadata": {}, "source": [ "### Original Pipeline Inference\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Text Preprocessing prepares the text prompt to be fed into the model, the `processor` object abstracts this step for us. Text tokenization is performed under the hood, it assigning tokens or IDs to the words; in other words, token IDs are just indices of the words in the model vocabulary. It helps the model understand the context of a sentence." ] }, { "cell_type": "code", "execution_count": 5, "id": "f3101955-6c20-4fc6-b675-1237260972bd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "processor = AutoProcessor.from_pretrained(\"facebook/musicgen-small\")\n", "\n", "inputs = processor(\n", " text=[\"80s pop track with bassy drums and synth\"],\n", " return_tensors=\"pt\",\n", ")\n", "\n", "audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=n_tokens)\n", "\n", "Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)" ] }, { "cell_type": "markdown", "id": "698f4db8-759d-47f8-912c-b11d7ba9b632", "metadata": {}, "source": [ "## Convert models to OpenVINO Intermediate representation (IR) format\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Model conversion API enables direct conversion of PyTorch models. We will utilize the `openvino.convert_model` method to acquire OpenVINO IR versions of the models. The method requires a model object and example input for model tracing. Under the hood, the converter will use the PyTorch JIT compiler, to build a frozen model graph.\n", "\n", "The pipeline consists of three important parts:\n", "\n", " - The [T5 text encoder](https://huggingface.co/google/flan-t5-base) that translates user prompts into vectors in the latent space that the next model - the MusicGen decoder can utilize.\n", " - The [MusicGen Language Model](https://huggingface.co/docs/transformers/model_doc/musicgen#transformers.MusicgenForCausalLM) that auto-regressively generates audio tokens (codes).\n", " - The [EnCodec model](https://huggingface.co/facebook/encodec_24khz) (we will use only the decoder part of it) is used to decode the audio waveform from the audio tokens predicted by the MusicGen Language Model.\n", "\n", "Let us convert each model step by step." ] }, { "cell_type": "markdown", "id": "019d43b6-5a9f-41ae-aad5-c66e832ddd84", "metadata": {}, "source": [ "### 0. Set Up Variables\n", "[back to top ⬆️](#Table-of-contents:)\n" ] }, { "cell_type": "code", "execution_count": 6, "id": "867f36ab-6cd6-4f50-a23c-e5069091cdbc", "metadata": {}, "outputs": [], "source": [ "models_dir = Path(\"./models\")\n", "t5_ir_path = models_dir / \"t5.xml\"\n", "musicgen_0_ir_path = models_dir / \"mg_0.xml\"\n", "musicgen_ir_path = models_dir / \"mg.xml\"\n", "audio_decoder_ir_path = models_dir / \"encodec.xml\"" ] }, { "cell_type": "markdown", "id": "5312460c-fb6f-471f-b2ff-4caf13047866", "metadata": {}, "source": [ "### 1. Convert Text Encoder\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "The text encoder is responsible for converting the input prompt, such as \"90s rock song with loud guitars and heavy drums\" into an embedding space that can be fed to the next model. Typically, it is a transformer-based encoder that maps a sequence of input tokens to a sequence of text embeddings.\n", "\n", "The input for the text encoder consists of a tensor `input_ids`, which contains token indices from the text processed by the tokenizer and `attention_mask` that we will ignore as we will process one prompt at a time and this vector will just consist of ones.\n", "\n", "We use OpenVINO Converter (OVC) below to convert the PyTorch model to the OpenVINO Intermediate Representation format (IR), which you can infer later with [OpenVINO runtime](https://docs.openvino.ai/2024/openvino-workflow/running-inference.html)" ] }, { "cell_type": "code", "execution_count": 7, "id": "5dadad3c-06ce-43e5-b059-b16238f3963f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING:tensorflow:Please fix your imports. Module tensorflow.python.training.tracking.base has been moved to tensorflow.python.trackable.base. The old module will be deleted in version 2.11.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[ WARNING ] Please fix your imports. Module %s has been moved to %s. The old module will be deleted in version %s.\n", "/home/ea/work/my_optimum_intel/optimum_env/lib/python3.8/site-packages/transformers/modeling_utils.py:4371: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead\n", " warnings.warn(\n" ] } ], "source": [ "if not t5_ir_path.exists():\n", " t5_ov = ov.convert_model(model.text_encoder, example_input={\"input_ids\": inputs[\"input_ids\"]})\n", "\n", " ov.save_model(t5_ov, t5_ir_path)\n", " del t5_ov\n", " gc.collect()" ] }, { "cell_type": "markdown", "id": "98019b03-8c03-47e4-b311-8bb6af45b70c", "metadata": {}, "source": [ "### 2. Convert MusicGen Language Model\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "This model is the central part of the whole pipeline, it takes the embedded text representation and generates audio codes that can be then decoded into actual music. The model outputs several streams of audio codes - tokens sampled from the pre-trained codebooks representing music efficiently with a lower frame rate. The model employs innovative codes intervaling strategy, that makes single-stage generation possible.\n", "\n", "On the 0th generation step the model accepts `input_ids` representing the indices of audio codes, `encoder_hidden_states` and `encoder_attention_mask` that were provided by the text encoder." ] }, { "cell_type": "code", "execution_count": 8, "id": "791f3259-ea4c-459a-b4b0-a19e61e79c6f", "metadata": {}, "outputs": [], "source": [ "# Set model config `torchscript` to True, so the model returns a tuple as output\n", "model.decoder.config.torchscript = True\n", "\n", "if not musicgen_0_ir_path.exists():\n", " decoder_input = {\n", " \"input_ids\": torch.ones(8, 1, dtype=torch.int64),\n", " \"encoder_hidden_states\": torch.ones(2, 12, 1024, dtype=torch.float32),\n", " \"encoder_attention_mask\": torch.ones(2, 12, dtype=torch.int64),\n", " }\n", " mg_ov_0_step = ov.convert_model(model.decoder, example_input=decoder_input)\n", "\n", " ov.save_model(mg_ov_0_step, musicgen_0_ir_path)\n", " del mg_ov_0_step\n", " gc.collect()" ] }, { "cell_type": "markdown", "id": "1882240a-3300-4c72-9a6b-fe72fd6820b3", "metadata": {}, "source": [ "On further iterations, the model is also provided with a `past_key_values` argument that contains previous outputs of the attention block, it allows us to save on computations.\n", "But for us, it means that the signature of the model's `forward` method changed. Models in OpenVINO IR have frozen calculation graphs and do not allow optional arguments, that is why the MusicGen model must be converted a second time, with an increased number of inputs." ] }, { "cell_type": "code", "execution_count": 10, "id": "5cd21b7d-2026-4781-849f-a7bff22e63a2", "metadata": {}, "outputs": [], "source": [ "# Add additional argument to the example_input dict\n", "if not musicgen_ir_path.exists():\n", " # Add `past_key_values` to the converted model signature\n", " decoder_input[\"past_key_values\"] = tuple(\n", " [\n", " (\n", " torch.ones(2, 16, 1, 64, dtype=torch.float32),\n", " torch.ones(2, 16, 1, 64, dtype=torch.float32),\n", " torch.ones(2, 16, 12, 64, dtype=torch.float32),\n", " torch.ones(2, 16, 12, 64, dtype=torch.float32),\n", " )\n", " ]\n", " * 24\n", " )\n", "\n", " mg_ov = ov.convert_model(model.decoder, example_input=decoder_input)\n", " for input in mg_ov.inputs[3:]:\n", " input.get_node().set_partial_shape(ov.PartialShape([-1, 16, -1, 64]))\n", " input.get_node().set_element_type(ov.Type.f32)\n", "\n", " mg_ov.validate_nodes_and_infer_types()\n", "\n", " ov.save_model(mg_ov, musicgen_ir_path)\n", " del mg_ov\n", " gc.collect()" ] }, { "cell_type": "markdown", "id": "dc93a8a2-312b-46c3-85b9-df0cd96feeb6", "metadata": {}, "source": [ "### 3. Convert Audio Decoder\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "The audio decoder which is a part of the EnCodec model is used to recover the audio waveform from the audio tokens predicted by the MusicGen decoder. To learn more about the model please refer to the corresponding [OpenVINO example](../encodec-audio-compression)." ] }, { "cell_type": "code", "execution_count": 11, "id": "5ca233ad-d8a2-46ca-ba99-a054647df626", "metadata": {}, "outputs": [], "source": [ "if not audio_decoder_ir_path.exists():\n", "\n", " class AudioDecoder(torch.nn.Module):\n", " def __init__(self, model):\n", " super().__init__()\n", " self.model = model\n", "\n", " def forward(self, output_ids):\n", " return self.model.decode(output_ids, [None])\n", "\n", " audio_decoder_input = {\"output_ids\": torch.ones((1, 1, 4, n_tokens - 3), dtype=torch.int64)}\n", "\n", " with torch.no_grad():\n", " audio_decoder_ov = ov.convert_model(AudioDecoder(model.audio_encoder), example_input=audio_decoder_input)\n", " ov.save_model(audio_decoder_ov, audio_decoder_ir_path)\n", " del audio_decoder_ov\n", " gc.collect()" ] }, { "cell_type": "markdown", "id": "d1172fd5-ebc8-4cec-a42e-1a69ec684226", "metadata": {}, "source": [ "## Embedding the converted models into the original pipeline\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "OpenVINO™ Runtime Python API is used to compile the model in OpenVINO IR format. The [Core](https://docs.openvino.ai/2024/api/ie_python_api/_autosummary/openvino.runtime.Core.html) class provides access to the OpenVINO Runtime API. The `core` object, which is an instance of the `Core` class represents the API and it is used to compile the model." ] }, { "cell_type": "code", "execution_count": 12, "id": "0231dea8-24dd-46bc-b5d6-399e72b3c11d", "metadata": {}, "outputs": [], "source": [ "core = ov.Core()" ] }, { "cell_type": "markdown", "id": "134b4437-653e-4cb1-90fc-0028fd68083b", "metadata": {}, "source": [ "#### Select inference device\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Select device that will be used to do models inference using OpenVINO from the dropdown list:" ] }, { "cell_type": "code", "execution_count": 13, "id": "fde35263-6a70-45d3-8ddc-8b5cf9c1fc35", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c56f9a2f20704363bd062f0c6f274e65", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Dropdown(description='Device:', index=3, options=('CPU', 'GPU.0', 'GPU.1', 'AUTO'), value='AUTO')" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import ipywidgets as widgets\n", "\n", "device = widgets.Dropdown(\n", " options=core.available_devices + [\"AUTO\"],\n", " value=\"AUTO\",\n", " description=\"Device:\",\n", " disabled=False,\n", ")\n", "\n", "device" ] }, { "cell_type": "markdown", "id": "13f35868-70a2-40b9-a23a-32823883ba49", "metadata": {}, "source": [ "### Adapt OpenVINO models to the original pipeline\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "Here we create wrapper classes for all three OpenVINO models that we want to embed in the original inference pipeline.\n", "Here are some of the things to consider when adapting an OV model:\n", " - Make sure that parameters passed by the original pipeline are forwarded to the compiled OV model properly; sometimes the OV model uses only a portion of the input arguments and some are ignored, sometimes you need to convert the argument to another data type or unwrap some data structures such as tuples or dictionaries.\n", " - Guarantee that the wrapper class returns results to the pipeline in an expected format. In the example below you can see how we pack OV model outputs into special classes declared in the HF repo.\n", " - Pay attention to the model method used in the original pipeline for calling the model - it may be not the `forward` method! Refer to the `AudioDecoderWrapper` to see how we wrap OV model inference into the `decode` method." ] }, { "cell_type": "code", "execution_count": 14, "id": "aaee5ead-f0ce-40be-b678-99518e1b9a98", "metadata": {}, "outputs": [], "source": [ "class TextEncoderWrapper(torch.nn.Module):\n", " def __init__(self, encoder_ir, config):\n", " super().__init__()\n", " self.encoder = core.compile_model(encoder_ir, device.value)\n", " self.config = config\n", "\n", " def forward(self, input_ids, **kwargs):\n", " last_hidden_state = self.encoder(input_ids)[self.encoder.outputs[0]]\n", " last_hidden_state = torch.tensor(last_hidden_state)\n", " return BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=last_hidden_state)\n", "\n", "\n", "class MusicGenWrapper(torch.nn.Module):\n", " def __init__(\n", " self,\n", " music_gen_lm_0_ir,\n", " music_gen_lm_ir,\n", " config,\n", " num_codebooks,\n", " build_delay_pattern_mask,\n", " apply_delay_pattern_mask,\n", " ):\n", " super().__init__()\n", " self.music_gen_lm_0 = core.compile_model(music_gen_lm_0_ir, device.value)\n", " self.music_gen_lm = core.compile_model(music_gen_lm_ir, device.value)\n", " self.config = config\n", " self.num_codebooks = num_codebooks\n", " self.build_delay_pattern_mask = build_delay_pattern_mask\n", " self.apply_delay_pattern_mask = apply_delay_pattern_mask\n", "\n", " def forward(\n", " self,\n", " input_ids: torch.LongTensor = None,\n", " encoder_hidden_states: torch.FloatTensor = None,\n", " encoder_attention_mask: torch.LongTensor = None,\n", " past_key_values: Optional[Tuple[torch.FloatTensor]] = None,\n", " **kwargs\n", " ):\n", " if past_key_values is None:\n", " model = self.music_gen_lm_0\n", " arguments = (input_ids, encoder_hidden_states, encoder_attention_mask)\n", " else:\n", " model = self.music_gen_lm\n", " arguments = (\n", " input_ids,\n", " encoder_hidden_states,\n", " encoder_attention_mask,\n", " *past_key_values,\n", " )\n", "\n", " output = model(arguments)\n", " return CausalLMOutputWithCrossAttentions(\n", " logits=torch.tensor(output[model.outputs[0]]),\n", " past_key_values=tuple([output[model.outputs[i]] for i in range(1, 97)]),\n", " )\n", "\n", "\n", "class AudioDecoderWrapper(torch.nn.Module):\n", " def __init__(self, decoder_ir, config):\n", " super().__init__()\n", " self.decoder = core.compile_model(decoder_ir, device.value)\n", " self.config = config\n", " self.output_type = namedtuple(\"AudioDecoderOutput\", [\"audio_values\"])\n", "\n", " def decode(self, output_ids, audio_scales):\n", " output = self.decoder(output_ids)[self.decoder.outputs[0]]\n", " return self.output_type(audio_values=torch.tensor(output))" ] }, { "cell_type": "markdown", "id": "846c9c7e-1fda-47e7-b0ca-47cb5c7f5c9d", "metadata": {}, "source": [ "Now we initialize the wrapper objects and load them to the HF pipeline" ] }, { "cell_type": "code", "execution_count": 15, "id": "3a4ee8dd-1d6f-47c0-a7e2-42c8cb5ef714", "metadata": {}, "outputs": [], "source": [ "text_encode_ov = TextEncoderWrapper(t5_ir_path, model.text_encoder.config)\n", "musicgen_decoder_ov = MusicGenWrapper(\n", " musicgen_0_ir_path,\n", " musicgen_ir_path,\n", " model.decoder.config,\n", " model.decoder.num_codebooks,\n", " model.decoder.build_delay_pattern_mask,\n", " model.decoder.apply_delay_pattern_mask,\n", ")\n", "audio_encoder_ov = AudioDecoderWrapper(audio_decoder_ir_path, model.audio_encoder.config)\n", "\n", "del model.text_encoder\n", "del model.decoder\n", "del model.audio_encoder\n", "gc.collect()\n", "\n", "model.text_encoder = text_encode_ov\n", "model.decoder = musicgen_decoder_ov\n", "model.audio_encoder = audio_encoder_ov\n", "\n", "\n", "def prepare_inputs_for_generation(\n", " self,\n", " decoder_input_ids,\n", " past_key_values=None,\n", " attention_mask=None,\n", " head_mask=None,\n", " decoder_attention_mask=None,\n", " decoder_head_mask=None,\n", " cross_attn_head_mask=None,\n", " use_cache=None,\n", " encoder_outputs=None,\n", " decoder_delay_pattern_mask=None,\n", " guidance_scale=None,\n", " **kwargs,\n", "):\n", " if decoder_delay_pattern_mask is None:\n", " (\n", " decoder_input_ids,\n", " decoder_delay_pattern_mask,\n", " ) = self.decoder.build_delay_pattern_mask(\n", " decoder_input_ids,\n", " self.generation_config.pad_token_id,\n", " max_length=self.generation_config.max_length,\n", " )\n", "\n", " # apply the delay pattern mask\n", " decoder_input_ids = self.decoder.apply_delay_pattern_mask(decoder_input_ids, decoder_delay_pattern_mask)\n", "\n", " if guidance_scale is not None and guidance_scale > 1:\n", " # for classifier free guidance we need to replicate the decoder args across the batch dim (we'll split these\n", " # before sampling)\n", " decoder_input_ids = decoder_input_ids.repeat((2, 1))\n", " if decoder_attention_mask is not None:\n", " decoder_attention_mask = decoder_attention_mask.repeat((2, 1))\n", "\n", " if past_key_values is not None:\n", " # cut decoder_input_ids if past is used\n", " decoder_input_ids = decoder_input_ids[:, -1:]\n", "\n", " return {\n", " \"input_ids\": None, # encoder_outputs is defined. input_ids not needed\n", " \"encoder_outputs\": encoder_outputs,\n", " \"past_key_values\": past_key_values,\n", " \"decoder_input_ids\": decoder_input_ids,\n", " \"attention_mask\": attention_mask,\n", " \"decoder_attention_mask\": decoder_attention_mask,\n", " \"head_mask\": head_mask,\n", " \"decoder_head_mask\": decoder_head_mask,\n", " \"cross_attn_head_mask\": cross_attn_head_mask,\n", " \"use_cache\": use_cache,\n", " }\n", "\n", "\n", "model.prepare_inputs_for_generation = partial(prepare_inputs_for_generation, model)" ] }, { "cell_type": "markdown", "id": "48d8462f", "metadata": {}, "source": [ "We can now infer the pipeline backed by OpenVINO models." ] }, { "cell_type": "code", "execution_count": 16, "id": "11fa475e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "processor = AutoProcessor.from_pretrained(\"facebook/musicgen-small\")\n", "\n", "inputs = processor(\n", " text=[\"80s pop track with bassy drums and synth\"],\n", " return_tensors=\"pt\",\n", ")\n", "\n", "audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=n_tokens)\n", "\n", "Audio(audio_values[0].cpu().numpy(), rate=sampling_rate)" ] }, { "cell_type": "markdown", "id": "549e0891-01fc-41d9-9ba5-44bea855cced", "metadata": {}, "source": [ "## Try out the converted pipeline\n", "[back to top ⬆️](#Table-of-contents:)\n", "\n", "The demo app below is created using [Gradio package](https://www.gradio.app/docs/interface)" ] }, { "cell_type": "code", "execution_count": 17, "id": "bd7b791a-f29a-4eb4-b426-f5d4e0b8e1de", "metadata": {}, "outputs": [], "source": [ "def _generate(prompt):\n", " inputs = processor(\n", " text=[\n", " prompt,\n", " ],\n", " padding=True,\n", " return_tensors=\"pt\",\n", " )\n", " audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=n_tokens)\n", " waveform = audio_values[0].cpu().squeeze() * 2**15\n", " return (sampling_rate, waveform.numpy().astype(np.int16))" ] }, { "cell_type": "code", "execution_count": 18, "id": "04059aa3-8549-4635-a51a-74c026f1d740", "metadata": { "tags": [], "test_replace": { " demo.launch(debug=True)": " demo.launch()", " demo.launch(share=True, debug=True)": " demo.launch(share=True)" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Running on local URL: http://127.0.0.1:7860\n", "\n", "To create a public link, set `share=True` in `launch()`.\n" ] }, { "data": { "text/html": [ "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Keyboard interruption in main thread... closing server.\n" ] } ], "source": [ "import gradio as gr\n", "\n", "demo = gr.Interface(\n", " _generate,\n", " inputs=[\n", " gr.Textbox(label=\"Text Prompt\"),\n", " ],\n", " outputs=[\"audio\"],\n", " examples=[\n", " [\"80s pop track with bassy drums and synth\"],\n", " [\"Earthy tones, environmentally conscious, ukulele-infused, harmonic, breezy, easygoing, organic instrumentation, gentle grooves\"],\n", " [\"90s rock song with loud guitars and heavy drums\"],\n", " [\"Heartful EDM with beautiful synths and chords\"],\n", " ],\n", " allow_flagging=\"never\",\n", ")\n", "try:\n", " demo.launch(debug=True)\n", "except Exception:\n", " demo.launch(share=True, debug=True)\n", "\n", "# If you are launching remotely, specify server_name and server_port\n", "# EXAMPLE: `demo.launch(server_name='your server name', server_port='server port in int')`\n", "# To learn more please refer to the Gradio docs: https://gradio.app/docs/" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" }, "openvino_notebooks": { "imageUrl": "https://github.com/openvinotoolkit/openvino_notebooks/blob/latest/notebooks/music-generation/music-generation.png?raw=true", "tags": { "categories": [ "Model Demos", "AI Trends" ], "libraries": [], "other": [], "tasks": [ "Text-to-Audio" ] } }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }