Spaces:
Runtime error
Runtime error
title: Mini-dalle Video-clip Maker | |
emoji: 馃惃 | |
colorFrom: purple | |
colorTo: indigo | |
sdk: gradio | |
sdk_version: 4.19.2 | |
app_file: app.py | |
pinned: false | |
license: mit | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
# DALL路E Video Clip Maker | |
This python project uses DALL路E Mini, through Replicate's API, to generate a photo-montage video | |
from a song. | |
Given a YouTube url, the program will extract the audio and transcript of the video and use the lyrics | |
in the transcript as text prompts for DALL路E Mini. | |
## Usage | |
The project can be accessed through a single command: | |
`python3 main.py <youtube url> --token <your replicate API token>` | |
An output example for the video in "Here Comes the Sun" by The Beatles: | |
<img src="misc/frame-432.png" width="250"/> <img src="misc/frame-177.png" width="250"/> | |
<img src="misc/frame-316.png" width="250"/> <img src="misc/frame-633.png" width="250"/> | |
<img src="misc/frame-1724.png" width="250"/> <img src="misc/frame-1328.png" width="250"/> | |
Note that the project only works with YouTube videos that have a transcription. | |
# Blog Post | |
## 1. Interacting with the Replicate API to run DALL路E Mini | |
[Replicate](https://replicate.com) is a service to run open-source machine learning models from the cloud. The Replicate API enables you to use all Replicate models inside a python script, which is the core of this project. | |
All of the machinery is wrapped in the `DalleImageGenerator` class in `dall_e.py`, which does all the interaction with Replicate. | |
Let's have a look at the code it runs in order to generate images from text. | |
In order to create an API object and specify the model we'd like to use, we first need an API token | |
which is available [here](https://replicate.com/docs/api) after subscribing to Replicate. | |
``` | |
import os | |
import replicate | |
os.environ["REPLICATE_API_TOKEN"] = <Your Api access token from Replicate> | |
dalle = replicate.models.get("kuprel/min-dalle") | |
urls = self.dalle.predict(text=<your prompt>, grid_size=<How many images to generate>, log2_supercondition_factor=<A parameter controlling the output relevance to text>) | |
``` | |
In this case, the model returns a list of urls to all intermediate images generated by DALL路E Mini. | |
We want the final output, so we call | |
``` | |
get_image(list(urls)[-1]) | |
``` | |
to download the last one using python's urllib library | |
## 2. Downloading content from YouTube | |
All the code in this section appears in **download_from_youtube.py**. | |
### Downloading the transcript | |
There is a very cool python package called **YouTubeTranscriptApi** and, as its name implies, it's going to be very usefull | |
The **YouTubeTranscriptApi.get_transcript** function needs a youtube video ID, so we'll first extract if from the video url using urllib: | |
The function **get_video_id** in the file does exactly that | |
and the main lines of code to get the transcripts are: | |
``` | |
id = get_video_id(url) | |
transcript = YouTubeTranscriptApi.get_transcript(id, languages=['en']) | |
``` | |
str is a python dictionary with entries 'text', 'star' 'duration' | |
indicating the starting time of each line of the lyrics and its duration. | |
### Downloading the audio | |
I used a library called **youtube_dl** that can download an .mp3 file with the sound of a YouTube video. | |
The usage is fairly simple and is wrapped in the download_mp3 function in the file | |
``` | |
import youtube_dl | |
ydl_opts = { | |
'outtmpl': <specify output file path>, | |
'format': 'bestaudio/best', | |
'postprocessors': [{ | |
'key': 'FFmpegExtractAudio', | |
'preferredcodec': 'mp3', | |
'preferredquality': '192', | |
}], | |
} | |
with youtube_dl.YoutubeDL(ydl_opts) as ydl: | |
ydl.download([url]) | |
``` | |
## 3. Making a video clip | |
The rest of the code is conceptually simple. Using the transcript lines as prompts to DALL路E Mini, we get images and combine | |
them with the .mp3 to video clip. | |
In practice, there are some things to pay attention to in order to make the timing of the lyrics sound and visuals play together. | |
Let's go through the code: | |
We loop over the transcript dictionary we previously downloaded: | |
``` | |
for (text, start, end) in transcript: | |
``` | |
Given the duration of the current and an input argument args.sec_per_img we calculate how many images wee need. | |
Also, DALL路E Mini generates a square grid of images, so if we want N images, we need to tell it to generate a grid of | |
dimension <pre xml:lang="latex">\sqrt{N}</pre>. The calculation is: | |
``` | |
grid_size = max(get_sqrt(duration / args.sec_per_img), 1) | |
``` | |
Now we ask Replicate for images from DALL路E Mini: | |
``` | |
images = dalle.generate_images(text, grid_size, text_adherence=3) | |
``` | |
If we want to generate a movie clip in a specific fps (Higher fps mean more accuracy in the timing because we can | |
change image more frequently) we usually need to write each image for multiple frames. | |
The calculation I did is: | |
``` | |
frames_per_image = int(duration * args.fps) // len(images) | |
``` | |
Now, we use **opencv** package to write the lyrics as subtitles on the frame | |
``` | |
frame = cv2.cvtColor(images[j], cv2.COLOR_RGBA2BGR) | |
frame = put_subtitles_on_frame(frame, text, resize_factor) | |
frames.append(frame) | |
``` | |
Where **put_subtitles_on_frame** is a function in utils.py that makes use of the **cv2.putText** function | |
Finally, we can write all the aggregated frames into a file: | |
``` | |
video = cv2.VideoWriter(vid_path, 0, args.fps, (img_dim , img_dim)) | |
for i, frame in enumerate(frames): | |
video.write(frame) | |
cv2.destroyAllWindows() | |
video.release() | |
``` | |
The code itself is in the **get_frames** function in **main.py** and is a little bit more elaborated. It also fills the | |
gaps parts of the song where there are no lyrics with images prompted by the last sentence or the song's name. | |
## 4. Sound and video mixing | |
Now that we have video, we only need to mix it with the downloaded .mp3 file. | |
We'll use FFMPEG for this with Shell commands executed from python. | |
The first of the two commands below cuts the mp3 file to fit the length of the generated video in cases where the lyrics | |
doesn't cover all the song. The second command mixes the two into a new file with video and song: | |
``` | |
os.system(f"ffmpeg -ss 00:00:00 -t {video_duration} -i '{mp3_path}' -map 0:a -acodec libmp3lame '{f'data/{args.song_name}/tmp.mp3'}'") | |
os.system(f"ffmpeg -i '{vid_path}' -i '{f'data/{args.song_name}/tmp.mp3'}' -map 0 -map 1:a -c:v copy -shortest '{final_vid_path}'") | |
``` | |
# TODO | |
- [ ] Fix subtitles no whitespace problems | |
- [ ] Allow working on raw .mp3 and .srt files instead of urls only | |
- [ ] Support automatic generated youtube transcriptions | |
- [ ] Better timing of subtitles and sound | |
- [ ] Find way to upload video without copyrights infringement | |
- [ ] Use other text to image models from Replicate |