File size: 6,772 Bytes
36cabf5
 
 
 
 
 
 
 
 
 
 
 
 
d2b22fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
title: Mini-dalle Video-clip Maker
emoji: 馃惃
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 4.19.2
app_file: app.py
pinned: false
license: mit
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# DALL路E Video Clip Maker

This python project uses DALL路E Mini, through Replicate's API, to generate a photo-montage video 
from a song.

Given a YouTube url, the program will extract the audio and transcript of the video and use the lyrics
in the transcript as text prompts for DALL路E Mini.

## Usage

The project can be accessed through a single command:

`python3 main.py <youtube url> --token <your replicate API token>`

An output example for the video in "Here Comes the Sun" by The Beatles:

<img src="misc/frame-432.png" width="250"/> <img src="misc/frame-177.png" width="250"/>
<img src="misc/frame-316.png" width="250"/> <img src="misc/frame-633.png" width="250"/>
<img src="misc/frame-1724.png" width="250"/> <img src="misc/frame-1328.png" width="250"/>

Note that the project only works with YouTube videos that have a transcription.

# Blog Post

## 1. Interacting with the Replicate API to run DALL路E Mini

[Replicate](https://replicate.com) is a service to run open-source machine learning models from the cloud. The Replicate API enables you to use all Replicate models inside a python script, which is the core of this project.

All of the machinery is wrapped in the `DalleImageGenerator` class in `dall_e.py`, which does all the interaction with Replicate.

Let's have a look at the code it runs in order to generate images from text.

In order to create an API object and specify the model we'd like to use, we first need an API token
which is available [here](https://replicate.com/docs/api) after subscribing to Replicate.

```
import os
import replicate
os.environ["REPLICATE_API_TOKEN"] = <Your Api access token from Replicate>    
dalle = replicate.models.get("kuprel/min-dalle")
urls = self.dalle.predict(text=<your prompt>, grid_size=<How many images to generate>, log2_supercondition_factor=<A parameter controlling the output relevance to text>)
```

In this case, the model returns a list of urls to all intermediate images generated by DALL路E Mini.

We want the final output, so we call 
```
get_image(list(urls)[-1])
```
to download the last one using python's urllib library

## 2. Downloading content from YouTube
All the code in this section appears in **download_from_youtube.py**.

### Downloading the transcript

There is a very cool python package called **YouTubeTranscriptApi** and, as its name implies, it's going to be very usefull

The **YouTubeTranscriptApi.get_transcript** function needs a youtube video ID, so we'll first extract if from the video url using urllib:
The function **get_video_id** in the file does exactly that

and the main lines of code to get the transcripts are:

```
id = get_video_id(url)
transcript = YouTubeTranscriptApi.get_transcript(id, languages=['en'])
```
str is a python dictionary with entries 'text', 'star' 'duration'
indicating the starting time of each line of the lyrics and its duration.

### Downloading the audio

I used a library called **youtube_dl** that can download an .mp3 file with the sound of a YouTube video.

The usage is fairly simple and is wrapped in the download_mp3 function in the file
```
import youtube_dl
ydl_opts = {
    'outtmpl': <specify output file path>,
    'format': 'bestaudio/best',
    'postprocessors': [{
        'key': 'FFmpegExtractAudio',
        'preferredcodec': 'mp3',
        'preferredquality': '192',
    }],
}
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])
```

## 3. Making a video clip
The rest of the code is conceptually simple. Using the transcript lines as prompts to DALL路E Mini, we get images and combine
them with the .mp3 to video clip.

In practice, there are some things to pay attention to in order to make the timing of the lyrics sound and visuals play together.

Let's go through the code:

We loop over the transcript dictionary we previously downloaded:

```
for (text, start, end) in transcript:
```

Given the duration of the current and an input argument args.sec_per_img we calculate how many images wee need.
Also, DALL路E Mini generates a square grid of images, so if we want N images, we need to tell it to generate a grid of
dimension <pre xml:lang="latex">\sqrt{N}</pre>. The calculation is:

```
    grid_size = max(get_sqrt(duration / args.sec_per_img), 1)
```

Now we ask Replicate for images from DALL路E Mini:
```
    images = dalle.generate_images(text, grid_size, text_adherence=3)
```
If we want to generate a movie clip in a specific fps (Higher fps mean more accuracy in the timing because we can
change image more frequently) we usually need to write each image for multiple frames.

The calculation I did is:
```
    frames_per_image = int(duration * args.fps) // len(images)
```

Now, we use **opencv** package to write the lyrics as subtitles on the frame
```
    frame = cv2.cvtColor(images[j], cv2.COLOR_RGBA2BGR)
    frame = put_subtitles_on_frame(frame, text, resize_factor)
    frames.append(frame)
```
Where **put_subtitles_on_frame** is a function in utils.py that makes use of the **cv2.putText** function

Finally, we can write all the aggregated frames into a file:
```
    video = cv2.VideoWriter(vid_path, 0, args.fps, (img_dim , img_dim))
    for i, frame in enumerate(frames):
        video.write(frame)
    cv2.destroyAllWindows()
    video.release()
```

The code itself is in  the **get_frames** function in **main.py** and is a little bit more elaborated. It also fills the
gaps parts of the song where there are no lyrics with images prompted by the last sentence or the song's name.

## 4. Sound and video mixing

Now that we have video, we only need to mix it with the downloaded .mp3 file.

We'll use FFMPEG for this with Shell commands executed from python.

The first of the two commands below cuts the mp3 file to fit the length of the generated video in cases where the lyrics
doesn't cover all the song. The second command mixes the two into a new file with video and song:

```
os.system(f"ffmpeg -ss 00:00:00 -t {video_duration} -i '{mp3_path}' -map 0:a -acodec libmp3lame '{f'data/{args.song_name}/tmp.mp3'}'")
os.system(f"ffmpeg -i '{vid_path}' -i '{f'data/{args.song_name}/tmp.mp3'}' -map 0 -map 1:a -c:v copy -shortest '{final_vid_path}'")
```

# TODO
- [ ] Fix subtitles no whitespace problems
- [ ] Allow working on raw .mp3 and .srt files instead of urls only
- [ ] Support automatic generated youtube transcriptions
- [ ] Better timing of subtitles and sound
- [ ] Find way to upload video without copyrights infringement
- [ ] Use other text to image models from Replicate